CN108108813A

CN108108813A - A kind of method that big classification deep learning GPU accelerates parallel

Info

Publication number: CN108108813A
Application number: CN201711251410.3A
Authority: CN
Inventors: 石宇; 徐卉; 程诚; 周祥东
Original assignee: Chongqing Institute of Green and Intelligent Technology of CAS
Current assignee: Chongqing Institute of Green and Intelligent Technology of CAS
Priority date: 2017-12-01
Filing date: 2017-12-01
Publication date: 2018-06-01

Abstract

The present invention provides a method for GPU parallel acceleration of large-scale deep learning, comprising: using a model to train the model parameters of the softmax layer in the deep neural network structure in parallel, each GPU trains its own model fragmentation, and the softmax layer of each GPU Through the data characteristics of interactive model parameters, deep learning is completed; the present invention adopts a hybrid architecture, that is, all levels before the softmax layer still use data parallelism, and the softmax layer adopts model parallelism, which breaks through the parallelism of large-scale deep learning The bottleneck of computing overcomes the problem of high communication cost and communication time spent on parameter interaction on the last layer of full-link layer in the deep neural network structure. Improve model learning efficiency and reduce GPU usage.

Description

A Method for GPU Parallel Acceleration of Large Category Deep Learning

技术领域technical field

本发明涉及计算机及其应用领域，尤其涉及一种大类别深度学习GPU并行加速的方法。The invention relates to the field of computers and applications thereof, in particular to a method for parallel acceleration of large-scale deep learning GPUs.

背景技术Background technique

目前，深度学习在几个主要领域都获得了突破性的进展：在语音识别领域，在图像识别领域，在自然语言处理领域。可以说到目前为止，深度学习是最接近人类大脑的智能学习方法。但是深度学习模型参数多，计算量大，训练数据的规模也更大，需要消耗很多计算资源。如果可以让训练加速，工作效率会明显提升，对于大规模的训练数据和模型来说，更可以将难以完成的任务变成可能。At present, deep learning has achieved breakthroughs in several major fields: in the field of speech recognition, in the field of image recognition, and in the field of natural language processing. It can be said that so far, deep learning is the closest intelligent learning method to the human brain. However, the deep learning model has many parameters, a large amount of calculation, and a larger scale of training data, which consumes a lot of computing resources. If the training can be accelerated, the work efficiency will be significantly improved, and for large-scale training data and models, difficult tasks can be made possible.

随着GPU的大规模并行架构支持的不断推进，面向通用计算的GPU(General-Purposed GPU,GPGPU)已成为加速可并行应用程序的重要手段。得益于GPU众核(many-core)体系结构，程序在GPU系统上的运行速度相较于单核CPU往往提升几十倍乃至上千倍。利用GPU来训练深度神经网络，可以充分发挥其数以千计计算核心的高效并行计算能力，在使用海量训练数据的场景下，所耗费的时间大幅缩短，占用的服务器也更少。With the continuous advancement of GPU's massively parallel architecture support, general-purpose computing-oriented GPU (General-Purposed GPU, GPGPU) has become an important means of accelerating parallel applications. Thanks to the many-core architecture of the GPU, the running speed of the program on the GPU system is often dozens or even thousands of times higher than that of a single-core CPU. Using GPUs to train deep neural networks can give full play to the efficient parallel computing capabilities of thousands of computing cores. In scenarios where massive training data is used, the time spent is greatly shortened and fewer servers are occupied.

目前大部分服务器都有8个或更多的GPU。原则上，使用更多的GPU可以大幅度地提升效率，但实现起来有一定困难，处理器之间需要交互大量的数据，并且花费更多的时间进行通信而非计算。传统的深度学习并行方法都是数据并行，将数据划分为几个分片，每个GPU处理其中一份，并且进行参数交互。但对于类别数很多的数据，在深度神经网络结构中的最后一层全链接层上，进行参数交互的通信成本太高，花费的通信时间远高过参数计算的时间，成为大类别深度学习并行运算的瓶颈，因此，需要一种新的技术手段，能够在保持原有深度学习效果的同时，大幅度提升模型学习效率，减少GPU占用率。Most servers today have 8 or more GPUs. In principle, using more GPUs can greatly improve efficiency, but it is difficult to implement. A large amount of data needs to be exchanged between processors, and more time is spent on communication rather than calculation. The traditional deep learning parallel methods are data parallel, which divides the data into several slices, and each GPU processes one of them, and performs parameter interaction. However, for data with a large number of categories, the communication cost for parameter interaction is too high on the last full-link layer in the deep neural network structure. Therefore, a new technical means is needed, which can greatly improve the model learning efficiency and reduce the GPU occupancy rate while maintaining the original deep learning effect.

发明内容Contents of the invention

鉴于以上所述现有技术的缺点，本发明提供一种大类别深度学习GPU并行加速的方法，以解决上述技术问题。In view of the shortcomings of the prior art described above, the present invention provides a method for GPU parallel acceleration of large-scale deep learning to solve the above technical problems.

本发明提供的一种大类别深度学习GPU并行加速的方法，包括：A method for parallel acceleration of large-scale deep learning GPUs provided by the present invention includes:

采用模型并行对深度神经网络结构中的softmax层的模型参数进行训练；Using model parallelism to train the model parameters of the softmax layer in the deep neural network structure;

每个GPU训练各自的模型分片，获取模型参数的数据特征；Each GPU trains its own model slice and obtains the data characteristics of the model parameters;

各GPU的softmax层之间通过交互模型参数的数据特征，完成深度学习。The softmax layer of each GPU completes deep learning by interacting with the data characteristics of model parameters.

进一步，采用混合架构对深度神经网络结构中的模型参数进行训练，所述混合结构包括采用模型并行对深度神经网络结构中的softmax层的模型参数进行训练，采用数据并行对深度神经网络结构中的其他层的模型参数进行训练。Further, a hybrid architecture is used to train the model parameters in the deep neural network structure. The hybrid structure includes using model parallelism to train the model parameters of the softmax layer in the deep neural network structure, and using data parallelism to train the model parameters in the deep neural network structure. The model parameters of other layers are trained.

进一步，所述模型并行包括将完整模型划分为若干模型分片，每个模型分片分别在不同的GPU上进行参数训练。Further, the model parallelization includes dividing the complete model into several model slices, and each model slice performs parameter training on a different GPU.

进一步，将深度神经网络结构中的softmax层划分为若干个模型分片，分别在不同的GPU上进行参数训练，每个GPU计算各自的模型分片，并获取对应模型分片的参数数据特征，所述模型分片的数量与GPU的数量一致。Further, the softmax layer in the deep neural network structure is divided into several model slices, and parameter training is performed on different GPUs, each GPU calculates its own model slice, and obtains the parameter data characteristics of the corresponding model slice, The number of model slices is consistent with the number of GPUs.

进一步，所述数据并行包括，根据GPU数量对训练数据进行切分，通过不同的GPU对切分后的训练数据分别进行训练，获取训练数据特征组，各GPU之间通过训练数据特征数组进行交互，所述训练数据为传输图像数据。Further, the data parallelism includes segmenting the training data according to the number of GPUs, training the segmented training data through different GPUs, obtaining the training data feature group, and interacting between each GPU through the training data feature array , the training data is transmission image data.

进一步，每个GPU完成自己的模型分片计算后，将所有GPU上的模型分配组合为一个完整的模型。Further, after each GPU completes its own model slice calculation, the model assignments on all GPUs are combined into a complete model.

进一步，通过all-gather算法，将数据传输至每块GPU上。Further, the data is transmitted to each GPU through the all-gather algorithm.

本发明的有益效果：本发明中的大类别深度学习GPU并行加速的方法，采用混合式架构，突破了大类别深度学习并行运算的瓶颈，克服了在深度神经网络结构中的最后一层全链接层上，进行参数交互的通信成本和花费的通信时间过高的问题，能够在保持原有深度学习效果的同时，大幅度提升模型学习效率，减少GPU占用率。Beneficial effects of the present invention: the method for GPU parallel acceleration of large-scale deep learning in the present invention adopts a hybrid architecture, breaks through the bottleneck of parallel computing for large-scale deep learning, and overcomes the last full link in the deep neural network structure At the layer level, the communication cost and communication time for parameter interaction are too high, which can greatly improve the model learning efficiency and reduce the GPU occupancy rate while maintaining the original deep learning effect.

附图说明Description of drawings

图1是本发明实施例中的大类别深度学习GPU并行加速的方法流程示意图。FIG. 1 is a schematic flowchart of a method for GPU parallel acceleration of large-scale deep learning in an embodiment of the present invention.

图2是本发明实施例中的大类别深度学习GPU并行加速的方法原理示意图。Fig. 2 is a schematic diagram of the principle of the GPU parallel acceleration method for large-scale deep learning in an embodiment of the present invention.

具体实施方式Detailed ways

以下通过特定的具体实例说明本发明的实施方式，本领域技术人员可由本说明书所揭露的内容轻易地了解本发明的其他优点与功效。本发明还可以通过另外不同的具体实施方式加以实施或应用，本说明书中的各项细节也可以基于不同观点与应用，在没有背离本发明的精神下进行各种修饰或改变。需说明的是，在不冲突的情况下，以下实施例及实施例中的特征可以相互组合。Embodiments of the present invention are described below through specific examples, and those skilled in the art can easily understand other advantages and effects of the present invention from the content disclosed in this specification. The present invention can also be implemented or applied through other different specific implementation modes, and various modifications or changes can be made to the details in this specification based on different viewpoints and applications without departing from the spirit of the present invention. It should be noted that, in the case of no conflict, the following embodiments and features in the embodiments can be combined with each other.

需要说明的是，以下实施例中所提供的图示仅以示意方式说明本发明的基本构想，遂图式中仅显示与本发明中有关的组件而非按照实际实施时的组件数目、形状及尺寸绘制，其实际实施时各组件的型态、数量及比例可为一种随意的改变，且其组件布局型态也可能更为复杂。It should be noted that the diagrams provided in the following embodiments are only schematically illustrating the basic ideas of the present invention, and only the components related to the present invention are shown in the diagrams rather than the number, shape and shape of the components in actual implementation. Dimensional drawing, the type, quantity and proportion of each component can be changed arbitrarily during actual implementation, and the component layout type may also be more complicated.

如图1所示，本实施例中的大类别深度学习GPU并行加速的方法，包括：As shown in Figure 1, the method for parallel acceleration of large-scale deep learning GPU in this embodiment includes:

传统深度学习采用的GPU数据并行，假设一个深度神经网络结构有7层，共有4个GPU，图像数据的batch_size为64，图像特征的维度为128，类别数为100万，最后一层softmax层需要处理的数据参数如下：每个GPU处理的数据参数量为64*128，模型参数量为128*100W，每个GPU针对不同的数据，在同一模型结构上进行训练，GPU间需要交互的参数量为4*128*100W。即，每个GPU利用部分数据，训练完整的模型，但由于GPU间数据有差别，所以最后需要将每个GPU上计算的模型参数汇总，由此可见，百万级的参数量已经严重影响模型的学习效率，在实际的工作过程中，深度学习模型参数多，计算量大，训练数据的规模也更大，需要消耗很多计算资源，对于类别数很多的数据，在深度神经网络结构中的最后一层全链接层上，花费的通信时间远高过参数计算的时间。The GPU data parallelism used in traditional deep learning assumes that a deep neural network structure has 7 layers, a total of 4 GPUs, the batch_size of image data is 64, the dimension of image features is 128, and the number of categories is 1 million. The last layer of softmax layer needs The data parameters to be processed are as follows: the amount of data parameters processed by each GPU is 64*128, and the amount of model parameters is 128*100W. Each GPU is trained on the same model structure for different data, and the amount of parameters that need to be interacted between GPUs It is 4*128*100W. That is, each GPU uses part of the data to train a complete model, but due to differences in data between GPUs, it is necessary to summarize the model parameters calculated on each GPU in the end. It can be seen that the number of million-level parameters has seriously affected the model In the actual work process, the deep learning model has many parameters, the amount of calculation is large, and the scale of training data is also larger, which needs to consume a lot of computing resources. For data with a large number of categories, the last step in the deep neural network structure On a full-link layer, the communication time spent is much higher than the parameter calculation time.

在本实施例中，采用混合架构对深度神经网络结构进行训练，对于网络结构中的其他层，需要交互的参数量不多，在GPU并行过程中，产生的通信成本不高，可以采用数据并行方法，本实施例在最后一层softmax层中采用模型并行的方式，由于模型参数的数量与数据类别数息息相关，针对大类别的深度学习，最后一层的参数交互就成为了算法性能的瓶颈。本实施例中的混合架构将最后一层softmax层由数据并行改为模型并行，可以大幅度提升算法性能，本实施例利用GPU的特性，将softmax层改为模型并行，即每块GPU在softmax层不再将模型参数共享给其他GPU，而是将参数特征通信给其他GPU，大大降低了通信成本，每块GPU不是拥有完全的模型参数，而是只有各自计算的这一部分参数，这部分参数称为一个模型分片。In this embodiment, a hybrid architecture is used to train the deep neural network structure. For other layers in the network structure, there are not many parameters that need to be interacted with. In the GPU parallel process, the communication cost generated is not high, and data parallelism can be used. Method, this embodiment uses model parallelism in the last layer of softmax layer. Since the number of model parameters is closely related to the number of data categories, for deep learning of large categories, the parameter interaction of the last layer becomes the bottleneck of algorithm performance. The hybrid architecture in this embodiment changes the last layer of softmax layer from data parallel to model parallel, which can greatly improve the performance of the algorithm. This embodiment uses the characteristics of the GPU to change the softmax layer to model parallel, that is, each GPU is at softmax The layer no longer shares the model parameters with other GPUs, but communicates the parameter features to other GPUs, which greatly reduces the communication cost. Each GPU does not have complete model parameters, but only this part of the parameters calculated by itself. This part of the parameters Called a model shard.

模型并行包括将完整模型划分为若干分片，每个分片分别在不同的GPU上进行运行，在本实施例中，将深度神经网络结构中的softmax层划分为若干个模型分片，分别在不同的GPU上进行参数训练，每个GPU计算各自的模型分片，并获取对应模型分片的参数数据特征，模型分片的数量与GPU的数量一致，每个GPU完成自己的模型分片计算后，再将所有GPU上的模型分配组合为一个完整的模型。Model parallelism includes dividing the complete model into several slices, and each slice is run on a different GPU respectively. In this embodiment, the softmax layer in the deep neural network structure is divided into several model slices, respectively in Parameter training is performed on different GPUs. Each GPU calculates its own model slice and obtains the parameter data characteristics of the corresponding model slice. The number of model slices is consistent with the number of GPUs. Each GPU completes its own model slice calculation. After that, the model assignments on all GPUs are combined into a complete model.

如图2所示，本实施例在softmax层之前先将各GPU上的数据(即图像特征)通过all-gather的方式通信，达到每块GPU上都有全部的数据信息。接下来再通过模型并行的方式来训练softmax层。在本实施例中，以4核GPU为例，将最后一层改为模型并行，则将一个完整模型划分为4个模型分片，每个模型分片分别在4个GPU上进行计算，各个GPU不再需要进行模型参数的通信，而是通过将参数特征通信给其他GPU。本实施例通过采用GPU的all-gather算法，使每个GPU上都能获得所有的数据信息。这样，每个GPU都可以根据所有的数据信息来训练自己的模型分片，保证了模型的每个部分都是由所有的数据学习得到的，在本实施例中，A,B,C,D视为每个GPU上处理的数据部分(图片特征)，数据量即为上述的64*128，通过all-gather算法，将数据部分(即图片特征)传输至每块GPU上，以保证4个GPU上的数据完全一致。每个GPU只计算自己的模型分片，而传输的信息由模型参数变为特征参数，数据量上由4*128*100W变为4*64*128。最后再将4个GPU上的模型分片组合为一个完整的模型，极大幅度的降低了softmax层通信的数据量，本发明利用GPU的特征对深度学习的架构进行优化，尤其适用于大类别情况，通过将深度学习架构优化成混合模式，从而大大提高了实际应用中学习效率。As shown in FIG. 2 , in this embodiment, the data (that is, image features) on each GPU is communicated in an all-gather manner before the softmax layer, so that each GPU has all data information. Next, the softmax layer is trained in a model-parallel manner. In this embodiment, taking a 4-core GPU as an example, changing the last layer to model parallelism, a complete model is divided into 4 model slices, and each model slice is calculated on 4 GPUs, each GPUs no longer need to communicate model parameters, but by communicating parameter features to other GPUs. In this embodiment, all data information can be obtained on each GPU by using the all-gather algorithm of the GPU. In this way, each GPU can train its own model slices according to all data information, ensuring that each part of the model is learned from all data. In this embodiment, A, B, C, D It is regarded as the data part (picture feature) processed on each GPU, and the amount of data is the above 64*128. Through the all-gather algorithm, the data part (ie picture feature) is transmitted to each GPU to ensure 4 The data on the GPU is completely consistent. Each GPU only calculates its own model slices, and the transmitted information changes from model parameters to feature parameters, and the amount of data changes from 4*128*100W to 4*64*128. Finally, the model slices on the 4 GPUs are combined into a complete model, which greatly reduces the amount of data in the softmax layer communication. The present invention uses the characteristics of the GPU to optimize the architecture of deep learning, especially for large categories. In this case, by optimizing the deep learning architecture into a hybrid mode, the learning efficiency in practical applications is greatly improved.

上述实施例仅例示性说明本发明的原理及其功效，而非用于限制本发明。任何熟悉此技术的人士皆可在不违背本发明的精神及范畴下，对上述实施例进行修饰或改变。因此，举凡所属技术领域中具有通常知识者在未脱离本发明所揭示的精神与技术思想下所完成的一切等效修饰或改变，仍应由本发明的权利要求所涵盖。The above-mentioned embodiments only illustrate the principles and effects of the present invention, but are not intended to limit the present invention. Anyone skilled in the art can modify or change the above-mentioned embodiments without departing from the spirit and scope of the present invention. Therefore, all equivalent modifications or changes made by those skilled in the art without departing from the spirit and technical ideas disclosed in the present invention shall still be covered by the claims of the present invention.

Claims

1. A method for large-scale deep learning GPU parallel acceleration, characterized in that, comprising:

Using model parallelism to train the model parameters of the softmax layer in the deep neural network structure;

Each GPU trains its own model slice and obtains the data characteristics of the model parameters;

The softmax layer of each GPU completes deep learning by interacting with the data characteristics of model parameters.

2. the method for large-scale deep learning GPU parallel acceleration according to claim 1, is characterized in that, adopts hybrid structure to train the model parameter in deep neural network structure, and described hybrid structure comprises adopting model parallel to deep neural network The model parameters of the softmax layer in the structure are trained, and the model parameters of other layers in the deep neural network structure are trained in parallel with data.

3. The method for GPU parallel acceleration of large-scale deep learning according to claim 2, wherein the model parallelism includes dividing the complete model into several model slices, and each model slice is performed on different GPUs respectively. parameter training.

4. the method for the parallel acceleration of large-scale deep learning GPU according to claim 3, is characterized in that, the softmax layer in deep neural network structure is divided into several model slices, carries out parameter training on different GPUs respectively, Each GPU calculates its own model slice, and obtains the parameter data features of the corresponding model slice, and the number of the model slices is consistent with the number of GPUs.

5. The method for GPU parallel acceleration of large-scale deep learning according to claim 4, wherein the data parallelism comprises, segmenting the training data according to the number of GPUs, and using different GPUs to segment the training data The training is performed separately to obtain the training data feature group, and the GPUs interact through the training data feature array, and the training data is the transmission image data.

6. The method for parallel acceleration of large-scale deep learning GPUs according to claim 5, wherein after each GPU completes its own model slicing calculation, the model assignments on all GPUs are combined into a complete model.

7. The method according to claim 4, characterized in that the data is transmitted to each GPU through an all-gather algorithm.