CN108108813A - A kind of method that big classification deep learning GPU accelerates parallel - Google Patents
A kind of method that big classification deep learning GPU accelerates parallel Download PDFInfo
- Publication number
- CN108108813A CN108108813A CN201711251410.3A CN201711251410A CN108108813A CN 108108813 A CN108108813 A CN 108108813A CN 201711251410 A CN201711251410 A CN 201711251410A CN 108108813 A CN108108813 A CN 108108813A
- Authority
- CN
- China
- Prior art keywords
- gpu
- model
- parallel
- deep learning
- parameter
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T1/00—General purpose image data processing
- G06T1/20—Processor architectures; Processor configuration, e.g. pipelining
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Biophysics (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Biomedical Technology (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Life Sciences & Earth Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Image Analysis (AREA)
Abstract
The present invention provides a kind of method that big classification deep learning GPU accelerates parallel, including:The model parameter of the softmax layers in deep neural network structure is trained parallel using model, the respective model burst of each GPU training, by the data characteristics of interaction models parameter between the softmax layers of each GPU, completes deep learning;The present invention uses hybrid framework, that is all levels before softmax layers, it is still using data parallel mode, softmax layers use model parallel mode, breach the bottleneck of big classification deep learning concurrent operation, overcome on the full linking layer of last layer in deep neural network structure, the problem of call duration time for carrying out the communications cost of parameter interaction and spending is excessively high, can be while original deep learning effect be kept, significantly lift scheme learning efficiency, reduces GPU occupancies.
Description
Technical field
Accelerate parallel the present invention relates to Computer and Its Application field more particularly to a kind of big classification deep learning GPU
Method.
Background technology
At present, deep learning all obtains breakthrough progress in several major domains:In field of speech recognition, in image
Identification field, in natural language processing field.It may be said that up to the present, deep learning is closest to intelligence of human brain
Learning method.But deep learning model parameter is more, it is computationally intensive, the scale of training data also bigger, it is necessary to consume many calculating
Resource.If training can be allowed to accelerate, work efficiency can be obviously improved, for large-scale training data and model, more
Being difficult to complete of the task can be become possibility.
With the continuous propulsion that the large-scale parallel framework of GPU is supported, towards the GPU (General- of general-purpose computations
Purposed GPU, GPGPU) become accelerate can concurrent application important means.Have benefited from GPU many-cores (many-
Core) architecture, the speed of service of the program in GPU system often promote tens times or even thousands of times compared to monokaryon CPU.
Deep neural network is trained using GPU, its thousands of efficient parallel computing capability for calculating core can be given full play to,
Under scene using magnanimity training data, the spent time significantly shortens, and the server of occupancy is also less.
Major part server has the GPU of 8 or more at present.In principle, can significantly be carried using more GPU
Rise efficiency, but implemented certain difficulty, need to interact substantial amounts of data between processor, and take more time into
Row communication and it is non-computational.Traditional deep learning parallel method is all data parallel, data is divided into several bursts, each
GPU handles a copy of it, and carries out parameter interaction.But for many data of classification number, in deep neural network structure
The full linking layer of last layer on, carry out that the communications cost of parameter interaction is too high, and the call duration time of cost far exceeds parameter calculating
Time, become the bottleneck of big classification deep learning concurrent operation, therefore, it is necessary to a kind of new technological means, can keep
While original deep learning effect, significantly lift scheme learning efficiency, reduces GPU occupancies.
The content of the invention
In view of the foregoing deficiencies of prior art, the present invention provides what a kind of big classification deep learning GPU accelerated parallel
Method, to solve above-mentioned technical problem.
The method that a kind of big classification deep learning GPU provided by the invention accelerates parallel, including:
The model parameter of the softmax layers in deep neural network structure is trained parallel using model;
Each GPU trains respective model burst, obtains the data characteristics of model parameter;
By the data characteristics of interaction models parameter between the softmax layers of each GPU, deep learning is completed.
Further, the model parameter in deep neural network structure is trained using mixed architecture, the mixing knot
Structure includes parallel being trained the model parameter of the softmax layers in deep neural network structure using model, using data
The model parameter of other layers in deep neural network structure is trained parallel.
Further, the model includes complete model being divided into several model bursts, each model burst difference parallel
Parameter training is carried out on different GPU.
Further, the softmax layers in deep neural network structure are divided into several model bursts, respectively in difference
GPU on carry out parameter training, each GPU calculates respective model burst, and the supplemental characteristic for obtaining corresponding model burst is special
Sign, the quantity of the model burst are consistent with the quantity of GPU.
Further, the data parallel includes, and carries out cutting to training data according to GPU quantity, passes through different GPU pairs
Training data after cutting is trained respectively, is obtained training data feature group, is passed through training data feature array between each GPU
It interacts, the training data is transmission image data.
Further, after each GPU completes the model burst calculating of oneself, the model distribution on all GPU is combined as one
A complete model.
Further, by all-gather algorithms, send data on every piece of GPU.
Beneficial effects of the present invention:The method that big classification deep learning GPU in the present invention accelerates parallel, use are hybrid
Framework breaches the bottleneck of big classification deep learning concurrent operation, overcomes last layer in deep neural network structure
On full linking layer, the problem of call duration time for carrying out the communications cost of parameter interaction and spending is excessively high, original depth can kept
While spending learning effect, significantly lift scheme learning efficiency, reduces GPU occupancies.
Description of the drawings
Fig. 1 is the method flow schematic diagram that the big classification deep learning GPU in the embodiment of the present invention accelerates parallel.
Fig. 2 is the Method And Principle schematic diagram that the big classification deep learning GPU in the embodiment of the present invention accelerates parallel.
Specific embodiment
Illustrate embodiments of the present invention below by way of specific specific example, those skilled in the art can be by this specification
Disclosed content understands other advantages and effect of the present invention easily.The present invention can also pass through in addition different specific realities
The mode of applying is embodied or practiced, the various details in this specification can also be based on different viewpoints with application, without departing from
Various modifications or alterations are carried out under the spirit of the present invention.It should be noted that in the case where there is no conflict, following embodiment and implementation
Feature in example can be mutually combined.
It should be noted that the diagram provided in following embodiment only illustrates the basic structure of the present invention in a schematic way
Think, then only the display component related with the present invention rather than component count, shape and size during according to actual implementation in schema
It draws, kenel, quantity and the ratio of each component can be a kind of random change during actual implementation, and its assembly layout kenel
It is likely more complexity.
As shown in Figure 1, the method that the big classification deep learning GPU in the present embodiment accelerates parallel, including:
The model parameter of the softmax layers in deep neural network structure is trained parallel using model;
Each GPU trains respective model burst, obtains the data characteristics of model parameter;
By the data characteristics of interaction models parameter between the softmax layers of each GPU, deep learning is completed.
The GPU data parallels that conventional depth study uses, it is assumed that a deep neural network structure has 7 layers, shares 4
GPU, the batch_size of image data is 64, and the dimension of characteristics of image is 128, and classification number is 1,000,000, last layer
Softmax layers of data parameters to be treated are as follows:The data parameters amount of each GPU processing is 64*128, and model parameter amount is
128*100W, each GPU are directed to different data, are trained in same model structure, interactive parameter amount is needed between GPU
For 4*128*100W.That is, each GPU utilizes partial data, the complete model of training, but due between GPU data have difference, so
It finally needs to summarize the model parameter calculated on each GPU, it can be seen that, million grades of parameter amount has seriously affected model
Learning efficiency, in the actual course of work, deep learning model parameter is more, computationally intensive, and the scale of training data is also more
Greatly, it is necessary to consume many computing resources, for many data of classification number, last layer in deep neural network structure is complete
On linking layer, the call duration time of cost far exceeds the time of parameter calculating.
In the present embodiment, deep neural network structure is trained using mixed architecture, in network structure
For other layers, it is necessary to which the parameter amount of interaction is few, in GPU parallel procedures, the communications cost of generation is not high, and data may be employed simultaneously
Row method, the present embodiment uses model parallel form in last layer of softmax layers, due to the quantity and number of model parameter
Closely bound up according to classification number, for the other deep learning of major class, the parameter interaction of last layer just becomes the bottle of algorithm performance
Neck.It is parallel that mixed architecture in the present embodiment by last layer of softmax layers is changed to model by data parallel, can significantly carry
Algorithm performance is risen, the present embodiment utilizes the characteristic of GPU, softmax layers is changed to model is parallel, i.e., every piece of GPU is at softmax layers
No longer by model parameter sharing to other GPU, but parameter attribute is communicated and gives other GPU, greatly reduce communications cost, often
Block GPU is not to possess complete model parameter, but this partial parameters only each calculated, this partial parameters are known as one
Model burst.
Model includes complete model being divided into several bursts parallel, and each burst is transported on different GPU respectively
Row, in the present embodiment, is divided into several model bursts, respectively not by the softmax layers in deep neural network structure
Parameter training is carried out on same GPU, each GPU calculates respective model burst, and obtains the supplemental characteristic of corresponding model burst
Feature, the quantity of model burst is consistent with the quantity of GPU, after each GPU completes the model burst calculating of oneself, then will be all
Model distribution on GPU is combined as a complete model.
As shown in Fig. 2, the present embodiment first passes through the data (i.e. characteristics of image) on each GPU before softmax layers
The mode of all-gather communicates, and reaching has whole data messages on every piece of GPU.Next the parallel side of model is passed through again
Formula trains softmax layers.In the present embodiment, by taking 4 core GPU as an example, it is parallel that last layer is changed to model, then it is complete by one
Integral mould is divided into 4 model bursts, and each model burst is calculated on 4 GPU respectively, each GPU no longer need into
The communication of row model parameter, but give other GPU by the way that parameter attribute is communicated.The present embodiment by using GPU all-
Gather algorithms make that all data messages can be obtained on each GPU.In this way, each GPU can be according to all data
Information trains the model burst of oneself, and each part that ensure that model is learnt by all data, at this
In embodiment, A, B, C, D are considered as the data portion (picture feature) handled on each GPU, and data volume is above-mentioned 64*128,
By all-gather algorithms, data portion (i.e. picture feature) is transmitted on every piece of GPU, to ensure the data on 4 GPU
It is completely the same.Each GPU only calculates the model burst of oneself, and the information transmitted becomes characteristic parameter from model parameter, data
4*64*128 is become from 4*128*100W in amount.The model burst on 4 GPU is finally combined as a complete model again,
Pole significantly reduces the data volume of softmax layers of communication, and the present invention carries out the framework of deep learning using the feature of GPU
Optimization, is particularly suitable for major class sorrow of separation condition, by the way that deep learning framework is optimized to mixed mode, so as to substantially increase reality
Using learning efficiency.
The above-described embodiments merely illustrate the principles and effects of the present invention, and is not intended to limit the present invention.It is any ripe
Know the personage of this technology all can carry out modifications and changes under the spirit and scope without prejudice to the present invention to above-described embodiment.Cause
This, those of ordinary skill in the art is complete without departing from disclosed spirit and institute under technological thought such as
Into all equivalent modifications or change, should by the present invention claim be covered.
Claims (7)
1. a kind of method that big classification deep learning GPU accelerates parallel, which is characterized in that including:
The model parameter of the softmax layers in deep neural network structure is trained parallel using model;
Each GPU trains respective model burst, obtains the data characteristics of model parameter;
By the data characteristics of interaction models parameter between the softmax layers of each GPU, deep learning is completed.
2. the method that big classification deep learning GPU according to claim 1 accelerates parallel, which is characterized in that using mixing
Framework is trained the model parameter in deep neural network structure, and the mixed structure is included using model parallel to depth
The model parameter of softmax layers in neural network structure is trained, using data parallel in deep neural network structure
The model parameter of other layers be trained.
3. the method that big classification deep learning GPU according to claim 2 accelerates parallel, which is characterized in that the model
Parallel to include complete model being divided into several model bursts, each model burst carries out parameter instruction on different GPU respectively
Practice.
4. the method that big classification deep learning GPU according to claim 3 accelerates parallel, which is characterized in that by depth god
Several model bursts are divided into through the softmax layers in network structure, carry out parameter training on different GPU respectively, often
A GPU calculates respective model burst, and obtains the supplemental characteristic feature of corresponding model burst, the quantity of the model burst with
The quantity of GPU is consistent.
5. the method that big classification deep learning GPU according to claim 4 accelerates parallel, which is characterized in that the data
Include parallel, according to GPU quantity to training data carry out cutting, by different GPU to the training data after cutting respectively into
Row training obtains training data feature group, is interacted between each GPU by training data feature array, the training data
To transmit image data.
6. the method that big classification deep learning GPU according to claim 5 accelerates parallel, which is characterized in that each GPU is complete
After being calculated into the model burst of oneself, the model distribution on all GPU is combined as a complete model.
7. the method that big classification deep learning GPU according to claim 4 accelerates parallel, which is characterized in that pass through all-
Gather algorithms are sent data on every piece of GPU.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711251410.3A CN108108813A (en) | 2017-12-01 | 2017-12-01 | A kind of method that big classification deep learning GPU accelerates parallel |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711251410.3A CN108108813A (en) | 2017-12-01 | 2017-12-01 | A kind of method that big classification deep learning GPU accelerates parallel |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108108813A true CN108108813A (en) | 2018-06-01 |
Family
ID=62208007
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201711251410.3A Pending CN108108813A (en) | 2017-12-01 | 2017-12-01 | A kind of method that big classification deep learning GPU accelerates parallel |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108108813A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109657793A (en) * | 2018-12-26 | 2019-04-19 | 广州小狗机器人技术有限公司 | Model training method and device, storage medium and electronic equipment |
WO2020164338A1 (en) * | 2019-02-13 | 2020-08-20 | 阿里巴巴集团控股有限公司 | Method, apparatus and device for updating convolutional neural network using gpu cluster |
CN112966829A (en) * | 2021-03-03 | 2021-06-15 | 山东英信计算机技术有限公司 | Deep learning model training method, device, equipment and readable medium |
CN114004730A (en) * | 2021-11-03 | 2022-02-01 | 奥特贝睿(天津)科技有限公司 | Deep neural network multi-model parallel reasoning method based on graphics processor |
CN115934181A (en) * | 2022-11-07 | 2023-04-07 | 北京百度网讯科技有限公司 | Data loading method and device, electronic equipment and storage medium |
-
2017
- 2017-12-01 CN CN201711251410.3A patent/CN108108813A/en active Pending
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109657793A (en) * | 2018-12-26 | 2019-04-19 | 广州小狗机器人技术有限公司 | Model training method and device, storage medium and electronic equipment |
WO2020164338A1 (en) * | 2019-02-13 | 2020-08-20 | 阿里巴巴集团控股有限公司 | Method, apparatus and device for updating convolutional neural network using gpu cluster |
TWI716102B (en) * | 2019-02-13 | 2021-01-11 | 開曼群島商創新先進技術有限公司 | Method, device and equipment for updating convolutional neural network using GPU cluster |
US11640531B2 (en) | 2019-02-13 | 2023-05-02 | Advanced New Technologies Co., Ltd. | Method, apparatus and device for updating convolutional neural network using GPU cluster |
CN112966829A (en) * | 2021-03-03 | 2021-06-15 | 山东英信计算机技术有限公司 | Deep learning model training method, device, equipment and readable medium |
CN114004730A (en) * | 2021-11-03 | 2022-02-01 | 奥特贝睿(天津)科技有限公司 | Deep neural network multi-model parallel reasoning method based on graphics processor |
CN114004730B (en) * | 2021-11-03 | 2024-09-17 | 奥特贝睿(天津)科技有限公司 | Deep neural network multi-model parallel reasoning method based on graphic processor |
CN115934181A (en) * | 2022-11-07 | 2023-04-07 | 北京百度网讯科技有限公司 | Data loading method and device, electronic equipment and storage medium |
CN115934181B (en) * | 2022-11-07 | 2023-10-13 | 北京百度网讯科技有限公司 | Data loading method, device, electronic equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108108813A (en) | A kind of method that big classification deep learning GPU accelerates parallel | |
CN109101948B (en) | Multi-attention machine mechanism video description method based on space-time and channel | |
EP3336760A1 (en) | Combined adversarial learning of inverse image manipulation operations | |
CN107066239A (en) | A kind of hardware configuration for realizing convolutional neural networks forward calculation | |
JP2020508504A (en) | Iterative multi-scale image generation using neural networks | |
CN109284812B (en) | Video game simulation method based on improved DQN | |
CN110503641A (en) | A kind of method and apparatus improving continuous casting billet face crack | |
CN108734288A (en) | A kind of operation method and device | |
CN107392835A (en) | A kind of processing method and processing device of particIe system | |
CN109597965A (en) | Data processing method, system, terminal and medium based on deep neural network | |
CN107515736A (en) | A kind of method for accelerating depth convolutional network calculating speed on embedded device | |
Chwif et al. | Discrete event simulation model reduction: A causal approach | |
CN111738276A (en) | Image processing method, device and equipment based on multi-core convolutional neural network | |
CN104268356A (en) | Airplane model assembling method for lean production | |
CN113079216A (en) | Cloud application implementation method and device, electronic equipment and readable storage medium | |
CN110488973A (en) | A kind of virtual interactive message leaving system and method | |
CN109840597A (en) | A kind of model prediction method, apparatus, electronic equipment and storage medium | |
Naseh et al. | Enabling Intelligent Vehicular Networks Through Distributed Learning in the Non-Terrestrial Networks 6G Vision | |
CN103678888B (en) | The flowing of a kind of heart blood based on Euler's fluid simulation algorithm schematically shows method | |
CN105069450A (en) | Quick multi-character recognition method | |
CN117009070A (en) | Method, device and equipment for constructing power-calculation scheduling knowledge graph and readable storage medium | |
Baig et al. | Bit rate reduction in cloud gaming using object detection technique | |
CN104156999A (en) | Three-dimensional scene rendering method | |
CN115292044A (en) | Data processing method and device, electronic equipment and storage medium | |
Ziaee | Job shop scheduling with makespan objective: A heuristic approach |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20180601 |