CN108108813A - A kind of method that big classification deep learning GPU accelerates parallel - Google Patents

A kind of method that big classification deep learning GPU accelerates parallel Download PDF

Info

Publication number
CN108108813A
CN108108813A CN201711251410.3A CN201711251410A CN108108813A CN 108108813 A CN108108813 A CN 108108813A CN 201711251410 A CN201711251410 A CN 201711251410A CN 108108813 A CN108108813 A CN 108108813A
Authority
CN
China
Prior art keywords
gpu
model
parallel
deep learning
parameter
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201711251410.3A
Other languages
Chinese (zh)
Inventor
石宇
徐卉
程诚
周祥东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing Institute of Green and Intelligent Technology of CAS
Original Assignee
Chongqing Institute of Green and Intelligent Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing Institute of Green and Intelligent Technology of CAS filed Critical Chongqing Institute of Green and Intelligent Technology of CAS
Priority to CN201711251410.3A priority Critical patent/CN108108813A/en
Publication of CN108108813A publication Critical patent/CN108108813A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T1/00General purpose image data processing
    • G06T1/20Processor architectures; Processor configuration, e.g. pipelining

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

The present invention provides a kind of method that big classification deep learning GPU accelerates parallel, including:The model parameter of the softmax layers in deep neural network structure is trained parallel using model, the respective model burst of each GPU training, by the data characteristics of interaction models parameter between the softmax layers of each GPU, completes deep learning;The present invention uses hybrid framework, that is all levels before softmax layers, it is still using data parallel mode, softmax layers use model parallel mode, breach the bottleneck of big classification deep learning concurrent operation, overcome on the full linking layer of last layer in deep neural network structure, the problem of call duration time for carrying out the communications cost of parameter interaction and spending is excessively high, can be while original deep learning effect be kept, significantly lift scheme learning efficiency, reduces GPU occupancies.

Description

A kind of method that big classification deep learning GPU accelerates parallel
Technical field
Accelerate parallel the present invention relates to Computer and Its Application field more particularly to a kind of big classification deep learning GPU Method.
Background technology
At present, deep learning all obtains breakthrough progress in several major domains:In field of speech recognition, in image Identification field, in natural language processing field.It may be said that up to the present, deep learning is closest to intelligence of human brain Learning method.But deep learning model parameter is more, it is computationally intensive, the scale of training data also bigger, it is necessary to consume many calculating Resource.If training can be allowed to accelerate, work efficiency can be obviously improved, for large-scale training data and model, more Being difficult to complete of the task can be become possibility.
With the continuous propulsion that the large-scale parallel framework of GPU is supported, towards the GPU (General- of general-purpose computations Purposed GPU, GPGPU) become accelerate can concurrent application important means.Have benefited from GPU many-cores (many- Core) architecture, the speed of service of the program in GPU system often promote tens times or even thousands of times compared to monokaryon CPU. Deep neural network is trained using GPU, its thousands of efficient parallel computing capability for calculating core can be given full play to, Under scene using magnanimity training data, the spent time significantly shortens, and the server of occupancy is also less.
Major part server has the GPU of 8 or more at present.In principle, can significantly be carried using more GPU Rise efficiency, but implemented certain difficulty, need to interact substantial amounts of data between processor, and take more time into Row communication and it is non-computational.Traditional deep learning parallel method is all data parallel, data is divided into several bursts, each GPU handles a copy of it, and carries out parameter interaction.But for many data of classification number, in deep neural network structure The full linking layer of last layer on, carry out that the communications cost of parameter interaction is too high, and the call duration time of cost far exceeds parameter calculating Time, become the bottleneck of big classification deep learning concurrent operation, therefore, it is necessary to a kind of new technological means, can keep While original deep learning effect, significantly lift scheme learning efficiency, reduces GPU occupancies.
The content of the invention
In view of the foregoing deficiencies of prior art, the present invention provides what a kind of big classification deep learning GPU accelerated parallel Method, to solve above-mentioned technical problem.
The method that a kind of big classification deep learning GPU provided by the invention accelerates parallel, including:
The model parameter of the softmax layers in deep neural network structure is trained parallel using model;
Each GPU trains respective model burst, obtains the data characteristics of model parameter;
By the data characteristics of interaction models parameter between the softmax layers of each GPU, deep learning is completed.
Further, the model parameter in deep neural network structure is trained using mixed architecture, the mixing knot Structure includes parallel being trained the model parameter of the softmax layers in deep neural network structure using model, using data The model parameter of other layers in deep neural network structure is trained parallel.
Further, the model includes complete model being divided into several model bursts, each model burst difference parallel Parameter training is carried out on different GPU.
Further, the softmax layers in deep neural network structure are divided into several model bursts, respectively in difference GPU on carry out parameter training, each GPU calculates respective model burst, and the supplemental characteristic for obtaining corresponding model burst is special Sign, the quantity of the model burst are consistent with the quantity of GPU.
Further, the data parallel includes, and carries out cutting to training data according to GPU quantity, passes through different GPU pairs Training data after cutting is trained respectively, is obtained training data feature group, is passed through training data feature array between each GPU It interacts, the training data is transmission image data.
Further, after each GPU completes the model burst calculating of oneself, the model distribution on all GPU is combined as one A complete model.
Further, by all-gather algorithms, send data on every piece of GPU.
Beneficial effects of the present invention:The method that big classification deep learning GPU in the present invention accelerates parallel, use are hybrid Framework breaches the bottleneck of big classification deep learning concurrent operation, overcomes last layer in deep neural network structure On full linking layer, the problem of call duration time for carrying out the communications cost of parameter interaction and spending is excessively high, original depth can kept While spending learning effect, significantly lift scheme learning efficiency, reduces GPU occupancies.
Description of the drawings
Fig. 1 is the method flow schematic diagram that the big classification deep learning GPU in the embodiment of the present invention accelerates parallel.
Fig. 2 is the Method And Principle schematic diagram that the big classification deep learning GPU in the embodiment of the present invention accelerates parallel.
Specific embodiment
Illustrate embodiments of the present invention below by way of specific specific example, those skilled in the art can be by this specification Disclosed content understands other advantages and effect of the present invention easily.The present invention can also pass through in addition different specific realities The mode of applying is embodied or practiced, the various details in this specification can also be based on different viewpoints with application, without departing from Various modifications or alterations are carried out under the spirit of the present invention.It should be noted that in the case where there is no conflict, following embodiment and implementation Feature in example can be mutually combined.
It should be noted that the diagram provided in following embodiment only illustrates the basic structure of the present invention in a schematic way Think, then only the display component related with the present invention rather than component count, shape and size during according to actual implementation in schema It draws, kenel, quantity and the ratio of each component can be a kind of random change during actual implementation, and its assembly layout kenel It is likely more complexity.
As shown in Figure 1, the method that the big classification deep learning GPU in the present embodiment accelerates parallel, including:
The model parameter of the softmax layers in deep neural network structure is trained parallel using model;
Each GPU trains respective model burst, obtains the data characteristics of model parameter;
By the data characteristics of interaction models parameter between the softmax layers of each GPU, deep learning is completed.
The GPU data parallels that conventional depth study uses, it is assumed that a deep neural network structure has 7 layers, shares 4 GPU, the batch_size of image data is 64, and the dimension of characteristics of image is 128, and classification number is 1,000,000, last layer Softmax layers of data parameters to be treated are as follows:The data parameters amount of each GPU processing is 64*128, and model parameter amount is 128*100W, each GPU are directed to different data, are trained in same model structure, interactive parameter amount is needed between GPU For 4*128*100W.That is, each GPU utilizes partial data, the complete model of training, but due between GPU data have difference, so It finally needs to summarize the model parameter calculated on each GPU, it can be seen that, million grades of parameter amount has seriously affected model Learning efficiency, in the actual course of work, deep learning model parameter is more, computationally intensive, and the scale of training data is also more Greatly, it is necessary to consume many computing resources, for many data of classification number, last layer in deep neural network structure is complete On linking layer, the call duration time of cost far exceeds the time of parameter calculating.
In the present embodiment, deep neural network structure is trained using mixed architecture, in network structure For other layers, it is necessary to which the parameter amount of interaction is few, in GPU parallel procedures, the communications cost of generation is not high, and data may be employed simultaneously Row method, the present embodiment uses model parallel form in last layer of softmax layers, due to the quantity and number of model parameter Closely bound up according to classification number, for the other deep learning of major class, the parameter interaction of last layer just becomes the bottle of algorithm performance Neck.It is parallel that mixed architecture in the present embodiment by last layer of softmax layers is changed to model by data parallel, can significantly carry Algorithm performance is risen, the present embodiment utilizes the characteristic of GPU, softmax layers is changed to model is parallel, i.e., every piece of GPU is at softmax layers No longer by model parameter sharing to other GPU, but parameter attribute is communicated and gives other GPU, greatly reduce communications cost, often Block GPU is not to possess complete model parameter, but this partial parameters only each calculated, this partial parameters are known as one Model burst.
Model includes complete model being divided into several bursts parallel, and each burst is transported on different GPU respectively Row, in the present embodiment, is divided into several model bursts, respectively not by the softmax layers in deep neural network structure Parameter training is carried out on same GPU, each GPU calculates respective model burst, and obtains the supplemental characteristic of corresponding model burst Feature, the quantity of model burst is consistent with the quantity of GPU, after each GPU completes the model burst calculating of oneself, then will be all Model distribution on GPU is combined as a complete model.
As shown in Fig. 2, the present embodiment first passes through the data (i.e. characteristics of image) on each GPU before softmax layers The mode of all-gather communicates, and reaching has whole data messages on every piece of GPU.Next the parallel side of model is passed through again Formula trains softmax layers.In the present embodiment, by taking 4 core GPU as an example, it is parallel that last layer is changed to model, then it is complete by one Integral mould is divided into 4 model bursts, and each model burst is calculated on 4 GPU respectively, each GPU no longer need into The communication of row model parameter, but give other GPU by the way that parameter attribute is communicated.The present embodiment by using GPU all- Gather algorithms make that all data messages can be obtained on each GPU.In this way, each GPU can be according to all data Information trains the model burst of oneself, and each part that ensure that model is learnt by all data, at this In embodiment, A, B, C, D are considered as the data portion (picture feature) handled on each GPU, and data volume is above-mentioned 64*128, By all-gather algorithms, data portion (i.e. picture feature) is transmitted on every piece of GPU, to ensure the data on 4 GPU It is completely the same.Each GPU only calculates the model burst of oneself, and the information transmitted becomes characteristic parameter from model parameter, data 4*64*128 is become from 4*128*100W in amount.The model burst on 4 GPU is finally combined as a complete model again, Pole significantly reduces the data volume of softmax layers of communication, and the present invention carries out the framework of deep learning using the feature of GPU Optimization, is particularly suitable for major class sorrow of separation condition, by the way that deep learning framework is optimized to mixed mode, so as to substantially increase reality Using learning efficiency.
The above-described embodiments merely illustrate the principles and effects of the present invention, and is not intended to limit the present invention.It is any ripe Know the personage of this technology all can carry out modifications and changes under the spirit and scope without prejudice to the present invention to above-described embodiment.Cause This, those of ordinary skill in the art is complete without departing from disclosed spirit and institute under technological thought such as Into all equivalent modifications or change, should by the present invention claim be covered.

Claims (7)

1. a kind of method that big classification deep learning GPU accelerates parallel, which is characterized in that including:
The model parameter of the softmax layers in deep neural network structure is trained parallel using model;
Each GPU trains respective model burst, obtains the data characteristics of model parameter;
By the data characteristics of interaction models parameter between the softmax layers of each GPU, deep learning is completed.
2. the method that big classification deep learning GPU according to claim 1 accelerates parallel, which is characterized in that using mixing Framework is trained the model parameter in deep neural network structure, and the mixed structure is included using model parallel to depth The model parameter of softmax layers in neural network structure is trained, using data parallel in deep neural network structure The model parameter of other layers be trained.
3. the method that big classification deep learning GPU according to claim 2 accelerates parallel, which is characterized in that the model Parallel to include complete model being divided into several model bursts, each model burst carries out parameter instruction on different GPU respectively Practice.
4. the method that big classification deep learning GPU according to claim 3 accelerates parallel, which is characterized in that by depth god Several model bursts are divided into through the softmax layers in network structure, carry out parameter training on different GPU respectively, often A GPU calculates respective model burst, and obtains the supplemental characteristic feature of corresponding model burst, the quantity of the model burst with The quantity of GPU is consistent.
5. the method that big classification deep learning GPU according to claim 4 accelerates parallel, which is characterized in that the data Include parallel, according to GPU quantity to training data carry out cutting, by different GPU to the training data after cutting respectively into Row training obtains training data feature group, is interacted between each GPU by training data feature array, the training data To transmit image data.
6. the method that big classification deep learning GPU according to claim 5 accelerates parallel, which is characterized in that each GPU is complete After being calculated into the model burst of oneself, the model distribution on all GPU is combined as a complete model.
7. the method that big classification deep learning GPU according to claim 4 accelerates parallel, which is characterized in that pass through all- Gather algorithms are sent data on every piece of GPU.
CN201711251410.3A 2017-12-01 2017-12-01 A kind of method that big classification deep learning GPU accelerates parallel Pending CN108108813A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711251410.3A CN108108813A (en) 2017-12-01 2017-12-01 A kind of method that big classification deep learning GPU accelerates parallel

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711251410.3A CN108108813A (en) 2017-12-01 2017-12-01 A kind of method that big classification deep learning GPU accelerates parallel

Publications (1)

Publication Number Publication Date
CN108108813A true CN108108813A (en) 2018-06-01

Family

ID=62208007

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711251410.3A Pending CN108108813A (en) 2017-12-01 2017-12-01 A kind of method that big classification deep learning GPU accelerates parallel

Country Status (1)

Country Link
CN (1) CN108108813A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109657793A (en) * 2018-12-26 2019-04-19 广州小狗机器人技术有限公司 Model training method and device, storage medium and electronic equipment
WO2020164338A1 (en) * 2019-02-13 2020-08-20 阿里巴巴集团控股有限公司 Method, apparatus and device for updating convolutional neural network using gpu cluster
CN112966829A (en) * 2021-03-03 2021-06-15 山东英信计算机技术有限公司 Deep learning model training method, device, equipment and readable medium
CN114004730A (en) * 2021-11-03 2022-02-01 奥特贝睿(天津)科技有限公司 Deep neural network multi-model parallel reasoning method based on graphics processor
CN115934181A (en) * 2022-11-07 2023-04-07 北京百度网讯科技有限公司 Data loading method and device, electronic equipment and storage medium

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109657793A (en) * 2018-12-26 2019-04-19 广州小狗机器人技术有限公司 Model training method and device, storage medium and electronic equipment
WO2020164338A1 (en) * 2019-02-13 2020-08-20 阿里巴巴集团控股有限公司 Method, apparatus and device for updating convolutional neural network using gpu cluster
TWI716102B (en) * 2019-02-13 2021-01-11 開曼群島商創新先進技術有限公司 Method, device and equipment for updating convolutional neural network using GPU cluster
US11640531B2 (en) 2019-02-13 2023-05-02 Advanced New Technologies Co., Ltd. Method, apparatus and device for updating convolutional neural network using GPU cluster
CN112966829A (en) * 2021-03-03 2021-06-15 山东英信计算机技术有限公司 Deep learning model training method, device, equipment and readable medium
CN114004730A (en) * 2021-11-03 2022-02-01 奥特贝睿(天津)科技有限公司 Deep neural network multi-model parallel reasoning method based on graphics processor
CN114004730B (en) * 2021-11-03 2024-09-17 奥特贝睿(天津)科技有限公司 Deep neural network multi-model parallel reasoning method based on graphic processor
CN115934181A (en) * 2022-11-07 2023-04-07 北京百度网讯科技有限公司 Data loading method and device, electronic equipment and storage medium
CN115934181B (en) * 2022-11-07 2023-10-13 北京百度网讯科技有限公司 Data loading method, device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN108108813A (en) A kind of method that big classification deep learning GPU accelerates parallel
CN109101948B (en) Multi-attention machine mechanism video description method based on space-time and channel
EP3336760A1 (en) Combined adversarial learning of inverse image manipulation operations
CN107066239A (en) A kind of hardware configuration for realizing convolutional neural networks forward calculation
JP2020508504A (en) Iterative multi-scale image generation using neural networks
CN109284812B (en) Video game simulation method based on improved DQN
CN110503641A (en) A kind of method and apparatus improving continuous casting billet face crack
CN108734288A (en) A kind of operation method and device
CN107392835A (en) A kind of processing method and processing device of particIe system
CN109597965A (en) Data processing method, system, terminal and medium based on deep neural network
CN107515736A (en) A kind of method for accelerating depth convolutional network calculating speed on embedded device
Chwif et al. Discrete event simulation model reduction: A causal approach
CN111738276A (en) Image processing method, device and equipment based on multi-core convolutional neural network
CN104268356A (en) Airplane model assembling method for lean production
CN113079216A (en) Cloud application implementation method and device, electronic equipment and readable storage medium
CN110488973A (en) A kind of virtual interactive message leaving system and method
CN109840597A (en) A kind of model prediction method, apparatus, electronic equipment and storage medium
Naseh et al. Enabling Intelligent Vehicular Networks Through Distributed Learning in the Non-Terrestrial Networks 6G Vision
CN103678888B (en) The flowing of a kind of heart blood based on Euler's fluid simulation algorithm schematically shows method
CN105069450A (en) Quick multi-character recognition method
CN117009070A (en) Method, device and equipment for constructing power-calculation scheduling knowledge graph and readable storage medium
Baig et al. Bit rate reduction in cloud gaming using object detection technique
CN104156999A (en) Three-dimensional scene rendering method
CN115292044A (en) Data processing method and device, electronic equipment and storage medium
Ziaee Job shop scheduling with makespan objective: A heuristic approach

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20180601