CN109919310A - A kind of GPU Memory Optimize Method and system towards deep learning training mission - Google Patents

A kind of GPU Memory Optimize Method and system towards deep learning training mission Download PDF

Info

Publication number
CN109919310A
CN109919310A CN201910035753.9A CN201910035753A CN109919310A CN 109919310 A CN109919310 A CN 109919310A CN 201910035753 A CN201910035753 A CN 201910035753A CN 109919310 A CN109919310 A CN 109919310A
Authority
CN
China
Prior art keywords
swapping
data
gpu
memory
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910035753.9A
Other languages
Chinese (zh)
Other versions
CN109919310B (en
Inventor
刘万涛
郭锦荣
虎嵩林
韩冀中
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Information Engineering of CAS
Original Assignee
Institute of Information Engineering of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Information Engineering of CAS filed Critical Institute of Information Engineering of CAS
Priority to CN201910035753.9A priority Critical patent/CN109919310B/en
Publication of CN109919310A publication Critical patent/CN109919310A/en
Application granted granted Critical
Publication of CN109919310B publication Critical patent/CN109919310B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Memory System Of A Hierarchy Structure (AREA)

Abstract

The present invention relates to a kind of GPU Memory Optimize Methods and system towards deep learning training mission.This method comprises: (1) designs basic swapping in and out operation;(2) static data acquisition is carried out first before training starts;(3) swapping in and out strategy is not taken, first trains several epoches, carries out Dynamic Data Acquiring during this period;(4) performance model of swapping in and out strategy is established, and clear GPU calculating, memory, the restricting relation between PCIe communication three;(5) optimal policy is determined according to performance model;(6) remaining epoch continues training until terminating using optimal minibatch size and its matched swapping in and out strategy.The present invention solves the problems, such as that ultra-deep neural network model can not train or can train minibatch size is too small to cause training effectiveness low, can make full use of GPU resource and promotes ultra-deep neural network model training effectiveness.

Description

A kind of GPU Memory Optimize Method and system towards deep learning training mission
Technical field
The invention belongs to deep learning fields, can not train specific to common GPU equipment because its memory source is in short supply The problem of ultra-deep neural network model, proposes a kind of GPU Memory Optimize Method, under the premise of not introducing extra time expense So that training mission can be performed on single GPU, while maximally utilising GPU calculating and memory source, abundant training for promotion Efficiency.
Background technique
In recent years, as depth learning technology is in computer vision, speech recognition, natural language processing, machine translation etc. The scale of the successful application of multiple fields, neural network model is growing day by day.By taking CV (computer vision) field as an example, classic map As the model of classification task develops to from 8 layers of Alexnet 152 layers of Resnet, model depth increases by 18 times, and accuracy rate is promoted Nearly 13%.Numerous research work show that the network the deep more is conducive to model and extract richer multi-level spy from input data Sign, study ability to express is stronger, and higher precision can be realized in more complicated task.However, training ultra-deep neural network Model is often highly difficult, and a large amount of neuron has very high demand along with floating type calculating operation, to the calculating power of hardware resource, In addition, the requirement to memory linearly increases also with the depth and minibatch size of network.The many-core of current general GPU Framework with the advantage of its highly-parallel by the favor of deep learning task, but the memory source relative shortage of GPU, for instructing Practicing ultra-deep neural network, often there will be two kinds of situations: 1) directly memory spilling can not train, and 2) it can train but can only set very Small minibatch size leads to not make full use of GPU computing resource, training time long low efficiency indirectly.Therefore, having must GPU memory utilized in conjunction with ultra-deep neural network model characteristic and be optimized, solve the problems, such as not training first, further Optimal minibatch size is selected to play GPU and calculate power, final raising training mission efficiency.
Existing GPU Memory Optimize Method includes re-computation and two kinds of swapping in and out.Re-computation technology is GPU in forward direction meter Discard portion intermediate result does not save during calculation, its correspondence is re-executed when retrospectively calculate needs the intermediate result again Forward calculation process seek, the shortcomings that this method, is to introduce additional 20-30% restatement evaluation time, for itself Through very time-consuming deep learning task, the method is difficult to promote applicable.The main thought of swapping in and out technology is by host side CPU Memory is stored as the standby of GPU memory, similar to the relationship of cache in computer architecture and DRAM, GPU memory In only relevant to the calculating of its current layer data of storage, other non-relevant datas are then stored in CPU memory.Specifically, forward direction passes During broadcasting, it is having been generated on GPU and not by next layer calculate call data will be swapped out in CPU memory from GPU memory, In back-propagation process, the data for being swapped out to CPU will calculate called preceding change to GPU memory, swapping in and out at next layer It is completed by PCIe bus communication.
Swapping in and out strategy can be by most of Data Migration in training process into CPU memory, and the memory at the end GPU needs It asks from network level and is reduced to level, can solve the problems, such as that most of ultra-deep neural network can not train.However, current swapping in and out Strategy design mostly be using formulate in advance heuristic rule (which data of swapping in and out and when execute swapping in and out behaviour Make), calculating in operational process, memory, the restricting relation for communicating three can not be specified, often occurs causing not in time due to communication The case where GPU calculates obstruction and introduces extra time expense, in addition, current technology addresses only the problem that can not be trained, not Consider the performance issue i.e. training effectiveness after can training.
Summary of the invention
In view of problem and shortage existing for prior art described above, the technical problem to be solved in the present invention is to provide one kind GPU Memory Optimize Method and system towards deep learning training mission, specifically, an inter-species is in the dynamic of CPU and GPU memory Swapping in and out strategy not only solves the problems, such as that ultra-deep neural network model cannot train, and further looks on trainable basis To optimal training program, a kind of solution of efficient is provided for the training of ultra-deep neural network model.
To solve the above problems, the present invention adopts the following technical solutions:
A kind of GPU Memory Optimize Method towards deep learning training mission, the specific steps of which are as follows:
(1), the operation of basic swapping in and out is designed, comprising: the time of the data of swapping in and out, swapping in and out.
(2), before training starts, progress static data acquisition first, for portraying the substantially special of network model to be trained Property, comprising: floating number behaviour needed for the byte-sized of each operand, each layer of calculating in global operation sequence (GOS), sequence Make (FLOP).
(3), start to train, do not take swapping in and out strategy, first training several epoches, (epoch is that single training changes Generation, all sample datas in sample are calculated only once a referred to as epoch), minibatch size (small lot size) choosing Lesser modal value is taken, carries out Dynamic Data Acquiring during this period, for portraying the performance of Current hardware environment, comprising: is different When in the next iteration of minibatch size needed for time, each data exchange operation needed for each layer of calculating Between.
(4), according to collected static multidate information, the performance model of swapping in and out strategy is established, and clear GPU is counted Calculation, memory, PCIe communicate the restricting relation between three.
(5), optimal policy is determined according to performance model, is changed including optimal minibatch size and its change to match It is tactful out.
(6), remaining epoch continues training directly using optimal minibatch size and its matched swapping in and out strategy To terminating.
The basic swapping in and out operation of design described in above-mentioned steps (1), the specific steps of which are as follows:
(1-1), the operation that swaps out is designed, the data that can be swapped out mainly include parameter and its ladder in neural network training process Degree, intermediate result and its gradient, after GPU, which completes current layer, to be calculated, if the relevant all data of current layer will not be by will The next layer of calculating executed is used, then is immediately performed the operation that swaps out to CPU memory, otherwise waits until to change again after next layer of calculating Out, the operation that swaps out is completed by individual threads, mutually independent with change operation;
(1-2) designs change operation, and the data that can be changed to are as swapped out, as long as the time of change is that current GPU has Free memory space, then change to immediately, changes to again after the thread release sufficient space that swaps out if waited without if, change behaviour It also is completed by individual threads, it is mutually independent with the operation that swaps out.
The acquisition of static data described in above-mentioned steps (2), the specific steps of which are as follows:
(2-1), the dependence for analyzing each data in network model to be trained successively is executed according to neural network by layer The characteristic of calculating, it may be determined that the sequencing that each data are accessed and are released in primary training, each data have changes accordingly Enter the operation that swaps out, and then constructs global operation sequence GOS;
For data in (2-2), neural network frequently with floating type tensor representation, the byte-sized of each data can direct root It is acquired according to its dimensional information multiplied by the byte number of real-coded GA, which is equal to its shared space in memory simultaneously Size and swapping in and out traffic operation traffic size;
Calculating in (2-3), neural network is substantially tensor operation, is similar to matrix operation, there is clearly dimension The floating-point operation number FLOP of relationship, each layer of calculating can be obtained by analyzing its corresponding tensor operation relationship.
Dynamic Data Acquiring described in above-mentioned steps (3), the specific steps of which are as follows:
(3-1), the Properties Analysis tool nvprof by using Nvidia, the calling by detecting timing kernel function can obtain The execution time needed for obtaining each layer of network;
(3-2), similarly the order that chronometric data copies between CPU and GPU memory, is detected by nvprof CudaMemcpyAsync can get when each data execute swapping in and out operation and be communicated the required time using PCIe.
The performance model of swapping in and out strategy is established described in above-mentioned steps (4), the specific steps of which are as follows:
(4-1), GPU computation model is established, mainly uses the information that step (2-3) and step (3-1) collect, pass through Be fitted every layer of calculating operation FLOP and its required operation time, the calculated performance FLOPs of current GPU can be obtained, it is further, right Any minibatch size, each layer of FLOP is acquired by step (2-3), in conjunction with GPU calculated performance curve, can be asked first It the calculating time for obtaining each layer, can be in the hope of execution needed for an iteration secondly by all layers in cumulative network The execution time of entire training process can be obtained finally by multiple iteration are accumulated in time;
(4-2), GPU memory model is established, mainly uses the information that step (2-1) and step (2-2) collect, according to Global operation sequence, swapping in and out operation are executed by two separate threads parallel respectively, and the operation that swaps out can discharge GPU memory, change Enter operation and occupy GPU memory, but at any one time, to any minibatch size, each change is acquired by step (2-2) Swap out the data committed memory size of operation, and the size of data being present in GPU memory (changed to subtract swapped out) must It must be less than or equal to given GPU memory size;
(4-3), PCIe traffic model is established, mainly uses the information that step (2-2) and step (3-2) collect, root According to the size and its corresponding transmission time of each swapping in and out operation data, the efficient communication bandwidth of current PC Ie can be acquired, To any minibatch size, the data traffic size of each swapping in and out operation is acquired by step (2-2), is combined with Bandwidth is imitated, call duration time needed for each swapping in and out operation can be acquired;
(4-4), GPU calculating, memory, PCIe communicate the restricting relation between three, mainly use step (4-1) to step Three submodels of (4-3) are not blocked with GPU calculating for target, and each data must guaranteed to have changed before calculating calling Enter GPU memory, the time for completing change is equal to the time point for starting to execute change operation plus the time of change PCIe communication, opens The time point for beginning to execute change operation depends on whether GPU is vacateed to accommodate and be worked as by the progress and the thread that swaps out of layer execution calculating The GPU memory headroom of preceding change data.
Determination optimal policy described in above-mentioned steps (5), the specific steps of which are as follows:
Training process is modeled as using minibatch size as Optimization goal according to the performance model that step (4) propose Optimization problem, it is most short for target with the training time, GPU limited memory and GPU calculating do not block for two constrain, solution obtain with The optimal minibatch size and its matched swapping in and out strategy of Current hardware environment configurations training current network.
Accordingly with above method, the present invention also provides a kind of GPU internal memory optimization system towards deep learning training mission System comprising:
Basic swapping in and out operational design module is responsible for the basic swapping in and out operation of design, data including swapping in and out, The time of swapping in and out;
Static data acquisition module is responsible for before training starts, progress static data acquisition first, for portraying wait train The fundamental characteristics of network model;
Dynamic Data Acquiring module is responsible for after starting training, is not taken swapping in and out strategy, first train several Epoches carries out Dynamic Data Acquiring during this period, for portraying the performance of Current hardware environment;
Performance model constructs module, is responsible for establishing swapping in and out strategy according to collected static data and dynamic data Performance model, and clear GPU calculating, memory, PCIe communication three between restricting relation;
Optimal policy determining module is responsible for determining optimal policy, including optimal minibatch size according to performance model And its swapping in and out strategy to match;
Continue training module, be responsible in remaining epoch using optimal minibatch size and its matched change Strategy continues training until terminating out.
The beneficial effects of the present invention are: using the GPU internal memory optimization strategy of swapping in and out, solve ultra-deep nerve first The too small problem for causing training effectiveness low of minibatch size can not be trained or can be trained to network model;Further, and now There is technology to compare, the present invention devises dynamic swapping in and out strategy, Illuminative design is substituted by the way of precisely modeling, Optimizing obtains best training performance under current hardware environment, and GPU resource is made full use of to promote ultra-deep neural network model instruction Practice efficiency.
Detailed description of the invention
Fig. 1 is a kind of flow chart of GPU Memory Optimize Method towards deep learning training mission of the invention.
Optimal minibatch size and its matched swapping in and out strategy are solved in the step of Fig. 2 is embodiment (5) Flow chart.
Specific embodiment
Below in conjunction with the drawings and specific embodiments, the present invention is described further.
Referring to Fig.1, a kind of GPU Memory Optimize Method towards deep learning training mission of the invention, this method include Following steps:
(1), the operation of basic swapping in and out is designed, comprising: the time of the data of swapping in and out, swapping in and out, specific to walk It is rapid as follows:
(1-1), the operation that swaps out is designed, the data that can be swapped out mainly include parameter and its ladder in neural network training process Degree, intermediate result and its gradient, after GPU, which completes current layer, to be calculated, if the relevant all data of current layer will not be by will The next layer of calculating executed is used, then is immediately performed the operation that swaps out to CPU memory, otherwise waits until to change again after next layer of calculating Out, the operation that swaps out is completed by individual threads, mutually independent with change operation;
(1-2), design change operation, the data that can be changed to are as swapped out, as long as the time of change is that current GPU has Free memory space, then change to immediately, changes to again after the thread release sufficient space that swaps out if waited without if, change behaviour It also is completed by individual threads, it is mutually independent with the operation that swaps out;
(2), before training starts, progress static data acquisition first, for portraying the substantially special of network model to be trained Property, comprising: floating number behaviour needed for the byte-sized of each operand, each layer of calculating in global operation sequence (GOS), sequence Make (FLOP),
(2-1), the dependence for analyzing each data in network model to be trained successively is executed according to neural network by layer The characteristic of calculating, it may be determined that the sequencing that each data are accessed and are released in primary training, each data have changes accordingly Enter the operation that swaps out, and then constructs global operation sequence GOS;
For data in (2-2), neural network frequently with floating type tensor representation, the byte-sized of each data can direct root It is acquired according to its dimensional information multiplied by the byte number of real-coded GA, which is equal to its shared space in memory simultaneously Size and swapping in and out traffic operation traffic size;
Calculating in (2-3), neural network is substantially tensor operation, is similar to matrix operation, there is clearly dimension The floating-point operation number FLOP of relationship, each layer of calculating can be obtained by analyzing its corresponding tensor operation relationship.
(3), start to train, do not take swapping in and out strategy, first 4 epoches, minibatch size is trained successively to set It is set to 4,8,16,32, carries out Dynamic Data Acquiring during this period, for portraying the performance of Current hardware environment, comprising: is different When in the next iteration of minibatch size needed for time, each data exchange operation needed for each layer of calculating Between, the specific steps of which are as follows:
(3-1), the Properties Analysis tool nvprof by using Nvidia, the calling by detecting timing kernel function can obtain The execution time needed for obtaining each layer of network;
(3-2), similarly the order that chronometric data copies between CPU and GPU memory, is detected by nvprof CudaMemcpyAsync can get when each data execute swapping in and out operation and be communicated the required time using PCIe.
(4), according to collected static multidate information, the performance model of swapping in and out strategy is established, and clear GPU is counted Calculation, memory, PCIe communicate the restricting relation between three, the specific steps of which are as follows:
(4-1), GPU computation model is established, mainly uses the information that step (2-3) and step (3-1) collect, pass through Be fitted every layer of calculating operation FLOP and its required operation time, the calculated performance FLOPs of current GPU can be obtained, it is further, right Any minibatch size, each layer of FLOP is acquired by step (2-3), in conjunction with GPU calculated performance curve, can be asked first It the calculating time for obtaining each layer, can be in the hope of execution needed for an iteration secondly by all layers in cumulative network The execution time of entire training process can be obtained finally by multiple iteration are accumulated in time;
(4-2), GPU memory model is established, mainly uses the information that step (2-1) and step (2-2) collect, according to Global operation sequence, swapping in and out operation are executed by two separate threads parallel respectively, and the operation that swaps out can discharge GPU memory, change Enter operation and occupy GPU memory, but at any one time, to any minibatch size, each change is acquired by step (2-2) Swap out the data committed memory size of operation, and the size of data being present in GPU memory (changed to subtract swapped out) must It must be less than or equal to given GPU memory size;
(4-3), PCIe traffic model is established, mainly uses the information that step (2-2) and step (3-2) collect, root According to the size and its corresponding transmission time of each swapping in and out operation data, the efficient communication bandwidth of current PC Ie can be acquired, To any minibatch size, the data traffic size of each swapping in and out operation is acquired by step (2-2), is combined with Bandwidth is imitated, call duration time needed for each swapping in and out operation can be acquired;
(4-4), GPU calculating, memory, PCIe communicate the restricting relation between three, mainly use step (4-1) to step Three submodels of (4-3) are not blocked with GPU calculating for target, and each data must guaranteed to have changed before calculating calling Enter GPU memory, the time for completing change is equal to the time point for starting to execute change operation plus the time of change PCIe communication, opens The time point for beginning to execute change operation depends on whether GPU is vacateed to accommodate and be worked as by the progress and the thread that swaps out of layer execution calculating The GPU memory headroom of preceding change data.
(5), optimal policy is determined according to performance model, is changed including optimal minibatch size and its change to match It is tactful out, the specific steps of which are as follows:
Training process is modeled as using minibatch size as Optimization goal according to the performance model that step (4) propose Optimization problem, most short for target with the training time, GPU limited memory and GPU calculating are not blocked for two constraints, solution procedure ginseng According to Fig. 2, finally obtains with the optimal minibatch size of Current hardware environment configurations training current network and its matched change Enter the strategy that swaps out.
(6), remaining epoch continues training directly using optimal minibatch size and its matched swapping in and out strategy To terminating.
Another embodiment of the present invention provides a kind of GPU memory optimizing systems towards deep learning training mission comprising:
Basic swapping in and out operational design module is responsible for the basic swapping in and out operation of design, data including swapping in and out, The time of swapping in and out;
Static data acquisition module is responsible for before training starts, progress static data acquisition first, for portraying wait train The fundamental characteristics of network model;
Dynamic Data Acquiring module is responsible for after starting training, is not taken swapping in and out strategy, first train several Epoches carries out Dynamic Data Acquiring during this period, for portraying the performance of Current hardware environment;
Performance model constructs module, is responsible for establishing swapping in and out strategy according to collected static data and dynamic data Performance model, and clear GPU calculating, memory, PCIe communication three between restricting relation;
Optimal policy determining module is responsible for determining optimal policy, including optimal minibatch size according to performance model And its swapping in and out strategy to match;
Continue training module, be responsible in remaining epoch using optimal minibatch size and its matched change Strategy continues training until terminating out.
The specific implementation of above-mentioned each module sees above the explanation to the method for the present invention.
Experimental data: experimental situation is one piece of NVIDIA Tesla M40GPU card, using ImageNet data set, Alexnet (256) indicates that training AlexNet, minibatch size are set as 256, remaining network structure and so on, as a result As shown in table 1.
1. experimental result of table
The above embodiments are merely illustrative of the technical solutions of the present invention rather than is limited, the ordinary skill of this field Personnel can be with modification or equivalent replacement of the technical solution of the present invention are made, without departing from the principle and scope of the present invention, originally The protection scope of invention should be subject to described in claims.

Claims (10)

1. a kind of GPU Memory Optimize Method towards deep learning training mission, which comprises the following steps:
(1) basic swapping in and out operation, the time of data, swapping in and out including swapping in and out are designed;
(2) before training starts, progress static data acquisition first, for portraying the fundamental characteristics of network model to be trained;
(3) start to train, do not take swapping in and out strategy, first train several epoches, carry out dynamic data during this period and adopt Collection, for portraying the performance of Current hardware environment;
(4) according to collected static data and dynamic data, the performance model of swapping in and out strategy is established, and clear GPU is counted Calculation, memory, PCIe communicate the restricting relation between three;
(5) optimal policy is determined according to performance model, including optimal minibatch size and its swapping in and out plan to match Slightly;
(6) remaining epoch continues training until knot using optimal minibatch size and its matched swapping in and out strategy Beam.
2. the method according to claim 1, wherein step (1) includes:
(1-1) designs the operation that swaps out: the data to swap out mainly include parameter and its gradient, centre in neural network training process As a result and its gradient, after GPU, which completes current layer, to be calculated, if the relevant all data of current layer will not will be executed down One layer of calculating is used, then is immediately performed the operation that swaps out to CPU memory, otherwise waits until to swap out again after next layer of calculating;It swaps out Operation is completed by individual threads, mutually independent with change operation;
(1-2) designs change operation: the data of change are as swapped out, as long as the time of change is in current GPU is available free Space being deposited, then is changed to immediately, being changed to again after the thread release sufficient space that swaps out if waited without if;Change operation is also It is completed by individual threads, it is mutually independent with the operation that swaps out.
3. the method according to claim 1, wherein step (2) the static data acquisition, the data of acquisition Include: global operation sequence, the byte-sized of each operand in sequence, each layer calculate needed for floating point number operations.
4. according to the method described in claim 3, it is characterized in that, step (2) includes:
(2-1) analyzes the dependence of each data in network model to be trained, and successively executes calculating by layer according to neural network Characteristic, the sequencing for determining that each data are accessed in primary training and being released, each data have corresponding swapping in and out Operation, and then construct global operation sequence;
Data in (2-2) neural network use floating type tensor representation, and the byte-sized of each data is directly according to its dimension Information is acquired multiplied by the byte number of real-coded GA, the byte-sized be equal to simultaneously its in memory shared space size with And the traffic size of swapping in and out traffic operation;
The floating-point operation number FLOP of each layer of calculating in (2-3) neural network is obtained by analyzing its corresponding tensor operation relationship It arrives.
5. method according to claim 1 or 4, which is characterized in that when step (3) starts to train, do not take swapping in and out Strategy first trains 4 epoches, minibatch size to be successively set as 4,8,16,32, carry out dynamic data during this period Acquisition.
6. the method according to claim 1, wherein step (3) described Dynamic Data Acquiring, the data of acquisition It include: to calculate required time, each data exchange operation for each layer in the different next iteration of minibatch size The required time.
7. according to the method described in claim 6, it is characterized in that, step (3) includes:
(3-1) obtains network by the calling of detection timing kernel function by using the Properties Analysis tool nvprof of Nvidia The execution time needed for each layer;
(3-2) detects the order cudaMemcpyAsync that chronometric data copies between CPU and GPU memory by nvprof, obtains Each data execute the time needed for communicating when swapping in and out operation using PCIe.
8. the method according to claim 1, wherein step (4) includes:
(4-1) establishes GPU computation model: FLOP and its required operation time by every layer of calculating operation of fitting are obtained current The calculated performance FLOPs of GPU further to any minibatch size, acquires each layer of FLOP, computational in conjunction with GPU Energy curve, acquires each layer of calculating time first, acquires an iteration institute secondly by all layers in cumulative network The execution time needed obtains the execution time of entire training process finally by multiple iteration are accumulated;
(4-2) establishes GPU memory model: according to global operation sequence, swapping in and out operation is parallel by two separate threads respectively It executes, the operation that swaps out can discharge GPU memory, and change operation occupies GPU memory, but at any one time, to any Minibatchsize, acquires the data committed memory size of each swapping in and out operation, and the data being present in GPU memory are big Small be necessarily less than is equal to given GPU memory size;
(4-3) establishes PCIe traffic model: according to the size and its corresponding transmission time of each swapping in and out operation data, asking The efficient communication bandwidth for obtaining current PCIe acquires the data communication of each swapping in and out operation to any minibatch size Size is measured, in conjunction with effective bandwidth, call duration time needed for acquiring each swapping in and out operation;
(4-4) arrives three submodels of step (4-3) according to step (4-1), is not blocked with GPU calculating for target, establishes GPU meter Calculation, memory, PCIe communicate the restricting relation between three;Each data must guaranteed to have changed to GPU before calculating calling Memory, the time for completing change are equal to the time point for starting to execute change operation plus the time of change PCIe communication, start to hold The time point of row change operation depends on the progress that GPU is calculated by layer execution and whether the thread that swaps out is vacateed to accommodate and currently be changed Enter the GPU memory headroom of data.
9. the method according to claim 1, wherein the method for step (5) the determining optimal policy is: according to Training process is modeled as optimization problem using minibatch size as Optimization goal by the performance model that step (4) proposes, with It is target that training time is most short, and GPU limited memory and GPU calculating are not blocked to be constrained for two, and solution is obtained with Current hardware ring Border configures the optimal minibatch size and its matched swapping in and out strategy of training current network.
10. a kind of GPU memory optimizing system towards deep learning training mission characterized by comprising
Basic swapping in and out operational design module is responsible for the basic swapping in and out operation of design, data, change including swapping in and out The time to swap out;
Static data acquisition module is responsible for before training starts, progress static data acquisition first, for portraying network to be trained The fundamental characteristics of model;
Dynamic Data Acquiring module is responsible for after starting training, is not taken swapping in and out strategy, first train several epoches, Dynamic Data Acquiring is carried out, during this period for portraying the performance of Current hardware environment;
Performance model constructs module, is responsible for establishing the property of swapping in and out strategy according to collected static data and dynamic data Can model, and clear GPU calculatings, memory, PCIe communicate three between restricting relation;
Optimal policy determining module is responsible for determining optimal policy according to performance model, including optimal minibatch size and its The swapping in and out strategy to match;
Continue training module, is responsible in remaining epoch using optimal minibatch size and its matched swapping in and out plan Slightly continue training until terminating.
CN201910035753.9A 2019-01-15 2019-01-15 GPU memory optimization method and system for deep learning training task Active CN109919310B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910035753.9A CN109919310B (en) 2019-01-15 2019-01-15 GPU memory optimization method and system for deep learning training task

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910035753.9A CN109919310B (en) 2019-01-15 2019-01-15 GPU memory optimization method and system for deep learning training task

Publications (2)

Publication Number Publication Date
CN109919310A true CN109919310A (en) 2019-06-21
CN109919310B CN109919310B (en) 2021-05-18

Family

ID=66960413

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910035753.9A Active CN109919310B (en) 2019-01-15 2019-01-15 GPU memory optimization method and system for deep learning training task

Country Status (1)

Country Link
CN (1) CN109919310B (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110490300A (en) * 2019-07-26 2019-11-22 苏州浪潮智能科技有限公司 A kind of operation accelerated method, apparatus and system based on deep learning
CN110751282A (en) * 2019-10-18 2020-02-04 北京百度网讯科技有限公司 Processor memory optimization method and device for deep learning training task
CN110941494A (en) * 2019-12-02 2020-03-31 哈尔滨工程大学 Deep learning-oriented GPU parallel computing data processing method
CN110942138A (en) * 2019-11-13 2020-03-31 华中科技大学 Deep neural network training method and system in hybrid memory environment
CN111078395A (en) * 2019-11-12 2020-04-28 华中科技大学 Deep learning GPU memory management optimization method and system based on tensor
CN111310054A (en) * 2020-03-06 2020-06-19 中国科学院信息工程研究所 Recommendation method and device based on adaptive Margin symmetry metric learning
CN111814948A (en) * 2020-06-18 2020-10-23 浙江大华技术股份有限公司 Operation method and operation device of neural network and computer readable storage medium
CN111882073A (en) * 2020-07-17 2020-11-03 苏州浪潮智能科技有限公司 Method and equipment for modifying distributed computation graph
CN112306697A (en) * 2020-12-31 2021-02-02 之江实验室 Deep learning memory management method and system based on Tensor access
CN117687802A (en) * 2024-02-02 2024-03-12 湖南马栏山视频先进技术研究院有限公司 Deep learning parallel scheduling method and device based on cloud platform and cloud platform
CN117892769A (en) * 2024-03-15 2024-04-16 之江实验室 Neural network training method, video memory scheduling method, system, equipment and product

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101526934A (en) * 2009-04-21 2009-09-09 浪潮电子信息产业股份有限公司 Construction method of GPU and CPU combined processor
CN104035751A (en) * 2014-06-20 2014-09-10 深圳市腾讯计算机系统有限公司 Graphics processing unit based parallel data processing method and device
CN105224502A (en) * 2015-09-28 2016-01-06 浪潮(北京)电子信息产业有限公司 A kind of degree of depth learning method based on GPU and system
WO2018024232A1 (en) * 2016-08-05 2018-02-08 上海寒武纪信息科技有限公司 Device and method for executing neural network operation
CN108229645A (en) * 2017-04-28 2018-06-29 北京市商汤科技开发有限公司 Convolution accelerates and computation processing method, device, electronic equipment and storage medium
CN108460457A (en) * 2018-03-30 2018-08-28 苏州纳智天地智能科技有限公司 A kind of more asynchronous training methods of card hybrid parallel of multimachine towards convolutional neural networks

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101526934A (en) * 2009-04-21 2009-09-09 浪潮电子信息产业股份有限公司 Construction method of GPU and CPU combined processor
CN104035751A (en) * 2014-06-20 2014-09-10 深圳市腾讯计算机系统有限公司 Graphics processing unit based parallel data processing method and device
CN105224502A (en) * 2015-09-28 2016-01-06 浪潮(北京)电子信息产业有限公司 A kind of degree of depth learning method based on GPU and system
WO2018024232A1 (en) * 2016-08-05 2018-02-08 上海寒武纪信息科技有限公司 Device and method for executing neural network operation
CN108229645A (en) * 2017-04-28 2018-06-29 北京市商汤科技开发有限公司 Convolution accelerates and computation processing method, device, electronic equipment and storage medium
CN108460457A (en) * 2018-03-30 2018-08-28 苏州纳智天地智能科技有限公司 A kind of more asynchronous training methods of card hybrid parallel of multimachine towards convolutional neural networks

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
CHEN MENG ET AL: "Training deeper models by gpu memory optimization on tensorflow", 《IN PROC. OF ML SYSTEMS WORKSHOP IN NIPS》 *
LINNAN WANG ET AL: "Superneurons: dynamic GPU memory management for training deep neural networks", 《PPOPP "18: PROCEEDINGS OF THE 23RD ACM SIGPLAN SYMPOSIUM ON PRINCIPLES AND PRACTICE OF PARALLEL PROGRAMMIN》 *
MINSOO RHU ET AL: "vDNN: Virtualized deep neural networks for scalable, memory-efficient neural network design", 《2016 49TH ANNUAL IEEE/ACM INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE (MICRO)》 *
于齐: "分支与不规则访存在GPU上的优化方法研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110490300A (en) * 2019-07-26 2019-11-22 苏州浪潮智能科技有限公司 A kind of operation accelerated method, apparatus and system based on deep learning
CN110490300B (en) * 2019-07-26 2022-03-15 苏州浪潮智能科技有限公司 Deep learning-based operation acceleration method, device and system
CN110751282A (en) * 2019-10-18 2020-02-04 北京百度网讯科技有限公司 Processor memory optimization method and device for deep learning training task
CN111078395A (en) * 2019-11-12 2020-04-28 华中科技大学 Deep learning GPU memory management optimization method and system based on tensor
CN111078395B (en) * 2019-11-12 2023-06-20 华中科技大学 Tensor-based deep learning GPU memory management optimization method and system
CN110942138A (en) * 2019-11-13 2020-03-31 华中科技大学 Deep neural network training method and system in hybrid memory environment
CN110942138B (en) * 2019-11-13 2022-02-15 华中科技大学 Deep neural network training method and system in hybrid memory environment
CN110941494A (en) * 2019-12-02 2020-03-31 哈尔滨工程大学 Deep learning-oriented GPU parallel computing data processing method
CN111310054A (en) * 2020-03-06 2020-06-19 中国科学院信息工程研究所 Recommendation method and device based on adaptive Margin symmetry metric learning
CN111814948B (en) * 2020-06-18 2021-07-13 浙江大华技术股份有限公司 Operation method and operation device of neural network and computer readable storage medium
CN111814948A (en) * 2020-06-18 2020-10-23 浙江大华技术股份有限公司 Operation method and operation device of neural network and computer readable storage medium
CN111882073A (en) * 2020-07-17 2020-11-03 苏州浪潮智能科技有限公司 Method and equipment for modifying distributed computation graph
CN112306697B (en) * 2020-12-31 2021-04-27 之江实验室 Deep learning memory management method and system based on Tensor access
CN112306697A (en) * 2020-12-31 2021-02-02 之江实验室 Deep learning memory management method and system based on Tensor access
CN117687802A (en) * 2024-02-02 2024-03-12 湖南马栏山视频先进技术研究院有限公司 Deep learning parallel scheduling method and device based on cloud platform and cloud platform
CN117687802B (en) * 2024-02-02 2024-04-30 湖南马栏山视频先进技术研究院有限公司 Deep learning parallel scheduling method and device based on cloud platform and cloud platform
CN117892769A (en) * 2024-03-15 2024-04-16 之江实验室 Neural network training method, video memory scheduling method, system, equipment and product
CN117892769B (en) * 2024-03-15 2024-06-11 之江实验室 Neural network training method, video memory scheduling method, system, equipment and product

Also Published As

Publication number Publication date
CN109919310B (en) 2021-05-18

Similar Documents

Publication Publication Date Title
CN109919310A (en) A kind of GPU Memory Optimize Method and system towards deep learning training mission
CN104657219B (en) A kind of application program threads number dynamic adjusting method being used under isomery many-core system
CN105956021B (en) A kind of automation task suitable for distributed machines study parallel method and its system
CN103617087B (en) MapReduce optimizing method suitable for iterative computations
CN105653790B (en) A kind of out-of order processor Cache memory access performance estimating method based on artificial neural network
CN106951499B (en) A kind of knowledge mapping representation method based on translation model
CN102981807B (en) Graphics processing unit (GPU) program optimization method based on compute unified device architecture (CUDA) parallel environment
CN105184367B (en) The model parameter training method and system of deep neural network
CN107861606A (en) A kind of heterogeneous polynuclear power cap method by coordinating DVFS and duty mapping
CN104765589B (en) Grid parallel computation preprocess method based on MPI
CN111079921A (en) Efficient neural network training and scheduling method based on heterogeneous distributed system
CN106339351A (en) SGD (Stochastic Gradient Descent) algorithm optimization system and method
CN109524118A (en) A kind of screen method for gestational diabetes based on machine learning and physical examination data
CN109242099A (en) Training method, device, training equipment and the storage medium of intensified learning network
CN109102498A (en) A kind of method of cluster type nucleus segmentation in cervical smear image
Durai et al. Liver disease prediction using machine learning
CN110333933A (en) A kind of HPL computation model emulation mode
CN111797833A (en) Automatic machine learning method and system oriented to remote sensing semantic segmentation
CN106020773A (en) Method for optimizing finite difference algorithm in heterogeneous many-core framework
WO2023236319A1 (en) Convolutional neural network deployment and optimization method for microcontroller
CN110276689A (en) Intelligent contract implementation method based on dynamic decision
CN110287114A (en) A kind of method and device of database script performance test
CN105094949A (en) Method and system for simulation based on instruction calculation model and feedback compensation
CN103455364B (en) A kind of multi-core environment concurrent program Cache performance online obtains system and method
CN110109811B (en) A kind of source tracing method towards GPU calculated performance problem

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant