CN109919310A - A kind of GPU Memory Optimize Method and system towards deep learning training mission - Google Patents
A kind of GPU Memory Optimize Method and system towards deep learning training mission Download PDFInfo
- Publication number
- CN109919310A CN109919310A CN201910035753.9A CN201910035753A CN109919310A CN 109919310 A CN109919310 A CN 109919310A CN 201910035753 A CN201910035753 A CN 201910035753A CN 109919310 A CN109919310 A CN 109919310A
- Authority
- CN
- China
- Prior art keywords
- swapping
- data
- gpu
- memory
- training
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Landscapes
- Memory System Of A Hierarchy Structure (AREA)
Abstract
The present invention relates to a kind of GPU Memory Optimize Methods and system towards deep learning training mission.This method comprises: (1) designs basic swapping in and out operation;(2) static data acquisition is carried out first before training starts;(3) swapping in and out strategy is not taken, first trains several epoches, carries out Dynamic Data Acquiring during this period;(4) performance model of swapping in and out strategy is established, and clear GPU calculating, memory, the restricting relation between PCIe communication three;(5) optimal policy is determined according to performance model;(6) remaining epoch continues training until terminating using optimal minibatch size and its matched swapping in and out strategy.The present invention solves the problems, such as that ultra-deep neural network model can not train or can train minibatch size is too small to cause training effectiveness low, can make full use of GPU resource and promotes ultra-deep neural network model training effectiveness.
Description
Technical field
The invention belongs to deep learning fields, can not train specific to common GPU equipment because its memory source is in short supply
The problem of ultra-deep neural network model, proposes a kind of GPU Memory Optimize Method, under the premise of not introducing extra time expense
So that training mission can be performed on single GPU, while maximally utilising GPU calculating and memory source, abundant training for promotion
Efficiency.
Background technique
In recent years, as depth learning technology is in computer vision, speech recognition, natural language processing, machine translation etc.
The scale of the successful application of multiple fields, neural network model is growing day by day.By taking CV (computer vision) field as an example, classic map
As the model of classification task develops to from 8 layers of Alexnet 152 layers of Resnet, model depth increases by 18 times, and accuracy rate is promoted
Nearly 13%.Numerous research work show that the network the deep more is conducive to model and extract richer multi-level spy from input data
Sign, study ability to express is stronger, and higher precision can be realized in more complicated task.However, training ultra-deep neural network
Model is often highly difficult, and a large amount of neuron has very high demand along with floating type calculating operation, to the calculating power of hardware resource,
In addition, the requirement to memory linearly increases also with the depth and minibatch size of network.The many-core of current general GPU
Framework with the advantage of its highly-parallel by the favor of deep learning task, but the memory source relative shortage of GPU, for instructing
Practicing ultra-deep neural network, often there will be two kinds of situations: 1) directly memory spilling can not train, and 2) it can train but can only set very
Small minibatch size leads to not make full use of GPU computing resource, training time long low efficiency indirectly.Therefore, having must
GPU memory utilized in conjunction with ultra-deep neural network model characteristic and be optimized, solve the problems, such as not training first, further
Optimal minibatch size is selected to play GPU and calculate power, final raising training mission efficiency.
Existing GPU Memory Optimize Method includes re-computation and two kinds of swapping in and out.Re-computation technology is GPU in forward direction meter
Discard portion intermediate result does not save during calculation, its correspondence is re-executed when retrospectively calculate needs the intermediate result again
Forward calculation process seek, the shortcomings that this method, is to introduce additional 20-30% restatement evaluation time, for itself
Through very time-consuming deep learning task, the method is difficult to promote applicable.The main thought of swapping in and out technology is by host side CPU
Memory is stored as the standby of GPU memory, similar to the relationship of cache in computer architecture and DRAM, GPU memory
In only relevant to the calculating of its current layer data of storage, other non-relevant datas are then stored in CPU memory.Specifically, forward direction passes
During broadcasting, it is having been generated on GPU and not by next layer calculate call data will be swapped out in CPU memory from GPU memory,
In back-propagation process, the data for being swapped out to CPU will calculate called preceding change to GPU memory, swapping in and out at next layer
It is completed by PCIe bus communication.
Swapping in and out strategy can be by most of Data Migration in training process into CPU memory, and the memory at the end GPU needs
It asks from network level and is reduced to level, can solve the problems, such as that most of ultra-deep neural network can not train.However, current swapping in and out
Strategy design mostly be using formulate in advance heuristic rule (which data of swapping in and out and when execute swapping in and out behaviour
Make), calculating in operational process, memory, the restricting relation for communicating three can not be specified, often occurs causing not in time due to communication
The case where GPU calculates obstruction and introduces extra time expense, in addition, current technology addresses only the problem that can not be trained, not
Consider the performance issue i.e. training effectiveness after can training.
Summary of the invention
In view of problem and shortage existing for prior art described above, the technical problem to be solved in the present invention is to provide one kind
GPU Memory Optimize Method and system towards deep learning training mission, specifically, an inter-species is in the dynamic of CPU and GPU memory
Swapping in and out strategy not only solves the problems, such as that ultra-deep neural network model cannot train, and further looks on trainable basis
To optimal training program, a kind of solution of efficient is provided for the training of ultra-deep neural network model.
To solve the above problems, the present invention adopts the following technical solutions:
A kind of GPU Memory Optimize Method towards deep learning training mission, the specific steps of which are as follows:
(1), the operation of basic swapping in and out is designed, comprising: the time of the data of swapping in and out, swapping in and out.
(2), before training starts, progress static data acquisition first, for portraying the substantially special of network model to be trained
Property, comprising: floating number behaviour needed for the byte-sized of each operand, each layer of calculating in global operation sequence (GOS), sequence
Make (FLOP).
(3), start to train, do not take swapping in and out strategy, first training several epoches, (epoch is that single training changes
Generation, all sample datas in sample are calculated only once a referred to as epoch), minibatch size (small lot size) choosing
Lesser modal value is taken, carries out Dynamic Data Acquiring during this period, for portraying the performance of Current hardware environment, comprising: is different
When in the next iteration of minibatch size needed for time, each data exchange operation needed for each layer of calculating
Between.
(4), according to collected static multidate information, the performance model of swapping in and out strategy is established, and clear GPU is counted
Calculation, memory, PCIe communicate the restricting relation between three.
(5), optimal policy is determined according to performance model, is changed including optimal minibatch size and its change to match
It is tactful out.
(6), remaining epoch continues training directly using optimal minibatch size and its matched swapping in and out strategy
To terminating.
The basic swapping in and out operation of design described in above-mentioned steps (1), the specific steps of which are as follows:
(1-1), the operation that swaps out is designed, the data that can be swapped out mainly include parameter and its ladder in neural network training process
Degree, intermediate result and its gradient, after GPU, which completes current layer, to be calculated, if the relevant all data of current layer will not be by will
The next layer of calculating executed is used, then is immediately performed the operation that swaps out to CPU memory, otherwise waits until to change again after next layer of calculating
Out, the operation that swaps out is completed by individual threads, mutually independent with change operation;
(1-2) designs change operation, and the data that can be changed to are as swapped out, as long as the time of change is that current GPU has
Free memory space, then change to immediately, changes to again after the thread release sufficient space that swaps out if waited without if, change behaviour
It also is completed by individual threads, it is mutually independent with the operation that swaps out.
The acquisition of static data described in above-mentioned steps (2), the specific steps of which are as follows:
(2-1), the dependence for analyzing each data in network model to be trained successively is executed according to neural network by layer
The characteristic of calculating, it may be determined that the sequencing that each data are accessed and are released in primary training, each data have changes accordingly
Enter the operation that swaps out, and then constructs global operation sequence GOS;
For data in (2-2), neural network frequently with floating type tensor representation, the byte-sized of each data can direct root
It is acquired according to its dimensional information multiplied by the byte number of real-coded GA, which is equal to its shared space in memory simultaneously
Size and swapping in and out traffic operation traffic size;
Calculating in (2-3), neural network is substantially tensor operation, is similar to matrix operation, there is clearly dimension
The floating-point operation number FLOP of relationship, each layer of calculating can be obtained by analyzing its corresponding tensor operation relationship.
Dynamic Data Acquiring described in above-mentioned steps (3), the specific steps of which are as follows:
(3-1), the Properties Analysis tool nvprof by using Nvidia, the calling by detecting timing kernel function can obtain
The execution time needed for obtaining each layer of network;
(3-2), similarly the order that chronometric data copies between CPU and GPU memory, is detected by nvprof
CudaMemcpyAsync can get when each data execute swapping in and out operation and be communicated the required time using PCIe.
The performance model of swapping in and out strategy is established described in above-mentioned steps (4), the specific steps of which are as follows:
(4-1), GPU computation model is established, mainly uses the information that step (2-3) and step (3-1) collect, pass through
Be fitted every layer of calculating operation FLOP and its required operation time, the calculated performance FLOPs of current GPU can be obtained, it is further, right
Any minibatch size, each layer of FLOP is acquired by step (2-3), in conjunction with GPU calculated performance curve, can be asked first
It the calculating time for obtaining each layer, can be in the hope of execution needed for an iteration secondly by all layers in cumulative network
The execution time of entire training process can be obtained finally by multiple iteration are accumulated in time;
(4-2), GPU memory model is established, mainly uses the information that step (2-1) and step (2-2) collect, according to
Global operation sequence, swapping in and out operation are executed by two separate threads parallel respectively, and the operation that swaps out can discharge GPU memory, change
Enter operation and occupy GPU memory, but at any one time, to any minibatch size, each change is acquired by step (2-2)
Swap out the data committed memory size of operation, and the size of data being present in GPU memory (changed to subtract swapped out) must
It must be less than or equal to given GPU memory size;
(4-3), PCIe traffic model is established, mainly uses the information that step (2-2) and step (3-2) collect, root
According to the size and its corresponding transmission time of each swapping in and out operation data, the efficient communication bandwidth of current PC Ie can be acquired,
To any minibatch size, the data traffic size of each swapping in and out operation is acquired by step (2-2), is combined with
Bandwidth is imitated, call duration time needed for each swapping in and out operation can be acquired;
(4-4), GPU calculating, memory, PCIe communicate the restricting relation between three, mainly use step (4-1) to step
Three submodels of (4-3) are not blocked with GPU calculating for target, and each data must guaranteed to have changed before calculating calling
Enter GPU memory, the time for completing change is equal to the time point for starting to execute change operation plus the time of change PCIe communication, opens
The time point for beginning to execute change operation depends on whether GPU is vacateed to accommodate and be worked as by the progress and the thread that swaps out of layer execution calculating
The GPU memory headroom of preceding change data.
Determination optimal policy described in above-mentioned steps (5), the specific steps of which are as follows:
Training process is modeled as using minibatch size as Optimization goal according to the performance model that step (4) propose
Optimization problem, it is most short for target with the training time, GPU limited memory and GPU calculating do not block for two constrain, solution obtain with
The optimal minibatch size and its matched swapping in and out strategy of Current hardware environment configurations training current network.
Accordingly with above method, the present invention also provides a kind of GPU internal memory optimization system towards deep learning training mission
System comprising:
Basic swapping in and out operational design module is responsible for the basic swapping in and out operation of design, data including swapping in and out,
The time of swapping in and out;
Static data acquisition module is responsible for before training starts, progress static data acquisition first, for portraying wait train
The fundamental characteristics of network model;
Dynamic Data Acquiring module is responsible for after starting training, is not taken swapping in and out strategy, first train several
Epoches carries out Dynamic Data Acquiring during this period, for portraying the performance of Current hardware environment;
Performance model constructs module, is responsible for establishing swapping in and out strategy according to collected static data and dynamic data
Performance model, and clear GPU calculating, memory, PCIe communication three between restricting relation;
Optimal policy determining module is responsible for determining optimal policy, including optimal minibatch size according to performance model
And its swapping in and out strategy to match;
Continue training module, be responsible in remaining epoch using optimal minibatch size and its matched change
Strategy continues training until terminating out.
The beneficial effects of the present invention are: using the GPU internal memory optimization strategy of swapping in and out, solve ultra-deep nerve first
The too small problem for causing training effectiveness low of minibatch size can not be trained or can be trained to network model;Further, and now
There is technology to compare, the present invention devises dynamic swapping in and out strategy, Illuminative design is substituted by the way of precisely modeling,
Optimizing obtains best training performance under current hardware environment, and GPU resource is made full use of to promote ultra-deep neural network model instruction
Practice efficiency.
Detailed description of the invention
Fig. 1 is a kind of flow chart of GPU Memory Optimize Method towards deep learning training mission of the invention.
Optimal minibatch size and its matched swapping in and out strategy are solved in the step of Fig. 2 is embodiment (5)
Flow chart.
Specific embodiment
Below in conjunction with the drawings and specific embodiments, the present invention is described further.
Referring to Fig.1, a kind of GPU Memory Optimize Method towards deep learning training mission of the invention, this method include
Following steps:
(1), the operation of basic swapping in and out is designed, comprising: the time of the data of swapping in and out, swapping in and out, specific to walk
It is rapid as follows:
(1-1), the operation that swaps out is designed, the data that can be swapped out mainly include parameter and its ladder in neural network training process
Degree, intermediate result and its gradient, after GPU, which completes current layer, to be calculated, if the relevant all data of current layer will not be by will
The next layer of calculating executed is used, then is immediately performed the operation that swaps out to CPU memory, otherwise waits until to change again after next layer of calculating
Out, the operation that swaps out is completed by individual threads, mutually independent with change operation;
(1-2), design change operation, the data that can be changed to are as swapped out, as long as the time of change is that current GPU has
Free memory space, then change to immediately, changes to again after the thread release sufficient space that swaps out if waited without if, change behaviour
It also is completed by individual threads, it is mutually independent with the operation that swaps out;
(2), before training starts, progress static data acquisition first, for portraying the substantially special of network model to be trained
Property, comprising: floating number behaviour needed for the byte-sized of each operand, each layer of calculating in global operation sequence (GOS), sequence
Make (FLOP),
(2-1), the dependence for analyzing each data in network model to be trained successively is executed according to neural network by layer
The characteristic of calculating, it may be determined that the sequencing that each data are accessed and are released in primary training, each data have changes accordingly
Enter the operation that swaps out, and then constructs global operation sequence GOS;
For data in (2-2), neural network frequently with floating type tensor representation, the byte-sized of each data can direct root
It is acquired according to its dimensional information multiplied by the byte number of real-coded GA, which is equal to its shared space in memory simultaneously
Size and swapping in and out traffic operation traffic size;
Calculating in (2-3), neural network is substantially tensor operation, is similar to matrix operation, there is clearly dimension
The floating-point operation number FLOP of relationship, each layer of calculating can be obtained by analyzing its corresponding tensor operation relationship.
(3), start to train, do not take swapping in and out strategy, first 4 epoches, minibatch size is trained successively to set
It is set to 4,8,16,32, carries out Dynamic Data Acquiring during this period, for portraying the performance of Current hardware environment, comprising: is different
When in the next iteration of minibatch size needed for time, each data exchange operation needed for each layer of calculating
Between, the specific steps of which are as follows:
(3-1), the Properties Analysis tool nvprof by using Nvidia, the calling by detecting timing kernel function can obtain
The execution time needed for obtaining each layer of network;
(3-2), similarly the order that chronometric data copies between CPU and GPU memory, is detected by nvprof
CudaMemcpyAsync can get when each data execute swapping in and out operation and be communicated the required time using PCIe.
(4), according to collected static multidate information, the performance model of swapping in and out strategy is established, and clear GPU is counted
Calculation, memory, PCIe communicate the restricting relation between three, the specific steps of which are as follows:
(4-1), GPU computation model is established, mainly uses the information that step (2-3) and step (3-1) collect, pass through
Be fitted every layer of calculating operation FLOP and its required operation time, the calculated performance FLOPs of current GPU can be obtained, it is further, right
Any minibatch size, each layer of FLOP is acquired by step (2-3), in conjunction with GPU calculated performance curve, can be asked first
It the calculating time for obtaining each layer, can be in the hope of execution needed for an iteration secondly by all layers in cumulative network
The execution time of entire training process can be obtained finally by multiple iteration are accumulated in time;
(4-2), GPU memory model is established, mainly uses the information that step (2-1) and step (2-2) collect, according to
Global operation sequence, swapping in and out operation are executed by two separate threads parallel respectively, and the operation that swaps out can discharge GPU memory, change
Enter operation and occupy GPU memory, but at any one time, to any minibatch size, each change is acquired by step (2-2)
Swap out the data committed memory size of operation, and the size of data being present in GPU memory (changed to subtract swapped out) must
It must be less than or equal to given GPU memory size;
(4-3), PCIe traffic model is established, mainly uses the information that step (2-2) and step (3-2) collect, root
According to the size and its corresponding transmission time of each swapping in and out operation data, the efficient communication bandwidth of current PC Ie can be acquired,
To any minibatch size, the data traffic size of each swapping in and out operation is acquired by step (2-2), is combined with
Bandwidth is imitated, call duration time needed for each swapping in and out operation can be acquired;
(4-4), GPU calculating, memory, PCIe communicate the restricting relation between three, mainly use step (4-1) to step
Three submodels of (4-3) are not blocked with GPU calculating for target, and each data must guaranteed to have changed before calculating calling
Enter GPU memory, the time for completing change is equal to the time point for starting to execute change operation plus the time of change PCIe communication, opens
The time point for beginning to execute change operation depends on whether GPU is vacateed to accommodate and be worked as by the progress and the thread that swaps out of layer execution calculating
The GPU memory headroom of preceding change data.
(5), optimal policy is determined according to performance model, is changed including optimal minibatch size and its change to match
It is tactful out, the specific steps of which are as follows:
Training process is modeled as using minibatch size as Optimization goal according to the performance model that step (4) propose
Optimization problem, most short for target with the training time, GPU limited memory and GPU calculating are not blocked for two constraints, solution procedure ginseng
According to Fig. 2, finally obtains with the optimal minibatch size of Current hardware environment configurations training current network and its matched change
Enter the strategy that swaps out.
(6), remaining epoch continues training directly using optimal minibatch size and its matched swapping in and out strategy
To terminating.
Another embodiment of the present invention provides a kind of GPU memory optimizing systems towards deep learning training mission comprising:
Basic swapping in and out operational design module is responsible for the basic swapping in and out operation of design, data including swapping in and out,
The time of swapping in and out;
Static data acquisition module is responsible for before training starts, progress static data acquisition first, for portraying wait train
The fundamental characteristics of network model;
Dynamic Data Acquiring module is responsible for after starting training, is not taken swapping in and out strategy, first train several
Epoches carries out Dynamic Data Acquiring during this period, for portraying the performance of Current hardware environment;
Performance model constructs module, is responsible for establishing swapping in and out strategy according to collected static data and dynamic data
Performance model, and clear GPU calculating, memory, PCIe communication three between restricting relation;
Optimal policy determining module is responsible for determining optimal policy, including optimal minibatch size according to performance model
And its swapping in and out strategy to match;
Continue training module, be responsible in remaining epoch using optimal minibatch size and its matched change
Strategy continues training until terminating out.
The specific implementation of above-mentioned each module sees above the explanation to the method for the present invention.
Experimental data: experimental situation is one piece of NVIDIA Tesla M40GPU card, using ImageNet data set,
Alexnet (256) indicates that training AlexNet, minibatch size are set as 256, remaining network structure and so on, as a result
As shown in table 1.
1. experimental result of table
The above embodiments are merely illustrative of the technical solutions of the present invention rather than is limited, the ordinary skill of this field
Personnel can be with modification or equivalent replacement of the technical solution of the present invention are made, without departing from the principle and scope of the present invention, originally
The protection scope of invention should be subject to described in claims.
Claims (10)
1. a kind of GPU Memory Optimize Method towards deep learning training mission, which comprises the following steps:
(1) basic swapping in and out operation, the time of data, swapping in and out including swapping in and out are designed;
(2) before training starts, progress static data acquisition first, for portraying the fundamental characteristics of network model to be trained;
(3) start to train, do not take swapping in and out strategy, first train several epoches, carry out dynamic data during this period and adopt
Collection, for portraying the performance of Current hardware environment;
(4) according to collected static data and dynamic data, the performance model of swapping in and out strategy is established, and clear GPU is counted
Calculation, memory, PCIe communicate the restricting relation between three;
(5) optimal policy is determined according to performance model, including optimal minibatch size and its swapping in and out plan to match
Slightly;
(6) remaining epoch continues training until knot using optimal minibatch size and its matched swapping in and out strategy
Beam.
2. the method according to claim 1, wherein step (1) includes:
(1-1) designs the operation that swaps out: the data to swap out mainly include parameter and its gradient, centre in neural network training process
As a result and its gradient, after GPU, which completes current layer, to be calculated, if the relevant all data of current layer will not will be executed down
One layer of calculating is used, then is immediately performed the operation that swaps out to CPU memory, otherwise waits until to swap out again after next layer of calculating;It swaps out
Operation is completed by individual threads, mutually independent with change operation;
(1-2) designs change operation: the data of change are as swapped out, as long as the time of change is in current GPU is available free
Space being deposited, then is changed to immediately, being changed to again after the thread release sufficient space that swaps out if waited without if;Change operation is also
It is completed by individual threads, it is mutually independent with the operation that swaps out.
3. the method according to claim 1, wherein step (2) the static data acquisition, the data of acquisition
Include: global operation sequence, the byte-sized of each operand in sequence, each layer calculate needed for floating point number operations.
4. according to the method described in claim 3, it is characterized in that, step (2) includes:
(2-1) analyzes the dependence of each data in network model to be trained, and successively executes calculating by layer according to neural network
Characteristic, the sequencing for determining that each data are accessed in primary training and being released, each data have corresponding swapping in and out
Operation, and then construct global operation sequence;
Data in (2-2) neural network use floating type tensor representation, and the byte-sized of each data is directly according to its dimension
Information is acquired multiplied by the byte number of real-coded GA, the byte-sized be equal to simultaneously its in memory shared space size with
And the traffic size of swapping in and out traffic operation;
The floating-point operation number FLOP of each layer of calculating in (2-3) neural network is obtained by analyzing its corresponding tensor operation relationship
It arrives.
5. method according to claim 1 or 4, which is characterized in that when step (3) starts to train, do not take swapping in and out
Strategy first trains 4 epoches, minibatch size to be successively set as 4,8,16,32, carry out dynamic data during this period
Acquisition.
6. the method according to claim 1, wherein step (3) described Dynamic Data Acquiring, the data of acquisition
It include: to calculate required time, each data exchange operation for each layer in the different next iteration of minibatch size
The required time.
7. according to the method described in claim 6, it is characterized in that, step (3) includes:
(3-1) obtains network by the calling of detection timing kernel function by using the Properties Analysis tool nvprof of Nvidia
The execution time needed for each layer;
(3-2) detects the order cudaMemcpyAsync that chronometric data copies between CPU and GPU memory by nvprof, obtains
Each data execute the time needed for communicating when swapping in and out operation using PCIe.
8. the method according to claim 1, wherein step (4) includes:
(4-1) establishes GPU computation model: FLOP and its required operation time by every layer of calculating operation of fitting are obtained current
The calculated performance FLOPs of GPU further to any minibatch size, acquires each layer of FLOP, computational in conjunction with GPU
Energy curve, acquires each layer of calculating time first, acquires an iteration institute secondly by all layers in cumulative network
The execution time needed obtains the execution time of entire training process finally by multiple iteration are accumulated;
(4-2) establishes GPU memory model: according to global operation sequence, swapping in and out operation is parallel by two separate threads respectively
It executes, the operation that swaps out can discharge GPU memory, and change operation occupies GPU memory, but at any one time, to any
Minibatchsize, acquires the data committed memory size of each swapping in and out operation, and the data being present in GPU memory are big
Small be necessarily less than is equal to given GPU memory size;
(4-3) establishes PCIe traffic model: according to the size and its corresponding transmission time of each swapping in and out operation data, asking
The efficient communication bandwidth for obtaining current PCIe acquires the data communication of each swapping in and out operation to any minibatch size
Size is measured, in conjunction with effective bandwidth, call duration time needed for acquiring each swapping in and out operation;
(4-4) arrives three submodels of step (4-3) according to step (4-1), is not blocked with GPU calculating for target, establishes GPU meter
Calculation, memory, PCIe communicate the restricting relation between three;Each data must guaranteed to have changed to GPU before calculating calling
Memory, the time for completing change are equal to the time point for starting to execute change operation plus the time of change PCIe communication, start to hold
The time point of row change operation depends on the progress that GPU is calculated by layer execution and whether the thread that swaps out is vacateed to accommodate and currently be changed
Enter the GPU memory headroom of data.
9. the method according to claim 1, wherein the method for step (5) the determining optimal policy is: according to
Training process is modeled as optimization problem using minibatch size as Optimization goal by the performance model that step (4) proposes, with
It is target that training time is most short, and GPU limited memory and GPU calculating are not blocked to be constrained for two, and solution is obtained with Current hardware ring
Border configures the optimal minibatch size and its matched swapping in and out strategy of training current network.
10. a kind of GPU memory optimizing system towards deep learning training mission characterized by comprising
Basic swapping in and out operational design module is responsible for the basic swapping in and out operation of design, data, change including swapping in and out
The time to swap out;
Static data acquisition module is responsible for before training starts, progress static data acquisition first, for portraying network to be trained
The fundamental characteristics of model;
Dynamic Data Acquiring module is responsible for after starting training, is not taken swapping in and out strategy, first train several epoches,
Dynamic Data Acquiring is carried out, during this period for portraying the performance of Current hardware environment;
Performance model constructs module, is responsible for establishing the property of swapping in and out strategy according to collected static data and dynamic data
Can model, and clear GPU calculatings, memory, PCIe communicate three between restricting relation;
Optimal policy determining module is responsible for determining optimal policy according to performance model, including optimal minibatch size and its
The swapping in and out strategy to match;
Continue training module, is responsible in remaining epoch using optimal minibatch size and its matched swapping in and out plan
Slightly continue training until terminating.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910035753.9A CN109919310B (en) | 2019-01-15 | 2019-01-15 | GPU memory optimization method and system for deep learning training task |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910035753.9A CN109919310B (en) | 2019-01-15 | 2019-01-15 | GPU memory optimization method and system for deep learning training task |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109919310A true CN109919310A (en) | 2019-06-21 |
CN109919310B CN109919310B (en) | 2021-05-18 |
Family
ID=66960413
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910035753.9A Active CN109919310B (en) | 2019-01-15 | 2019-01-15 | GPU memory optimization method and system for deep learning training task |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109919310B (en) |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110490300A (en) * | 2019-07-26 | 2019-11-22 | 苏州浪潮智能科技有限公司 | A kind of operation accelerated method, apparatus and system based on deep learning |
CN110751282A (en) * | 2019-10-18 | 2020-02-04 | 北京百度网讯科技有限公司 | Processor memory optimization method and device for deep learning training task |
CN110941494A (en) * | 2019-12-02 | 2020-03-31 | 哈尔滨工程大学 | Deep learning-oriented GPU parallel computing data processing method |
CN110942138A (en) * | 2019-11-13 | 2020-03-31 | 华中科技大学 | Deep neural network training method and system in hybrid memory environment |
CN111078395A (en) * | 2019-11-12 | 2020-04-28 | 华中科技大学 | Deep learning GPU memory management optimization method and system based on tensor |
CN111310054A (en) * | 2020-03-06 | 2020-06-19 | 中国科学院信息工程研究所 | Recommendation method and device based on adaptive Margin symmetry metric learning |
CN111814948A (en) * | 2020-06-18 | 2020-10-23 | 浙江大华技术股份有限公司 | Operation method and operation device of neural network and computer readable storage medium |
CN111882073A (en) * | 2020-07-17 | 2020-11-03 | 苏州浪潮智能科技有限公司 | Method and equipment for modifying distributed computation graph |
CN112306697A (en) * | 2020-12-31 | 2021-02-02 | 之江实验室 | Deep learning memory management method and system based on Tensor access |
CN117687802A (en) * | 2024-02-02 | 2024-03-12 | 湖南马栏山视频先进技术研究院有限公司 | Deep learning parallel scheduling method and device based on cloud platform and cloud platform |
CN117892769A (en) * | 2024-03-15 | 2024-04-16 | 之江实验室 | Neural network training method, video memory scheduling method, system, equipment and product |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101526934A (en) * | 2009-04-21 | 2009-09-09 | 浪潮电子信息产业股份有限公司 | Construction method of GPU and CPU combined processor |
CN104035751A (en) * | 2014-06-20 | 2014-09-10 | 深圳市腾讯计算机系统有限公司 | Graphics processing unit based parallel data processing method and device |
CN105224502A (en) * | 2015-09-28 | 2016-01-06 | 浪潮(北京)电子信息产业有限公司 | A kind of degree of depth learning method based on GPU and system |
WO2018024232A1 (en) * | 2016-08-05 | 2018-02-08 | 上海寒武纪信息科技有限公司 | Device and method for executing neural network operation |
CN108229645A (en) * | 2017-04-28 | 2018-06-29 | 北京市商汤科技开发有限公司 | Convolution accelerates and computation processing method, device, electronic equipment and storage medium |
CN108460457A (en) * | 2018-03-30 | 2018-08-28 | 苏州纳智天地智能科技有限公司 | A kind of more asynchronous training methods of card hybrid parallel of multimachine towards convolutional neural networks |
-
2019
- 2019-01-15 CN CN201910035753.9A patent/CN109919310B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101526934A (en) * | 2009-04-21 | 2009-09-09 | 浪潮电子信息产业股份有限公司 | Construction method of GPU and CPU combined processor |
CN104035751A (en) * | 2014-06-20 | 2014-09-10 | 深圳市腾讯计算机系统有限公司 | Graphics processing unit based parallel data processing method and device |
CN105224502A (en) * | 2015-09-28 | 2016-01-06 | 浪潮(北京)电子信息产业有限公司 | A kind of degree of depth learning method based on GPU and system |
WO2018024232A1 (en) * | 2016-08-05 | 2018-02-08 | 上海寒武纪信息科技有限公司 | Device and method for executing neural network operation |
CN108229645A (en) * | 2017-04-28 | 2018-06-29 | 北京市商汤科技开发有限公司 | Convolution accelerates and computation processing method, device, electronic equipment and storage medium |
CN108460457A (en) * | 2018-03-30 | 2018-08-28 | 苏州纳智天地智能科技有限公司 | A kind of more asynchronous training methods of card hybrid parallel of multimachine towards convolutional neural networks |
Non-Patent Citations (4)
Title |
---|
CHEN MENG ET AL: "Training deeper models by gpu memory optimization on tensorflow", 《IN PROC. OF ML SYSTEMS WORKSHOP IN NIPS》 * |
LINNAN WANG ET AL: "Superneurons: dynamic GPU memory management for training deep neural networks", 《PPOPP "18: PROCEEDINGS OF THE 23RD ACM SIGPLAN SYMPOSIUM ON PRINCIPLES AND PRACTICE OF PARALLEL PROGRAMMIN》 * |
MINSOO RHU ET AL: "vDNN: Virtualized deep neural networks for scalable, memory-efficient neural network design", 《2016 49TH ANNUAL IEEE/ACM INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE (MICRO)》 * |
于齐: "分支与不规则访存在GPU上的优化方法研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
Cited By (18)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110490300A (en) * | 2019-07-26 | 2019-11-22 | 苏州浪潮智能科技有限公司 | A kind of operation accelerated method, apparatus and system based on deep learning |
CN110490300B (en) * | 2019-07-26 | 2022-03-15 | 苏州浪潮智能科技有限公司 | Deep learning-based operation acceleration method, device and system |
CN110751282A (en) * | 2019-10-18 | 2020-02-04 | 北京百度网讯科技有限公司 | Processor memory optimization method and device for deep learning training task |
CN111078395A (en) * | 2019-11-12 | 2020-04-28 | 华中科技大学 | Deep learning GPU memory management optimization method and system based on tensor |
CN111078395B (en) * | 2019-11-12 | 2023-06-20 | 华中科技大学 | Tensor-based deep learning GPU memory management optimization method and system |
CN110942138A (en) * | 2019-11-13 | 2020-03-31 | 华中科技大学 | Deep neural network training method and system in hybrid memory environment |
CN110942138B (en) * | 2019-11-13 | 2022-02-15 | 华中科技大学 | Deep neural network training method and system in hybrid memory environment |
CN110941494A (en) * | 2019-12-02 | 2020-03-31 | 哈尔滨工程大学 | Deep learning-oriented GPU parallel computing data processing method |
CN111310054A (en) * | 2020-03-06 | 2020-06-19 | 中国科学院信息工程研究所 | Recommendation method and device based on adaptive Margin symmetry metric learning |
CN111814948B (en) * | 2020-06-18 | 2021-07-13 | 浙江大华技术股份有限公司 | Operation method and operation device of neural network and computer readable storage medium |
CN111814948A (en) * | 2020-06-18 | 2020-10-23 | 浙江大华技术股份有限公司 | Operation method and operation device of neural network and computer readable storage medium |
CN111882073A (en) * | 2020-07-17 | 2020-11-03 | 苏州浪潮智能科技有限公司 | Method and equipment for modifying distributed computation graph |
CN112306697B (en) * | 2020-12-31 | 2021-04-27 | 之江实验室 | Deep learning memory management method and system based on Tensor access |
CN112306697A (en) * | 2020-12-31 | 2021-02-02 | 之江实验室 | Deep learning memory management method and system based on Tensor access |
CN117687802A (en) * | 2024-02-02 | 2024-03-12 | 湖南马栏山视频先进技术研究院有限公司 | Deep learning parallel scheduling method and device based on cloud platform and cloud platform |
CN117687802B (en) * | 2024-02-02 | 2024-04-30 | 湖南马栏山视频先进技术研究院有限公司 | Deep learning parallel scheduling method and device based on cloud platform and cloud platform |
CN117892769A (en) * | 2024-03-15 | 2024-04-16 | 之江实验室 | Neural network training method, video memory scheduling method, system, equipment and product |
CN117892769B (en) * | 2024-03-15 | 2024-06-11 | 之江实验室 | Neural network training method, video memory scheduling method, system, equipment and product |
Also Published As
Publication number | Publication date |
---|---|
CN109919310B (en) | 2021-05-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109919310A (en) | A kind of GPU Memory Optimize Method and system towards deep learning training mission | |
CN104657219B (en) | A kind of application program threads number dynamic adjusting method being used under isomery many-core system | |
CN105956021B (en) | A kind of automation task suitable for distributed machines study parallel method and its system | |
CN103617087B (en) | MapReduce optimizing method suitable for iterative computations | |
CN105653790B (en) | A kind of out-of order processor Cache memory access performance estimating method based on artificial neural network | |
CN106951499B (en) | A kind of knowledge mapping representation method based on translation model | |
CN102981807B (en) | Graphics processing unit (GPU) program optimization method based on compute unified device architecture (CUDA) parallel environment | |
CN105184367B (en) | The model parameter training method and system of deep neural network | |
CN107861606A (en) | A kind of heterogeneous polynuclear power cap method by coordinating DVFS and duty mapping | |
CN104765589B (en) | Grid parallel computation preprocess method based on MPI | |
CN111079921A (en) | Efficient neural network training and scheduling method based on heterogeneous distributed system | |
CN106339351A (en) | SGD (Stochastic Gradient Descent) algorithm optimization system and method | |
CN109524118A (en) | A kind of screen method for gestational diabetes based on machine learning and physical examination data | |
CN109242099A (en) | Training method, device, training equipment and the storage medium of intensified learning network | |
CN109102498A (en) | A kind of method of cluster type nucleus segmentation in cervical smear image | |
Durai et al. | Liver disease prediction using machine learning | |
CN110333933A (en) | A kind of HPL computation model emulation mode | |
CN111797833A (en) | Automatic machine learning method and system oriented to remote sensing semantic segmentation | |
CN106020773A (en) | Method for optimizing finite difference algorithm in heterogeneous many-core framework | |
WO2023236319A1 (en) | Convolutional neural network deployment and optimization method for microcontroller | |
CN110276689A (en) | Intelligent contract implementation method based on dynamic decision | |
CN110287114A (en) | A kind of method and device of database script performance test | |
CN105094949A (en) | Method and system for simulation based on instruction calculation model and feedback compensation | |
CN103455364B (en) | A kind of multi-core environment concurrent program Cache performance online obtains system and method | |
CN110109811B (en) | A kind of source tracing method towards GPU calculated performance problem |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |