CN109919310A

CN109919310A - A kind of GPU Memory Optimize Method and system towards deep learning training mission

Info

Publication number: CN109919310A
Application number: CN201910035753.9A
Authority: CN
Inventors: 刘万涛; 郭锦荣; 虎嵩林; 韩冀中
Original assignee: Institute of Information Engineering of CAS
Current assignee: Institute of Information Engineering of CAS
Priority date: 2019-01-15
Filing date: 2019-01-15
Publication date: 2019-06-21
Anticipated expiration: 2039-01-15
Also published as: CN109919310B

Abstract

The present invention relates to a kind of GPU Memory Optimize Methods and system towards deep learning training mission.This method comprises: (1) designs basic swapping in and out operation；(2) static data acquisition is carried out first before training starts；(3) swapping in and out strategy is not taken, first trains several epoches, carries out Dynamic Data Acquiring during this period；(4) performance model of swapping in and out strategy is established, and clear GPU calculating, memory, the restricting relation between PCIe communication three；(5) optimal policy is determined according to performance model；(6) remaining epoch continues training until terminating using optimal minibatch size and its matched swapping in and out strategy.The present invention solves the problems, such as that ultra-deep neural network model can not train or can train minibatch size is too small to cause training effectiveness low, can make full use of GPU resource and promotes ultra-deep neural network model training effectiveness.

Description

A kind of GPU Memory Optimize Method and system towards deep learning training mission

Technical field

The invention belongs to deep learning fields, can not train specific to common GPU equipment because its memory source is in short supply The problem of ultra-deep neural network model, proposes a kind of GPU Memory Optimize Method, under the premise of not introducing extra time expense So that training mission can be performed on single GPU, while maximally utilising GPU calculating and memory source, abundant training for promotion Efficiency.

Background technique

In recent years, as depth learning technology is in computer vision, speech recognition, natural language processing, machine translation etc. The scale of the successful application of multiple fields, neural network model is growing day by day.By taking CV (computer vision) field as an example, classic map As the model of classification task develops to from 8 layers of Alexnet 152 layers of Resnet, model depth increases by 18 times, and accuracy rate is promoted Nearly 13%.Numerous research work show that the network the deep more is conducive to model and extract richer multi-level spy from input data Sign, study ability to express is stronger, and higher precision can be realized in more complicated task.However, training ultra-deep neural network Model is often highly difficult, and a large amount of neuron has very high demand along with floating type calculating operation, to the calculating power of hardware resource, In addition, the requirement to memory linearly increases also with the depth and minibatch size of network.The many-core of current general GPU Framework with the advantage of its highly-parallel by the favor of deep learning task, but the memory source relative shortage of GPU, for instructing Practicing ultra-deep neural network, often there will be two kinds of situations: 1) directly memory spilling can not train, and 2) it can train but can only set very Small minibatch size leads to not make full use of GPU computing resource, training time long low efficiency indirectly.Therefore, having must GPU memory utilized in conjunction with ultra-deep neural network model characteristic and be optimized, solve the problems, such as not training first, further Optimal minibatch size is selected to play GPU and calculate power, final raising training mission efficiency.

Existing GPU Memory Optimize Method includes re-computation and two kinds of swapping in and out.Re-computation technology is GPU in forward direction meter Discard portion intermediate result does not save during calculation, its correspondence is re-executed when retrospectively calculate needs the intermediate result again Forward calculation process seek, the shortcomings that this method, is to introduce additional 20-30% restatement evaluation time, for itself Through very time-consuming deep learning task, the method is difficult to promote applicable.The main thought of swapping in and out technology is by host side CPU Memory is stored as the standby of GPU memory, similar to the relationship of cache in computer architecture and DRAM, GPU memory In only relevant to the calculating of its current layer data of storage, other non-relevant datas are then stored in CPU memory.Specifically, forward direction passes During broadcasting, it is having been generated on GPU and not by next layer calculate call data will be swapped out in CPU memory from GPU memory, In back-propagation process, the data for being swapped out to CPU will calculate called preceding change to GPU memory, swapping in and out at next layer It is completed by PCIe bus communication.

Swapping in and out strategy can be by most of Data Migration in training process into CPU memory, and the memory at the end GPU needs It asks from network level and is reduced to level, can solve the problems, such as that most of ultra-deep neural network can not train.However, current swapping in and out Strategy design mostly be using formulate in advance heuristic rule (which data of swapping in and out and when execute swapping in and out behaviour Make), calculating in operational process, memory, the restricting relation for communicating three can not be specified, often occurs causing not in time due to communication The case where GPU calculates obstruction and introduces extra time expense, in addition, current technology addresses only the problem that can not be trained, not Consider the performance issue i.e. training effectiveness after can training.

Summary of the invention

In view of problem and shortage existing for prior art described above, the technical problem to be solved in the present invention is to provide one kind GPU Memory Optimize Method and system towards deep learning training mission, specifically, an inter-species is in the dynamic of CPU and GPU memory Swapping in and out strategy not only solves the problems, such as that ultra-deep neural network model cannot train, and further looks on trainable basis To optimal training program, a kind of solution of efficient is provided for the training of ultra-deep neural network model.

To solve the above problems, the present invention adopts the following technical solutions:

A kind of GPU Memory Optimize Method towards deep learning training mission, the specific steps of which are as follows:

(1), the operation of basic swapping in and out is designed, comprising: the time of the data of swapping in and out, swapping in and out.

(2), before training starts, progress static data acquisition first, for portraying the substantially special of network model to be trained Property, comprising: floating number behaviour needed for the byte-sized of each operand, each layer of calculating in global operation sequence (GOS), sequence Make (FLOP).

(3), start to train, do not take swapping in and out strategy, first training several epoches, (epoch is that single training changes Generation, all sample datas in sample are calculated only once a referred to as epoch), minibatch size (small lot size) choosing Lesser modal value is taken, carries out Dynamic Data Acquiring during this period, for portraying the performance of Current hardware environment, comprising: is different When in the next iteration of minibatch size needed for time, each data exchange operation needed for each layer of calculating Between.

(4), according to collected static multidate information, the performance model of swapping in and out strategy is established, and clear GPU is counted Calculation, memory, PCIe communicate the restricting relation between three.

(5), optimal policy is determined according to performance model, is changed including optimal minibatch size and its change to match It is tactful out.

(6), remaining epoch continues training directly using optimal minibatch size and its matched swapping in and out strategy To terminating.

The basic swapping in and out operation of design described in above-mentioned steps (1), the specific steps of which are as follows:

(1-1), the operation that swaps out is designed, the data that can be swapped out mainly include parameter and its ladder in neural network training process Degree, intermediate result and its gradient, after GPU, which completes current layer, to be calculated, if the relevant all data of current layer will not be by will The next layer of calculating executed is used, then is immediately performed the operation that swaps out to CPU memory, otherwise waits until to change again after next layer of calculating Out, the operation that swaps out is completed by individual threads, mutually independent with change operation；

(1-2) designs change operation, and the data that can be changed to are as swapped out, as long as the time of change is that current GPU has Free memory space, then change to immediately, changes to again after the thread release sufficient space that swaps out if waited without if, change behaviour It also is completed by individual threads, it is mutually independent with the operation that swaps out.

The acquisition of static data described in above-mentioned steps (2), the specific steps of which are as follows:

(2-1), the dependence for analyzing each data in network model to be trained successively is executed according to neural network by layer The characteristic of calculating, it may be determined that the sequencing that each data are accessed and are released in primary training, each data have changes accordingly Enter the operation that swaps out, and then constructs global operation sequence GOS；

For data in (2-2), neural network frequently with floating type tensor representation, the byte-sized of each data can direct root It is acquired according to its dimensional information multiplied by the byte number of real-coded GA, which is equal to its shared space in memory simultaneously Size and swapping in and out traffic operation traffic size；

Calculating in (2-3), neural network is substantially tensor operation, is similar to matrix operation, there is clearly dimension The floating-point operation number FLOP of relationship, each layer of calculating can be obtained by analyzing its corresponding tensor operation relationship.

Dynamic Data Acquiring described in above-mentioned steps (3), the specific steps of which are as follows:

(3-1), the Properties Analysis tool nvprof by using Nvidia, the calling by detecting timing kernel function can obtain The execution time needed for obtaining each layer of network；

(3-2), similarly the order that chronometric data copies between CPU and GPU memory, is detected by nvprof CudaMemcpyAsync can get when each data execute swapping in and out operation and be communicated the required time using PCIe.

The performance model of swapping in and out strategy is established described in above-mentioned steps (4), the specific steps of which are as follows:

(4-1), GPU computation model is established, mainly uses the information that step (2-3) and step (3-1) collect, pass through Be fitted every layer of calculating operation FLOP and its required operation time, the calculated performance FLOPs of current GPU can be obtained, it is further, right Any minibatch size, each layer of FLOP is acquired by step (2-3), in conjunction with GPU calculated performance curve, can be asked first It the calculating time for obtaining each layer, can be in the hope of execution needed for an iteration secondly by all layers in cumulative network The execution time of entire training process can be obtained finally by multiple iteration are accumulated in time；

(4-2), GPU memory model is established, mainly uses the information that step (2-1) and step (2-2) collect, according to Global operation sequence, swapping in and out operation are executed by two separate threads parallel respectively, and the operation that swaps out can discharge GPU memory, change Enter operation and occupy GPU memory, but at any one time, to any minibatch size, each change is acquired by step (2-2) Swap out the data committed memory size of operation, and the size of data being present in GPU memory (changed to subtract swapped out) must It must be less than or equal to given GPU memory size；

(4-3), PCIe traffic model is established, mainly uses the information that step (2-2) and step (3-2) collect, root According to the size and its corresponding transmission time of each swapping in and out operation data, the efficient communication bandwidth of current PC Ie can be acquired, To any minibatch size, the data traffic size of each swapping in and out operation is acquired by step (2-2), is combined with Bandwidth is imitated, call duration time needed for each swapping in and out operation can be acquired；

(4-4), GPU calculating, memory, PCIe communicate the restricting relation between three, mainly use step (4-1) to step Three submodels of (4-3) are not blocked with GPU calculating for target, and each data must guaranteed to have changed before calculating calling Enter GPU memory, the time for completing change is equal to the time point for starting to execute change operation plus the time of change PCIe communication, opens The time point for beginning to execute change operation depends on whether GPU is vacateed to accommodate and be worked as by the progress and the thread that swaps out of layer execution calculating The GPU memory headroom of preceding change data.

Determination optimal policy described in above-mentioned steps (5), the specific steps of which are as follows:

Training process is modeled as using minibatch size as Optimization goal according to the performance model that step (4) propose Optimization problem, it is most short for target with the training time, GPU limited memory and GPU calculating do not block for two constrain, solution obtain with The optimal minibatch size and its matched swapping in and out strategy of Current hardware environment configurations training current network.

Accordingly with above method, the present invention also provides a kind of GPU internal memory optimization system towards deep learning training mission System comprising:

Basic swapping in and out operational design module is responsible for the basic swapping in and out operation of design, data including swapping in and out, The time of swapping in and out；

Static data acquisition module is responsible for before training starts, progress static data acquisition first, for portraying wait train The fundamental characteristics of network model；

Dynamic Data Acquiring module is responsible for after starting training, is not taken swapping in and out strategy, first train several Epoches carries out Dynamic Data Acquiring during this period, for portraying the performance of Current hardware environment；

Performance model constructs module, is responsible for establishing swapping in and out strategy according to collected static data and dynamic data Performance model, and clear GPU calculating, memory, PCIe communication three between restricting relation；

Optimal policy determining module is responsible for determining optimal policy, including optimal minibatch size according to performance model And its swapping in and out strategy to match；

Continue training module, be responsible in remaining epoch using optimal minibatch size and its matched change Strategy continues training until terminating out.

The beneficial effects of the present invention are: using the GPU internal memory optimization strategy of swapping in and out, solve ultra-deep nerve first The too small problem for causing training effectiveness low of minibatch size can not be trained or can be trained to network model；Further, and now There is technology to compare, the present invention devises dynamic swapping in and out strategy, Illuminative design is substituted by the way of precisely modeling, Optimizing obtains best training performance under current hardware environment, and GPU resource is made full use of to promote ultra-deep neural network model instruction Practice efficiency.

Detailed description of the invention

Fig. 1 is a kind of flow chart of GPU Memory Optimize Method towards deep learning training mission of the invention.

Optimal minibatch size and its matched swapping in and out strategy are solved in the step of Fig. 2 is embodiment (5) Flow chart.

Specific embodiment

Below in conjunction with the drawings and specific embodiments, the present invention is described further.

Referring to Fig.1, a kind of GPU Memory Optimize Method towards deep learning training mission of the invention, this method include Following steps:

(1), the operation of basic swapping in and out is designed, comprising: the time of the data of swapping in and out, swapping in and out, specific to walk It is rapid as follows:

(1-2), design change operation, the data that can be changed to are as swapped out, as long as the time of change is that current GPU has Free memory space, then change to immediately, changes to again after the thread release sufficient space that swaps out if waited without if, change behaviour It also is completed by individual threads, it is mutually independent with the operation that swaps out；

(2), before training starts, progress static data acquisition first, for portraying the substantially special of network model to be trained Property, comprising: floating number behaviour needed for the byte-sized of each operand, each layer of calculating in global operation sequence (GOS), sequence Make (FLOP),

(3), start to train, do not take swapping in and out strategy, first 4 epoches, minibatch size is trained successively to set It is set to 4,8,16,32, carries out Dynamic Data Acquiring during this period, for portraying the performance of Current hardware environment, comprising: is different When in the next iteration of minibatch size needed for time, each data exchange operation needed for each layer of calculating Between, the specific steps of which are as follows:

(4), according to collected static multidate information, the performance model of swapping in and out strategy is established, and clear GPU is counted Calculation, memory, PCIe communicate the restricting relation between three, the specific steps of which are as follows:

(5), optimal policy is determined according to performance model, is changed including optimal minibatch size and its change to match It is tactful out, the specific steps of which are as follows:

Training process is modeled as using minibatch size as Optimization goal according to the performance model that step (4) propose Optimization problem, most short for target with the training time, GPU limited memory and GPU calculating are not blocked for two constraints, solution procedure ginseng According to Fig. 2, finally obtains with the optimal minibatch size of Current hardware environment configurations training current network and its matched change Enter the strategy that swaps out.

Another embodiment of the present invention provides a kind of GPU memory optimizing systems towards deep learning training mission comprising:

The specific implementation of above-mentioned each module sees above the explanation to the method for the present invention.

Experimental data: experimental situation is one piece of NVIDIA Tesla M40GPU card, using ImageNet data set, Alexnet (256) indicates that training AlexNet, minibatch size are set as 256, remaining network structure and so on, as a result As shown in table 1.

1. experimental result of table

The above embodiments are merely illustrative of the technical solutions of the present invention rather than is limited, the ordinary skill of this field Personnel can be with modification or equivalent replacement of the technical solution of the present invention are made, without departing from the principle and scope of the present invention, originally The protection scope of invention should be subject to described in claims.

Claims

1. a kind of GPU Memory Optimize Method towards deep learning training mission, which comprises the following steps:

(1) basic swapping in and out operation, the time of data, swapping in and out including swapping in and out are designed；

(2) before training starts, progress static data acquisition first, for portraying the fundamental characteristics of network model to be trained；

(3) start to train, do not take swapping in and out strategy, first train several epoches, carry out dynamic data during this period and adopt Collection, for portraying the performance of Current hardware environment；

(4) according to collected static data and dynamic data, the performance model of swapping in and out strategy is established, and clear GPU is counted Calculation, memory, PCIe communicate the restricting relation between three；

(5) optimal policy is determined according to performance model, including optimal minibatch size and its swapping in and out plan to match Slightly；

(6) remaining epoch continues training until knot using optimal minibatch size and its matched swapping in and out strategy Beam.

2. the method according to claim 1, wherein step (1) includes:

(1-1) designs the operation that swaps out: the data to swap out mainly include parameter and its gradient, centre in neural network training process As a result and its gradient, after GPU, which completes current layer, to be calculated, if the relevant all data of current layer will not will be executed down One layer of calculating is used, then is immediately performed the operation that swaps out to CPU memory, otherwise waits until to swap out again after next layer of calculating；It swaps out Operation is completed by individual threads, mutually independent with change operation；

(1-2) designs change operation: the data of change are as swapped out, as long as the time of change is in current GPU is available free Space being deposited, then is changed to immediately, being changed to again after the thread release sufficient space that swaps out if waited without if；Change operation is also It is completed by individual threads, it is mutually independent with the operation that swaps out.

3. the method according to claim 1, wherein step (2) the static data acquisition, the data of acquisition Include: global operation sequence, the byte-sized of each operand in sequence, each layer calculate needed for floating point number operations.

4. according to the method described in claim 3, it is characterized in that, step (2) includes:

(2-1) analyzes the dependence of each data in network model to be trained, and successively executes calculating by layer according to neural network Characteristic, the sequencing for determining that each data are accessed in primary training and being released, each data have corresponding swapping in and out Operation, and then construct global operation sequence；

Data in (2-2) neural network use floating type tensor representation, and the byte-sized of each data is directly according to its dimension Information is acquired multiplied by the byte number of real-coded GA, the byte-sized be equal to simultaneously its in memory shared space size with And the traffic size of swapping in and out traffic operation；

The floating-point operation number FLOP of each layer of calculating in (2-3) neural network is obtained by analyzing its corresponding tensor operation relationship It arrives.

5. method according to claim 1 or 4, which is characterized in that when step (3) starts to train, do not take swapping in and out Strategy first trains 4 epoches, minibatch size to be successively set as 4,8,16,32, carry out dynamic data during this period Acquisition.

6. the method according to claim 1, wherein step (3) described Dynamic Data Acquiring, the data of acquisition It include: to calculate required time, each data exchange operation for each layer in the different next iteration of minibatch size The required time.

7. according to the method described in claim 6, it is characterized in that, step (3) includes:

(3-1) obtains network by the calling of detection timing kernel function by using the Properties Analysis tool nvprof of Nvidia The execution time needed for each layer；

(3-2) detects the order cudaMemcpyAsync that chronometric data copies between CPU and GPU memory by nvprof, obtains Each data execute the time needed for communicating when swapping in and out operation using PCIe.

8. the method according to claim 1, wherein step (4) includes:

(4-1) establishes GPU computation model: FLOP and its required operation time by every layer of calculating operation of fitting are obtained current The calculated performance FLOPs of GPU further to any minibatch size, acquires each layer of FLOP, computational in conjunction with GPU Energy curve, acquires each layer of calculating time first, acquires an iteration institute secondly by all layers in cumulative network The execution time needed obtains the execution time of entire training process finally by multiple iteration are accumulated；

(4-2) establishes GPU memory model: according to global operation sequence, swapping in and out operation is parallel by two separate threads respectively It executes, the operation that swaps out can discharge GPU memory, and change operation occupies GPU memory, but at any one time, to any Minibatchsize, acquires the data committed memory size of each swapping in and out operation, and the data being present in GPU memory are big Small be necessarily less than is equal to given GPU memory size；

(4-3) establishes PCIe traffic model: according to the size and its corresponding transmission time of each swapping in and out operation data, asking The efficient communication bandwidth for obtaining current PCIe acquires the data communication of each swapping in and out operation to any minibatch size Size is measured, in conjunction with effective bandwidth, call duration time needed for acquiring each swapping in and out operation；

(4-4) arrives three submodels of step (4-3) according to step (4-1), is not blocked with GPU calculating for target, establishes GPU meter Calculation, memory, PCIe communicate the restricting relation between three；Each data must guaranteed to have changed to GPU before calculating calling Memory, the time for completing change are equal to the time point for starting to execute change operation plus the time of change PCIe communication, start to hold The time point of row change operation depends on the progress that GPU is calculated by layer execution and whether the thread that swaps out is vacateed to accommodate and currently be changed Enter the GPU memory headroom of data.

9. the method according to claim 1, wherein the method for step (5) the determining optimal policy is: according to Training process is modeled as optimization problem using minibatch size as Optimization goal by the performance model that step (4) proposes, with It is target that training time is most short, and GPU limited memory and GPU calculating are not blocked to be constrained for two, and solution is obtained with Current hardware ring Border configures the optimal minibatch size and its matched swapping in and out strategy of training current network.

10. a kind of GPU memory optimizing system towards deep learning training mission characterized by comprising

Basic swapping in and out operational design module is responsible for the basic swapping in and out operation of design, data, change including swapping in and out The time to swap out；

Static data acquisition module is responsible for before training starts, progress static data acquisition first, for portraying network to be trained The fundamental characteristics of model；

Dynamic Data Acquiring module is responsible for after starting training, is not taken swapping in and out strategy, first train several epoches, Dynamic Data Acquiring is carried out, during this period for portraying the performance of Current hardware environment；

Performance model constructs module, is responsible for establishing the property of swapping in and out strategy according to collected static data and dynamic data Can model, and clear GPU calculatings, memory, PCIe communicate three between restricting relation；

Optimal policy determining module is responsible for determining optimal policy according to performance model, including optimal minibatch size and its The swapping in and out strategy to match；

Continue training module, is responsible in remaining epoch using optimal minibatch size and its matched swapping in and out plan Slightly continue training until terminating.