CN112882830A

CN112882830A - Video memory management method, video memory management device, model training device, electronic equipment and storage medium

Info

Publication number: CN112882830A
Application number: CN202110150321.XA
Authority: CN
Inventors: 邓哲也; 章玄润; 高华佐
Original assignee: Beijing Megvii Technology Co Ltd
Current assignee: Beijing Megvii Technology Co Ltd
Priority date: 2021-02-03
Filing date: 2021-02-03
Publication date: 2021-06-01

Abstract

The invention discloses a method and a device for video memory management and model training, an electronic device and a storage medium, wherein the method for video memory management comprises the following steps: acquiring a video memory threshold corresponding to the current round of training of the model; determining a target tensor meeting a tensor selection rule under the condition that the display memory occupation value of the electronic equipment is larger than the display memory threshold value; and releasing the video memory occupied by the target tensor. Therefore, by implementing the method, the occupation of the GPU video memory can be reduced by releasing the tensor which does not influence the normal training of the model in the video memory, the global information of the whole computational graph does not need to be obtained in advance, the complete dynamic is realized, and the video memory management under the deep learning framework of the dynamic graph mechanism is realized.

Description

Video memory management method, video memory management device, model training device, electronic equipment and storage medium

Technical Field

The invention relates to the technical field of deep learning, in particular to a method and a device for video memory management and model training, electronic equipment and a storage medium.

Background

In the field of deep learning, with the increase of training data, the Size and complexity of a model are greatly increased, and at present, in the process of model training, a difficulty that limited GPU (Graphics Processing Unit) video memory resources cannot meet the requirement of model training with large Batch Size is often encountered, so that a new challenge is brought to a deep learning framework, whether limited computing memory resources can be effectively utilized in the process of model training, and particularly, GPU video memory occupation is reduced, which is an important index for evaluating the performance of the deep learning framework.

In the related art, the deep learning framework has the following methods for reducing the video memory occupation: through proper gradient definition, the reverse gradient calculation of operators such as Relu, Sigmoid and the like does not depend on forward calculation as input, so that the part of the video memory can be released after the forward calculation is completed; or calculating the life cycle of each operator, wherein operators with non-overlapping life cycles can share the video memory; or, reducing the video memory occupation by additional data transmission, for example, exchanging temporarily unused data from the GPU to a CPU (Central Processing Unit), and exchanging data from the CPU when necessary; alternatively, the memory occupancy is reduced by additional computations, such as a sub-linear memory optimization method that recalculates intermediate results using Gradient Check pointing (Gradient Check pointing).

However, the above methods in the related art all need to obtain global information of the computation graph in advance, which requires that the computation graph of the deep learning framework must be a static graph, which is not available for the deep learning framework of the dynamic graph mechanism, and therefore, it is a technical problem to be solved by those skilled in the art to provide a memory management method for the deep learning framework of the dynamic graph mechanism.

Disclosure of Invention

The embodiment of the invention provides a video memory management method, a model training method, a video memory management device, an electronic device and a storage medium, and aims to solve the technical problem that the video memory management method in the prior art is unavailable on a deep learning framework of a dynamic graph mechanism.

According to a first aspect of the present invention, a video memory management method is disclosed, the method comprising:

acquiring a video memory threshold corresponding to the current round of training of the model;

determining a target tensor meeting a tensor selection rule under the condition that the display memory occupation value of the electronic equipment is larger than the display memory threshold value;

and releasing the video memory occupied by the target tensor.

Optionally, as some embodiments, the determining the target tensor satisfying the tensor selection rule includes:

calculating an evaluation function value corresponding to the tensor in the video memory according to the target evaluation function;

and determining the tensor with the largest evaluation function value as the target tensor.

Optionally, as some embodiments, the calculating, according to a target evaluation function, an evaluation function value corresponding to a tensor in the video memory includes:

calculating an estimation function value corresponding to the unlocked tensor in the video memory according to the target estimation function;

the determining the tensor with the largest evaluation function value as the target tensor comprises the following steps:

and determining the tensor which is not locked and has the largest evaluation function value as the target tensor.

Optionally, as some embodiments, the method further comprises:

and determining the target evaluation function according to at least two or more of the size of the video memory occupied by the tensor, the time length of the video memory occupied by the tensor, the calculation cost of the tensor and the recalculation times of the tensor.

Optionally, as some embodiments, the target estimation function is:

wherein t is tensor, m (t) is the size of the video memory occupied by t, l (t) is the time length of the video memory occupied by t, c (t) is the calculation cost of t, r (t) is the recalculation times of t, and alpha, beta, gamma and delta are hyper-parameters of the objective evaluation function.

Optionally, as some embodiments, when the current round of training of the model is the first round of training, the display memory threshold is display memory capacity/2, α β γ 1, δ 1/2.

Optionally, as some embodiments, the method further comprises:

acquiring the times that the video memory occupation value of the Nth round of training exceeds the video memory threshold value;

and adjusting the video memory threshold corresponding to the (N + 1) th training turn according to the frequency of the video memory occupancy value exceeding the video memory threshold, wherein N is an integer greater than or equal to 1.

Optionally, as some embodiments, the adjusting, according to the number of times that the video memory occupancy value exceeds the video memory threshold, the video memory threshold corresponding to the (N + 1) th round of training includes:

under the condition that the frequency that the video memory occupancy value exceeds the video memory threshold value is greater than the first time threshold value, increasing the video memory threshold value corresponding to the (N + 1) th round of training;

and reducing the video memory threshold corresponding to the (N + 1) th round of training under the condition that the frequency of the video memory occupancy value exceeding the video memory threshold is not greater than the first time threshold.

Optionally, as some embodiments, the method further comprises:

acquiring the times that the application space of the Nth round of training is larger than the maximum value of the video memory fragments and/or the percentage of the recalculated time length to the total training time length;

and adjusting the value of the hyper-parameter in the target evaluation function corresponding to the (N + 1) th round of training according to the times that the application space is larger than the maximum value of the video memory fragments and/or the percentage of the recalculated time length in the total training time length.

Optionally, as some embodiments, the adjusting, according to the number of times that the application space is greater than the maximum value of the video memory fragments and/or the percentage of the recalculated duration to the total training duration, the value of the hyper-parameter in the target estimation function corresponding to the (N + 1) th round of training includes at least one of the following steps:

under the condition that the times that the application space is larger than the maximum value of the video memory fragments are larger than a second time threshold value, increasing the value of alpha corresponding to the (N + 1) th round of training;

and under the condition that the percentage of the recalculated time length to the total training time length is increased compared with the last round of model training, reducing the value of gamma corresponding to the (N + 1) th round of training.

Optionally, as some embodiments, the method further comprises:

calculating the time consumption parameter of the Nth training, namely recalculating the time consumption/originally calculating the time consumption;

and running a simulated annealing algorithm based on the time-consuming parameters, and adjusting a video memory threshold corresponding to the (N + 1) th round of training and/or the value of the hyper-parameter in the target evaluation function.

Optionally, as some embodiments, the obtaining of the computed cost of the tensor comprises:

reading historical calculation costs of historical tensors with the same operators and/or the same input shapes as the tensors from a cache;

and determining the read historical calculation cost as the calculation cost corresponding to the tensor.

Optionally, as some embodiments, the obtaining of the duration that the tensor occupies the video memory includes:

acquiring the number of operators being executed in the current round of training and the time of the tensor entering the video memory;

and determining the time length of the tensor occupying the video memory according to the number of the operators being executed and the time of the tensor entering the video memory.

Optionally, as some embodiments, the method further comprises:

determining whether the current round of training comprises a target operator execution sequence, if so, determining a tensor corresponding to the target operator execution sequence as the target tensor;

wherein the target operator execution sequence comprises: and the tensors corresponding to the target operator execution sequence have released records in the historical training process.

Optionally, as some embodiments, after the step of releasing the video memory occupied by the target tensor, the method further includes:

and storing the newly generated tensor in the current round of training into the released video memory.

According to a second aspect of the invention, a method of model training is disclosed, the method comprising:

acquiring a training sample set, wherein the training sample set comprises training data used for model training;

and performing model training based on the training sample set and the initial model, and in the model training process, managing tensors generated by training based on the video memory management method in the first aspect until a target model is obtained by training.

According to a third aspect of the present invention, a video memory management apparatus is disclosed, the apparatus comprising:

the first acquisition module is used for acquiring a video memory threshold corresponding to the current round of training of the model;

the first determining module is used for determining a target tensor meeting a tensor selection rule under the condition that the video memory occupation value of the electronic equipment is larger than the video memory threshold value;

and the releasing module is used for releasing the video memory occupied by the target tensor.

Optionally, as some embodiments, the first determining module includes:

the calculation submodule is used for calculating an estimation function value corresponding to the tensor in the video memory according to the target estimation function;

and the determining submodule is used for determining the tensor with the largest evaluation function value as the target tensor.

Optionally, as some embodiments, the computation submodule includes:

the calculation unit is used for calculating an evaluation function value corresponding to the unlocked tensor in the video memory according to the target evaluation function;

the determination sub-module includes:

and a determining unit configured to determine an unlocked tensor having a maximum evaluation function value as the target tensor.

Optionally, as some embodiments, the apparatus further comprises:

and the second determining module is used for determining the target evaluation function according to at least two or more of the size of the video memory occupied by the tensor, the time length of the video memory occupied by the tensor, the calculation cost of the tensor and the recalculation times of the tensor.

Optionally, as some embodiments, the target estimation function is:

Optionally, as some embodiments, the apparatus further comprises:

the second acquisition module is used for acquiring the times that the video memory occupation value of the Nth round of training exceeds the video memory threshold;

and the first adjusting module is used for adjusting the video memory threshold corresponding to the (N + 1) th round of training according to the frequency that the video memory occupancy value exceeds the video memory threshold, wherein N is an integer greater than or equal to 1.

Optionally, as some embodiments, the first adjusting module includes:

the first adjusting submodule is used for increasing the video memory threshold corresponding to the (N + 1) th round of training under the condition that the frequency that the video memory occupancy value exceeds the video memory threshold is greater than the first time threshold;

and the second adjusting submodule is used for reducing the video memory threshold corresponding to the (N + 1) th round of training under the condition that the frequency of the video memory occupancy value exceeding the video memory threshold is not greater than the first frequency threshold.

Optionally, as some embodiments, the apparatus further comprises:

the third acquisition module is used for acquiring the times that the application space of the Nth round of training is larger than the maximum value of the video memory fragments and/or the percentage of the recalculated time length to the total training time length;

and the second adjusting module is used for adjusting the value of the hyper-parameter in the target evaluation function corresponding to the (N + 1) th round of training according to the times that the application space is larger than the maximum value of the video memory fragments and/or the percentage of the recalculated time length in the total training time length.

Optionally, as some embodiments, the second adjusting module includes at least one of the following sub-modules:

a third adjusting submodule, configured to increase a value of α corresponding to the (N + 1) th round of training when the number of times that the application space is greater than the maximum value of the video memory fragments is greater than a second time threshold;

and the fourth adjusting submodule is used for reducing the value of gamma corresponding to the N +1 th round of training under the condition that the percentage of the recalculated time length to the total training time length is increased compared with that of the previous round of model training.

Optionally, as some embodiments, the apparatus further comprises:

the calculating module is used for calculating the consumed time parameter of the Nth round of training, namely recalculating consumed time/originally calculating consumed time;

and the third adjusting module is used for operating a simulated annealing algorithm based on the time-consuming parameters and adjusting the video memory threshold corresponding to the (N + 1) th training and/or the value of the hyper-parameter in the target evaluation function.

Optionally, as some embodiments, the apparatus further comprises:

the reading module is used for reading the historical calculation cost of the historical tensor which has the same operator and/or the same input shape with the tensor from the cache;

and the third determining module is used for determining the read historical calculation cost as the calculation cost corresponding to the tensor.

Optionally, as some embodiments, the apparatus further comprises:

the fourth acquisition module is used for acquiring the number of operators being executed in the current round of training and the time of the tensor entering the video memory;

and the fourth determining module is used for determining the time length of the tensor occupying the video memory according to the number of the operators being executed and the time of the tensor entering the video memory.

Optionally, as some embodiments, the apparatus further comprises:

a fifth determining module, configured to determine whether the current round of training includes a target operator execution sequence, and if so, determine a tensor corresponding to the target operator execution sequence as the target tensor;

Optionally, as some embodiments, the apparatus further comprises:

and the storage module is used for storing the newly generated tensor in the current round of training into the released video memory.

According to a fourth aspect of the present invention, there is disclosed a model training apparatus, the apparatus comprising:

the second acquisition module is used for acquiring a training sample set, wherein the training sample set comprises training data used for model training;

and the training module is used for carrying out model training based on the training sample set and the initial model, and managing tensors generated by training based on the video memory management device in the third aspect in the model training process until a target model is obtained by training.

According to a fifth aspect of the present invention, there is disclosed an electronic apparatus comprising: memory, a processor and a program stored on the memory and executable on the processor, which program, when executed by the processor, performs the steps of the video memory management method as in the first aspect or the steps of the model training method as in the second aspect.

According to a sixth aspect of the present invention, a computer readable storage medium is disclosed, having stored thereon a program which, when executed by the processor, carries out the steps of the video memory management method as in the first aspect or the steps of the model training method as in the second aspect.

In the embodiment of the invention, the video memory threshold corresponding to the current round of training of the model can be obtained, the target tensor meeting the tensor selection rule is determined under the condition that the video memory occupation value of the electronic equipment is greater than the video memory threshold, and the video memory occupied by the target tensor is released. Therefore, in the embodiment of the invention, the occupation of the GPU video memory can be reduced by releasing the tensor which does not influence the normal training of the model in the video memory, the global information of the whole computational graph does not need to be obtained in advance, the complete dynamic is realized, and the video memory management under the deep learning framework of the dynamic graph mechanism is realized. In addition, under the deep learning framework of the dynamic graph mechanism, a user can train a model which is one time larger than the original model in the time which is constant times of the original time consumption on a machine with unchanged memory capacity without modifying the original training code, and the model training efficiency is higher.

Drawings

FIG. 1 is a flow diagram of a video memory management method of some embodiments of the invention;

FIG. 2 is an exemplary diagram of the correspondence of tensors in Python to tensors in C + + according to some embodiments of the present invention;

FIG. 3 is an exemplary diagram of tensor changes in a memory when only three tensors can be stored in the memory according to some embodiments of the present invention;

FIG. 4 is a flow chart of a model training method of some embodiments of the present invention;

fig. 5 is a schematic structural diagram of a video memory management apparatus according to some embodiments of the present invention;

FIG. 6 is a schematic diagram of a model training apparatus according to some embodiments of the present invention.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

It should be noted that, for simplicity of description, the method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the illustrated order of acts, as some steps may occur in other orders or concurrently in accordance with the embodiments of the present invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no particular act is required to implement the invention.

In the field of deep learning, with the increase of training data, the Size and complexity of a model are greatly increased, and a dilemma frequently encountered by people is that limited GPU video memory resources cannot meet the requirement of training a model with a large Batch Size. Therefore, a new challenge is brought to the deep learning framework, whether limited computing storage resources can be effectively utilized during model training or not can be realized, and particularly, GPU video memory occupation is reduced, so that the method is an important index for evaluating the performance of the deep learning framework.

In the related art, under the condition that the computing storage resources are certain, the deep learning framework has several methods for reducing the video memory occupation, which are specifically as follows:

according to the method I, through proper gradient definition, the reverse gradient calculation of operators such as Relu and Sigmoid does not depend on forward calculation as input, so that the part of memory can be released after the forward calculation is completed;

calculating the life cycle of each operator, wherein operators with non-overlapping life cycles can share the video memory;

reducing the video memory occupation through additional data transmission, for example, exchanging the temporarily unused data from the GPU to the CPU, and exchanging the data from the CPU when necessary;

and fourthly, reducing the video memory occupation through additional calculation, for example, a sub-linear video memory optimization method for recalculating the intermediate result by using Gradient Check points (Gradient Check pointing).

The drawback of the above methods is that global information of the computation graph needs to be obtained in advance, which requires that the computation graph must be static. The benefit of static maps is that the program, when compiled, can generate the structure of the neural network, which allows the compiler to optimize to the maximum extent, e.g., calculate the life cycle of each operator to better allocate memory, calculate optimal gradient checkpoints by search algorithms, etc. However, this also means that there is a large gap between the actual execution of the compiler and the execution of the program desired by the user, and errors in the code will be more difficult to find. On the contrary, the program under the dynamic graph can be completely executed according to the order of writing commands by the user, is easy to debug and is more user-friendly. Thus, for convenience, users are increasingly choosing to use flexible frameworks with dynamic graph mechanisms, such as PyTorch, MegEngine, and so forth.

It is also because of the flexible nature of the dynamic graph that the above-described techniques do not migrate well from static graphs to dynamic graphs. For example, the second method is not feasible on the dynamic graph, because the future calculation sequence cannot be obtained on the dynamic graph, the life cycle of each operator cannot be calculated; the fourth method is not feasible on the dynamic map, and because the dynamic map cannot obtain the information of the whole calculation map from the beginning, the optimal gradient check point cannot be calculated. Therefore, the video memory management method is not applicable to the deep learning framework of the dynamic graph mechanism.

In order to solve the above technical problems, embodiments of the present invention provide a method and an apparatus for video memory management and model training, an electronic device, and a storage medium.

First, a video memory management method according to an embodiment of the present invention is described below.

It should be noted that the video memory management method provided in the embodiment of the present invention is applicable to an electronic device, and in practical applications, the electronic device may be a server, and the like.

Fig. 1 is a flowchart of a video memory management method according to some embodiments of the present invention, and as shown in fig. 1, the method may include the following steps: step 101, step 102 and step 103, wherein,

in step 101, a video memory threshold corresponding to the current round of training of the model is obtained.

For the convenience of understanding, the deep learning framework and the related contents of model training based on the deep learning framework involved in the embodiments of the present invention are described with reference to an example.

Before actual model training, a user deploys a deep learning framework supporting a dynamic graph on a machine (mainly a server, which is described later by taking the server as an example) performing the model training, and writes corresponding model codes in Python language, for example, if the user desires to finally train to obtain an object classification model, the Python code corresponding to the object classification model needs to be written. And then, submitting the written model code to a deep learning framework of the server for running, wherein the deep learning framework provides an interpreter, the interpreter splits the model code into a plurality of small tasks, and the small tasks are sent to a GPU of the server for model training, for example, kernel functions in the model code are transmitted to the GPU. In the model training process, some data related to the model training may occupy a video memory of the GPU, for example, a Tensor (Tensor) generated in the model training process may be stored in the video memory, and an object of the embodiment of the present invention is to: and optimizing the video memory occupation condition in the model training process.

In the embodiment of the invention, one function supported by the interpreter is to bind tensors in the model code to tensor types in the C + + code one by one, and when a certain amount of video memory occupied by the GPU video memory needs to be released, the mode can be realized only by operating a tensor structure body in the C + + code.

As shown in fig. 2, fig. 2 shows the correspondence between the tensor in the Python code and the tensor in C + +, and taking the tensor as t as an example, some auxiliary information may be recorded in the tensor type of C + +: calculating history information and attribute values, wherein calculating the history information comprises: calculating an operator of t and all input tensors of the operator, wherein the attribute values comprise: m (t) is the size of the video memory occupied by t (unit: MB), L (t) is the duration of the video memory occupied by t, C (t) is the calculation cost of t (unit: ms), and R (t) is the recalculation times of t.

In the embodiment of the present invention, each round of model training may be used as a processing unit, and for each round of model training, the video memory occupancy in each round of training may be optimized by using the same/different video memory optimization parameters according to the model and the video memory occupancy, etc. in consideration that a basic model (also referred to as an "initial model" or a "backbone network", etc.) needs to be trained for multiple rounds to obtain a target model, and the video memory occupancy in each round of model training is usually different.

In the embodiment of the invention, in the process of model training, the current training round number of the model is obtained every time one round of training is started, namely, the model is currently trained to the number of rounds.

In order to prevent the video memory from being exhausted, in the embodiment of the invention, a video memory threshold is set for each round of training, when the video memory occupancy value of the electronic device exceeds the video memory threshold, according to the tensor selection rule, a target tensor which does not influence the subsequent training process of the model is searched from the tensors stored in the video memory, the target tensor is released from the video memory, and the searching and releasing operations are repeated until the video memory occupancy value of the electronic device is lower than the video memory threshold.

In the embodiment of the present invention, for the released tensor, if the user needs to access the tensor in the future, or needs to use the tensor when performing reverse derivation, the original value may be restored according to the calculation path (stored in the CPU memory), the operator, and all the inputs of the tensor. That is, only the calculation path of each tensor needs to be recorded in the framework, and details such as tensor searching, memory release, tensor recalculation recovery and the like are realized without a user, so that the user is unaware of the whole optimization process, and programming experience of the user is not influenced.

In the embodiment of the invention, in the training process of the model, in order to ensure the optimization effect of the video memory, the video memory threshold value and the tensor selection rule are dynamically adjusted to a certain extent along with the training of the model, that is, for the training process of one model, different training rounds are performed, and the corresponding video memory threshold value and the tensor selection rule are usually different. Therefore, in the present round of model training, when the video memory occupation is optimized, the video memory threshold and the tensor selection rule corresponding to the current round of training need to be obtained.

In step 102, a target tensor meeting the tensor selection rule is determined under the condition that the video memory occupancy value of the electronic device is greater than the video memory threshold value.

In some embodiments provided by the present invention, the tensor selection rule may be implemented based on an evaluation function of the tensor, in this case, the step 102 may specifically include the following steps (not shown in the figure): step 1021 and step 1022, wherein,

in step 1021, under the condition that the video memory occupancy value of the electronic device is greater than the video memory threshold value, calculating an evaluation function value corresponding to a tensor in the video memory according to the target evaluation function;

in step 1022, the tensor having the largest evaluation function value is determined as the target tensor.

In the embodiment of the invention, the target evaluation function can be determined according to at least two or more of the size of the video memory occupied by the tensor, the time length of the video memory occupied by the tensor, the calculation cost of the tensor and the recalculation times of the tensor.

In the embodiment of the present invention, when determining the target evaluation function according to the size of the video memory occupied by the tensor, the duration of the video memory occupied by the tensor, the calculation cost of the tensor, and the recalculation times of the tensor, the target evaluation function may be:

wherein t is tensor, m (t) is the size of the video memory occupied by t, l (t) is the time length of the video memory occupied by t, c (t) is the calculation cost of t, r (t) is the recalculation times of t, and alpha, beta, gamma and delta are hyper-parameters of the target evaluation function.

In the embodiment of the present invention, when the current round of training of the model is the first round of training, the display memory threshold is display memory capacity/2, α is β is γ is 1, and δ is 1/2. That is, at the beginning of the first round of training of the model, the initial strategy is: setting a video memory threshold value as video memory capacity/2, setting a super parameter alpha as beta as gamma as 1, and setting a super parameter delta as 1/2. Of course, the implementation of the invention does not limit the initial strategy when the model starts the first round of training, and other initial strategies can be set correspondingly according to the actual application requirements.

In the embodiment of the invention, the video memory threshold value and the hyper-parameter of the target evaluation function corresponding to each round of training can be dynamically fine-tuned according to the historical training information of the model.

Based on the above dynamic fine tuning thought, the video memory management method provided by the embodiment of the present invention may further include the following steps:

and adjusting the video memory threshold corresponding to the (N + 1) th round of training according to the times that the video memory occupancy value exceeds the video memory threshold, wherein N is an integer greater than or equal to 1.

In the embodiment of the present invention, when the video memory threshold corresponding to the (N + 1) th round of training is adjusted according to the number of times that the video memory occupancy value exceeds the video memory threshold, the following strategies may be adopted:

and under the condition that the frequency that the video memory occupancy value exceeds the video memory threshold value is not greater than the first time threshold value, reducing the video memory threshold value corresponding to the (N + 1) th round of training.

In one example, when the number of times that the video memory occupancy value exceeds the video memory threshold is greater than the first time threshold, the following settings are set: a video memory threshold corresponding to the N +1 th round of training is equal to a video memory threshold corresponding to the N-th round of training (1+ 10%); under the condition that the frequency that the video memory occupancy value exceeds the video memory threshold value is not greater than the first time threshold value, setting: the video memory threshold corresponding to the N +1 th round of training is the video memory threshold corresponding to the N th round of training (1-10%).

In the embodiment of the present invention, when the value of the hyper-parameter in the target evaluation function corresponding to the (N + 1) th round of training is adjusted according to the number of times that the application space is larger than the maximum value of the video memory fragments and/or the percentage of the recalculated time length to the total training time length, the following strategy may be adopted:

and under the condition that the percentage of the recalculated time length to the total training time length is increased compared with the last round of model training, reducing the value of gamma corresponding to the N +1 round of training.

In one example, when the number of times that the application space is larger than the maximum value of the video memory fragments is larger than the second time threshold, it is described that the tensor which occupies a larger amount of the video memory should be more emphasized to release, in this case, the following settings are set: α corresponding to the N +1 th round (α x (1+ 10%) corresponding to the N-th round); in the case that the percentage of the recalculated duration to the total training duration is increased compared to the previous round of model training, it is stated that the tensor with lower recalculation cost should be released more heavily, and at this time, the following settings are set: γ for N +1 round (1-10%) for N round.

In addition to the foregoing adjustment strategy, considering that the training time consumption of each round mainly consists of two parts, namely original calculation time consumption and recalculation time consumption, in order to successfully train the model without exceeding the video memory limit, a time consumption parameter P is introduced, namely recalculation time consumption/original calculation time consumption, and P is made as small as possible. In some embodiments, the hyper-parameter may be adjusted based on a simulated annealing algorithm, which is not limited in the embodiments of the present invention. At this time, the video memory management method provided in the embodiment of the present invention may further include the following steps:

and (3) operating a simulated annealing algorithm based on the time-consuming parameters, and adjusting the video memory threshold corresponding to the (N + 1) th training and/or the value of the hyper-parameters in the target estimation function.

Since the simulated annealing algorithm is the existing algorithm, it is not described herein again.

In the embodiment of the present invention, when the calculation cost c (t) of the measurement tensor is considered, because synchronization between the devices is introduced, excessive measurement times should be avoided as much as possible, which causes extra excessive overhead. For the situation, Cache (Cache) optimization can be introduced, the time consumption is measured only once for the tensors of the same operator and/or the same input shape and stored in the Cache, and then the time consumption is directly read from the Cache when the tensors of the same operator and the same input shape are encountered.

Correspondingly, based on the above-mentioned idea, the step of obtaining the computed cost of the tensor may include the following steps: reading historical calculation cost of historical tensors which have the same operators and/or the same input shapes as the tensors from a cache; and determining the read historical calculation cost as the calculation cost corresponding to the tensor.

In the embodiment of the invention, when the time length L (t) of the display memory occupied by the tensor is calculated, in order to avoid calling a large number of functions for acquiring the current time, the number of currently executed operators is used as the time stamp to approximately calculate the time stamp when the current time stamp-tensor enters the display memory, so that the expense of calling the time function is reduced.

Correspondingly, based on the above-mentioned thinking, the step of obtaining the duration that the tensor occupies the video memory may include the following steps: acquiring the number of operators being executed in the current round of training and the time of the tensor entering the video memory; and determining the time length of the tensor occupying the video memory according to the number of the operators being executed and the time of the tensor entering the video memory.

In the embodiment of the invention, before the kernel function of an operator is transmitted, the input tensor dependent on the operator needs to be locked to prevent the input tensor from being determined as the target tensor, so that the tensor does not exist in the memory when the equipment calculates the operator, and the lock of the input tensor can be released after the calculation of the operator is finished.

In view of the above situation, the step 1021 may specifically include the following steps:

calculating an estimation function value corresponding to the unlocked tensor in the display memory according to the target estimation function;

correspondingly, the step 1022 may specifically include the following steps:

In the embodiment of the invention, when a user deletes tensor t in a model code, an interpreter needs to judge whether tensors which depend on t as operator input are all deleted, and t cannot be deleted in the interpreter as long as one tensor u depends on t. That is, the interpreter can delete t only if the tensors dependent on t are all deleted by the user.

In step 103, the video memory occupied by the target tensor is released.

In the embodiment of the invention, after the video memory occupied by the target tensor is released, the newly generated tensor in the current round of training can be stored in the released video memory, so that the model training can be normally carried out.

In one example, as shown in fig. 3, fig. 3 shows how tensors in a video memory change when the video memory capacity is three tensors. In the model training process, A, B and C are stored in the video memory, and three tensors are stored, at this time, the video memory occupancy value of the electronic device is 100%, and if the video memory threshold value is 70%, since the video memory occupancy value is 100% greater than the video memory threshold value 70%, the target tensor in the video memory needs to be found out, for example, if C is the target tensor, the video memory occupied by C is released, so as to achieve the purpose of optimizing the video memory.

As can be seen from the above embodiment, in this embodiment, a video memory threshold corresponding to the current round of training of the model can be obtained, and when the video memory occupancy value of the electronic device is greater than the video memory threshold, the target tensor meeting the tensor selection rule is determined, and the video memory occupied by the target tensor is released. Therefore, in the embodiment of the invention, the occupation of the GPU video memory can be reduced by releasing the tensor which does not influence the normal training of the model in the video memory, the global information of the whole computational graph does not need to be obtained in advance, the complete dynamic is realized, and the video memory management under the deep learning framework of the dynamic graph mechanism is realized. In addition, under the deep learning framework of the dynamic graph mechanism, a user can train a model which is one time larger than the original model in the time which is constant times of the original time consumption on a machine with unchanged memory capacity without modifying the original trained code, for example: on a 2080Ti (video memory 11GB) card, the size of the ResNet50 training batch can be doubled to about 250, and the model training efficiency is high.

In some embodiments provided by the present invention, the video memory management method may further add the following steps on the basis of the embodiment shown in fig. 1: determining whether the current round of training comprises a target operator execution sequence, if so, determining a tensor corresponding to the target operator execution sequence as a target tensor;

wherein, the target operator execution sequence comprises: and the tensors corresponding to the target operator execution sequence have released records in the historical training process according to a plurality of operators executed in a specific sequence.

In one example, if the form is "convolution-batch regularization-Relu-pooling-convolution-batch regularization-Relu", tensors corresponding to a calculation sequence with a length of 7 are all released at a certain time later, when the 7 operators are found again in the subsequent training process, the memory occupied by the corresponding tensor can be actively released, so that the phenomenon that the memory is passively searched when the future memory exceeds a threshold value can be avoided, and the searching overhead is reduced.

FIG. 4 is a flow chart of a model training method of some embodiments of the present invention, which, as shown in FIG. 4, may include the steps of: step 401 and step 402, wherein,

in step 401, a training sample set is obtained, wherein the training sample set includes training data for model training.

In step 402, model training is performed based on the training sample set and the initial model, and in the process of model training, tensors generated by training are managed based on a preset video memory management method until a target model is obtained by training.

In an embodiment of the present invention, the preset video memory management method is a video memory management method in any one of the above embodiments.

In the embodiment of the present invention, the target model includes, but is not limited to, models for the following purposes: a model for determining a class to which an image to be processed belongs, a model for recognizing a face of a person in the image to be processed, a model for detecting a specific object in the image to be processed, a model for segmenting a specific object in the image to be processed, and a model for generating a new image having a similar feature to that of the image to be processed, and the like.

In the embodiment of the application, a proper training sample set can be selected according to the purpose of the target model.

As can be seen from the foregoing embodiment, in this embodiment, the tensor generated in the model training process can be managed based on the video memory management method, so that a user can train a model that is twice as large as the original model in a time that is a constant time that is the original time consumption on a machine with unchanged video memory capacity without modifying the original trained code, for example: on a 2080Ti (video memory 11GB) card, the size of the ResNet50 training batch can be doubled to about 250, and the model training efficiency is high.

Fig. 5 is a schematic structural diagram of a video memory management apparatus according to some embodiments of the present invention, and as shown in fig. 5, the video memory management apparatus 500 may include: a first acquisition module 501, a first determination module 502, and a release module 503, wherein,

a first obtaining module 501, configured to obtain a video memory threshold corresponding to a current round of training of a model;

a first determining module 502, configured to determine a target tensor meeting a tensor selection rule when a video memory occupancy value of the electronic device is greater than the video memory threshold;

a releasing module 503, configured to release the video memory occupied by the target tensor.

As can be seen from the above embodiment, in this embodiment, a video memory threshold corresponding to the current round of training of the model can be obtained, and when the video memory occupancy value of the electronic device is greater than the video memory threshold, the target tensor meeting the tensor selection rule is determined, and the video memory occupied by the target tensor is released. Therefore, in the embodiment of the invention, the occupation of the GPU video memory can be reduced by releasing the tensor which does not influence the normal training of the model in the video memory, the global information of the whole computational graph does not need to be obtained in advance, the complete dynamic is realized, and the video memory management under the deep learning framework of the dynamic graph mechanism is realized. In addition, under the deep learning framework of the dynamic graph mechanism, a user can train a model which is one time larger than the original model in the time which is constant times of the original time consumption on a machine with unchanged memory capacity without modifying the original training code, and the model training efficiency is higher.

Optionally, as some embodiments, the first determining module 502 may include:

Optionally, as some embodiments, the computation submodule may include:

the determining sub-module may include:

Optionally, as some embodiments, the video memory management apparatus 500 may further include:

Optionally, as some embodiments, the target estimation function is:

Optionally, as some embodiments, the first adjusting module may include:

Optionally, as some embodiments, the second adjusting module may include at least one of the following sub-modules:

Fig. 6 is a schematic structural diagram of a model training apparatus according to some embodiments of the present invention, and as shown in fig. 6, the model training apparatus 600 may include: a second acquisition module 601 and a training module 602, wherein,

a second obtaining module 601, configured to obtain a training sample set, where the training sample set includes training data used for model training;

and the training module 602 is configured to perform model training based on the training sample set and the initial model, and manage tensors generated by the training based on any one of the video memory management devices in a model training process until a target model is obtained by the training.

For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.

The present invention also provides, according to some embodiments thereof, an electronic device comprising: a memory, a processor and a program stored on the memory and executable on the processor, the program, when executed by the processor, implementing the steps in the video memory management method according to any of the embodiments or implementing the steps in the model training method according to some of the embodiments.

According to some embodiments of the present invention, the present invention further provides a computer readable storage medium, on which a program is stored, the program, when executed by a processor, implementing the steps in the video memory management method according to any of the embodiments or implementing the steps in the model training method according to some of the embodiments.

The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, embodiments of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present invention have been described, additional variations and modifications of these embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the embodiments of the invention.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or terminal that comprises the element.

The video memory management and model training method, device, electronic device and storage medium provided by the invention are introduced in detail, and a specific example is applied in the text to explain the principle and the implementation of the invention, and the description of the embodiment is only used for helping to understand the method and the core idea of the invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A video memory management method is applied to electronic equipment and is characterized by comprising the following steps:

and releasing the video memory occupied by the target tensor.

2. The method of claim 1, wherein determining the target tensor that satisfies the tensor selection rule comprises:

3. The method of claim 2, wherein the calculating the evaluation function value corresponding to the tensor in the video memory according to the target evaluation function comprises:

4. A method according to claim 2 or 3, characterized in that the method further comprises:

5. The method of claim 4, wherein the objective estimation function is:

6. The method according to claim 5, wherein when the current round of training of the model is the first round of training, the video memory threshold is 2, α β γ 1, δ 1/2.

7. The method according to any one of claims 1-6, further comprising:

8. The method according to claim 7, wherein the adjusting the video memory threshold corresponding to the (N + 1) th round of training according to the number of times that the video memory occupancy value exceeds the video memory threshold comprises:

9. The method according to any one of claims 5-8, further comprising:

10. The method according to claim 9, wherein the adjusting the value of the hyper-parameter in the objective evaluation function corresponding to the (N + 1) th round of training according to the number of times that the application space is larger than the maximum value of the video memory fragments and/or the percentage of the recalculated duration to the total training duration comprises at least one of the following steps:

11. The method according to any one of claims 5-10, further comprising:

12. The method according to any of claims 4-11, wherein the step of obtaining the computed cost of the tensor comprises:

13. The method according to any one of claims 4-12, wherein the obtaining of the duration of the video memory occupied by the tensor comprises:

14. The method according to any one of claims 1-13, further comprising:

15. The method according to any of claims 1-14, further comprising, after the step of releasing the video memory occupied by the target tensor:

16. A method of model training, the method comprising:

and performing model training based on the training sample set and the initial model, and during the model training process, managing tensors generated by training based on the video memory management method of any one of claims 1 to 15 until a target model is obtained by training.

17. A video memory management apparatus, comprising:

18. A model training apparatus, the apparatus comprising:

a training module, configured to perform model training based on the training sample set and the initial model, and in a model training process, manage tensors generated by training based on the video memory management apparatus of claim 17 until a target model is obtained by training.

19. An electronic device, comprising: memory, processor and program stored on the memory and executable on the processor, which when executed by the processor performs the steps of the video memory management method according to any one of claims 1 to 15 or the steps of the model training method according to claim 16.

20. A computer readable storage medium having stored thereon a program which, when being executed by a processor, carries out the steps of the video memory management method according to any one of claims 1 to 15 or the steps of the model training method according to claim 16.