CN116610456B - Memory optimization method based on eager memory reuse algorithm - Google Patents

Memory optimization method based on eager memory reuse algorithm Download PDF

Info

Publication number
CN116610456B
CN116610456B CN202310883730.XA CN202310883730A CN116610456B CN 116610456 B CN116610456 B CN 116610456B CN 202310883730 A CN202310883730 A CN 202310883730A CN 116610456 B CN116610456 B CN 116610456B
Authority
CN
China
Prior art keywords
tensor
layer
memory
node
nodes
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310883730.XA
Other languages
Chinese (zh)
Other versions
CN116610456A (en
Inventor
徐远超
曹博钧
钱入意
史钦文
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Capital Normal University
Original Assignee
Capital Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Capital Normal University filed Critical Capital Normal University
Priority to CN202310883730.XA priority Critical patent/CN116610456B/en
Publication of CN116610456A publication Critical patent/CN116610456A/en
Application granted granted Critical
Publication of CN116610456B publication Critical patent/CN116610456B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5011Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
    • G06F9/5016Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals the resource being the memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5011Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
    • G06F9/5022Mechanisms to release resources
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention provides a memory optimization method based on an eager memory reuse algorithm, which comprises the steps of constructing a calculation graph of a GPU (graphics processing unit) in operation, and obtaining life cycles of all tensors and all three-layer tensor structures in the calculation graph; determining the execution sequence of the three-layer tensor structure; based on the execution sequence, executing an Eager Reuse algorithm, and distributing memory for each tensor; according to the allocation result of the Eager Reuse algorithm, memory is applied at the beginning point of the life cycle of each tensor, and the memory is released at the end point of the life cycle. By the method provided by the invention, the memory consumption is reduced, the conflict between the shortage of the memory capacity and the large demand is relieved, and a larger deep learning model can be deployed on the limited GPU memory.

Description

Memory optimization method based on eager memory reuse algorithm
Technical Field
The invention belongs to the field of artificial intelligence deep learning frameworks, and particularly relates to a memory optimization method based on an eager memory reuse algorithm.
Background
In deep learning, GPUs are typically used to accelerate the training process of deep neural networks, however the limited physical memory of GPUs limits further development of deep neural network models. Deep learning models typically require a large amount of memory to store parameters, states, and intermediate results. The eager reuse algorithm can reduce the memory consumption in a memory reuse mode, relieves the conflict between the shortage of the memory capacity and the large demand, and enables a larger deep learning model to be deployed on the limited GPU memory.
Memory reuse is the repeated use of the same block of memory space between different tensors through analysis of the lifecycle of the different tensors. Reusing allocated memory helps to improve memory utilization and reduce allocation delay. The simplest memory reuse is in-situ update, it directly stores the output on the input physical address, the more flexible method is to allocate a large memory as a shared memory pool in advance, and realize the memory reuse through the shared memory, which is one of the main methods of deep learning system memory optimization. The intermediate tensors in the computational graph occupy the main memory space and are the main targets for memory reuse. The existing memory reuse algorithm based on the calculation graph has a large tensor priority algorithm and a short life cycle priority algorithm.
High tensor priority algorithm: the large tensor priority algorithm is to allocate the memory according to the order of the tensors from large to small. The lifecycle of the tensors is first determined according to the order in which the nodes are scheduled. And then, memory is allocated according to the descending order of the size of each tensor, when the memory is allocated, searching from a low address to a high address of the memory space, and if enough space can be allocated to the current tensor, the allocation is performed. If the lifecycle of the tensor to be allocated does not overlap with the lifetime of the previously allocated tensor, the currently allocated tensor may reuse the allocated memory space by recording the relative position of the tensor using an offset. Finally, the peak memory of the whole graph can be calculated according to the offset and the final tensor. Theoretically, for this peak size memory of a graph, it is only necessary to fetch it once from the operating system to meet the memory requirements of all tensors during the graph execution.
Short life cycle priority algorithm: the short life cycle priority algorithm performs memory allocation according to the sorting of the life cycle of tensors from short to long. Intuitively, the longer one tensor stays in the memory pool, the more interference it interferes with the other tensor. As it divides the memory pool into two discrete portions given by the time stamps. If a long life-cycle tensor is located in an unsuitable address space, a large continuous space will be divided into two small discontinuous parts. When a new tensor arrives, the allocator tends to request contiguous space from a higher free space, and if tensors of some long life cycles occur in lower address spaces, more memory fragmentation occurs. The key idea of the short life cycle priority algorithm is to consolidate the entire memory layout by placing these long life cycle tensors at the high addresses of the short life cycle tensors.
In the computational graph, the intermediate tensor occupies the main memory space, and therefore, this portion of memory is the main optimization target of the memory reuse algorithm. There are currently two reuse algorithms: a large tensor priority algorithm and a short lifecycle priority algorithm. The two algorithms perform memory reuse to a certain extent, so that the final memory occupation is reduced, but only whether the life cycles of tensors are overlapped or not is considered, and the more specific life cycle relative position relationship between adjacent tensors is ignored, so that the opportunity of reusing some memory is wasted, and the reuse result is often not optimal. Whether a large tensor priority algorithm or a short life cycle priority algorithm is based on a data feature, a set of feasible solutions are searched from a complex calculation graph. Because the memory reuse of the computation graph is necessarily affected by the simultaneous constraint of multiple factors, the more complex the computation graph is, the more difficult it is to drill down the computation graph reuse.
Disclosure of Invention
The present invention aims to solve at least one of the technical problems in the related art to some extent.
Therefore, the invention aims to provide a memory optimization method based on an eager memory reuse algorithm, which is used for reducing memory consumption and relieving the contradiction between the shortage of memory capacity and large demand.
In order to achieve the above objective, an embodiment of a first aspect of the present invention provides a memory optimization method based on an eager memory reuse algorithm, including:
constructing a calculation graph of the GPU in operation, and obtaining the life cycle of all tensors in the calculation graph and all three-layer tensor structures;
determining an execution sequence of the three-layer tensor structure;
based on the execution sequence, executing an Eager Reuse algorithm, and distributing memory for each tensor;
and applying memory at the starting point of the life cycle of each tensor according to the allocation result of the Eager Reuse algorithm, and releasing the memory at the ending point of the life cycle.
In addition, the memory optimization method based on the eager memory reuse algorithm according to the above embodiment of the present invention may further have the following additional technical features:
further, in an embodiment of the present invention, the obtaining the life cycle of all tensors in the calculation map includes:
constructing a life cycle information table with time slots as rows, current time slots as executing nodes, tensors generated in the current time slots and tensors ending in the current time slots as columns by traversing the time slots, thereby acquiring the life cycles of all tensors.
Further, in an embodiment of the present invention, the obtaining all three-layer tensor structures in the computation graph includes:
traversing all nodes of the computational graph, regarding the found nodes as fork nodes, and marking the fork nodes as first-layer nodes node_i;
traversing all adjacent nodes of node_i, regarding the found Node as a middle Node, and marking the found Node as a middle layer Node node_j;
traversing all adjacent nodes of the node_j, regarding the found Node as a join Node, and marking the join Node as a third layer node_k;
according to the rule that the nodes are matched with tensors one by one, the structure of the three layers of nodes which are found is transformed into a three-layer tensor structure which is shaped like [ FT, MT, JT ], and the three-layer tensor structure which is found is stored in a linked list form, wherein FT represents the tensor of the first layer, MT represents the tensor of the middle layer, and JT represents the tensor of the third layer.
Further, in an embodiment of the present invention, the determining the execution order of the three-layer tensor structure includes:
s401, initializing an index idx=0, wherein idx represents that when the condition is met, the idx-th three-layer tensor structure in the unprocessed three-layer tensor structure linked list is moved to the first position of the linked list; judging whether all tensors in the calculation graph are processed or not, wherein when the three-layer tensor structure is marked as processed, the FT, MT and JT contained in the three-layer tensor structure are marked as processed at the same time; if all the three-layer tensor structures are processed, resetting the modification of the linked list, the modification of the degree of entry and the modification of the flag bit in the algorithm to be before the algorithm is executed, only moving the idx-th three-layer tensor structure to the head of the linked list of the three-layer tensor structure which is not executed, and correspondingly judging whether the idx-th three-layer tensor structure FT, MT and JT is processed or not, and if the idx-th three-layer tensor structure FT, MT and JT are processed, ending the algorithm; otherwise, go to S402;
s402, subtracting one from the inflows of all adjacent nodes of each node in the three-layer tensor structure of the first position of the unordered three-layer tensor structure linked list;
s403, judging whether the ingress degree of three nodes of the three-layer tensor structure is less than or equal to 0 at the moment, if yes, modifying the flag bit, namely marking the three-layer tensor structure as processed, and simultaneously marking FT, MT and JT in the three-layer tensor structure as processed, and returning to S401; otherwise, go to S404;
s404, indexing idx=idx+1, moving the idx-th three-layer tensor structure in the currently unexecuted three-layer tensor structure linked list to the first position of the unexecuted three-layer tensor structure linked list, and proceeding to S402.
Further, in an embodiment of the present invention, the executing the Eager Reuse algorithm based on the execution order includes:
s501, defining a flag bit for each layer of tensor for representing the relative position of tensor placement for an input three-layer tensor structure, wherein the value can be three of-1, 0 and 1, 0 is an initial value, 1 represents that relative to the current tensor, the adjacent tensor is placed at a high address of a memory address of the tensor, 1 represents that relative to the current tensor, the adjacent tensor is placed at a low address of the memory address of the tensor, and if the tags of the three-layer tensor are all 0, the tag of FT is set to be 1; the base address of FT of the first three-layer tensor structure is 0, memory is allocated for the FT, and then the tags of FT, MT and JT are updated according to the principle that the tags of adjacent tensors are opposite numbers; inputting FT, MT and JT into S502 in sequence for execution;
s502, judging tag value of the previous allocation tensor: if the tag value is 1, go to S503; if tag value is-1, go to S504;
s503, traversing the currently allocated tensor, finding the tensor with the maximum value of the allocated base address plus the tensor size as the base address of the current tensor, and ending the memory pre-allocation of the current tensor;
s504, judging whether a free space exists at the low address of the allocated tensor; if not, go to S505; if so, go to S506;
s505, traversing the currently allocated tensor, finding a value obtained by subtracting the current tensor from the minimum base address in the allocated tensor, and taking the value as the base address of the current tensor, wherein the memory pre-allocation of the current tensor is finished;
s506, subtracting the tensor size of the current tensor to be allocated from the previous tensor base address to obtain a value as the current tensor base address, and performing memory pre-allocation; go to S507;
s507, checking whether the current tensor collides with other allocated tensors, and if not, ending the memory pre-allocation of the current tensor; if there is a conflict, go to S508;
and S508, shifting the tensor base address generating the conflict to a higher address bit until all conflicts are resolved, and ending the pre-allocation of the current tensor memory.
In order to achieve the above objective, an embodiment of the present invention provides a memory optimization device based on an eager memory reuse algorithm, including the following modules:
the acquisition module is used for constructing a calculation graph of the GPU in operation and acquiring the life cycle of all tensors and all three-layer tensor structures in the calculation graph;
the ordering module is used for determining the execution sequence of the three-layer tensor structure;
the execution module is used for executing an Eager Reuse algorithm based on the execution sequence and distributing memory for each tensor;
and the allocation module is used for applying the memory at the starting point of the life cycle of each tensor according to the allocation result of the Eager Reuse algorithm, and releasing the memory at the ending point of the life cycle.
Further, in an embodiment of the present invention, the obtaining module is further configured to:
constructing a life cycle information table taking a time period as a row, taking a current time period as an execution node, generating tensors of the current time period, and taking the tensors of the current time period as columns by traversing the time period; thereby acquiring the lifecycle of all tensors.
Further, in an embodiment of the present invention, the obtaining module is further configured to:
traversing all nodes of the computational graph, regarding the found nodes as fork nodes, and marking the fork nodes as first-layer nodes node_i;
traversing all adjacent nodes of node_i, regarding the found Node as a middle Node, and marking the found Node as a middle layer Node node_j;
traversing all adjacent nodes of the node_j, regarding the found Node as a join Node, and marking the join Node as a third layer node_k;
according to the rule that the nodes are matched with tensors one by one, the structure of the three layers of nodes which are found is transformed into a three-layer tensor structure which is shaped like [ FT, MT, JT ], and the three-layer tensor structure which is found is stored in a linked list form, wherein FT represents the tensor of the first layer, MT represents the tensor of the middle layer, and JT represents the tensor of the third layer.
To achieve the above object, an embodiment of the present invention provides a computer device, which is characterized by comprising a memory, a processor, and a computer program stored in the memory and capable of running on the processor, wherein the processor implements a memory optimization method based on an eager memory reuse algorithm as described above when executing the computer program.
To achieve the above object, a fourth aspect of the present invention provides a computer readable storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements a memory optimization method based on an eager memory reuse algorithm as described above.
According to the memory optimization method based on the eager memory reuse algorithm, which is provided by the embodiment of the invention, through constructing a calculation graph of the GPU in operation, the life cycle of all tensors in the calculation graph and all three-layer tensor structures are obtained; determining the execution sequence of the three-layer tensor structure; executing an eager reuse algorithm based on the execution order; and applying for the memory at the starting point of each tensor life cycle according to the allocation result of the eager reuse algorithm, and releasing the memory at the end of the life cycle. Through the eager memory reuse algorithm, the intermediate tensor capable of performing memory reuse in the neural network model can be further mined, the memory consumption in the whole model execution process is reduced, and the problem of insufficient memory capacity is relieved, so that a larger deep learning model can be deployed on the GPU with limited capacity.
Drawings
The foregoing and/or additional aspects and advantages of the invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings, in which:
fig. 1 is a flow chart of a memory optimization method based on an eager memory reuse algorithm according to an embodiment of the present invention.
Fig. 2 is a schematic diagram of three-layer tensor structure & timely reuse according to an embodiment of the present invention.
FIG. 3 is a tensor lifecycle representation of an intent provided by an embodiment of the present invention.
Fig. 4 is a schematic diagram of a three-layer tensor structure & timely reuse result according to an embodiment of the present invention.
Fig. 5 is a schematic diagram of an implementation step of a timely reuse algorithm according to an embodiment of the present invention.
Fig. 6 is a flow chart of a memory optimization device based on an eager memory reuse algorithm according to an embodiment of the present invention.
Detailed Description
Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative and intended to explain the present invention and should not be construed as limiting the invention.
The memory optimization method based on the eager memory reuse algorithm according to the embodiment of the invention is described below with reference to the accompanying drawings.
Fig. 1 is a flow chart of a memory optimization method based on an eager memory reuse algorithm according to an embodiment of the present invention.
As shown in fig. 1, the memory optimization method based on the eager memory reuse algorithm includes the following steps:
s101: constructing a calculation graph of the GPU in operation, and obtaining the life cycle of all tensors in the calculation graph and all three-layer tensor structures;
the non-overlapping life cycle is the precondition of reusing the same memory between different tensors. Only when the tensor structure is above two layers, the memory space between tensors can be reused, as shown in fig. 2, and the memory can be reused between t3 and t 1. For this purpose, the invention defines a new minimum structure capable of memory reuse, namely a three-layer tensor structure. In the graph structure, there are many three-layer tensor structures, as shown in fig. 2, the nodes (1), (2), and (4) form one three-layer tensor structure, which are respectively defined as a fork node, a middle node, and a join node, and the tensors generated by the three-layer tensor structures are respectively defined as a Fork_tensor (FT), a Middle_tensor (MT), and a Join_tensor (JT). Depending on the execution order and the topology of the graph, all three layers of tensor structures (FT, MT, JT) can be found in each iteration. The life cycle of the FT is closely connected with the life cycle of the JT, in other words, when the life cycle of the FT is finished, the life cycle of the JT is started immediately, and memory reuse can be performed between the two.
The core of the eager reuse algorithm is to reuse the memory of the three-layer tensor structure in time. With the three-layer tensor structure as the minimum unit, JT can reuse the memory space of FT, so when planning memory for FT, the memory size allocated for it should be max (sizeof (JT), sizeof (FT)). Taking fig. 2 as an example, the distribution result of the three-layer tensor structure (t 1, t2, t 3) is max (sizeof (t 1), sizeof (t 3)) +t2.
Further, in one embodiment of the invention, obtaining the lifecycle of all intermediate tensors in the computational graph includes:
constructing a life cycle information table with time slots as rows, current time slots as executing nodes, tensors generated in the current time slots and tensors ending in the current time slots as columns by traversing the time slots, thereby acquiring the life cycles of all tensors.
The lifecycle of a tensor is related to the output and the nodes using the tensor. The algorithm defines the node that outputs the tensor as the starting node of its lifecycle. The starting node outputs the tensor data in the executing process, and storage space needs to be allocated for the tensor in advance. The node using the tensor is defined as the end node of its lifecycle. After the end node has used the tensor, the memory space allocated to the tensor may be freed. When a plurality of nodes use the same tensor, according to the execution sequence, the last node using the tensor is regarded as the ending node of the tensor life cycle. Part c of fig. 3 is the tensor life cycle table obtained according to part a of fig. 3.
When the operators are completely strung, at most one node is executed at a certain moment, the execution phases of the nodes are in one-to-one correspondence with a certain Time period (Time) on a Time axis, as shown in part b of fig. 3, and the node 1 corresponds to the Time I. Therefore, in the serial process, the life cycle information of all tensors can be obtained by traversing nodes or traversing Time, and the two are not different in obtaining the life cycle of the tensors.
When operators are allowed to run in parallel, a plurality of nodes may be executed at a certain moment, a certain time period on a time axis may correspond to the execution of the plurality of nodes, and when the parallelism among the operators can be maximized, as the execution time of each operator has a difference, in order to maximize the utilization rate of chip resources, a scheduling algorithm can dynamically adjust the execution sequence of the operators on the premise of not violating the correlation, so that the execution sequence of the operators cannot be determined in advance, and therefore, the life cycle of tensors cannot be determined, and memory reuse cannot be planned. Therefore, the present invention, like other algorithms, temporarily does not consider the case of inter-operator parallelism, but is under study to allow memory reuse in the case of inter-operator parallelism.
For the acquisition of the three-layer tensor structure in the task graph, the following definitions are given first:
node_num: a list storing all node numbers in the task graph;
adj_num [ idx ]: a list storing all adjacency node numbers of the idx-th node.
The corresponding algorithm 1 is given again as follows:
input: calculating a graph;
and (3) outputting: all three layers of tensor structures in the computational graph stored in linked list form.
Further, in one embodiment of the present invention, obtaining all three-layer tensor structures in the computational graph includes:
traversing all nodes of the computational graph, and regarding the found nodes as fork nodes, and marking the fork nodes as nodes_i;
traversing all adjacent nodes of the node_i, and regarding the found Node as a middle Node, and marking the middle Node as a node_j;
traversing all adjacent nodes of the node_j, and regarding the found Node as a join Node, and marking the join Node as a node_k;
and according to the rule that the nodes are matched with tensors one by one, converting the structure of the three layers of nodes [ node_i, node_j, node_k ] to [ FT, MT, JT ] to be regarded as a group of three-layer tensor structures, and putting the three-layer tensor structures into a three-layer tensor structure linked list.
S102: determining the execution sequence of the three-layer tensor structure;
the degree of entry of each node is recorded initially, and the three-layer tensor structure found in the algorithm 1 is stored in a linked list form, and the following algorithm 2 is a flow for determining the execution sequence of the three-layer tensor structure.
Input: a list of unexecuted three-layer tensor structures;
and (3) outputting: a linked list of three layers of tensor structures reordered in order of execution.
Further, in one embodiment of the present invention, determining the execution order of the three-layer tensor structure includes:
s401, initializing an index idx=0, wherein idx represents that when the condition is met, the idx-th three-layer tensor structure in the unprocessed three-layer tensor structure linked list is moved to the first position of the linked list; judging whether all tensors in the calculation graph are processed or not, wherein when the three-layer tensor structure is marked as processed, the FT, MT and JT contained in the three-layer tensor structure are marked as processed at the same time; if all the three-layer tensor structures are processed, resetting the modification of the linked list, the modification of the degree of entry and the modification of the flag bit in the algorithm to be before the algorithm is executed, only moving the idx-th three-layer tensor structure to the head of the linked list of the three-layer tensor structure, correspondingly judging whether the idx-th three-layer tensor structure FT, MT and JT is processed or not, and if the idx-th three-layer tensor structure FT, MT and JT are processed, ending the algorithm; otherwise, go to S402;
s402, subtracting one from the inflows of all adjacent nodes of each node in the three-layer tensor structure of the first position of the unordered three-layer tensor structure linked list;
s403, judging whether the ingress degree of three nodes of the three-layer tensor structure is less than or equal to 0 at the moment, if yes, modifying the flag bit, namely marking the three-layer tensor structure as processed, and simultaneously marking FT, MT and JT in the three-layer tensor structure as processed, and returning to S401; otherwise, go to S404;
s404, indexing idx=idx+1, moving the idx-th three-layer tensor structure in the currently unexecuted three-layer tensor structure linked list to the first position of the unexecuted three-layer tensor structure linked list, and proceeding to S402.
S103: based on the execution sequence, executing an Eager Reuse algorithm, and distributing memory for each tensor;
the key of the eager reuse algorithm is to reuse the reusable part of each three-layer tensor structure in time, so after the execution sequence of the three-layer tensor structure is determined, the algorithm needs to complete memory reuse of the three-layer tensor structure in sequence and in time according to the execution sequence.
In a single three-layer tensor structure, the reuse of the memory is obvious, but when the graph is gradually complex, the hierarchy of the three-layer tensor structure is no longer single, a more detailed description needs to be made for the eager reuse algorithm, as shown in algorithm 3:
input: a three-layer tensor structure;
and (3) outputting: the current three-layer tensor structure uses the results urgently.
Further, in one embodiment of the present invention, performing the eager reuse algorithm based on the execution order includes:
s501, defining that each layer of tensor has a flag bit initialized to 0 for an input three-layer tensor structure; if the tags of the three layers of tensors are all 0, setting the tag of FT to be 1; if the tag of the Tensor is set to be 1 or-1, updating the tag of the three-layer Tensor according to the rule that the tag of FT is the opposite number of the tag of MT, the tag of MT is the opposite number of the tag of FT, and the tag of JT is the opposite number of the tag of MT; inputting FT, MT and JT into S502 in sequence for execution;
s502, judging the value of the tag, and if the tag value is 1, going to S503; if tag value is-1, go to S504;
s503, traversing the tensors distributed at present; finding the maximum tensor of the base address plus tensor size, taking the value of the base address plus tensor size as the base address of the current tensor, and ending the pre-allocation of the tensor memory;
s504, judging whether a free space exists at a low address of a base address of a tensor corresponding to a tag value; if not, go to S505; if so, go to S506;
s505, traversing the currently allocated tensor, finding the tensor with the minimum base address in the allocated tensor, taking the value obtained by subtracting the size of the current tensor from the base address as the base address of the current tensor, and ending the pre-allocation of the tensor memory;
s506, taking a value obtained by subtracting the tensor size of the current tensor to be allocated from the current tensor base address as the current tensor base address, and performing memory pre-allocation; go to S507;
s507, checking whether the current tensor collides with other allocated tensors, and if not, ending the memory pre-allocation of the current tensor; if there is a conflict, go to S508;
and S508, shifting the tensor base address generating the conflict to a higher address bit until all conflicts are resolved and the tensor memory pre-allocation is finished.
FIG. 4 is a typical three-layer tensor structure and its derivatives, which can be used as a control verification algorithm.
S104: and applying for the memory in time at the starting point of each tensor life cycle according to the allocation result of the eager reuse algorithm, and releasing the memory in time at the end of the life cycle.
From preprocessing to the end of allocation, the overall algorithm flow is summarized as shown in fig. 5:
1) Pretreatment: the lifecycle of all tensors and all three-layer tensor structures in the figure are obtained.
2) Determining the execution sequence of the three-layer tensor structure: according to the algorithm proposed by the algorithm 2, the execution sequence of the three-layer tensor structure is determined.
3) Execute the eageruse algorithm: based on the determined three-layer tensor structure execution order, eageRefuse is executed in accordance with the algorithm proposed by algorithm 3.
4) Memory allocation: according to the allocation result of EageRefuse algorithm, for each tensor, applying for memory in time at the starting point of its life cycle, and releasing memory in time at the end of life cycle.
At this time, the EageReuse algorithm is completed. As shown in fig. 5.
Compared with the existing large tensor priority algorithm and the short life cycle priority algorithm, the eager memory reuse algorithm has higher memory reuse rate on average, and can realize better memory reuse effect. And the overhead is of the same order as both.
According to the memory optimization method based on the eager memory reuse algorithm, which is provided by the embodiment of the invention, through constructing a calculation graph of the GPU in operation, the life cycle of all tensors in the calculation graph and all three-layer tensor structures are obtained; determining the execution sequence of the three-layer tensor structure; executing an eager reuse algorithm based on the execution order; according to the allocation result of the eager reuse algorithm, applying for the memory at the starting point of each tensor life cycle, releasing the memory at the end of the life cycle, reducing the memory consumption by a memory reuse mode, relieving the conflict between the shortage of the memory capacity and the large demand, and enabling a larger deep learning model to be deployed on the limited GPU memory.
In order to implement the above embodiment, the present invention also provides a memory optimization device based on the eager memory reuse algorithm.
Fig. 6 is a schematic structural diagram of a memory optimization device based on an eager memory reuse algorithm according to an embodiment of the present invention.
As shown in fig. 6, the memory optimization device based on the eager memory reuse algorithm includes: an acquisition module 100, a ranking module 200, an execution module 300, an allocation module 400, wherein,
the acquisition module is used for constructing a calculation graph of the GPU in operation to acquire the life cycle of all tensors in the calculation graph and all three-layer tensor structures;
the ordering module is used for determining the execution sequence of the three-layer tensor structure;
the execution module is used for executing an Eager Reuse algorithm based on the execution sequence and distributing memory for each tensor;
and the allocation module is used for applying the memory at the starting point of the life cycle of each tensor according to the allocation result of the Eager Reuse algorithm, and releasing the memory at the ending point of the life cycle.
Further, in an embodiment of the present invention, the obtaining module is further configured to:
constructing a life cycle information table taking a time period as a row, taking a current time period as an execution node, generating tensors of the current time period, and taking the tensors of the current time period as columns by traversing the time period; thereby acquiring the lifecycle of all tensors.
Further, in an embodiment of the present invention, the obtaining module is further configured to:
traversing all nodes of the computational graph, regarding the found nodes as fork nodes, and marking the fork nodes as first-layer nodes node_i;
traversing all adjacent nodes of node_i, regarding the found Node as a middle Node, and marking the found Node as a middle layer Node node_j;
traversing all adjacent nodes of the node_j, regarding the found Node as a join Node, and marking the join Node as a third layer node_k;
according to the rule that the nodes are matched with tensors one by one, the structure of the three layers of nodes which are found is transformed into a three-layer tensor structure which is shaped like [ FT, MT, JT ], and the three-layer tensor structure which is found is stored in a linked list form, wherein FT represents the tensor of the first layer, MT represents the tensor of the middle layer, and JT represents the tensor of the third layer.
To achieve the above object, an embodiment of the present invention provides a computer device, which is characterized by comprising a memory, a processor, and a computer program stored in the memory and capable of running on the processor, wherein the processor implements the memory optimization method based on the eager memory reuse algorithm as described above when executing the computer program.
To achieve the above object, a fourth aspect of the present invention provides a computer readable storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements a memory optimization method based on the eager memory reuse algorithm as described above.
In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, the different embodiments or examples described in this specification and the features of the different embodiments or examples may be combined and combined by those skilled in the art without contradiction.
Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In the description of the present invention, the meaning of "plurality" means at least two, for example, two, three, etc., unless specifically defined otherwise.
While embodiments of the present invention have been shown and described above, it will be understood that the above embodiments are illustrative and not to be construed as limiting the invention, and that variations, modifications, alternatives and variations may be made to the above embodiments by one of ordinary skill in the art within the scope of the invention.

Claims (8)

1. The memory optimization method based on the eager memory reuse algorithm is characterized by comprising the following steps of:
constructing a calculation graph of the GPU in operation, and obtaining the life cycle of all tensors in the calculation graph and all three-layer tensor structures;
determining an execution sequence of the three-layer tensor structure;
based on the execution sequence, executing an Eager Reuse algorithm, and distributing memory for each tensor;
applying a memory at the starting point of the life cycle of each tensor according to the allocation result of the Eager Reuse algorithm, and releasing the memory at the ending point of the life cycle;
the method for constructing the computing graph of the GPU operation comprises the steps of: traversing all nodes of the computational graph, regarding the found nodes as fork nodes, and marking the fork nodes as first-layer nodes node_i;
traversing all adjacent nodes of node_i, regarding the found Node as a middle Node, and marking the found Node as a middle layer Node node_j;
traversing all adjacent nodes of the node_j, regarding the found Node as a join Node, and marking the join Node as a third layer node_k;
according to the rule that the nodes are matched with tensors one by one, the structure of the three layers of nodes which are found is transformed into a three-layer tensor structure which is shaped like [ FT, MT, JT ], and the three-layer tensor structure which is found is stored in a linked list form, wherein FT represents the tensor of the first layer, MT represents the tensor of the middle layer, and JT represents the tensor of the third layer.
2. The method of claim 1, wherein the obtaining the lifecycle of all tensors in the computational graph comprises:
constructing a life cycle information table with time slots as rows, current time slots as executing nodes, tensors generated in the current time slots and tensors ending in the current time slots as columns by traversing the time slots, thereby acquiring the life cycles of all tensors.
3. The method of claim 1, wherein determining the order of execution of the three-layer tensor structure comprises:
s401, initializing an index idx=0, wherein idx represents that when the condition is met, the idx-th three-layer tensor structure in the unprocessed three-layer tensor structure linked list is moved to the first position of the linked list; judging whether all tensors in the calculation graph are processed or not, wherein when the three-layer tensor structure is marked as processed, the FT, MT and JT contained in the three-layer tensor structure are marked as processed at the same time; if all the three-layer tensor structures are processed, resetting the modification of the linked list, the modification of the degree of entry and the modification of the flag bit in the algorithm to be before the algorithm is executed, only moving the idx-th three-layer tensor structure to the head of the linked list of the three-layer tensor structure which is not executed, and correspondingly judging whether the idx-th three-layer tensor structure FT, MT and JT is processed or not, and if the idx-th three-layer tensor structure FT, MT and JT are processed, ending the algorithm; otherwise, go to S402;
s402, subtracting one from the inflows of all adjacent nodes of each node in the three-layer tensor structure of the first position of the unordered three-layer tensor structure linked list;
s403, judging whether the ingress degree of three nodes of the three-layer tensor structure is less than or equal to 0 at the moment, if yes, modifying the flag bit, namely marking the three-layer tensor structure as processed, and simultaneously marking FT, MT and JT in the three-layer tensor structure as processed, and returning to S401; otherwise, go to S404;
s404, indexing idx=idx+1, moving the idx-th three-layer tensor structure in the currently unexecuted three-layer tensor structure linked list to the first position of the unexecuted three-layer tensor structure linked list, and proceeding to S402.
4. A method according to claim 3, wherein said executing an Eager Reuse algorithm based on said execution order comprises:
s501, defining a flag bit for each layer of tensor for representing the relative position of tensor placement for an input three-layer tensor structure, wherein the value can be three of-1, 0 and 1, 0 is an initial value, 1 represents that relative to the current tensor, the adjacent tensor is placed at a high address of a memory address of the tensor, 1 represents that relative to the current tensor, the adjacent tensor is placed at a low address of the memory address of the tensor, and if the tags of the three-layer tensor are all 0, the tag of FT is set to be 1; the base address of FT of the first three-layer tensor structure is 0, memory is allocated for the FT, and then the tags of FT, MT and JT are updated according to the principle that the tags of adjacent tensors are opposite numbers; inputting FT, MT and JT into S502 in sequence for execution;
s502, judging tag value of the previous allocation tensor: if the tag value is 1, go to S503; if tag value is-1, go to S504;
s503, traversing the currently allocated tensor, finding the tensor with the maximum value of the allocated base address plus the tensor size as the base address of the current tensor, and ending the memory pre-allocation of the current tensor;
s504, judging whether a free space exists at the low address of the allocated tensor; if not, go to S505; if so, go to S506;
s505, traversing the currently allocated tensor, finding a value obtained by subtracting the current tensor from the minimum base address in the allocated tensor, and taking the value as the base address of the current tensor, wherein the memory pre-allocation of the current tensor is finished;
s506, subtracting the tensor size of the current tensor to be allocated from the previous tensor base address to obtain a value as the current tensor base address, and performing memory pre-allocation; go to S507;
s507, checking whether the current tensor collides with other allocated tensors, and if not, ending the memory pre-allocation of the current tensor; if there is a conflict, go to S508;
and S508, shifting the tensor base address generating the conflict to a higher address bit until all conflicts are resolved, and ending the pre-allocation of the current tensor memory.
5. The memory optimizing device based on EageReuse memory reuse algorithm is characterized by comprising the following modules:
the acquisition module is used for constructing a calculation graph of the GPU in operation and acquiring the life cycle of all tensors and all three-layer tensor structures in the calculation graph;
the ordering module is used for determining the execution sequence of the three-layer tensor structure;
the execution module is used for executing an Eager Reuse algorithm based on the execution sequence and distributing memory for each tensor;
the allocation module is used for applying memory at the starting point of the life cycle of each tensor according to the allocation result of the Eager Reuse algorithm, and releasing the memory at the ending point of the life cycle;
wherein, the acquisition module is further configured to:
traversing all nodes of the computational graph, regarding the found nodes as fork nodes, and marking the fork nodes as first-layer nodes node_i;
traversing all adjacent nodes of node_i, regarding the found Node as a middle Node, and marking the found Node as a middle layer Node node_j;
traversing all adjacent nodes of the node_j, regarding the found Node as a join Node, and marking the join Node as a third layer node_k;
according to the rule that the nodes are matched with tensors one by one, the structure of the three layers of nodes which are found is transformed into a three-layer tensor structure which is shaped like [ FT, MT, JT ], and the three-layer tensor structure which is found is stored in a linked list form, wherein FT represents the tensor of the first layer, MT represents the tensor of the middle layer, and JT represents the tensor of the third layer.
6. The apparatus of claim 5, wherein the acquisition module is further configured to:
constructing a life cycle information table taking a time period as a row, taking a current time period as an execution node, generating tensors of the current time period, and taking the tensors of the current time period as columns by traversing the time period; thereby acquiring the lifecycle of all tensors.
7. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the memory optimization method based on the eager memory reuse algorithm as claimed in any of claims 1-4 when executing the computer program.
8. A computer readable storage medium having stored thereon a computer program, which when executed by a processor implements the memory optimization method based on the eager memory reuse algorithm according to any of claims 1-4.
CN202310883730.XA 2023-07-19 2023-07-19 Memory optimization method based on eager memory reuse algorithm Active CN116610456B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310883730.XA CN116610456B (en) 2023-07-19 2023-07-19 Memory optimization method based on eager memory reuse algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310883730.XA CN116610456B (en) 2023-07-19 2023-07-19 Memory optimization method based on eager memory reuse algorithm

Publications (2)

Publication Number Publication Date
CN116610456A CN116610456A (en) 2023-08-18
CN116610456B true CN116610456B (en) 2023-09-26

Family

ID=87676805

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310883730.XA Active CN116610456B (en) 2023-07-19 2023-07-19 Memory optimization method based on eager memory reuse algorithm

Country Status (1)

Country Link
CN (1) CN116610456B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110941494A (en) * 2019-12-02 2020-03-31 哈尔滨工程大学 Deep learning-oriented GPU parallel computing data processing method
CN112559165A (en) * 2019-09-25 2021-03-26 阿里巴巴集团控股有限公司 Memory management method and device, electronic equipment and computer readable storage medium
CN113806078A (en) * 2021-08-27 2021-12-17 南京中科逆熵科技有限公司 Memory scheduling method for edge ai inference framework
US11556757B1 (en) * 2020-12-10 2023-01-17 Neuralmagic Ltd. System and method of executing deep tensor columns in neural networks
CN116302461A (en) * 2022-08-05 2023-06-23 阿里巴巴(中国)有限公司 Deep learning memory allocation optimization method and system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112559165A (en) * 2019-09-25 2021-03-26 阿里巴巴集团控股有限公司 Memory management method and device, electronic equipment and computer readable storage medium
CN110941494A (en) * 2019-12-02 2020-03-31 哈尔滨工程大学 Deep learning-oriented GPU parallel computing data processing method
US11556757B1 (en) * 2020-12-10 2023-01-17 Neuralmagic Ltd. System and method of executing deep tensor columns in neural networks
CN113806078A (en) * 2021-08-27 2021-12-17 南京中科逆熵科技有限公司 Memory scheduling method for edge ai inference framework
CN116302461A (en) * 2022-08-05 2023-06-23 阿里巴巴(中国)有限公司 Deep learning memory allocation optimization method and system

Also Published As

Publication number Publication date
CN116610456A (en) 2023-08-18

Similar Documents

Publication Publication Date Title
CN115248728B (en) Distributed training task scheduling method, system and device for intelligent computing
CN111738434B (en) Method for executing deep neural network on heterogeneous processing unit
CN103970602B (en) Data flow program scheduling method oriented to multi-core processor X86
JPH09171503A (en) Method and apparatus for parallel processing
WO2022002021A1 (en) Memory space pre-allocation system in static network, and method thereof
CN110889497B (en) Learning task compiling method of artificial intelligence processor and related product
CN102937918A (en) Data block balancing method in operation process of HDFS (Hadoop Distributed File System)
CN110766145A (en) Learning task compiling method of artificial intelligence processor and related product
CN113037800B (en) Job scheduling method and job scheduling device
CN110969362A (en) Multi-target task scheduling method and system under cloud computing system
CN113807714B (en) Method, apparatus, device, storage medium and program product for resource allocation
CN111400868A (en) Distributed workshop scheduling optimization method and system with order and robot carrying functions
CN116302461A (en) Deep learning memory allocation optimization method and system
Takeda et al. Sensory uncertainty field for mobile robot navigation
Feljan et al. Task allocation optimization for multicore embedded systems
CN108108242B (en) Storage layer intelligent distribution control method based on big data
CN110766146B (en) Learning task compiling method of artificial intelligence processor and related product
CN116610456B (en) Memory optimization method based on eager memory reuse algorithm
CN104239520B (en) A kind of HDFS data block Placement Strategies based on historical information
CN117234710A (en) Method for realizing memory optimization of AI model training by reinforcement learning
CN106055862A (en) Novel efficient heuristic-type two-stage parallel branch-and-bound method
CN115496373A (en) Task allocation method and device applied to agile management platform
CN111597035A (en) Simulation engine time advancing method and system based on multiple threads
CN110399124A (en) A kind of code generating method, device, equipment and readable storage medium storing program for executing
WO2019086764A1 (en) Graphics engine resource management and allocation system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant