CN117032954A - Memory optimization method, system, equipment and medium for terminal training model - Google Patents

Memory optimization method, system, equipment and medium for terminal training model Download PDF

Info

Publication number
CN117032954A
CN117032954A CN202310876845.6A CN202310876845A CN117032954A CN 117032954 A CN117032954 A CN 117032954A CN 202310876845 A CN202310876845 A CN 202310876845A CN 117032954 A CN117032954 A CN 117032954A
Authority
CN
China
Prior art keywords
memory
model
terminal
tensor
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202310876845.6A
Other languages
Chinese (zh)
Other versions
CN117032954B (en
Inventor
赵凤英
王启鹏
陈震鹏
陆璇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Fanrui Technology Partnership LP
Original Assignee
Beijing Fanrui Technology Partnership LP
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Fanrui Technology Partnership LP filed Critical Beijing Fanrui Technology Partnership LP
Priority to CN202310876845.6A priority Critical patent/CN117032954B/en
Publication of CN117032954A publication Critical patent/CN117032954A/en
Application granted granted Critical
Publication of CN117032954B publication Critical patent/CN117032954B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5011Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
    • G06F9/5016Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals the resource being the memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5011Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
    • G06F9/5022Mechanisms to release resources
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/098Distributed learning, e.g. federated learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Complex Calculations (AREA)

Abstract

The application provides a memory optimization method, a memory optimization system, memory optimization equipment and a memory optimization medium aiming at a terminal training model, and belongs to the field of data processing. The method comprises the following steps: the terminal acquires a model structure and a memory access condition of a model deployed in the terminal; transmitting the memory access condition and the model structure to the cloud; the cloud end generates a first calculation graph based on the model structure, wherein the first calculation graph is used for representing the topological sequence of a plurality of operators in the model; the cloud end generates a second calculation map based on the first calculation map; the cloud end generates a memory allocation scheme based on the memory access condition and the second calculation graph, wherein the memory allocation scheme is used for allocating the memory configured by the terminal for the training model; the cloud sends the second calculation graph and the memory allocation scheme to the terminal; the terminal revises the model based on the second calculation graph, and allocates memory configured by the terminal for training the model based on the memory allocation scheme to perform model training. The application aims to reduce the memory occupied by training in the process of training a model.

Description

Memory optimization method, system, equipment and medium for terminal training model
Technical Field
The embodiment of the application relates to the technical field of data processing, in particular to a memory optimization method, a memory optimization system, memory optimization equipment and a memory optimization medium for a terminal training model.
Background
Deep neural networks have become a major component of smart applications on mobile devices, and there are now a very large number of applications enabled by deep learning, such as voice assistants, augmented reality, etc. The computing power of chips on mobile equipment is increasingly enhanced, a large number of optimization algorithms of the deep neural network are also emerging in the academic field, and the development of the technology and hardware enables the reasoning and prediction process of the deep neural network to be increasingly carried out on the mobile equipment, so that the high load-sharing expenditure is avoided.
Furthermore, the terminal learning technology gradually becomes a new learning mode, and the technology directly uses the mobile equipment to perform model training, so that very good model individuation is realized, and privacy data is well protected. Meanwhile, some existing deep learning methods, such as federal learning, segmentation learning and the like, use a terminal learning technology as a key component, and train local data by using terminal equipment. However, the resources above the terminal device are very limited, such as computing power, memory, etc., and cannot meet the resource requirements of deep neural network training.
Aiming at the problem of limited memory in the model training process, some existing research works are aimed at a server and a Graphic Processor (GPU), and lack of a terminal device and a Central Processing Unit (CPU). The prior art is mainly GPU-oriented and includes model compression, recalculation, data exchange and the like, but some of the techniques are not applicable to terminal equipment. Firstly, data exchange is mainly carried out between a GPU and a CPU, namely, the data exchange between the existing CPU and a main memory is mainly realized through a high-speed PCIe interface or an NVLink interface, through which the data transmission efficiency can reach tens of megabytes per second, but the lack of high-speed read-write hardware on terminal equipment can only be realized through a hard disk, the read-write speed of the data is limited, and the requirement of model training cannot be met. In addition to model compression, this technique typically uses a low-precision model representation, but this approach is typically at the expense of loss of model performance, and we have found through experimentation that the loss of performance from model compression is very significant in federally learned scenarios.
Disclosure of Invention
The embodiment of the application provides a memory optimization method, a system, equipment and a medium for a terminal training model, aiming at reducing the memory occupied by training in the process of training the model.
In a first aspect, an embodiment of the present application provides a memory optimization method for a terminal training model, where the method includes:
the terminal acquires a model structure and a memory access condition of a model deployed in the terminal;
sending the memory access condition and the model structure to a cloud;
the cloud end generates a first calculation graph based on the model structure, wherein the first calculation graph is used for representing the topological sequence of a plurality of operators in the model;
the cloud sequentially executes a plurality of operators in the first calculation graph according to a topological sequence, acquires memory indexes corresponding to tensors output by the operators, accumulates the memory indexes, and releases the tensors corresponding to the current maximum memory indexes when the accumulated memory indexes are larger than a preset threshold;
when the released tensor is used again in the process of continuously executing the first calculation graph, the cloud end recalculates the tensor and inserts an operator corresponding to the tensor into the topological sequence of the first calculation graph;
the cloud end takes the first calculation graph with the released tensor inserted as a second calculation graph after the execution of the first calculation graph is completed;
the cloud end generates a memory allocation scheme based on the memory access condition and the second calculation graph, wherein the memory allocation scheme is used for allocating the memory configured by the terminal for training the model;
the cloud end sends the second calculation graph and the memory allocation scheme to a terminal;
and revising the model by the terminal based on the second calculation graph, and distributing the memory configured by the terminal for training the model based on the memory distribution scheme so as to execute model training.
Optionally, the obtaining the model structure and the memory access condition of the model in the terminal includes:
the terminal counts the operator types, the operator quantity and the input and output of tensors in the operators of the model;
and the terminal counts the access and release time of the memory and the size of the access memory in the model training process.
Optionally, the obtaining the memory index corresponding to the tensor output by the plurality of operators includes:
acquiring the saved memory and the retrieval time of the tensor;
the quotient of the saved memory of the tensor and the retrieval time is the memory index of the tensor.
Optionally, the cloud end generates a memory allocation scheme based on the memory access condition and the second computation graph, including:
when the second calculation graph is calculated, each time two operators are calculated to generate tensors, namely the memory occupation amount of the tensors is obtained;
each time, tensor generated by operator calculation in the first calculation map is obtained, the tensor is compared with the memory occupation amount of the obtained tensor, and the tensors are ordered according to the sequence from large to small;
based on the memory occupation amount arrangement sequence of the tensors, distributing the simulated memory on the cloud end to the tensors according to a two-dimensional boxing method, wherein the simulated memory is the same as the memory configured for training the model in the terminal in size;
and generating the memory allocation scheme based on the allocation situation of the simulated memory on the cloud.
Optionally, the allocating the simulated memory on the cloud end to the tensor according to a two-dimensional boxing method based on the memory occupation order of the tensors includes:
and when the simulation memory is allocated for the plurality of tensors, preferentially allocating the memory with the smallest memory address in the simulation memory to the tensor with the largest memory occupation amount in the plurality of tensors.
Optionally, the terminal revises the model based on the second calculation map, and allocates the memory configured by the terminal for training the model based on the memory allocation scheme, including:
the terminal rearranges the topological order of a plurality of operators in the model based on the second calculation graph;
and acquiring a memory configured by the terminal for training the model, and setting a memory address offset for the memory occupied by each tensor generated by calculation of the second calculation graph based on the memory allocation scheme.
In a second aspect, an embodiment of the present application provides a memory optimization system for a terminal training model, where the system includes a statistics module, a first computation graph generating module, a second computation graph generating module, a memory allocation scheme generating module, and an execution module;
the statistics module is used for acquiring a model structure and a memory access condition of a model deployed in the terminal; the memory access condition and the model structure are sent to a cloud;
the cloud end is used for generating a model structure according to the first computing diagram, and the cloud end is used for generating a model structure according to the first computing diagram;
the second calculation map generation module is used for sequentially executing a plurality of operators in the first calculation map according to a topological order, acquiring memory indexes corresponding to tensors output by the operators, accumulating the memory indexes, and releasing the tensors corresponding to the current maximum memory indexes when the accumulated memory indexes are larger than a preset threshold; the cloud end recalculates the tensor and inserts an operator corresponding to the tensor into the topological sequence of the first computational graph when the tensor which is released is used again in the process of continuously executing the first computational graph; after the cloud first calculation map is executed, the first calculation map with the released tensor inserted is used as a second calculation map;
the memory allocation scheme generation module is used for generating a memory allocation scheme based on the memory access condition and the second calculation graph by the cloud end, wherein the memory allocation scheme is used for allocating the memory configured by the terminal for training the model;
the execution module is used for sending the second calculation graph and the memory allocation scheme to a terminal by the cloud; and revising the model by the terminal based on the second calculation graph, and distributing the memory configured by the terminal for training the model based on the memory distribution scheme so as to execute model training.
In a third aspect, an embodiment of the present application provides an electronic device, including a memory, a processor, and a computer program stored on the memory and capable of running on the processor, where the processor executes a memory optimization method for a terminal training model according to the first aspect of the embodiment.
In a fourth aspect, an embodiment of the present application provides a computer readable storage medium, where a computer program is stored, where the computer program when executed by a processor implements a memory optimization method for a terminal training model according to the first aspect of the embodiment.
The beneficial effects are that: when a model is trained, a cloud can generate a first calculation diagram through a model structure and a memory access condition, and a plurality of tensors generated in the model training process are obtained through calculation of the first calculation diagram, wherein the generated tensors are the reasons for occupying a large memory in the model training process, so that when the cloud calculates a plurality of operators in the first calculation diagram through setting a preset threshold, if the memory occupied by the currently generated tensors exceeds the preset threshold, the tensor with the largest memory index is released temporarily, and the memory occupied by the tensor with the largest memory index is reduced; when the tensor needs to be used again, recalculating an operator corresponding to the tensor and inserting the operator corresponding to the tensor into the first calculation diagram to form a second calculation diagram, and when the terminal trains a model, releasing the tensor with the maximum memory index when a plurality of tensors in the second calculation diagram exceed a preset threshold value and reducing the memory occupied by the maximum tensor of the memory index in the process of recalculating the tensor; and the memory can be allocated for tensors generated in a plurality of training processes through the memory access condition and the memory allocation scheme generated by the second calculation graph, so that the effect of reducing the memory occupied by the model in the model training process is achieved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments of the present application will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flowchart illustrating a memory optimization method for a terminal training model according to an embodiment of the present application;
FIG. 2 is a flowchart illustrating a memory optimization method for a terminal training model according to another embodiment of the present application;
FIG. 3 is a flowchart illustrating steps for generating a second calculation map in a memory optimization method for a terminal training model according to an embodiment of the present application;
fig. 4 is a functional block diagram of a memory optimization system for a terminal training model according to a second embodiment of the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
Example 1
Referring to fig. 1, a flowchart illustrating steps of a memory optimization method for a terminal training model in an embodiment of the present application is shown, where the method specifically may include the following steps:
s101, the terminal acquires a model structure and a memory access condition of a model deployed in the terminal; sending the memory access condition and the model structure to a cloud;
when the neural network model is trained, the optimized memory needs to be put in the cloud, because the process of optimizing the memory needs to occupy computing resources, the terminal is blocked only by depending on the computing resources of the terminal, and the terminal optimizing process is slower.
When the model is optimized, the basic condition of the model can be obtained by counting the model structure of the model deployed in the terminal, then the memory access condition in the terminal is obtained, and the model can be restored at the cloud end through the model structure and the memory access condition, so that the cloud end can generate the memory in the training process of the model to optimize, and the effect of saving the memory in the training process of the terminal model is achieved.
S102, the cloud end generates a first calculation graph based on the model structure, wherein the first calculation graph is used for representing the topological sequence of a plurality of operators in the model;
whichever type of model, it includes a plurality of operators, which are basic building blocks within the model; in the training process of the model, a plurality of operators are calculated according to forward or backward propagation, and a final model training result can be obtained through calculation of the operators; the first computational graph can represent a topological order of a plurality of operators in the model, and the plurality of operators in the model perform computation based on the topological order when executing.
S103, the cloud sequentially executes a plurality of operators in the first calculation graph according to a topological order, acquires memory indexes corresponding to tensors output by the operators and accumulates the memory indexes, and releases the tensor corresponding to the current maximum memory index when the accumulated memory indexes are larger than a preset threshold; when the released tensor is used again in the process of continuously executing the first calculation graph, the cloud end recalculates the tensor and inserts an operator corresponding to the tensor into the topological sequence of the first calculation graph; the cloud end takes the first calculation graph with the released tensor inserted as a second calculation graph after the execution of the first calculation graph is completed;
in the scheme, when the optimization is performed mainly by optimizing the memory occupied by the tensor in the training process, firstly, sequentially executing a plurality of operators in a first calculation graph according to a topological order at the cloud end, so that all tensors in the model training process can be obtained, and when the accumulated memory index is greater than a preset threshold value, the memory occupied by the corresponding tensor can be saved after the tensor corresponding to the current maximum memory index is released, so that the effect of memory optimization is achieved; in the process of acquiring the tensor, the first calculation graph is calculated in a cloud simulation mode, and the calculation resources of the terminal are not needed to be used, so that the calculation resources of the terminal are further saved.
S104, the cloud end generates a memory allocation scheme based on the memory access condition and the second calculation graph, wherein the memory allocation scheme is used for allocating the memory configured by the terminal for training the model;
the cloud end simulates the situation that the tensors occupy the memory in the model training process according to the memory access situation and the second calculation graph, and then distributes the memory to a plurality of tensors so as to achieve the effect of saving the memory; the simulation process is completed in the cloud, and resources of the terminal are not occupied during the simulation.
S105, the cloud sends the second calculation graph and the memory allocation scheme to a terminal; and revising the model by the terminal based on the second calculation graph, and distributing the memory configured by the terminal for training the model based on the memory distribution scheme so as to execute model training.
After the cloud simulation obtains the second calculation map and the memory allocation scheme, the memory allocation scheme and the second calculation map are sent to the terminal, after the terminal rearranges a plurality of operators in the model according to the second calculation map, the topological sequence of the operators is changed, and when some operators are executed for many times, the sequence of generating tensors through the second calculation map is also changed, and in the model training process, some tensors occupying larger memory are released, so that the effect of optimizing the memory of the terminal is achieved.
In the embodiment, when a model is trained, the cloud end can generate a first calculation graph through a model structure and a memory access condition, and a plurality of tensors generated in the model training process are obtained through calculation of the first calculation graph, and the generated tensors are the reasons for occupying a large memory in the model training process, so that when the cloud end calculates a plurality of operators in the first calculation graph through setting a preset threshold, if the memory occupied by the currently generated tensors exceeds the preset threshold, the tensor with the largest memory index is released temporarily, and the memory occupied by the tensor with the largest memory index is reduced; when the tensor needs to be used again, recalculating an operator corresponding to the tensor and inserting the operator corresponding to the tensor into the first calculation diagram to form a second calculation diagram, and when the terminal trains a model, releasing the tensor with the maximum memory index when a plurality of tensors in the second calculation diagram exceed a preset threshold value and reducing the memory occupied by the maximum tensor of the memory index in the process of recalculating the tensor; and the memory can be allocated for tensors generated in a plurality of training processes through the memory access condition and the memory allocation scheme generated by the second calculation graph, so that the effect of reducing the memory occupied by the model in the model training process is achieved.
Referring to fig. 2, a flowchart illustrating steps of a memory optimization method for a terminal training model in an embodiment of the present application is shown, where the method specifically may include the following steps:
s101, the terminal acquires a model structure and a memory access condition of a model deployed in the terminal; and sending the memory access condition and the model structure to a cloud.
The model can be simulated on the cloud by counting the model structure and the memory access condition of the model and sending the model structure and the memory access condition to the cloud, so that the model is restored on the cloud; through reducing the model on the cloud, the memory optimization can be directly performed on the cloud when the memory is optimized, then the obtained memory optimization scheme is sent to the terminal, and the terminal optimizes the memory of the terminal in the process of training the model according to the memory optimization scheme; in the process of obtaining the memory optimization scheme, the cloud computing resources are mainly used, so that the computing resources of the terminal are saved.
The terminal counts the operator types, the operator quantity and the input and output of tensors in the operators of the model;
and the terminal counts the access and release time of the memory and the size of the access memory in the model training process.
At the terminal, firstly, measurement training is needed, a training model is executed once, all needed information is obtained through statistics, and the needed information is transmitted to the cloud side. During statistics, the original model is used for counting the operator types, the operator quantity and tensors in operators of the model when the first training is executed, and the access and release time and the access memory size of the memory in the model training process mainly through a code instrumentation mode.
S102, the cloud end generates a first calculation graph based on the model structure, wherein the first calculation graph is used for representing the topological sequence of a plurality of operators in the model;
after the terminal statistics is completed, the memory access condition and the model structure are sent to the cloud, the cloud can restore the model through the memory access condition and the model structure, and a first calculation map is generated, so that a plurality of operators in the model are calculated and rearranged through the first calculation map, and a memory optimization scheme and a second calculation map are generated.
S103, generating a second calculation map based on the first calculation map; referring to fig. 3, there is shown the sub-steps of generating a second computational graph, comprising the sub-steps of:
s1031, sequentially executing a plurality of operators in the first calculation graph by the cloud according to a topological sequence, acquiring memory indexes corresponding to tensors output by the operators, accumulating the memory indexes, and releasing the tensors corresponding to the current maximum memory indexes when the accumulated memory indexes are larger than a preset threshold;
in this embodiment, the memory index specifically adopts a memory-saving index per second, where the memory-saving index per second represents that, after a certain tensor is released in a periodic time, the cost of the tensor is calculated again, that is, the beneficial degree of releasing the tensor is higher as the value of the cost is higher, and the beneficial degree brought by releasing the tensor is higher.
The memory index is obtained by the following steps:
a: acquiring the saved memory and the retrieval time of the tensor;
the saving memory of the tensor refers to the size of memory space that can be emptied after a single tensor is released; and the recapture time refers to the time of recalculating the tensor through an operator after the memory is released.
B: the quotient of the saved memory of the tensor and the retrieval time is the memory index of the tensor.
When the tensor is released, by adopting the memory saving index per second as a specific memory index, after the tensor with the largest memory saving index per second is selected for release, when the tensor is used again, the cost of the tensor and the memory saved by releasing the tensor can be calculated again to be optimal, and the memory saving effect brought by releasing the tensor can be improved maximally.
S1032, when the cloud end uses the released tensor again in the process of continuing to execute the first calculation graph, the tensor is recalculated and an operator corresponding to the tensor is inserted into the topological order of the first calculation graph.
S1033, after the execution of the first calculation map is completed, the cloud end takes the first calculation map with the released tensor inserted as a second calculation map;
when executing the first calculation graph, the operator topological order of the first calculation graph is consistent with the operator topological order in the terminal internal model, and after the first calculation graph is executed, the cloud end releases a plurality of tensors in the execution process and calculates the tensors when the tensors need to be reused, so that a plurality of operators corresponding to the recalculated tensors are inserted into the first calculation graph, and the second calculation graph has more operators than the first calculation graph; in the second computational graph, the topological order of the plurality of operators is rearranged according to the tensor that is inserted and released.
S104, the cloud end generates a memory allocation scheme based on the memory access condition and the second calculation graph, wherein the memory allocation scheme is used for allocating the memory configured by the terminal for training the model. The method comprises the following steps:
s1041, when the second calculation map is calculated, calculating to generate tensors by each operator, namely obtaining the memory occupation amount of the tensors;
in the process of calculating the second calculation graph, each operator can generate a tensor, and by acquiring the memory occupation amount of the tensor, the cloud can judge how much memory the tensor needs to occupy in the process of training the model.
S1042, comparing tensors generated by operator calculation in each time of obtaining more than one tensor in the second calculation graph with the memory occupation amount of the obtained tensors and sequencing the tensors according to the sequence from big to small;
s1043, based on the memory occupation quantity arrangement sequence of the tensors, distributing the simulated memory on the cloud end to the tensors according to a two-dimensional boxing method, wherein the simulated memory is the same as the memory configured for training the model in the terminal.
The two-dimensional boxing method comprises the following steps: and when the simulation memory is allocated for the plurality of tensors, preferentially allocating the memory with the smallest memory address in the simulation memory to the tensor with the largest memory occupation amount in the plurality of tensors.
By adopting the two-dimensional boxing method to distribute the simulation memory, the chip space in the simulation memory can be reduced as much as possible, thereby maximally utilizing the simulation memory.
When the model is trained, the terminal firstly distributes enough memory for the training model once, and when a memory optimization scheme is generated on the cloud, the terminal also distributes the simulated memory, the size of the simulated memory is consistent with the memory distributed for the training model once by the terminal, and the memory addresses of the simulated memory are also consistent. The memory optimization scheme generated on the cloud can be directly adapted to the terminal by setting the simulation memory to be consistent with the memory on the terminal.
When the cloud allocates the simulated memory for each tensor, the size of the memory address of the memory required to be occupied by each tensor is recorded, and the offset is set according to the memory address allocated for each tensor, and as an example, the specific process of setting the offset includes: after simulating a one-time allocation of memory, the offset is a, assuming that the length n of memory address can be represented by [0, n ]) is a continuous address space in which the memory used for each tensor is located, which can be represented by coordinates [ a, b ]. When the terminal executes, a section of address space is applied for through the malloc library function, and a pointer ptr of the starting address of the address space is obtained, so that the addresses of all tensors are ptr+X, wherein X represents the offset.
S1044, generating the memory allocation scheme based on the allocation situation of the simulated memory on the cloud.
Memory addresses and memory sizes are allocated for each tensor, and then the memory address offset required by the memory addresses allocated for each tensor is calculated by combining the position of the initial memory address in the analog memory, and after calculation is completed; the offset and the memory size of each tensor are recorded.
S105, the cloud sends the second calculation graph and the memory allocation scheme to a terminal; and revising the model by the terminal based on the second calculation graph, and distributing the memory configured by the terminal for training the model based on the memory distribution scheme so as to execute model training.
The terminal rearranges the topological order of a plurality of operators in the model based on the second calculation graph; and acquiring a memory configured by the terminal for training the model, and setting a memory address offset for the memory occupied by each tensor generated by calculation of the second calculation graph based on the memory allocation scheme.
After receiving the second calculation graph and the memory optimization scheme, the terminal rearranges a plurality of operators in the model according to the second calculation graph; when the operators are arranged, operators corresponding to the tensor subjected to the recalculation are mainly inserted into the first calculation graph, so that the topological order of the operators in the model is consistent with the topological order of the operators in the second calculation graph; and for the memory allocation scheme, allocating the tensor generated by the calculation of a plurality of operators in the second calculation graph to the memory according to the memory address offset and the memory allocated by the terminal for the training model.
In the embodiment, when a model is trained, the cloud end can generate a first calculation graph through a model structure and a memory access condition, and a plurality of tensors generated in the model training process are obtained through calculation of the first calculation graph, and the generated tensors are the reasons for occupying a large memory in the model training process, so that when the cloud end calculates a plurality of operators in the first calculation graph through setting a preset threshold, if the memory occupied by the currently generated tensors exceeds the preset threshold, the tensor with the largest memory index is released temporarily, and the memory occupied by the tensor with the largest memory index is reduced; when the tensor needs to be used again, recalculating an operator corresponding to the tensor and inserting the operator corresponding to the tensor into the first calculation diagram to form a second calculation diagram, and when the terminal trains a model, releasing the tensor with the maximum memory index when a plurality of tensors in the second calculation diagram exceed a preset threshold value and reducing the memory occupied by the maximum tensor of the memory index in the process of recalculating the tensor; and the memory can be allocated for tensors generated in a plurality of training processes through the memory access condition and the memory allocation scheme generated by the second calculation graph, so that the effect of reducing the memory occupied by the model in the model training process is achieved.
Example two
Referring to fig. 4, a memory optimization system for a terminal training model in an embodiment of the present application is shown, where the system includes a statistics module, a first computation graph generating module, a second computation graph generating module, a memory allocation scheme generating module, and an execution module;
the statistics module is used for acquiring a model structure and a memory access condition of a model deployed in the terminal; the memory access condition and the model structure are sent to a cloud;
the cloud end is used for generating a model structure according to the first computing diagram, and the cloud end is used for generating a model structure according to the first computing diagram;
the second calculation map generation module is used for sequentially executing a plurality of operators in the first calculation map according to a topological order, acquiring memory indexes corresponding to tensors output by the operators, accumulating the memory indexes, and releasing the tensors corresponding to the current maximum memory indexes when the accumulated memory indexes are larger than a preset threshold; the cloud end recalculates the tensor and inserts an operator corresponding to the tensor into the topological sequence of the first computational graph when the tensor which is released is used again in the process of continuously executing the first computational graph; after the cloud first calculation map is executed, the first calculation map with the released tensor inserted is used as a second calculation map;
the memory allocation scheme generation module is used for generating a memory allocation scheme based on the memory access condition and the second calculation graph by the cloud end, wherein the memory allocation scheme is used for allocating the memory configured by the terminal for training the model;
the execution module is used for sending the second calculation graph and the memory allocation scheme to a terminal by the cloud; and revising the model by the terminal based on the second calculation graph, and distributing the memory configured by the terminal for training the model based on the memory distribution scheme so as to execute model training.
Example III
The embodiment of the application provides an electronic device, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor executes a memory optimization method for a terminal training model as described in the first embodiment.
Example IV
An embodiment of the present application provides a computer readable storage medium, where a computer program is stored, where the computer program when executed by a processor implements a memory optimization method for a terminal training model as described in the first embodiment.
In this specification, each embodiment is described in a progressive manner, and each embodiment is mainly described by differences from other embodiments, and identical and similar parts between the embodiments are all enough to be referred to each other.
It will be apparent to those skilled in the art that embodiments of the present application may be provided as a method, apparatus, or computer program product. Accordingly, embodiments of the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the application may take the form of a computer program product on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
Embodiments of the present application are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal device to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal device, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiment and all such alterations and modifications as fall within the scope of the embodiments of the application.
Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or terminal device comprising the element.
The principles and embodiments of the present application have been described herein with reference to specific examples, the description of which is intended only to assist in understanding the methods of the present application and the core ideas thereof; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in accordance with the ideas of the present application, the present description should not be construed as limiting the present application in view of the above.

Claims (9)

1. A memory optimization method for a terminal training model, the method comprising:
the terminal acquires a model structure and a memory access condition of a model deployed in the terminal;
sending the memory access condition and the model structure to a cloud;
the cloud end generates a first calculation graph based on the model structure, wherein the first calculation graph is used for representing the topological sequence of a plurality of operators in the model;
the cloud sequentially executes a plurality of operators in the first calculation graph according to a topological sequence, acquires memory indexes corresponding to tensors output by the operators, accumulates the memory indexes, and releases the tensors corresponding to the current maximum memory indexes when the accumulated memory indexes are larger than a preset threshold;
when the released tensor is used again in the process of continuously executing the first calculation graph, the cloud end recalculates the tensor and inserts an operator corresponding to the tensor into the topological sequence of the first calculation graph;
the cloud end takes the first calculation graph with the released tensor inserted as a second calculation graph after the execution of the first calculation graph is completed;
the cloud end generates a memory allocation scheme based on the memory access condition and the second calculation graph, wherein the memory allocation scheme is used for allocating the memory configured by the terminal for training the model;
the cloud end sends the second calculation graph and the memory allocation scheme to a terminal;
and revising the model by the terminal based on the second calculation graph, and distributing the memory configured by the terminal for training the model based on the memory distribution scheme so as to execute model training.
2. The method for optimizing memory for a training model of a terminal according to claim 1, wherein the obtaining the model structure and the memory access condition of the model in the terminal comprises:
the terminal counts the operator types, the operator quantity and the input and output of tensors in the operators of the model;
and the terminal counts the access and release time of the memory and the size of the access memory in the model training process.
3. The memory optimization method for a terminal training model according to claim 2, wherein the obtaining the memory index corresponding to the tensor output by the plurality of operators comprises:
acquiring the saved memory and the retrieval time of the tensor;
the quotient of the saved memory of the tensor and the retrieval time is the memory index of the tensor.
4. The memory optimization method for a terminal training model according to claim 2, wherein the cloud generating a memory allocation scheme based on the memory access condition and the second computation graph includes:
when the second calculation graph is calculated, each operator calculates to generate tensors, namely, the memory occupation amount of the tensors is obtained;
each time, tensor generated by operator calculation in the second calculation graph is obtained, the tensor is compared with the memory occupation amount of the obtained tensor, and the tensors are ordered according to the sequence from large to small;
based on the memory occupation amount arrangement sequence of the tensors, distributing the simulated memory on the cloud end to the tensors according to a two-dimensional boxing method, wherein the simulated memory is the same as the memory configured for training the model in the terminal in size;
and generating the memory allocation scheme based on the allocation situation of the simulated memory on the cloud.
5. The memory optimization method for a terminal training model according to claim 4, wherein the allocating the simulated memory on the cloud terminal to the tensor according to the two-dimensional boxing method based on the memory occupation amount arrangement sequence of the tensors comprises:
and when the simulation memory is allocated for the plurality of tensors, preferentially allocating the memory with the smallest memory address in the simulation memory to the tensor with the largest memory occupation amount in the plurality of tensors.
6. The memory optimization method for a terminal training model according to claim 5, wherein the terminal revising the model based on the second computation graph and allocating the memory configured by the terminal for training the model based on the memory allocation scheme comprises:
the terminal rearranges the topological order of a plurality of operators in the model based on the second calculation graph;
and acquiring a memory configured by the terminal for training the model, and setting a memory address offset for the memory occupied by each tensor generated by calculation of the second calculation graph based on the memory allocation scheme.
7. The memory optimization system for the terminal training model is characterized by comprising a statistics module, a first calculation map generation module, a second calculation map generation module, a memory allocation scheme generation module and an execution module;
the statistics module is used for acquiring a model structure and a memory access condition of a model deployed in the terminal; the memory access condition and the model structure are sent to a cloud;
the cloud end is used for generating a model structure according to the first computing diagram, and the cloud end is used for generating a model structure according to the first computing diagram;
the second calculation map generation module is used for sequentially executing a plurality of operators in the first calculation map according to a topological order, acquiring memory indexes corresponding to tensors output by the operators, accumulating the memory indexes, and releasing the tensors corresponding to the current maximum memory indexes when the accumulated memory indexes are larger than a preset threshold; the cloud end recalculates the tensor and inserts an operator corresponding to the tensor into the topological sequence of the first computational graph when the tensor which is released is used again in the process of continuously executing the first computational graph; after the cloud first calculation map is executed, the first calculation map with the released tensor inserted is used as a second calculation map;
the memory allocation scheme generation module is used for generating a memory allocation scheme based on the memory access condition and the second calculation graph by the cloud end, wherein the memory allocation scheme is used for allocating the memory configured by the terminal for training the model;
the execution module is used for sending the second calculation graph and the memory allocation scheme to a terminal by the cloud; and revising the model by the terminal based on the second calculation graph, and distributing the memory configured by the terminal for training the model based on the memory distribution scheme so as to execute model training.
8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements a memory optimization method for a terminal training model according to any of claims 1 to 6 when the computer program is executed by the processor.
9. A computer readable storage medium, wherein a computer program is stored on the computer readable storage medium, the computer program, when executed by a processor, implementing a memory optimization method for a terminal training model according to any of claims 1 to 6.
CN202310876845.6A 2023-07-17 2023-07-17 Memory optimization method, system, equipment and medium for terminal training model Active CN117032954B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310876845.6A CN117032954B (en) 2023-07-17 2023-07-17 Memory optimization method, system, equipment and medium for terminal training model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310876845.6A CN117032954B (en) 2023-07-17 2023-07-17 Memory optimization method, system, equipment and medium for terminal training model

Publications (2)

Publication Number Publication Date
CN117032954A true CN117032954A (en) 2023-11-10
CN117032954B CN117032954B (en) 2024-04-26

Family

ID=88628875

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310876845.6A Active CN117032954B (en) 2023-07-17 2023-07-17 Memory optimization method, system, equipment and medium for terminal training model

Country Status (1)

Country Link
CN (1) CN117032954B (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112882830A (en) * 2021-02-03 2021-06-01 北京迈格威科技有限公司 Video memory management method, video memory management device, model training device, electronic equipment and storage medium
US20220035544A1 (en) * 2020-07-31 2022-02-03 Sigmastar Technology Ltd. Memory allocation method and device, and electronic apparatus
WO2022068663A1 (en) * 2020-09-29 2022-04-07 华为技术有限公司 Memory allocation method, related device, and computer readable storage medium
CN115185692A (en) * 2022-07-18 2022-10-14 北京一流科技有限公司 Memory allocation and release decision system supporting dynamic recalculation and method thereof
CN115878332A (en) * 2023-02-14 2023-03-31 北京燧原智能科技有限公司 Memory resource allocation method, device, equipment and medium in deep learning network
WO2023082575A1 (en) * 2022-04-27 2023-05-19 之江实验室 Graph execution pipeline parallelism method and apparatus for neural network model computation
CN116204847A (en) * 2021-11-29 2023-06-02 华为技术有限公司 Calculation graph optimization method, device and equipment
CN116302461A (en) * 2022-08-05 2023-06-23 阿里巴巴(中国)有限公司 Deep learning memory allocation optimization method and system

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220035544A1 (en) * 2020-07-31 2022-02-03 Sigmastar Technology Ltd. Memory allocation method and device, and electronic apparatus
WO2022068663A1 (en) * 2020-09-29 2022-04-07 华为技术有限公司 Memory allocation method, related device, and computer readable storage medium
CN114327844A (en) * 2020-09-29 2022-04-12 华为技术有限公司 Memory allocation method, related device and computer readable storage medium
CN112882830A (en) * 2021-02-03 2021-06-01 北京迈格威科技有限公司 Video memory management method, video memory management device, model training device, electronic equipment and storage medium
CN116204847A (en) * 2021-11-29 2023-06-02 华为技术有限公司 Calculation graph optimization method, device and equipment
WO2023082575A1 (en) * 2022-04-27 2023-05-19 之江实验室 Graph execution pipeline parallelism method and apparatus for neural network model computation
CN115185692A (en) * 2022-07-18 2022-10-14 北京一流科技有限公司 Memory allocation and release decision system supporting dynamic recalculation and method thereof
CN116302461A (en) * 2022-08-05 2023-06-23 阿里巴巴(中国)有限公司 Deep learning memory allocation optimization method and system
CN115878332A (en) * 2023-02-14 2023-03-31 北京燧原智能科技有限公司 Memory resource allocation method, device, equipment and medium in deep learning network

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
ZAN ZONG 等: "STR: Hybrid Tensor Re-Generation to Break Memory Wall for DNN Training", 《IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS》, vol. 34, no. 8, 10 April 2023 (2023-04-10), pages 2403 - 2418 *
钱旭威: "动态神经网络及其在边缘设备上的部署研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》, no. 11, 15 November 2022 (2022-11-15), pages 139 - 9 *
马玮良 等: "深度学习中的内存管理问题研究综述", 《大数据》, vol. 6, no. 4, 10 July 2020 (2020-07-10), pages 56 - 68 *

Also Published As

Publication number Publication date
CN117032954B (en) 2024-04-26

Similar Documents

Publication Publication Date Title
CN109213600B (en) GPU resource scheduling method and device based on AI cloud
CN110413396B (en) Resource scheduling method, device and equipment and readable storage medium
CN107729138B (en) Method and device for analyzing high-performance distributed vector space data
CN109189572B (en) Resource estimation method and system, electronic equipment and storage medium
US8028291B2 (en) Method and computer program product for job selection and resource allocation of a massively parallel processor
CN105528371A (en) Method, device, and system for executing writing task
CN109412865B (en) Virtual network resource allocation method, system and electronic equipment
CN117032954B (en) Memory optimization method, system, equipment and medium for terminal training model
CN111695701B (en) System for realizing data set construction processing based on federal learning and construction generation method thereof
CN115130672B (en) Software and hardware collaborative optimization convolutional neural network calculation method and device
CN103870904A (en) PaaS platform health status management method and PaaS platform health status management device
CN109951506B (en) Method and equipment for evaluating performance of storage cluster
CN112468546B (en) Account position determining method, device, server and storage medium
CN112783441B (en) Method and device for adjusting read-write speed limit of virtual machine disk and computing equipment
CN111767188B (en) Training task monitoring method and device
CN111177106B (en) Distributed data computing system and method
CN109901931B (en) Reduction function quantity determination method, device and system
CN112540843B (en) Resource allocation method and device, storage equipment and storage medium
CN109788061B (en) Computing task deployment method and device
CN109377510B (en) Particle tracking method and system on supercomputing cluster
CN115968467A (en) Memory constrained scheduling
US10970430B2 (en) Computer-readable recording medium, computing machine resource allocation method, and particle simulation apparatus
CN108804640B (en) Data grouping method, device, storage medium and equipment based on maximized IV
CN106557430A (en) A kind of data cached brush method and device
CN105608212B (en) Method and system for ensuring that MapReduce data input fragment contains complete record

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant