CN115269179A - Static memory allocation method, device, equipment and medium - Google Patents

Static memory allocation method, device, equipment and medium Download PDF

Info

Publication number
CN115269179A
CN115269179A CN202210835894.0A CN202210835894A CN115269179A CN 115269179 A CN115269179 A CN 115269179A CN 202210835894 A CN202210835894 A CN 202210835894A CN 115269179 A CN115269179 A CN 115269179A
Authority
CN
China
Prior art keywords
memory
rectangles
rectangle
tensor
tensors
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210835894.0A
Other languages
Chinese (zh)
Inventor
庄宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Dahua Technology Co Ltd
Original Assignee
Zhejiang Dahua Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Dahua Technology Co Ltd filed Critical Zhejiang Dahua Technology Co Ltd
Priority to CN202210835894.0A priority Critical patent/CN115269179A/en
Publication of CN115269179A publication Critical patent/CN115269179A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5011Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
    • G06F9/5016Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals the resource being the memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/0223User address space allocation, e.g. contiguous or non contiguous base addressing
    • G06F12/023Free address space management
    • G06F12/0238Memory management in non-volatile memory, e.g. resistive RAM or ferroelectric memory
    • G06F12/0246Memory management in non-volatile memory, e.g. resistive RAM or ferroelectric memory in block erasable memory, e.g. flash memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Memory System (AREA)

Abstract

The present application relates to the field of data storage technologies, and in particular, to a method, an apparatus, a device, and a medium for allocating static memory, which are used to improve the utilization rate of memory resources. The method comprises the following steps: acquiring a plurality of tensors of a target model, and determining time information and space size of a memory occupied by each tensor; sequencing the tensors according to a preset index, and allocating initial address offset to the tensors according to a preset rule; if the tail address offset of any tensor is larger than the theoretical threshold, adjusting the sequence of the tensors, and allocating the head address offset for the tensors again according to a preset rule; the tail address offset of each tensor is the sum of the head address offset and the space size of each tensor, and the theoretical threshold is the maximum value of the sum of the space sizes of the memory occupied by the corresponding tensor at each moment; and determining the memory address of each tensor according to the head address offset and the tail address offset of each tensor until the plurality of tensors meet any one of the plurality of tuning targets.

Description

Static memory allocation method, device, equipment and medium
Technical Field
The present application relates to the field of data storage technologies, and in particular, to a static memory allocation method, apparatus, device, and medium.
Background
The deep learning forward reasoning mainly comprises operators and tensors containing intermediate calculation results of the operators, in order to reduce reasoning time, memory is usually allocated to the tensors in advance, if the memory is allocated to each tensor independently, the memory usage amount of a reasoning engine is large, and equipment sensitive to memory resources is not applicable to the memory allocation mode.
In fact, the tensors do not need to coexist, the multiple tensors without dependency relationship can be allowed to share the same memory, and in forward estimation, the information of the tensors can be known in advance, so that the utilization rate of memory resources can be improved by using a static memory allocation method.
Disclosure of Invention
The embodiment of the application provides a static memory allocation method, a device, equipment and a medium, which are used for improving the utilization rate of memory resources.
In a first aspect, the present application provides a static memory allocation method, including:
acquiring a plurality of tensors of a target model, and determining time information and space size of a memory occupied by each tensor;
sequencing the tensors according to a preset index, and allocating initial address offset to the tensors according to a preset rule; the initial address offset represents an offset between an initial address of each tensor and an initial address of the memory pool, the preset rule comprises that memory addresses of tensors with overlapped time information are not overlapped, time information of tensors with overlapped memory addresses is not overlapped, and the memory address of each tensor is determined according to the initial address offset and the space size of each tensor;
if the tail address offset of any tensor is larger than the theoretical threshold, adjusting the sequence of the tensors, and allocating the head address offset for the tensors again according to the preset rule; the tail address offset of each tensor is the sum of the head address offset and the space size of each tensor, and the theoretical threshold is the maximum value of the sum of the space sizes of the memory occupied by the corresponding tensor at each moment;
and determining the memory address of each tensor according to the head address offset and the tail address offset of each tensor until the tensors meet any one of a plurality of tuning targets.
In the embodiment of the application, when the tail address offset of any tensor is larger than a theoretical threshold, the ordering of the tensors is adjusted, the head address offset is allocated to the tensors again according to a preset rule until the tensors meet the optimization target, and the memory address of each tensor is determined according to the head address offset and the tail address offset of each tensor. By adjusting the sorting and setting the mode of adjusting the optimal target, the optimal sorting mode is found as much as possible, the optimal memory allocation results of a plurality of tensors are obtained, and the utilization rate of memory resources is improved.
In a possible embodiment, sorting the tensors according to a preset index, and allocating a head address offset to the tensors according to a preset rule includes:
generating a plurality of memory rectangles corresponding to the memory operations; the memory operations represent operations of the tensors occupying memory, the length of each memory rectangle represents the time length of each memory operation, and the width of each memory rectangle represents the space size of each memory operation;
sequencing the memory rectangles according to a preset index, and arranging the memory rectangles in memory containers corresponding to the memory pool according to a preset rule; the first address of the memory pool corresponds to the bottom of the memory container, and the minimum distance between each memory rectangle and the bottom is the first address offset of each tensor.
In a possible embodiment, if there is a tail address offset of any tensor greater than a theoretical threshold, adjusting the ordering of the tensors, and reassigning a head address offset for the tensors according to the preset rule, includes:
if the width of the target rectangle exceeds a theoretical threshold value, adjusting the sequence of the memory rectangles, and rearranging the memory rectangles in the memory container according to the preset rule; the target rectangle is a minimum rectangle containing a plurality of arranged memory rectangles, and the width of the target rectangle represents the maximum value of tail address offsets of the tensors.
In the embodiment of the application, the static memory allocation problem is converted into the two-dimensional rectangular tape packing problem, the memory operation is geometrically and abstractly expressed, memory rectangles are constructed according to the time information and the space size of the memory occupied by each tensor, the memory is converted into rectangles, a memory pool is converted into a memory container, a plurality of rectangles are arranged in the memory container, and the tuning target is achieved by adjusting the positions of the rectangles arranged in the memory container.
In one possible embodiment, after until the plurality of tensors satisfy any one of a plurality of tuning objectives, the method further comprises:
and determining the head address offset and the tail address offset of each tensor according to the position of each memory rectangle in the memory container.
In one possible embodiment, adjusting the ordering of the plurality of memory rectangles includes:
classifying the memory rectangles to obtain a high-risk memory rectangle, a medium-risk memory rectangle, a low-risk memory rectangle and a no-risk memory rectangle; wherein the high-risk memory rectangles include a first memory rectangle corresponding to a first tensor whose tail address offset is greater than the theoretical threshold and a second memory rectangle overlapping with the first memory rectangle in time, the medium-risk memory rectangle includes a third memory rectangle corresponding to a second tensor whose tail address offset is equal to the theoretical threshold and a fourth memory rectangle overlapping with the third memory rectangle in time, the low-risk memory rectangle includes memory rectangles overlapping with the high-risk memory rectangle and the medium-risk memory rectangle in time, and the risk-free memory rectangle includes memory rectangles of the plurality of memory rectangles excluding the high-risk memory rectangle, the medium-risk memory rectangle and the low-risk memory rectangle;
respectively adjusting the sequence of the high-risk memory rectangles and the sequence of the medium-risk memory rectangles;
and sequentially combining the adjusted high-risk memory rectangle, the adjusted middle-risk memory rectangle, the low-risk memory rectangle and the risk-free memory rectangle.
In the embodiment of the application, the memory rectangles are divided into four categories according to the risk size affecting the tuning target, only the sequence of the medium-high risk memory rectangles is adjusted, and the low-risk memory rectangles and the risk-free memory rectangles are simply combined in a descending order according to the risk level, so that the tuning effect is ensured, and meanwhile, the algorithm complexity is reduced.
In one possible embodiment, adjusting the ordering of the high-risk memory rectangles and the ordering of the medium-risk memory rectangles separately comprises:
according to the time overlapping length, dividing the high-risk memory rectangles into a plurality of groups, and dividing the low-risk memory rectangles into a plurality of groups;
sequencing the memory rectangles of each group according to the preset index, and arranging the memory rectangles in the memory container according to the preset rule;
if the width of the target rectangle exceeds the theoretical threshold, adjusting the sequencing of the memory rectangles of each group, and rearranging the memory rectangles in the memory container according to the preset rule;
and sequentially exporting the memory rectangles of each group after adjustment and sorting until the memory rectangles of each group meet any one of the plurality of adjustment and optimization targets, and obtaining adjusted high-risk memory rectangles and adjusted middle-risk memory rectangles.
In the embodiment of the application, the high-risk memory rectangles and the medium-risk memory rectangles are further divided into a plurality of groups, and the memory rectangles of each group are sequentially optimized, so that the algorithm complexity is further reduced. And clustering division is carried out according to the time overlapping length, and memory rectangles with high overlapping degree are divided into the same group, so that the effectiveness of layout tuning can be improved.
In a possible embodiment, the preset index includes at least one of a length of the memory rectangle, a width of the memory rectangle, and a length of a time overlap between the memory rectangles.
In the embodiment of the application, the memory rectangles are preprocessed, and are sequenced according to one or more indexes, so that the tuning speed can be increased, and the tuning result can be improved.
In a possible embodiment, the tuning targets include that the tail address offset of any tensor is not greater than the theoretical threshold, the same sorting result occurs, and the number of times of adjustment is greater than or equal to a preset number of times.
In a second aspect, the present application provides a static memory allocation apparatus, including:
the acquisition module is used for acquiring a plurality of tensors of the target model and determining the time information and the space size of a memory occupied by each tensor;
the allocation module is used for sequencing the tensors according to preset indexes and allocating initial address offsets to the tensors according to a preset rule; the initial address offset represents an offset between an initial address of each tensor and an initial address of the memory pool, the preset rule comprises that memory addresses of tensors with overlapped time information are not overlapped, time information of tensors with overlapped memory addresses is not overlapped, and the memory address of each tensor is determined according to the initial address offset and the space size of each tensor;
the allocation module is further configured to adjust the ordering of the tensors and allocate head address offsets to the tensors again according to the preset rule if the tail address offset of any tensor is greater than a theoretical threshold; the tail address offset of each tensor is the sum of the head address offset and the space size of each tensor, and the theoretical threshold is the maximum value of the sum of the space sizes of the memory occupied by the corresponding tensor at each moment;
and the determining module is used for determining the memory address of each tensor according to the head address offset and the tail address offset of each tensor until the plurality of tensors meet any one of a plurality of tuning targets.
In a possible embodiment, the allocation module is specifically configured to:
generating a plurality of memory rectangles corresponding to the memory operations; the memory operations represent operations of the tensors occupying memory, the length of each memory rectangle represents the time length of each memory operation, and the width of each memory rectangle represents the space size of each memory operation;
sequencing the memory rectangles according to a preset index, and arranging the memory rectangles in memory containers corresponding to the memory pool according to a preset rule; the first address of the memory pool corresponds to the bottom of the memory container, and the minimum distance between each memory rectangle and the bottom is the first address offset of each tensor.
In a possible embodiment, the allocation module is specifically configured to:
if the width of the target rectangle exceeds a theoretical threshold value, adjusting the sequence of the memory rectangles, and rearranging the memory rectangles in the memory container according to the preset rule; the target rectangle is a minimum rectangle comprising a plurality of memory rectangles which are arranged, and the width of the target rectangle represents the maximum value in tail address offset of the tensors.
In a possible embodiment, the determining module is further configured to:
after the tensors meet any one of the tuning targets, determining a head address offset and a tail address offset of each tensor according to the position of each memory rectangle in the memory container.
In a possible embodiment, the allocation module is specifically configured to:
classifying the memory rectangles to obtain a high-risk memory rectangle, a medium-risk memory rectangle, a low-risk memory rectangle and a no-risk memory rectangle; wherein the high-risk memory rectangles include a first memory rectangle corresponding to a first tensor whose tail address offset is greater than the theoretical threshold and a second memory rectangle overlapping with the first memory rectangle in time, the medium-risk memory rectangle includes a third memory rectangle corresponding to a second tensor whose tail address offset is equal to the theoretical threshold and a fourth memory rectangle overlapping with the third memory rectangle in time, the low-risk memory rectangle includes memory rectangles overlapping with the high-risk memory rectangle and the medium-risk memory rectangle in time, and the risk-free memory rectangle includes memory rectangles of the plurality of memory rectangles excluding the high-risk memory rectangle, the medium-risk memory rectangle and the low-risk memory rectangle;
respectively adjusting the sequence of the high-risk memory rectangles and the sequence of the medium-risk memory rectangles;
and combining the adjusted high-risk memory rectangle, the adjusted middle-risk memory rectangle, the low-risk memory rectangle and the risk-free memory rectangle in sequence.
In a possible embodiment, the allocation module is specifically configured to:
according to the time overlapping length, dividing the high-risk memory rectangles into a plurality of groups, and dividing the low-risk memory rectangles into a plurality of groups;
sorting the memory rectangles of each group according to the preset index, and arranging the memory rectangles in the memory container according to the preset rule;
if the width of the target rectangle exceeds the theoretical threshold, adjusting the sequencing of the memory rectangles of each group, and rearranging the memory rectangles in the memory container according to the preset rule;
and sequentially exporting the memory rectangles of each group after adjustment and sequencing until the memory rectangles of each group meet any one of a plurality of tuning targets, and obtaining adjusted high-risk memory rectangles and adjusted middle-risk memory rectangles.
In a possible embodiment, the preset index includes at least one of a length of the memory rectangle, a width of the memory rectangle, and a length of a time overlap between the memory rectangles.
In a possible embodiment, the tuning targets include that the tail address offset of any tensor is not greater than the theoretical threshold, the same sorting result occurs, and the number of times of adjustment is greater than or equal to a preset number of times.
In a third aspect, the present application provides an electronic device, comprising:
a memory for storing program instructions;
a processor for calling the program instructions stored in the memory and executing the method of any of the first aspect according to the obtained program instructions.
In a fourth aspect, the present application provides a computer readable storage medium having stored thereon a computer program comprising program instructions which, when executed by a computer, cause the computer to perform the method of any of the first aspects.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments or related technologies of the present application, the drawings used in the description of the embodiments or related technologies will be briefly introduced below, it is obvious that the drawings in the description below are only the embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
Fig. 1 is a schematic view of an application scenario of a static memory allocation method according to an embodiment of the present application;
fig. 2 is a first flowchart illustrating a static memory allocation method according to an embodiment of the present application;
FIG. 3 is a schematic diagram of a target rectangle provided by an embodiment of the present application;
fig. 4 is a schematic structural diagram of a memory allocation device according to an embodiment of the present application;
fig. 5 is a second flowchart illustrating a static memory allocation method according to an embodiment of the present application;
fig. 6 is a schematic structural diagram of a layout space tuner according to an embodiment of the present disclosure;
fig. 7 is a flowchart of a high risk memory optimizer according to an embodiment of the present application;
fig. 8 is a structural diagram of a static memory allocation apparatus according to an embodiment of the present application;
fig. 9 is a block diagram of an electronic device according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the technical solutions in the embodiments of the present invention will be described clearly and completely in the following with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, not all of the embodiments of the present invention. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present application without making any creative effort belong to the protection scope of the present application. In the present application, the embodiments and features of the embodiments may be arbitrarily combined with each other without conflict. Also, while a logical order is shown in the flow diagrams, in some cases, the steps shown or described may be performed in an order different than here.
The terms "first" and "second" in the description and claims of the present application and the above-described drawings are used for distinguishing between different objects and not for describing a particular order. Furthermore, the term "comprises" and any variations thereof are intended to cover non-exclusive protection. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements but may alternatively include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
In the embodiments of the present application, "a plurality" may mean at least two, for example, two, three, or more, and the embodiments of the present application are not limited.
Before describing the static memory allocation method provided in the embodiment of the present application, for convenience of understanding, first, a detailed description is given to a background technology of the embodiment of the present application.
The traditional tensor memory multiplexing method mainly ranks a plurality of tensors according to the size of a memory, and sequentially allocates a memory for each tensor according to the ranking, wherein the allocated memory is added into a pre-allocation list. And detecting whether a memory with proper size and no life cycle overlapping exists in a pre-allocation list or not aiming at a certain tensor to be allocated, if so, allocating the memory to the tensor mark for use, and if not, allocating a new memory to the tensor to be allocated and adding the new memory to the pre-allocation list. However, if the difference between the memory sizes required by the tensors is large, the tensor of the small memory may reuse the large memory, so that a large amount of idle memory is generated, and the utilization rate of the memory resource is low.
In view of this, embodiments of the present application provide a static memory allocation method, which may be executed by a memory allocation device. The memory allocation device may be implemented by a terminal, such as a mobile terminal, a fixed terminal, or a portable terminal, such as a mobile phone, a multimedia computer, a multimedia tablet, a desktop computer, a notebook computer, a tablet computer, or the like, or a server. The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a network service, cloud communication, middleware service, a domain name service, a security service, a CDN, and a big data and artificial intelligence platform, but is not limited thereto.
Some brief descriptions are given below to application scenarios to which the technical solution of the embodiment of the present application can be applied, and it should be noted that the application scenarios described below are only used for describing the embodiment of the present application and are not limited. In a specific implementation process, the technical scheme provided by the embodiment of the application can be flexibly applied according to actual needs.
Referring to fig. 1, a schematic view of an application scenario of the static memory allocation method provided in the embodiment of the present application is shown, where the application scenario includes a plurality of tensors 101 to be allocated and a memory allocation device 102.
Specifically, after the memory allocation device 102 acquires a plurality of tensors 101 to be allocated, a memory is allocated to each tensor. For the meaning of the tensors, please refer to the above discussion, and how the memory allocation device 102 allocates the memory for the tensors 101 to be allocated will be described in detail below.
As described above, in the following description, with reference to the application scenario shown in fig. 1, the memory allocation device 102 in fig. 1 executes a static memory allocation method as an example. Fig. 2 is a schematic flow chart of a static memory allocation method according to an embodiment of the present disclosure.
S201, obtaining a plurality of tensors of the target model, and determining time information and space size of a memory occupied by each tensor.
The target model mainly comprises an operator and a plurality of tensors containing intermediate calculation results of the operator, the target model comprises a neural network model, a deep learning forward reasoning model and the like, after the memory allocation device obtains the target model, the plurality of tensors of the target model are sequentially obtained according to the execution sequence of the operator, and the time information and the space size of the memory occupied by each tensor can be determined by analyzing the dependency relationship among the plurality of tensors. The time information includes a start time and an end time of the memory occupied by each tensor, and the size of the space occupied by each tensor is also called the size of the memory.
S202, sequencing the tensors according to preset indexes, and distributing first address offset for the tensors according to a preset rule.
In one possible embodiment, the preset index includes at least one of a time length of the memory occupied by the tensor, a space size of the memory occupied by the tensor, and a time overlapping length between the tensors. The time length of the memory occupied by each tensor is the difference value between the starting time and the ending time of the memory occupied by each tensor, and the time overlapping length between the tensors is the time length of any two tensors simultaneously occupying the memory.
For example, if the time information of memory occupied by tensor a is 8-00-8, and the time information of memory occupied by tensor B is 8-05-8, then the length of time that memory is occupied by tensor a and tensor B is 10 minutes, and the length of time overlap between tensor a and tensor B is 5 minutes when memory is occupied by both tensor a and tensor B during the period of 8.
After the memory allocation device or the tensors are allocated, the tensors may be ordered according to a preset index, which may be in a descending order or a descending order, and head address offsets are allocated to the ordered tensors in sequence according to a preset rule. The initial address offset represents the offset between the initial address of each tensor and the initial address of the memory pool, and the preset rule comprises that the memory addresses of the tensors with overlapped time information are not overlapped, and the time information of the tensors with overlapped memory addresses is not overlapped. The memory address of each tensor is determined according to the first address offset and the space size of each tensor.
For example, the ordering of the two tensors is: tensor A, tensor B. The tensors A and B need to occupy memory in the period from 8 to 05 to 8, a first address offset is allocated to the tensor A, and then a first address offset is allocated to the tensor B, and the first address offset of the tensor B is larger than or equal to a tail address offset of the tensor A, so that the memory addresses of the tensor A and the tensor B are not overlapped.
In one possible embodiment, in order to save memory as much as possible, the preset rule may further include that a tail address offset of the first tensor is equal to a head address offset of the second tensor, wherein the ordering of the first tensor precedes the ordering of the second tensor, and the second tensor is the tensor which has the overlap with the time information of the first tensor and is closest to the ordering of the first tensor. The tail address offset of each tensor represents the offset between the tail address of each tensor and the head address of the memory pool, and the tail address offset of each tensor is the sum of the head address offset and the space size of each tensor.
For example, the ordering of the four tensors is: tensor A, tensor B, tensor C, tensor D, tensor A and tensor C, tensor D's time information all has the overlap, if tensor A is first tensor, then tensor C is the second tensor, if tensor C is first tensor, then tensor D is the second tensor. The tail address offset of tensor A is equal to the head address offset of tensor C, which is equal to the head address offset of tensor D.
And S203, if the tail address offset of any tensor is larger than a theoretical threshold, adjusting the sequence of the tensors, and allocating the head address offset for the tensors again according to a preset rule.
Considering that the initial memory allocation result of the plurality of tensors is usually not the optimal allocation result, the memory allocation device may determine whether to reallocate the head address offset according to whether the tail address offset of any tensor in the plurality of tensors is greater than a theoretical threshold. For the meaning and calculation of the tail address offset, please refer to the contents discussed above, and the details are not repeated herein. The theoretical threshold is the maximum value of the sum of the space size of the memory occupied by the corresponding tensor at each moment.
Specifically, after allocating the head address offset to the plurality of tensors, the memory allocation device determines the tail address offset of each tensor according to the sum of the head address offset and the space size of each tensor. The memory allocation device may calculate a sum of the space sizes of the memory occupied by the corresponding tensor at each time to obtain a plurality of sums, and a maximum value of the sums is taken as a theoretical threshold. Further, the memory allocation device may detect whether the tail address offset of each tensor is greater than a theoretical threshold, adjust the ordering of the plurality of tensors if the tail address offset of any tensor is greater than the theoretical threshold, and sequentially allocate the head address offset to the plurality of tensors after the adjustment of the ordering again according to a preset rule.
And S204, determining the memory address of each tensor according to the head address offset and the tail address offset of each tensor until the plurality of tensors meet any one of a plurality of tuning targets.
The memory allocation device adjusts the ordering of the tensors, and determines whether to continuously adjust the ordering of the tensors according to whether the tensors meet the tuning target or not after allocating the first address offset for the tensors again according to a preset rule.
Specifically, the tuning targets include that the tail address offset of any tensor is smaller than or equal to a theoretical threshold, the same sorting result occurs, and the number of times of tuning is larger than or equal to a preset number of times. The same sorting result is also called as the fixed point tuning result. And if the plurality of tensors do not meet any one of the plurality of tuning targets, continuing to adjust the ordering of the plurality of tensors, and re-allocating the first address offset for the plurality of tensors according to a preset rule. Until the plurality of tensors meet any one of the plurality of tuning targets, the memory allocation device may determine the head address of the memory pool of each tensor according to the head address of the memory pool and the head address offset of each tensor, determine the tail address of each tensor according to the head address of the memory pool and the tail address offset of each tensor, and determine the memory address of each tensor according to the head address of each tensor and the tail address of each tensor.
For example, for 3 tensors: tensor X, tensor Y, tensor W, the number of times of predetermineeing is 4, and the sequencing result for the first time is: x, Y, W, after assigning first address offset for X, Y, W in proper order, if there is tail address offset of a certain tensor greater than the theoretical threshold, then adjust the sequencing of a plurality of tensors, the result of sequencing for the second time is: x, W, Y, after X, W, Y distribute the first address offset for X, W, Y in proper order, still there is the tail address offset of a certain tensor to be greater than theoretical threshold, and the second is different with the sequencing result of first time, and does not reach the preset number of times, then continue to adjust the sequencing of a plurality of tensors, and the sequencing result of the third time is: x, Y and W, the tail address offset of a certain tensor still exists and is larger than a theoretical threshold, the number of times is not reached, but the third time is the same as the first time, and the adjustment can be stopped.
It should be noted that after the memory allocation device executes S202, if there is no tail address offset of any tensor larger than the theoretical threshold, which indicates that the initial allocation result of the plurality of tensors is the optimal allocation result, without executing S203 and S204, the memory address of each tensor can be determined directly according to the head address offset and the tail address offset of each tensor. For details of how to determine the memory address, reference is made to the contents discussed above, and details are not repeated here.
In order to more pictorially explain the memory allocation method provided by the embodiment of the present application, in a possible embodiment, the memory allocation device may perform geometric abstract representation on the memory occupied by each tensor, where the memory is equivalent to a rectangle, the memory pool is equivalent to a container, the process of allocating the memory for the tensors is equivalent to a process of placing the rectangle in the container, and the memory address of each tensor is determined according to a final position of the rectangle in the container. The specific steps are as follows.
S1.1, generating a plurality of memory rectangles corresponding to the memory operations, sequencing the memory rectangles according to preset indexes, and arranging the memory rectangles in memory containers corresponding to the memory pool according to preset rules.
The memory operations represent operations of which the tensors occupy the memory, the length of each memory rectangle represents the time length of each memory operation, the width of each memory rectangle represents the space size of each memory operation, and the length of the memory container represents the total time length of the memory operations.
Specifically, the memory allocation device analyzes the start time and the end time of each memory operation according to the time information of the memory occupied by each tensor, and uses the time length between the start time and the end time as the length of each memory rectangle and the space size of the memory occupied by each tensor as the width of each memory rectangle, thereby generating a plurality of memory rectangles corresponding to a plurality of memory operations.
In one possible embodiment, the serialization model may provide a sequence of memory operations. The serialization model is a model stored in an array form, wherein the arrangement order of nodes is the execution order of the nodes. The memory operation sequence includes a plurality of memory operations executed in sequence, and each memory operation may have a plurality of parameters, such as a memory name, a memory operation type, a memory size, an address alignment size, and the like. The memory operation types are two, namely applying for the memory and releasing the memory, and the address alignment size is the first address offset.
After the memory allocation device obtains the plurality of memory rectangles, the plurality of memory rectangles may be sorted according to a preset index, and the sorted plurality of memory rectangles are sequentially arranged in the memory container corresponding to the memory pool according to a preset rule.
The first address of the memory pool corresponds to the bottom of the memory container, and the minimum distance between each memory rectangle and the bottom is the first address offset of each tensor. The preset rule includes that any two memory rectangles in the plurality of memory rectangles cannot be overlapped, and the preset rule may further include a bottom alignment rule, for example, a plane rectangular coordinate system is established for the memory container, a horizontal axis of the memory container represents time, a vertical axis represents memory offset, and bottom alignment refers to that the memory rectangles are placed in close contact with a direction with smaller memory offset.
The meaning of the preset index is referred to, and in a possible embodiment, the preset index includes at least one of the width of the memory rectangle, the height of the memory rectangle, and the length of time overlap between the memory rectangles. This is described in the following cases.
In the first case, the preset index includes any one of the width of the memory rectangle, the height of the memory rectangle, and the time overlapping length between the memory rectangles.
In case 1, the preset index is the width of the memory rectangle or the height of the memory rectangle.
The memory allocation device sorts the plurality of memory rectangles by width or height. When there are at least two memory rectangles with the same width or height, the sequencing order of the at least two memory rectangles may be arbitrarily arranged. For example, if the width of the memory rectangle a is the same as that of the memory rectangle B, the memory rectangle a and the memory rectangle B may be arranged in the order of a and B or B and a.
And 2, presetting the index as the time overlapping length between memory rectangles.
And the memory allocation equipment performs sequencing according to the sum of the time overlapping length between each memory rectangle and other memory rectangles, and if the two memory rectangles do not have time overlapping, the time overlapping length between the two memory rectangles is 0. When at least two memory rectangles with the same sum of time overlap lengths exist, the sequence of the at least two memory rectangles can be randomly arranged.
For example, for 3 memory rectangles: the time overlapping length of the rectangle A and the rectangle B is X1, the time overlapping length of the rectangle A and the rectangle C is X2, the time overlapping length of the rectangle B and the rectangle C is X3, the sum of the time overlapping lengths of the rectangle A and the other rectangles is X1+ X2, the sum of the time overlapping lengths of the rectangle B and the other rectangles is X1+ X3, and the sum of the time overlapping lengths of the rectangle C and the other rectangles is X2+ X3. If X1+ X2> X1+ X3> X2+ X3, the arrangement order from big to small is A, B and C. If X1+ X2> X1+ X3= X2+ X3, the arrangement order from large to small may be a, B, C, or a, C, B.
In the second case, the preset index includes any two indexes of the width of the memory rectangle, the height of the memory rectangle and the time overlapping length between the memory rectangles.
Specifically, any two indexes are a first index and a second index, respectively. The memory allocation devices may be sorted by a sum or product or ratio of the first and second indicators.
For example, if the ratio of the width to the height of the memory rectangle a is greater than the ratio of the width to the height of the memory rectangle B, the sequence from large to small is: A. and B.
Or, the memory allocation device may sort the plurality of memory rectangles according to the first index, and when there are at least two memory rectangles with the same first index, sort the at least two memory rectangles according to the second index. Wherein the priority of the first index is higher than the priority of the second index.
For example, the high priority is higher than the wide priority, for 3 memory rectangles: the height of the rectangle A is larger than that of the rectangle B, the height of the rectangle B is equal to that of the rectangle C, the width of the rectangle C is larger than that of the rectangle B, and the sequence from large to small is as follows: A. c and B.
In the third case, the preset indexes include three indexes of the width of the memory rectangle, the height of the memory rectangle and the time overlapping length between the memory rectangles.
Specifically, the three indexes are a first index, a second index, and a third index, respectively, and the memory allocation device may sort the three indexes according to a sum or a product of the first index, the second index, and the third index.
Or, the memory allocation device may sort the plurality of memory rectangles according to the first index, and when there are a plurality of first rectangles with the same first index in the plurality of memory rectangles, sort the plurality of first rectangles according to the second index. When a plurality of second rectangles with the same second index exist in the plurality of first rectangles, the plurality of second rectangles are sorted according to the third index. The priority of the first index is higher than that of the second index, and the priority of the second index is higher than that of the third index.
And S1.2, if the width of the target rectangle exceeds a theoretical threshold value, adjusting the sequence of the memory rectangles, and rearranging the memory rectangles in the memory container according to a preset rule.
Considering that the initial placement of the rectangle is not usually the optimal placement position, for example, the top of the partial inner rectangle exceeds the optimal height corresponding to the theoretical threshold. The memory allocation device may determine whether to adjust the location of the memory rectangle in the memory container based on whether the width of the target rectangle exceeds a theoretical threshold. The target rectangle is a minimum rectangle comprising a plurality of arranged memory rectangles. The width of the target rectangle represents the maximum of the tail address offsets of the plurality of tensors.
Referring to fig. 3, which is a schematic diagram of a target rectangle provided in the present embodiment, there are 9 rectangles corresponding to solid line boxes in total, and a rectangle corresponding to a dotted line box is the target rectangle.
Specifically, after the memory allocation device has placed the plurality of memory rectangles, a target rectangle may be determined, the width of the target rectangle is compared with a theoretical threshold, if the width of the target rectangle exceeds the theoretical threshold, the memory allocation device adjusts the ordering of the plurality of memory rectangles, and rearranges the plurality of memory rectangles after the ordering is adjusted in the memory container in sequence according to a preset rule.
In order to accelerate the tuning speed, the application provides a peak suppression tuning method, which focuses on the tuning processing of the high risk memory and aims to reduce the height of the peak memory rectangle in the memory container.
Specifically, the memory allocation device may classify a plurality of memory rectangles to obtain a high-risk memory rectangle, a medium-risk memory rectangle, a low-risk memory rectangle, and a non-risk memory rectangle, adjust the order of the high-risk memory rectangle and the order of the medium-risk memory rectangle, respectively, and sequentially merge the adjusted high-risk memory rectangle, the adjusted medium-risk memory rectangle, the low-risk memory rectangle, and the non-risk memory rectangle.
The high-risk memory rectangles comprise first memory rectangles corresponding to first tensors with tail address offsets larger than a theoretical threshold value, and second memory rectangles overlapped with the first memory rectangles in time, the medium-risk memory rectangles comprise third memory rectangles corresponding to second tensors with tail address offsets equal to the theoretical threshold value, and fourth memory rectangles overlapped with the third memory rectangles in time, the low-risk memory rectangles comprise memory rectangles overlapped with the high-risk memory rectangles and the medium-risk memory rectangles in time, and the risk-free memory rectangles comprise memory rectangles of the plurality of memory rectangles except the high-risk memory rectangles, the medium-risk memory rectangles and the low-risk memory rectangles.
The following description relates to how to adjust the ordering of the high risk memory rectangles and the ordering of the medium risk memory rectangles.
The memory allocation equipment can divide the high-risk memory rectangles into a plurality of groups and divide the low-risk memory rectangles into a plurality of groups according to the time overlapping length, sort the memory rectangles of each group according to a preset index, arrange the memory rectangles in the memory container according to a preset rule, adjust the sort of the memory rectangles of each group if the width of the target rectangle exceeds a theoretical threshold value, and rearrange the memory rectangles in the memory container according to the preset rule; and sequentially exporting the memory rectangles of each group after adjustment and sequencing until the memory rectangles of each group meet any one of a plurality of tuning targets, and obtaining adjusted high-risk memory rectangles and adjusted middle-risk memory rectangles.
It should be noted that the results of the arrangement of each set of memory rectangles can be accumulated into a memory container. For example, when the memory container is empty, the first group of memory rectangles are arranged in the empty memory container and tuning is started until the tuning target is met, and the arrangement result of the first group of memory rectangles in the memory container is saved. And then arranging the second group of memory rectangles in a memory container containing the first group of memory rectangles, starting tuning until the tuning target is met, storing the arrangement results of the first group of memory rectangles and the second group of memory rectangles in the memory container, and repeating the third group and the fourth group.
The embodiment of the application starts from the bottleneck of the utilization rate of the memory, provides the peak suppression tuning method, preferentially processes the memory which directly influences the utilization rate, performs clustering grouping on the part of the memory, and gives consideration to algorithm complexity and algorithm effect.
And S1.3, determining the head address offset and the tail address offset of each tensor according to the position of each memory rectangle in the memory container until the plurality of memory rectangles meet any one of the plurality of tuning targets.
Specifically, the plurality of tuning targets comprise target rectangles with widths not exceeding a theoretical threshold, the same sorting result appears, and the number of tuning times is larger than or equal to a preset number of times. And if the plurality of memory rectangles do not meet any one of the plurality of tuning targets, continuously adjusting the sequence of the plurality of memory rectangles, and rearranging the memory rectangles in the memory container according to a preset rule. Until the plurality of memory rectangles meet any one of the plurality of tuning targets, the memory allocation device may determine the head address offset of each tensor according to the minimum distance between each memory rectangle and the bottom of the memory container, determine the tail address offset of each tensor according to the maximum distance between each memory rectangle and the bottom of the memory container, and further determine the memory address of each tensor according to the head address of each tensor and the tail address of each tensor.
In a possible embodiment, after the tensors satisfy any one of the tuning targets, the memory allocation device may further visualize the memory rectangles and the memory containers, so that a user can view and check the memory allocation result.
In one possible embodiment, after determining the memory address of each tensor, the memory allocation device may detect whether there is an overlapping memory address, ensuring that the memory can be used normally.
Referring to fig. 4, a schematic structural diagram of a memory allocation apparatus according to an embodiment of the present application is provided, where the memory allocation apparatus includes: a memory rectangle converter 401, a memory layout tuner 402 and a memory allocation debugger 403. The memory rectangle converter 401 is configured to convert each memory operation into a memory rectangle. The memory layout optimizer 402 is configured to adjust the position of the memory rectangle in the memory container, so that the height of the memory container is as low as possible, and output memory allocation information after the optimization is completed. The memory allocation debugger 403 is configured to perform inspection and visual analysis on the memory allocation information. The memory allocation information includes memory name, time information, memory size, first address offset, and the like.
The memory layout optimizer 402 includes a memory rectangle sorter 404, a memory rectangle placer 405, a layout space optimizer 406, and a memory allocation information exporter 407. The memory rectangle sorter 404 is configured to sort a plurality of memory rectangles. Memory rectangle placer 405 is used to arrange a plurality of memory rectangles in sequence in a memory container. The layout space optimizer 406 is used for optimizing the spatial layout of the memory rectangles. The memory allocation information exporter 407 is configured to export the memory allocation information after the tuning is completed.
Referring to fig. 5, a second flowchart of a static memory allocation method according to an embodiment of the present application is provided, and the static memory allocation method according to the embodiment of the present application is described below with reference to fig. 4 and fig. 5.
S501, obtaining a memory operation sequence.
The memory rectangle converter 401 may receive a memory operation sequence provided by the serialization model, where the memory operation sequence includes a plurality of memory operations executed in sequence. For the meaning of the memory operation, please refer to the above discussion, and the description is omitted here.
S502, converting each memory operation into a memory rectangle.
After the memory rectangle converter 401 obtains the memory operation sequence, the time length and the space size of each memory operation are analyzed according to the memory operation sequence, and a memory rectangle is constructed according to the time length and the space size to obtain a memory rectangle list including a plurality of memory rectangles.
S503, sorting the memory rectangles.
The memory rectangle sorter 404 receives the memory rectangle list output by the memory rectangle converter 401, sorts a plurality of memory rectangles in the memory rectangle list according to a preset index, and outputs the sorted plurality of memory rectangles. For the meaning of the predetermined index, please refer to the above discussion, and the details are not repeated herein.
S504, sequentially placing the sorted memory rectangles in a memory container.
The memory rectangle placer 405 receives the sorted memory rectangle list output by the memory rectangle sorter 404, and places a plurality of memory rectangles in the memory container in sequence according to a preset rule and a preset arrangement order. For the sorting method, the predetermined rule and the meaning of the memory container, please refer to the above discussion, which is not repeated herein.
And S505, whether the tuning target is met or not.
And judging that the placing result of the memory rectangular placing device 405 meets the tuning target. If yes, go to S507. If not, go to step S506. For the meaning of the tuning target, please refer to the content discussed above, and the detailed description is omitted here.
S506, optimizing the spatial layout of the memory rectangles.
When the tuning target is not reached, the layout space optimizer 406 receives the sorted memory rectangle list output by the memory rectangle sorter 404, tunes the spatial layout of the plurality of memory rectangles, adjusts an unreasonable part in the memory layout, so that the tuning result is closer to the tuning target, and then continues to execute S505 until the tuning target is met.
And S507, checking and visually displaying the memory allocation information.
When the tuning target is reached, the memory allocation debugger 403 may receive the memory allocation information output by the memory allocation information exporter 407, check whether the memory addresses overlap, and visualize the memory rectangle and the memory container. For how to determine the memory address, please refer to the above discussion, and the details are not repeated herein.
Referring to fig. 6, a schematic structural diagram of a layout space optimizer is provided in the present embodiment, where the layout space optimizer 406 includes a memory rectangle classifier 601, a medium-high risk memory tuner 602, and a memory rectangle merger 603. The specific workflow of the layout space optimizer 406 is described below in conjunction with the various modules.
The memory rectangle classifier 601 receives the sorted memory rectangle list output by the memory rectangle sorter 404, and divides a plurality of memory rectangles in the sorted memory rectangle list into four types, namely a high-risk memory rectangle, a medium-risk memory rectangle, a low-risk memory rectangle and a no-risk memory rectangle. For the meaning of the high-risk memory rectangle, the medium-risk memory rectangle, the low-risk memory rectangle, and the no-risk memory rectangle, reference is made to the content discussed above, and details are not repeated here.
The high risk memory tuner 602 receives the high risk memory rectangles and the medium risk memory rectangles output by the memory rectangle classifier 601, and respectively adjusts the sequence of the high risk memory rectangles and the sequence of the medium risk memory rectangles.
The memory rectangle combiner 603 receives the adjusted high-risk memory rectangle and the adjusted medium-risk memory rectangle output by the medium-high-risk memory optimizer 602, and the low-risk memory rectangle and the non-risk memory rectangle output by the memory rectangle classifier 601, and then combines the adjusted high-risk memory rectangle, the adjusted medium-risk memory rectangle, the low-risk memory rectangle and the non-risk memory rectangle in sequence according to the risk degradation sequence to obtain an updated memory rectangle list.
In one possible embodiment, the high risk memory tuner 602 further comprises a memory rectangle reorderer for reordering the memory rectangles. Referring to fig. 7, an embodiment of the present application provides a flowchart of a high risk memory tuner, and a detailed flowchart of the high risk memory tuner 602 is described below with reference to fig. 4 and 7.
S701, dividing the high risk memory rectangles into a plurality of groups.
The medium high risk memory tuner 602 divides the medium high risk memory rectangles into a plurality of groups and divides the medium risk memory rectangles into a plurality of groups.
S702, sorting in the group.
The high risk memory tuner 602 sorts the memory rectangles of each group by the memory rectangle sorter 404.
And S703, sequentially placing the memory rectangles in each group in a memory container.
The high risk memory tuner 602 places the memory rectangles in each group in the memory container in sequence according to the preset rule through the memory rectangle placer 405.
And S704, whether the tuning target is met.
The medium-high risk memory tuner 602 determines whether the placement result of the memory rectangles in each group meets the tuning target, if not, S705 is executed, and if so, S706 is executed.
S705, rearranging the memory rectangles of each group.
The high risk memory tuner 602 rearranges the memory rectangles of each group through the memory rectangle reorderer to obtain a possibly better ordering result.
And S706, accumulating each group of tuning results into a memory container.
The medium and high risk memory tuner 602 sequentially accumulates each set of tuning results into a memory container. For the accumulated meaning, reference is made to the above discussion and the description is omitted here.
And S707, exporting a medium-high risk memory rectangle list.
When all groups have been processed, the hdm tuner 602 derives a rectangular list of hdms from the memory containers.
To sum up, the static memory allocation problem is converted into the two-dimensional rectangular belt packing problem, the static memory allocation problem is solved through an approximate mathematic problem, each memory to be allocated is equivalent to a rectangle, a memory pool is equivalent to a box container, the rectangles are placed in the box container, the maximum height of the box container is as low as possible through adjustment of sequencing, grouping strategies, setting of tuning targets and the like, a more optimal memory allocation result is found, and the utilization rate of memory resources is improved.
Based on the same inventive concept, the present application further provides a static memory allocation apparatus, which may be specifically disposed in the memory allocation device discussed above, with reference to fig. 8, the apparatus includes:
an obtaining module 801, configured to obtain multiple tensors of a target model, and determine time information and a space size of a memory occupied by each tensor;
the allocating module 802 is configured to sort the plurality of tensors according to a preset index, and allocate head address offsets to the plurality of tensors according to a preset rule; the initial address offset represents the offset between the initial address of each tensor and the initial address of the memory pool, the preset rule comprises that the memory addresses of the tensors with overlapped time information are not overlapped, the time information of the tensors with overlapped memory addresses is not overlapped, and the memory address of each tensor is determined according to the initial address offset and the space size of each tensor;
the allocating module 802 is further configured to, if there is a tail address offset of any tensor greater than a theoretical threshold, adjust ordering of the plurality of tensors, and reallocate a head address offset for the plurality of tensors according to a preset rule; the tail address offset of each tensor is the sum of the head address offset and the space size of each tensor, and the theoretical threshold is the maximum value of the sum of the space sizes of the memory occupied by the corresponding tensor at each moment;
a determining module 803, configured to determine the memory address of each tensor according to the head address offset and the tail address offset of each tensor until the plurality of tensors satisfy any one of the tuning targets.
In a possible embodiment, the allocation module 802 is specifically configured to:
generating a plurality of memory rectangles corresponding to the plurality of memory operations; the memory operations represent the operations of the tensors occupying the memory, the length of each memory rectangle represents the time length of each memory operation, and the width of each memory rectangle represents the space size of each memory operation;
sequencing the memory rectangles according to a preset index, and arranging the memory rectangles in memory containers corresponding to the memory pool according to a preset rule; the first address of the memory pool corresponds to the bottom of the memory container, and the minimum distance between each memory rectangle and the bottom is the first address offset of each tensor.
In a possible embodiment, the allocation module 802 is specifically configured to:
if the width of the target rectangle exceeds a theoretical threshold value, adjusting the sequence of the memory rectangles, and rearranging the memory rectangles in a memory container according to a preset rule; the target rectangle is the smallest rectangle comprising a plurality of memory rectangles after arrangement, and the width of the target rectangle represents the maximum value in tail address offset of a plurality of tensors.
In a possible embodiment, the determining module 803 is further configured to:
after the plurality of tensors satisfy any one of the plurality of tuning targets, determining a head address offset and a tail address offset of each tensor according to the position of each memory rectangle in the memory container.
In a possible embodiment, the allocation module 802 is specifically configured to:
classifying the memory rectangles to obtain a high-risk memory rectangle, a medium-risk memory rectangle, a low-risk memory rectangle and a no-risk memory rectangle; the high-risk memory rectangles comprise first memory rectangles corresponding to first quantums with tail address offsets larger than a theoretical threshold value and second memory rectangles overlapping with the first memory rectangles in time, the middle-risk memory rectangles comprise third memory rectangles corresponding to second quantums with tail address offsets equal to the theoretical threshold value and fourth memory rectangles overlapping with the third memory rectangles in time, the low-risk memory rectangles comprise memory rectangles overlapping with the high-risk memory rectangles and the middle-risk memory rectangles in time, and the non-risk memory rectangles comprise memory rectangles of the multiple memory rectangles except the high-risk memory rectangles, the middle-risk memory rectangles and the low-risk memory rectangles;
respectively adjusting the sequence of the high risk memory rectangles and the sequence of the medium risk memory rectangles;
and combining the adjusted high-risk memory rectangle, the adjusted middle-risk memory rectangle, the adjusted low-risk memory rectangle and the risk-free memory rectangle in sequence.
In a possible embodiment, the allocation module 802 is specifically configured to:
according to the time overlapping length, dividing high-risk memory rectangles into a plurality of groups, and dividing low-risk memory rectangles into a plurality of groups;
sequencing the memory rectangles of each group according to a preset index, and arranging the memory rectangles in a memory container according to a preset rule;
if the width of the target rectangle exceeds the theoretical threshold, adjusting the sequencing of the memory rectangles of each group, and rearranging the memory rectangles in the memory container according to a preset rule;
and sequentially exporting the memory rectangles of each group after adjustment and sequencing until the memory rectangles meet any one of the tuning targets, and obtaining adjusted high-risk memory rectangles and adjusted middle-risk memory rectangles.
In a possible embodiment, the preset index includes at least one of a length of the memory rectangle, a width of the memory rectangle, and a length of a time overlap between the memory rectangles.
In one possible embodiment, the plurality of tuning targets includes that the width of the target rectangle does not exceed the theoretical threshold, the same sorting result occurs, and the number of times of tuning is greater than or equal to the preset number of times.
It should be noted that although in the above detailed description several modules or sub-modules of the apparatus are mentioned, such division is merely exemplary and not mandatory. Indeed, the features and functionality of two or more of the modules described above may be embodied in one unit, according to embodiments of the invention. Conversely, the features and functions of one module described above may be further divided into embodiments by a plurality of modules.
It should be noted that the apparatus in fig. 8 may also be used to implement any of the static memory allocation methods discussed above, and details thereof are not repeated here.
Based on the same inventive concept, an embodiment of the present application further provides an electronic device, which is equivalent to the memory allocation device discussed above, with reference to fig. 9, the electronic device includes:
a memory 902 for storing program instructions;
the processor 901 is configured to call the program instruction stored in the storage 902, and execute any one of the static memory allocation methods described in fig. 2 and fig. 5 according to the obtained program instruction. The processor 901 may also implement the functions of the respective modules in the apparatus shown in fig. 8.
In the embodiment of the present application, a specific connection medium between the processor 901 and the memory 902 is not limited, and fig. 9 illustrates that the processor 901 and the memory 902 are connected by the bus 900. The bus 900 is shown in fig. 9 by a thick line, and the connection between other components is merely illustrative and not limited thereto. The bus 900 may be divided into an address bus, a data bus, a control bus, etc., and is shown with only one thick line in fig. 9 for ease of illustration, but does not represent only one bus or type of bus. Alternatively, the processor 901 may also be referred to as a controller, without limitation to name a few.
The processor 901 is a control center of the apparatus, and can connect various parts of the whole control device by using various interfaces and lines, and by executing or executing instructions stored in the memory 902 and calling data stored in the memory 902, various functions of the apparatus and processing data, thereby performing overall monitoring on the apparatus.
In one possible design, the processor 901 may include one or more processing units, and the processor 901 may integrate an application processor, which mainly handles operating systems, user interfaces, application programs, and the like, and a modem processor, which mainly handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 901. In some embodiments, the processor 901 and the memory 902 may be implemented on the same chip, or in some embodiments, they may be implemented separately on separate chips.
The processor 901 may be a general-purpose processor, such as a Central Processing Unit (CPU), a digital signal processor, an application specific integrated circuit, a field programmable gate array or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or the like, that may implement or perform the methods, steps, and logic blocks disclosed in embodiments of the present application. A general purpose processor may be a microprocessor or any conventional processor or the like. The steps of the target monitoring method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware processor, or implemented by a combination of hardware and software modules in the processor.
Memory 902, which is a non-volatile computer-readable storage medium, may be used to store non-volatile software programs, non-volatile computer-executable programs, and modules. The Memory 902 may include at least one type of storage medium, and may include, for example, a flash Memory, a hard disk, a multimedia card, a card-type Memory, a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Programmable Read Only Memory (PROM), a Read Only Memory (ROM), a charge Erasable Programmable Read Only Memory (EEPROM), a magnetic Memory, a magnetic disk, an optical disk, and the like. The memory 902 is any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to such. The memory 902 of the embodiments of the present application may also be circuitry or any other device capable of performing a storage function for storing program instructions and/or data.
By programming the processor 901, the code corresponding to the target monitoring method described in the foregoing embodiment may be solidified into a chip, so that the chip can execute the steps of the static memory allocation method described in fig. 2 and fig. 5 when running. How to program the processor 901 is well known to those skilled in the art and will not be described herein.
Based on the same inventive concept, embodiments of the present application provide a computer-readable storage medium storing a computer program, the computer program comprising program instructions that, when executed by a computer, cause the computer to perform any of the static memory allocation methods as discussed above. Because the principle of solving the problem of the computer-readable storage medium is similar to that of the static memory allocation method, the implementation of the computer-readable storage medium may refer to the implementation of the method, and the repeated parts are not described again.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and so forth) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims (11)

1. A static memory allocation method, comprising:
acquiring a plurality of tensors of a target model, and determining time information and space size of a memory occupied by each tensor;
sequencing the tensors according to a preset index, and allocating initial address offset to the tensors according to a preset rule; the initial address offset represents an offset between an initial address of each tensor and an initial address of the memory pool, the preset rule comprises that memory addresses of tensors with overlapped time information are not overlapped, time information of tensors with overlapped memory addresses is not overlapped, and the memory address of each tensor is determined according to the initial address offset and the space size of each tensor;
if the tail address offset of any tensor is larger than the theoretical threshold, adjusting the sequence of the tensors, and allocating the head address offset for the tensors again according to the preset rule; the tail address offset of each tensor is the sum of the head address offset and the space size of each tensor, and the theoretical threshold is the maximum value of the sum of the space sizes of the memory occupied by the corresponding tensor at each moment;
and determining the memory address of each tensor according to the head address offset and the tail address offset of each tensor until the tensors meet any one of a plurality of tuning targets.
2. The method of claim 1, wherein sorting the tensors by a predetermined index and assigning a head address offset to the tensors by a predetermined rule comprises:
generating a plurality of memory rectangles corresponding to the memory operations; the memory operations represent operations of the tensors occupying memory, the length of each memory rectangle represents the time length of each memory operation, and the width of each memory rectangle represents the space size of each memory operation;
sequencing the memory rectangles according to a preset index, and arranging the memory rectangles in memory containers corresponding to the memory pool according to a preset rule; the first address of the memory pool corresponds to the bottom of the memory container, and the minimum distance between each memory rectangle and the bottom is the first address offset of each tensor.
3. The method of claim 2, wherein if there is a tail address offset of any tensor greater than a theoretical threshold, adjusting the ordering of the plurality of tensors, and reassigning a head address offset to the plurality of tensors according to the predetermined rule, comprises:
if the width of the target rectangle exceeds a theoretical threshold value, adjusting the sequence of the memory rectangles, and rearranging the memory rectangles in the memory container according to the preset rule; the target rectangle is a minimum rectangle containing a plurality of arranged memory rectangles, and the width of the target rectangle represents the maximum value of tail address offsets of the tensors.
4. The method of claim 3, after until the plurality of tensors satisfy any of a plurality of tuning objectives, the method further comprising:
and determining the head address offset and the tail address offset of each tensor according to the position of each memory rectangle in the memory container.
5. The method of claim 3, wherein adjusting the ordering of the plurality of memory rectangles comprises:
classifying the memory rectangles to obtain a high-risk memory rectangle, a medium-risk memory rectangle, a low-risk memory rectangle and a no-risk memory rectangle; wherein the high-risk memory rectangles include a first memory rectangle corresponding to a first tensor whose tail address offset is greater than the theoretical threshold and a second memory rectangle overlapping with the first memory rectangle in time, the medium-risk memory rectangle includes a third memory rectangle corresponding to a second tensor whose tail address offset is equal to the theoretical threshold and a fourth memory rectangle overlapping with the third memory rectangle in time, the low-risk memory rectangle includes memory rectangles overlapping with the high-risk memory rectangle and the medium-risk memory rectangle in time, and the risk-free memory rectangle includes memory rectangles of the plurality of memory rectangles excluding the high-risk memory rectangle, the medium-risk memory rectangle and the low-risk memory rectangle;
respectively adjusting the sequence of the high-risk memory rectangles and the sequence of the medium-risk memory rectangles;
and sequentially combining the adjusted high-risk memory rectangle, the adjusted middle-risk memory rectangle, the low-risk memory rectangle and the risk-free memory rectangle.
6. The method of claim 5, wherein adjusting the ordering of the high-risk memory rectangles and the ordering of the medium-risk memory rectangles separately comprises:
dividing the high-risk memory rectangles into a plurality of groups according to the time overlapping length, and dividing the low-risk memory rectangles into a plurality of groups;
sequencing the memory rectangles of each group according to the preset index, and arranging the memory rectangles in the memory container according to the preset rule;
if the width of the target rectangle exceeds the theoretical threshold, adjusting the sequencing of the memory rectangles of each group, and rearranging the memory rectangles in the memory container according to the preset rule;
and sequentially exporting the memory rectangles of each group after adjustment and sorting until the memory rectangles of each group meet any one of the plurality of adjustment and optimization targets, and obtaining adjusted high-risk memory rectangles and adjusted middle-risk memory rectangles.
7. The method according to any one of claims 2-6, wherein the predetermined criteria includes at least one of a length of the memory rectangles, a width of the memory rectangles, and a length of time overlap between the memory rectangles.
8. The method of any one of claims 1-6, wherein the plurality of tuning targets includes tail address offsets that are not present for any tensor greater than the theoretical threshold, the same ordering result occurs, and a number of tuning times is greater than or equal to a preset number of times.
9. A static memory allocation apparatus, comprising:
the acquisition module is used for acquiring a plurality of tensors of the target model and determining the time information and the space size of a memory occupied by each tensor;
the allocation module is used for sequencing the tensors according to preset indexes and allocating initial address offsets to the tensors according to a preset rule; the initial address offset represents an offset between an initial address of each tensor and an initial address of the memory pool, the preset rule comprises that memory addresses of tensors with overlapped time information are not overlapped, time information of tensors with overlapped memory addresses is not overlapped, and the memory address of each tensor is determined according to the initial address offset and the space size of each tensor;
the allocation module is further configured to adjust the ordering of the tensors and allocate head address offsets to the tensors again according to the preset rule if the tail address offset of any tensor is greater than a theoretical threshold; the tail address offset of each tensor is the sum of the head address offset and the space size of each tensor, and the theoretical threshold is the maximum value of the sum of the space sizes of the memory occupied by the corresponding tensor at each moment;
and the determining module is used for determining the memory address of each tensor according to the head address offset and the tail address offset of each tensor until the plurality of tensors meet any one of a plurality of tuning targets.
10. An electronic device, comprising:
a memory for storing program instructions;
a processor for calling program instructions stored in said memory and for executing the method of any one of claims 1 to 8 in accordance with the obtained program instructions.
11. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program comprising program instructions that, when executed by a computer, cause the computer to perform the method according to any one of claims 1-8.
CN202210835894.0A 2022-07-15 2022-07-15 Static memory allocation method, device, equipment and medium Pending CN115269179A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210835894.0A CN115269179A (en) 2022-07-15 2022-07-15 Static memory allocation method, device, equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210835894.0A CN115269179A (en) 2022-07-15 2022-07-15 Static memory allocation method, device, equipment and medium

Publications (1)

Publication Number Publication Date
CN115269179A true CN115269179A (en) 2022-11-01

Family

ID=83765790

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210835894.0A Pending CN115269179A (en) 2022-07-15 2022-07-15 Static memory allocation method, device, equipment and medium

Country Status (1)

Country Link
CN (1) CN115269179A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117093509A (en) * 2023-10-18 2023-11-21 上海为旌科技有限公司 On-chip memory address allocation method and system based on greedy algorithm

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117093509A (en) * 2023-10-18 2023-11-21 上海为旌科技有限公司 On-chip memory address allocation method and system based on greedy algorithm
CN117093509B (en) * 2023-10-18 2024-01-26 上海为旌科技有限公司 On-chip memory address allocation method and system based on greedy algorithm

Similar Documents

Publication Publication Date Title
CN108398924B (en) Scheduling method and scheduling device for robot transport vehicle
US8620932B2 (en) Parallel sorting apparatus, method, and program
KR101813887B1 (en) Method and apparatus for providing guide for pallet loading
Aggarwal et al. The load rebalancing problem
US20200257972A1 (en) Method and apparatus for determining memory requirement in a network
WO2022067531A1 (en) Computing resource aware task scheduling method
CN112214319B (en) Task scheduling method for sensing computing resources
US20200175361A1 (en) Partitioning of deep learning inference with dynamic offloading
CN111046045B (en) Method, device, equipment and storage medium for processing data inclination
US10178042B2 (en) System, method, and apparatus for computer system resource allocation
CN111108480A (en) System and method for distributed resource demand and allocation
CN111880939A (en) Container dynamic migration method and device and electronic equipment
Wang et al. One-dimensional k-center on uncertain data
CN115269179A (en) Static memory allocation method, device, equipment and medium
US7890705B2 (en) Shared-memory multiprocessor system and information processing method
CN115310527A (en) Density-based clustering calculation method, device, equipment and storage medium
US8667008B2 (en) Search request control apparatus and search request control method
CN107172193A (en) A kind of load-balancing method and its device based on cluster
CN107169604B (en) Logistics distribution method and device
US20140047454A1 (en) Load balancing in an sap system
CN113570176B (en) Cargo packing scheme output method, device, computer equipment and storage medium
CN113792079A (en) Data query method and device, computer equipment and storage medium
CN116097222A (en) Memory arrangement optimization method and device
CN117421109B (en) Training task scheduling method and device, computer equipment and storage medium
Ren et al. Generalized Skyline Interval Coloring and Dynamic Geometric Bin Packing Problems

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination