CN116700996A - Memory allocation method, device, equipment and medium of neural network - Google Patents

Memory allocation method, device, equipment and medium of neural network Download PDF

Info

Publication number
CN116700996A
CN116700996A CN202310973717.3A CN202310973717A CN116700996A CN 116700996 A CN116700996 A CN 116700996A CN 202310973717 A CN202310973717 A CN 202310973717A CN 116700996 A CN116700996 A CN 116700996A
Authority
CN
China
Prior art keywords
node
memory
calculation
current node
computing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202310973717.3A
Other languages
Chinese (zh)
Other versions
CN116700996B (en
Inventor
刘宝琦
解易
张亚林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Suiyuan Intelligent Technology Co ltd
Original Assignee
Beijing Suiyuan Intelligent Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Suiyuan Intelligent Technology Co ltd filed Critical Beijing Suiyuan Intelligent Technology Co ltd
Priority to CN202310973717.3A priority Critical patent/CN116700996B/en
Publication of CN116700996A publication Critical patent/CN116700996A/en
Application granted granted Critical
Publication of CN116700996B publication Critical patent/CN116700996B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5011Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
    • G06F9/5016Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals the resource being the memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5011Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
    • G06F9/5022Mechanisms to release resources
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention discloses a memory allocation method, a device, equipment and a medium of a neural network, comprising the following steps: sequentially acquiring one node from a plurality of computing nodes corresponding to the neural network computing graph as a current node; applying for a calculation memory in the equipment memory through a calculation core matched with the current node, and releasing the calculation memory after the current node is detected to finish calculation; taking the memory release space corresponding to the current node as a sub memory pool matched with the current node, and applying for a return value memory corresponding to the current node in the sub memory pool through a computing core; and determining the nearest public descendant node of the child node set corresponding to the current node in the calculation graph, and releasing the return value memory according to the execution information of the nearest public descendant node. The technical scheme of the embodiment of the invention can realize the full multiplexing of the memory of the equipment, avoid the communication between the computing core DMA and the memory non-affine area to the greatest extent, and ensure the computing performance of the many-core computing equipment.

Description

Memory allocation method, device, equipment and medium of neural network
Technical Field
The present invention relates to the field of computer technologies, and in particular, to a method, an apparatus, a device, and a medium for allocating memory of a neural network.
Background
With the rapid growth of deep neural networks, deep learning tasks have created a great demand for computational effort, and in order to improve learning efficiency, many-core computing devices have been proposed to perform computing tasks. In the process of executing tasks by many-core computing devices, it is particularly important to reasonably allocate the device memory.
In the prior art, in order to ensure the security of the memory allocation algorithm, an independent memory space is allocated to each computing node before the operation of the neural network computation graph.
However, the above method has the following technical drawbacks: the limited memory space in the device cannot meet the memory requirement of uncertain number of computing nodes, the memory in the device cannot be fully reused, and the memory occupied by each node only participates in work at the node computing moment. Second, since the direct memory access (Direct Memory Access, DMA) unit of each computing core in the many-core computing device corresponds to one memory affinity region, if the computing cores continuously apply for new memory space, the DMA is forced to communicate with the memory non-affinity region, thereby reducing the communication efficiency of the DMA and affecting the computing performance of the many-core computing device.
Disclosure of Invention
The invention provides a memory allocation method, a device, equipment and a medium of a neural network, which can realize the full multiplexing of the equipment memory, avoid the communication between a computing core DMA and a memory non-affine area to the greatest extent and ensure the computing performance of many-core computing equipment.
According to an aspect of the present invention, there is provided a memory allocation method of a neural network, applied to a many-core computing device, the method including:
sequentially acquiring one node from a plurality of computing nodes corresponding to the neural network computing graph as a current node;
applying for a calculation memory in the equipment memory through a calculation core matched with the current node, and releasing the calculation memory after detecting that the calculation core finishes calculation on the current node;
taking the memory release space corresponding to the current node as a sub memory pool matched with the current node, and applying for a return value memory corresponding to the current node through the computing core in the sub memory pool;
in the calculation graph, determining the nearest public descendant node of the child node set corresponding to the current node, and releasing the return value memory according to the execution information of the nearest public descendant node;
The child memory pool is also used for applying for the calculation memory by the child node corresponding to the current node.
Optionally, applying for the computing memory in the device memory through the computing core matched with the current node, including:
judging whether an ancestor node corresponding to the current node exists in the calculation graph;
if yes, determining a sub memory pool corresponding to the ancestor node in the equipment memory, and judging whether the sub memory pool corresponding to the ancestor node meets the calculation requirement of the current node or not;
if yes, applying for a calculation memory in a child memory pool corresponding to the ancestor node through a calculation core matched with the current node; if not, determining a calculation memory in the non-applied memory corresponding to the equipment through the calculation core matched with the current node.
Optionally, in the computation graph, determining the nearest common descendant node of the child node set corresponding to the current node includes:
and in the calculation graph, determining a child node set corresponding to the current node, acquiring a subsequent node closest to the child node set, and taking the subsequent node as a nearest public descendant node of the child node set corresponding to the current node.
Optionally, acquiring a successor node closest to the child node set, and taking the successor node as a closest public descendant node of the child node set corresponding to the current node, including:
Acquiring a offspring node set corresponding to each child node in the child node set in the calculation graph;
determining intersections corresponding to a plurality of offspring node sets, and determining a scheduling sub-graph formed by directed edges according to the intersections;
and acquiring a node with zero degree in the scheduling subgraph as the nearest public descendant node of the current node corresponding to the child node set.
Optionally, releasing the return value memory according to the execution information of the nearest public offspring node includes:
if the current node corresponds to the child node set and a plurality of nearest public offspring nodes exist, acquiring earliest execution time corresponding to the plurality of nearest public offspring nodes;
and releasing the return value memory according to the earliest execution time.
Optionally, before determining the nearest public descendant node of the child node set corresponding to the current node in the computation graph, the method further includes:
acquiring a node with zero degree from the calculation graph as a target node, and adding a virtual successor node corresponding to the target node into the calculation graph;
releasing the return value memory according to the execution information of the nearest public offspring node, including:
And if the nearest public descendant node of the child node set corresponding to the current node is a virtual descendant node, releasing the return value memory corresponding to the current node after all the nodes in the calculation graph are detected to finish calculation.
According to another aspect of the present invention, there is provided a memory allocation apparatus for a neural network, which is applied to a many-core computing device, the apparatus including:
the node acquisition module is used for sequentially acquiring one node from a plurality of computing nodes corresponding to the neural network computing graph as a current node;
the computing memory processing module is used for applying for a computing memory in the equipment memory through a computing core matched with the current node, and releasing the computing memory after detecting that the computing core is matched with the current node to finish computing;
the return value memory application module is used for taking the memory release space corresponding to the current node as a sub memory pool matched with the current node, and applying for the return value memory corresponding to the current node through the calculation core in the sub memory pool;
the return value memory releasing module is used for determining the nearest public descendant node of the child node set corresponding to the current node in the calculation graph, and releasing the return value memory according to the execution information of the nearest public descendant node;
The child memory pool is also used for applying for the calculation memory by the child node corresponding to the current node.
According to another aspect of the present invention, there is provided a many-core computing device, the device comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor;
wherein each processor includes a DMA unit therein;
the memory stores a computer program executable by the at least one processor, the computer program being executable by the at least one processor to enable the at least one processor to perform the memory allocation method of the neural network according to any one of the embodiments of the present invention.
According to another aspect of the present invention, there is provided a computer readable storage medium storing computer instructions for causing a processor to implement a memory allocation method of a neural network according to any embodiment of the present invention when executed.
According to another aspect of the present invention, there is provided a computer program product comprising a computer program which, when executed by a processor, implements a memory allocation method of a neural network according to any embodiment of the present invention.
According to the technical scheme provided by the embodiment of the invention, one node is sequentially obtained from a plurality of computing nodes corresponding to a neural network computing graph as a current node, a computing memory is applied in a device memory through a computing core matched with the current node, the computing memory is released after the computing core is detected to finish computing the current node, a memory release space corresponding to the current node is used as a sub memory pool matched with the current node, a return value memory corresponding to the current node is applied in the sub memory pool through the computing core, the nearest public offspring node of a sub node set corresponding to the current node is determined in the computing graph, and the return value memory is released according to the execution information of the nearest public offspring node, so that the full multiplexing of the device memory can be realized, the communication between a computing core DMA and a memory non-affine area is avoided to the greatest extent, and the computing performance of a mass-core computing device is ensured.
It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the invention or to delineate the scope of the invention. Other features of the present invention will become apparent from the description that follows.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a flowchart of a memory allocation method of a neural network according to an embodiment of the present invention;
fig. 2a is a flowchart of another memory allocation method of a neural network according to an embodiment of the present invention;
FIG. 2b is a neural network computational graph provided in accordance with embodiments of the present invention;
FIG. 2c is another neural network computational graph provided in accordance with embodiments of the present invention;
fig. 3a is a flowchart of another memory allocation method of a neural network according to an embodiment of the present invention;
FIG. 3b is another neural network computational graph provided in accordance with embodiments of the present invention;
FIG. 3c is a scheduling sub-graph provided in accordance with an embodiment of the present invention;
FIG. 3d is another neural network computational graph provided in accordance with embodiments of the present invention;
FIG. 3e is another neural network computational graph provided in accordance with embodiments of the present invention;
Fig. 3f is a schematic diagram of a scenario where a memory allocation method of a neural network according to an embodiment of the present invention is applicable;
fig. 4 is a schematic structural diagram of a memory allocation device of a neural network according to an embodiment of the present invention;
fig. 5 is a schematic structural diagram of a many-core computing device for implementing a memory allocation method of a neural network according to an embodiment of the present invention.
Detailed Description
In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.
It should be noted that the terms "first," "second," and the like in the description and the claims of the present invention and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
Fig. 1 is a flowchart of a memory allocation method of a neural network according to an embodiment of the present invention, where the method may be performed by a memory allocation device of the neural network, the memory allocation device of the neural network may be implemented in hardware and/or software, and the memory allocation device of the neural network may be configured in a many-core computing device. As shown in fig. 1, the method includes:
step 110, sequentially acquiring one node from a plurality of computing nodes corresponding to the neural network computation graph as a current node.
In this embodiment, the neural network computation graph may include a plurality of operators (i.e., computation nodes), where the data dependency between the computation nodes is represented by directed edges, and the entire computation graph may form a directed acyclic graph.
In this step, optionally, a node may be sequentially acquired as a current node according to the front-back dependency relationship between the computing nodes in the neural network computation graph.
And 120, applying for a calculation memory in the equipment memory through a calculation core matched with the current node, and releasing the calculation memory after detecting that the calculation core finishes calculation of the current node.
In this step, a computing core matching with the current node may be first determined in a many-core computing device, then a computing memory is applied for in a device memory according to the tensor size to be processed by the current node through the computing core, and the computing of the current node is completed according to the computing memory, and after the completion of the computing of the current node is detected, the computing memory is released.
And 130, taking the memory release space corresponding to the current node as a sub memory pool matched with the current node, and applying for the return value memory corresponding to the current node through the computing core in the sub memory pool.
In this embodiment, the computation memory released in the above step may be used as a memory release space corresponding to the current node, and the memory release space is used as a sub memory pool (sub memory pool) matched with the current node, and then the computation core applies for the return value memory in the sub memory pool.
The return value memory is used for storing the calculation result of the current node and transmitting the calculation result to a subsequent node which has a data dependency relationship with the current node.
In one implementation manner of this embodiment, the child memory pool is further used for applying for the computation memory by the child node corresponding to the current node.
The method has the advantages that the application time of the return value memory can be delayed as much as possible by releasing the calculation memory of the current node and applying for the return value memory after the release is completed, and the return value memory is applied for the latest under the condition that the calculation performance of the calculation node is ensured, so that the calculation node can be prevented from occupying the memory for a long time, and the full multiplexing of the equipment memory is realized;
in practical application, the larger the tensor processed by the nodes of the upper layer (shallow layer) is, the larger the memory of the corresponding application is, and as the number of node layers increases, the smaller the tensor processed by the nodes of the lower layer (deep layer) is, and the smaller the memory of the corresponding application is. The sub memory pool of the current node is used as a first-choice region of the current node and the corresponding sub node for applying for the memory, so that the characteristics that the shallow nodes apply for more memory and the deep nodes apply for less memory in the neural network computing graph can be met, the released memory space is fully and repeatedly utilized, communication between the computing core DMA and the memory non-affine region is avoided to the greatest extent, and the computing performance of the many-core computing equipment is ensured.
And 140, determining the nearest public descendant node of the child node set corresponding to the current node in the calculation graph, and releasing the return value memory according to the execution information of the nearest public descendant node.
In this embodiment, a most recent common descendant node (Lowest Common Offspring, LCO) of the current node corresponding to the set of child nodes may be determined in the computational graph, and if the most recent common descendant node is detected to start execution, the return value memory is released.
The method has the advantages that the execution time of the nearest public descendant node is the best time for releasing the return value memory of the current node, and the return value memory is released according to the execution information of the nearest public descendant node, so that the normal calculation of the descendant node associated with the current node can be ensured, and the reliability of the output result of the neural network calculation graph is improved.
According to the technical scheme provided by the embodiment of the invention, one node is sequentially obtained from a plurality of computing nodes corresponding to a neural network computing graph as a current node, a computing memory is applied in a device memory through a computing core matched with the current node, the computing memory is released after the computing core is detected to finish computing the current node, a memory release space corresponding to the current node is used as a sub memory pool matched with the current node, a return value memory corresponding to the current node is applied in the sub memory pool through the computing core, the nearest public offspring node of a sub node set corresponding to the current node is determined in the computing graph, and the return value memory is released according to the execution information of the nearest public offspring node, so that the full multiplexing of the device memory can be realized, the communication between a computing core DMA and a memory non-affine area is avoided to the greatest extent, and the computing performance of a mass-core computing device is ensured.
Fig. 2a is a flowchart of a memory allocation method for a neural network according to a second embodiment of the present invention, where the embodiment is further refined. As shown in fig. 2a, the method comprises:
step 210, sequentially acquiring one node from a plurality of computing nodes corresponding to the neural network computation graph as a current node.
Step 220, applying for a calculation memory in the equipment memory through the calculation core matched with the current node, and releasing the calculation memory after detecting that the calculation core finishes calculation on the current node.
In one implementation manner of this embodiment, applying for the computation memory in the device memory through the computation core matched with the current node includes: judging whether an ancestor node corresponding to the current node exists in the calculation graph; if yes, determining a sub memory pool corresponding to the ancestor node in the equipment memory, and judging whether the sub memory pool corresponding to the ancestor node meets the calculation requirement of the current node or not; if yes, applying for a calculation memory in a child memory pool corresponding to the ancestor node through a calculation core matched with the current node; if not, determining a calculation memory in the non-applied memory corresponding to the equipment through the calculation core matched with the current node.
In this embodiment, if an ancestor node corresponding to the current node exists in the computation graph, and a child memory pool of the ancestor node is greater than or equal to a computation memory required by the current node, the computation memory may be applied for in the child memory pool by a computation core matched with the current node. Otherwise, if the sub-memory Chi Xiao is in the computation memory required by the current node, the computation memory can be determined from the memory not applied for in the device by the computation core matched by the current node.
If the ancestor node corresponding to the current node does not exist in the calculation graph, the calculation memory can be applied to the global memory of the equipment through the calculation core matched with the current node.
The advantage of the arrangement is that the child memory pool of the node is used as a preferred area of the node and the child node for applying the memory, a memory allocation strategy for preventing the memory fragmentation is provided, the applied memory can be reused as much as possible, communication between the computing core DMA and the memory non-affine area is avoided to the greatest extent, and the computing performance of the many-core computing equipment is ensured.
Step 230, using the memory release space corresponding to the current node as a sub memory pool matched with the current node, and applying for the return value memory corresponding to the current node through the computing core in the sub memory pool.
Step 240, determining a sub-node set corresponding to the current node in the computation graph, obtaining a successor node closest to the sub-node set, and taking the successor node as a nearest public descendant node of the sub-node set corresponding to the current node.
In a specific embodiment, taking the calculation diagram in fig. 2b as an example, assuming that the current node is node 1, the set of children nodes corresponding to node 1 is p= {2,5}, since node 4 is the successor node closest to node 2 and node 5, node 4 can be regarded as the closest common descendant node of the set of children nodes corresponding to node 1.
And 250, releasing the return value memory according to the execution information of the nearest public offspring node.
In this embodiment, taking the calculation diagram in fig. 2b as an example, if the current node is node 1 and the most recent common descendant node of the corresponding child node set is node 4, after detecting that node 4 starts to execute, the return value memory of node 1 may be released.
In one implementation manner of this embodiment, according to the execution information of the most recent common descendant node, releasing the return value memory includes: if the current node corresponds to the child node set and a plurality of nearest public offspring nodes exist, acquiring earliest execution time corresponding to the plurality of nearest public offspring nodes; and releasing the return value memory according to the earliest execution time.
In a specific embodiment, taking the calculation diagram in fig. 2C as an example, assuming that the current node is node 1, the set of sub-nodes corresponding to node 1 is p= {2,4}, from fig. 2C, it can be determined that c= {3,5} is the nearest common descendant node of the set of sub-nodes P, and since the number of the nearest common descendant nodes is greater than 1, the start execution time corresponding to each of node 3 and node 5 can be obtained, and the earliest start execution time is taken as the release time of the return value memory of node 1.
According to the technical scheme provided by the embodiment of the invention, one node is sequentially obtained from a plurality of computing nodes corresponding to a neural network computing graph as a current node, a computing memory is applied to an equipment memory through a computing core matched with the current node, the computing memory is released after the computing core is detected to finish computing the current node, a memory release space corresponding to the current node is used as a sub memory pool matched with the current node, a return value memory corresponding to the current node is applied to the sub memory pool through the computing core, a sub node set corresponding to the current node is determined in the computing graph, a subsequent node closest to the sub node set is obtained, the subsequent node is used as a nearest common sub node of the sub node set corresponding to the current node, and the return value memory is released according to the execution information of the nearest common sub node, so that the equipment memory is fully reused, communication between the computing core DMA and a memory non-affine area is avoided to the greatest extent, and the computing performance of the multi-core computing equipment is ensured.
Fig. 3a is a flowchart of another memory allocation method for a neural network according to a third embodiment of the present invention, where the foregoing embodiment is further refined. As shown in fig. 3a, the method comprises:
step 310, sequentially acquiring one node from a plurality of computing nodes corresponding to the neural network computation graph as a current node.
Step 320, applying for a computation memory in the equipment memory through the computation core matched with the current node, and releasing the computation memory after detecting that the computation core completes computation of the current node.
And 330, using the memory release space corresponding to the current node as a sub memory pool matched with the current node, and applying for the return value memory corresponding to the current node in the sub memory pool through the computing core.
Step 340, determining a child node set corresponding to the current node in the computation graph, and obtaining a offspring node set corresponding to each child node in the child node set in the computation graph.
In this embodiment, taking the calculation diagram in fig. 3b as an example, assume that the current node is node 9, the corresponding set of child nodes is p= {4,7}, where the set of descendant nodes of node 4 is q1= {1,0,2}, and the set of descendant nodes of node 7 is q2= {6,1,2,3,0}.
Specifically, the set of descendant nodes may be searched by a Depth-First-Search (DFS) algorithm.
Step 350, determining intersections corresponding to a plurality of offspring node sets, determining a scheduling sub-graph formed by directed edges according to the intersections, and acquiring nodes with zero degree in the scheduling sub-graph as the nearest public offspring nodes of the child node set corresponding to the current node.
In this step, taking the calculation graph in fig. 3b as an example, the intersection set corresponding to the offspring node set Q1 and the offspring node set Q2 is m= {0,1,2}, and the scheduling sub graph corresponding to the intersection set is shown in fig. 3 c. Nodes with zero degree in the scheduling sub-graph are node 1 and node 2. Thus, node 1 and node 2 may be considered as the nearest common descendant nodes of the set of child nodes corresponding to node 9.
The method has the advantages that the release time of the node return value memory can be accurately determined, and the device memory can be fully reused.
And 360, releasing the return value memory according to the execution information of the nearest public offspring node.
In this embodiment, before determining the most recent common descendant node of the current node corresponding to the set of child nodes, the method further includes: acquiring a node with zero degree from a calculation graph as a target node, and adding a virtual successor node corresponding to the target node into the calculation graph; releasing the return value memory according to the execution information of the nearest public offspring node, including: and if the current node corresponds to the nearest public descendant node of the child node set and is a virtual descendant node, releasing the return value memory corresponding to the current node after all nodes in the calculation graph are detected to finish calculation.
In a specific embodiment, taking the calculation diagram in fig. 3d as an example, the node with zero degree in the calculation diagram is the node 7, so a virtual successor node "End" can be added to the node 7, as shown in fig. 3 e.
According to the technical scheme provided by the embodiment of the invention, one node is sequentially obtained from a plurality of computing nodes corresponding to a neural network computing graph as a current node, a computing memory is applied to an equipment memory through a computing core matched with the current node, the computing memory is released after the computing core is detected to finish computing the current node, a memory release space corresponding to the current node is used as a sub memory pool matched with the current node, a return value memory corresponding to the current node is applied to the sub memory pool through the computing core, a sub node set corresponding to the current node is determined in the computing graph, a offspring node set corresponding to each sub node in the sub node set is obtained, intersections corresponding to the offspring node sets are determined, a scheduling sub graph formed by directed edges is determined according to the intersections, and a node with zero degree is obtained in the scheduling sub graph and is used as a technical means of the nearest public offspring node corresponding to the current node set.
On the basis of the above embodiment, taking the calculation chart in fig. 3e as an example, each node corresponds to the most recent common descendant node of the child node set, and the application memory and the return value memory corresponding to each node are shown in table 1. For the calculation map, this embodiment provides a specific memory allocation policy for preventing memory fragmentation, and an applicable scenario map is shown in fig. 3 f. The allocation strategy comprises the following procedures:
node 1 memory allocation: the node 1 applies for a calculation memory from the non-applied memory of the equipment, releases the calculation memory to a sub memory pool 1sub memory of the node 1 after the calculation is completed, and the node 1 applies for a return value memory from the 1sub memory;
node 2 memory allocation: the node 2 applies for a calculation memory from 1 subsubool, releases the calculation memory to 2 subsubool after calculation is completed, and applies for a return value memory from 2 subsubool;
node 3 memory allocation: the node 3 applies for a calculation memory from the non-applied memory of the device, releases the calculation memory to 3 subspinool after the calculation is completed, and the node 3 applies for a return value memory from the 3 subspinool;
node 4 memory allocation: the node 4 applies for a calculation memory from 2 subsulools, releases the calculation memory to 4 subsulools after the calculation is completed, and the node 4 applies for a return value memory from 4 subsulools;
Node 5 memory allocation: the node 5 applies for a calculation memory from 3 subsulools, releases the calculation memory to 5 subsulools after the calculation is completed, applies for a return value memory from 5 subsulools, and the node 1 releases the return value memory to 1 subsulool;
node 6 memory allocation: the node 6 applies for a calculation memory from 1 sub-bpool, releases the calculation memory to 6 sub-bpool after calculation is completed, applies for a return value memory from 6 sub-bpool, and the node 2 releases the return value memory to 2 sub-bpool;
node 7 memory allocation: the node 7 applies for a calculation memory from 1 sub-bpool, releases the calculation memory to 7 sub-bpool after calculation is completed, applies for a return value memory from 7 sub-bpool, and the node 4 releases the return value memory to 4 sub-bpool; the node 5 releases the return value memory to 5 subsubool;
finally, the node 6 releases the return value memory to 6 subsubool; node 3 releases the return value memory to 3 subsubool.
TABLE 1
Fig. 4 is a schematic structural diagram of a memory allocation device of a neural network according to a fourth embodiment of the present invention, where the device is applied to many-core computing equipment. As shown in fig. 4, the apparatus includes: a node acquisition module 410, a compute memory processing module 420, a return value memory application module 430, and a return value memory release module 440.
The node obtaining module 410 is configured to sequentially obtain one node from a plurality of computing nodes corresponding to the neural network computation graph as a current node;
the calculation memory processing module 420 is configured to apply for a calculation memory in the device memory through a calculation core matched with the current node, and release the calculation memory after detecting that the calculation core completes calculation for the current node;
the return value memory application module 430 is configured to use the memory release space corresponding to the current node as a sub-memory pool matched with the current node, and apply for a return value memory corresponding to the current node by using the computing core in the sub-memory pool; the child memory pool is also used for applying for calculating the memory by the child node corresponding to the current node;
and the return value memory releasing module 440 is configured to determine, in the computation graph, a nearest public descendant node of the child node set corresponding to the current node, and release the return value memory according to execution information of the nearest public descendant node.
According to the technical scheme provided by the embodiment of the invention, one node is sequentially obtained from a plurality of computing nodes corresponding to a neural network computing graph as a current node, a computing memory is applied in a device memory through a computing core matched with the current node, the computing memory is released after the computing core is detected to finish computing the current node, a memory release space corresponding to the current node is used as a sub memory pool matched with the current node, a return value memory corresponding to the current node is applied in the sub memory pool through the computing core, the nearest public offspring node of a sub node set corresponding to the current node is determined in the computing graph, and the return value memory is released according to the execution information of the nearest public offspring node, so that the full multiplexing of the device memory can be realized, the communication between a computing core DMA and a memory non-affine area is avoided to the greatest extent, and the computing performance of a mass-core computing device is ensured.
Based on the above embodiment, the computing memory processing module 420 includes:
the node judging unit is used for judging whether ancestor nodes corresponding to the current node exist in the calculation graph or not; if yes, determining a sub memory pool corresponding to the ancestor node in the equipment memory, and judging whether the sub memory pool corresponding to the ancestor node meets the calculation requirement of the current node or not; if yes, applying for a calculation memory in a child memory pool corresponding to the ancestor node through a calculation core matched with the current node; if not, determining a calculation memory in the non-applied memory corresponding to the equipment through the calculation core matched with the current node.
The return value memory release module 440 includes:
a successor node obtaining unit, configured to determine a child node set corresponding to a current node in the computation graph, obtain a successor node closest to the child node set, and use the successor node as a closest public descendant node of the child node set corresponding to the current node;
a node set obtaining unit, configured to obtain, in the computation graph, a offspring node set corresponding to each child node in the child node set;
a scheduling sub-graph determining unit, configured to determine intersections corresponding to a plurality of offspring node sets, and determine a scheduling sub-graph formed by directed edges according to the intersections;
The nearest public offspring node determining unit is used for acquiring a node with zero degree in the scheduling subgraph and taking the node as a nearest public offspring node of a child node set corresponding to the current node;
an execution time acquisition unit, configured to acquire earliest execution times corresponding to a plurality of nearest public descendant nodes if the current node corresponds to a child node set and there are the plurality of nearest public descendant nodes; releasing the return value memory according to the earliest execution time;
the virtual node adding unit is used for acquiring a node with zero degree from the calculation graph as a target node and adding a virtual successor node corresponding to the target node into the calculation graph;
and the memory releasing unit is used for releasing the return value memory corresponding to the current node after all the nodes in the calculation graph are detected to finish calculation if the nearest public descendant node of the child node set corresponding to the current node is a virtual successor node.
The device can execute the method provided by all the embodiments of the invention, and has the corresponding functional modules and beneficial effects of executing the method. Technical details not described in detail in the embodiments of the present invention can be found in the methods provided in all the foregoing embodiments of the present invention.
Fig. 5 shows a schematic diagram of a many-core computing device 10 that may be used to implement an embodiment of the invention. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed herein.
As shown in fig. 5, the many-core computing device 10 includes at least one processor 11, and memory, such as read-only memory (ROM) 12, random Access Memory (RAM) 13, etc., communicatively coupled to the at least one processor 11. Wherein each processor 11 comprises a DMA unit. The memory stores computer programs executable by at least one processor, and the processor 11 can perform various suitable actions and processes according to the computer programs stored in a Read Only Memory (ROM) 12 or the computer programs loaded from a storage unit 18 into a Random Access Memory (RAM) 13. In RAM 13, various programs and data required for the operation of many-core computing device 10 may also be stored. The processor 11, the ROM 12 and the RAM 13 are connected to each other via a bus 14. An input/output (I/O) interface 15 is also connected to bus 14.
A number of components in many-core computing device 10 are connected to I/O interface 15, including: an input unit 16 such as a keyboard, a mouse, etc.; an output unit 17 such as various types of displays, speakers, and the like; a storage unit 18 such as a magnetic disk, an optical disk, or the like; and a communication unit 19 such as a network card, modem, wireless communication transceiver, etc. Communication unit 19 allows many-core computing device 10 to exchange information/data with other devices over a computer network, such as the internet, and/or various telecommunications networks.
The processor 11 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of processor 11 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various processors running machine learning model algorithms, digital Signal Processors (DSPs), and any suitable processor, controller, microcontroller, etc. The processor 11 performs the various methods and processes described above, such as a memory allocation method for a neural network.
In some embodiments, the memory allocation method of the neural network may be implemented as a computer program, which is tangibly embodied on a computer-readable storage medium, such as the storage unit 18. In some embodiments, part or all of the computer program may be loaded and/or installed onto the many-core computing device 10 via the ROM 12 and/or the communication unit 19. When the computer program is loaded into RAM 13 and executed by processor 11, one or more steps of the memory allocation method of the neural network described above may be performed. Alternatively, in other embodiments, the processor 11 may be configured to perform the memory allocation method of the neural network in any other suitable manner (e.g., by means of firmware).
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.
A computer program for carrying out methods of the present invention may be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the computer programs, when executed by the processor, cause the functions/acts specified in the flowchart and/or block diagram block or blocks to be implemented. The computer program may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of the present invention, a computer-readable storage medium may be a tangible medium that can contain, or store a computer program for use by or in connection with an instruction execution system, apparatus, or device. The computer readable storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. Alternatively, the computer readable storage medium may be a machine readable signal medium. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a many-core computing device having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or a trackball) through which a user can provide input to the many-core computing device. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), blockchain networks, and the internet.
The computing system may include clients and servers. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service are overcome.
It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps described in the present invention may be performed in parallel, sequentially, or in a different order, so long as the desired results of the technical solution of the present invention are achieved, and the present invention is not limited herein.
The above embodiments do not limit the scope of the present invention. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should be included in the scope of the present invention.

Claims (10)

1. A memory allocation method for a neural network, the method being applied to a many-core computing device, the method comprising:
sequentially acquiring one node from a plurality of computing nodes corresponding to the neural network computing graph as a current node;
applying for a calculation memory in the equipment memory through a calculation core matched with the current node, and releasing the calculation memory after detecting that the calculation core finishes calculation on the current node;
Taking the memory release space corresponding to the current node as a sub memory pool matched with the current node, and applying for a return value memory corresponding to the current node through the computing core in the sub memory pool;
in the calculation graph, determining the nearest public descendant node of the child node set corresponding to the current node, and releasing the return value memory according to the execution information of the nearest public descendant node;
the child memory pool is also used for applying for the calculation memory by the child node corresponding to the current node.
2. The method of claim 1, wherein applying for computational memory in the device memory through the computational core that matches the current node comprises:
judging whether an ancestor node corresponding to the current node exists in the calculation graph;
if yes, determining a sub memory pool corresponding to the ancestor node in the equipment memory, and judging whether the sub memory pool corresponding to the ancestor node meets the calculation requirement of the current node or not;
if yes, applying for a calculation memory in a child memory pool corresponding to the ancestor node through a calculation core matched with the current node; if not, determining a calculation memory in the non-applied memory corresponding to the equipment through the calculation core matched with the current node.
3. The method of claim 1, wherein determining, in the computational graph, a closest common descendant node of the set of child nodes corresponding to the current node comprises:
and in the calculation graph, determining a child node set corresponding to the current node, acquiring a subsequent node closest to the child node set, and taking the subsequent node as a nearest public descendant node of the child node set corresponding to the current node.
4. A method according to claim 3, wherein obtaining the successor node closest to the set of ionic nodes, and wherein regarding the successor node as the closest common descendant node of the set of child nodes corresponding to the current node comprises:
acquiring a offspring node set corresponding to each child node in the child node set in the calculation graph;
determining intersections corresponding to a plurality of offspring node sets, and determining a scheduling sub-graph formed by directed edges according to the intersections;
and acquiring a node with zero degree in the scheduling subgraph as the nearest public descendant node of the current node corresponding to the child node set.
5. The method of claim 1, wherein releasing the return value memory based on the execution information of the most recent common descendant node comprises:
If the current node corresponds to the child node set and a plurality of nearest public offspring nodes exist, acquiring earliest execution time corresponding to the plurality of nearest public offspring nodes;
and releasing the return value memory according to the earliest execution time.
6. The method of claim 1, wherein prior to determining the most recent common descendant node of the set of child nodes corresponding to the current node in the computational graph, further comprising:
acquiring a node with zero degree from the calculation graph as a target node, and adding a virtual successor node corresponding to the target node into the calculation graph;
releasing the return value memory according to the execution information of the nearest public offspring node, including:
and if the nearest public descendant node of the child node set corresponding to the current node is a virtual descendant node, releasing the return value memory corresponding to the current node after all the nodes in the calculation graph are detected to finish calculation.
7. A memory allocation apparatus for a neural network, the apparatus being applied to a many-core computing device, the apparatus comprising:
the node acquisition module is used for sequentially acquiring one node from a plurality of computing nodes corresponding to the neural network computing graph as a current node;
The computing memory processing module is used for applying for a computing memory in the equipment memory through a computing core matched with the current node, and releasing the computing memory after detecting that the computing core is matched with the current node to finish computing;
the return value memory application module is used for taking the memory release space corresponding to the current node as a sub memory pool matched with the current node, and applying for the return value memory corresponding to the current node through the calculation core in the sub memory pool;
the return value memory releasing module is used for determining the nearest public descendant node of the child node set corresponding to the current node in the calculation graph, and releasing the return value memory according to the execution information of the nearest public descendant node;
the child memory pool is also used for applying for the calculation memory by the child node corresponding to the current node.
8. A many-core computing device, the device comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor;
each processor comprises a Direct Memory Access (DMA) unit;
the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the memory allocation method of the neural network of any one of claims 1-6.
9. A computer readable storage medium storing computer instructions for causing a processor to perform the memory allocation method of the neural network of any one of claims 1-6.
10. A computer program product, characterized in that it comprises a computer program which, when executed by a processor, implements the memory allocation method of a neural network according to any one of claims 1-6.
CN202310973717.3A 2023-08-04 2023-08-04 Memory allocation method, device, equipment and medium of neural network Active CN116700996B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310973717.3A CN116700996B (en) 2023-08-04 2023-08-04 Memory allocation method, device, equipment and medium of neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310973717.3A CN116700996B (en) 2023-08-04 2023-08-04 Memory allocation method, device, equipment and medium of neural network

Publications (2)

Publication Number Publication Date
CN116700996A true CN116700996A (en) 2023-09-05
CN116700996B CN116700996B (en) 2023-11-07

Family

ID=87824298

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310973717.3A Active CN116700996B (en) 2023-08-04 2023-08-04 Memory allocation method, device, equipment and medium of neural network

Country Status (1)

Country Link
CN (1) CN116700996B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110490313A (en) * 2019-08-14 2019-11-22 北京中科寒武纪科技有限公司 A kind of memory multiplexing method and its Related product
CN110597616A (en) * 2018-06-13 2019-12-20 华为技术有限公司 Memory allocation method and device for neural network
CN112669852A (en) * 2020-12-15 2021-04-16 北京百度网讯科技有限公司 Memory allocation method and device and electronic equipment
WO2022002021A1 (en) * 2020-06-29 2022-01-06 北京一流科技有限公司 Memory space pre-allocation system in static network, and method thereof
CN114327844A (en) * 2020-09-29 2022-04-12 华为技术有限公司 Memory allocation method, related device and computer readable storage medium
WO2022105187A1 (en) * 2020-11-18 2022-05-27 华为技术有限公司 Memory management method, device, and system
CN116361203A (en) * 2021-12-24 2023-06-30 无锡灵汐类脑科技有限公司 Memory allocation method and device, electronic equipment and computer readable medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110597616A (en) * 2018-06-13 2019-12-20 华为技术有限公司 Memory allocation method and device for neural network
CN110490313A (en) * 2019-08-14 2019-11-22 北京中科寒武纪科技有限公司 A kind of memory multiplexing method and its Related product
WO2022002021A1 (en) * 2020-06-29 2022-01-06 北京一流科技有限公司 Memory space pre-allocation system in static network, and method thereof
CN114327844A (en) * 2020-09-29 2022-04-12 华为技术有限公司 Memory allocation method, related device and computer readable storage medium
WO2022105187A1 (en) * 2020-11-18 2022-05-27 华为技术有限公司 Memory management method, device, and system
CN112669852A (en) * 2020-12-15 2021-04-16 北京百度网讯科技有限公司 Memory allocation method and device and electronic equipment
CN116361203A (en) * 2021-12-24 2023-06-30 无锡灵汐类脑科技有限公司 Memory allocation method and device, electronic equipment and computer readable medium

Also Published As

Publication number Publication date
CN116700996B (en) 2023-11-07

Similar Documents

Publication Publication Date Title
CN116166405B (en) Neural network task scheduling strategy determination method and device in heterogeneous scene
CN115150471B (en) Data processing method, apparatus, device, storage medium, and program product
CN113032093B (en) Distributed computing method, device and platform
CN117009283A (en) Multi-core multi-chip data processing method, device, chip and storage medium
CN116700996B (en) Memory allocation method, device, equipment and medium of neural network
CN115495248B (en) Memory allocation method and device of reasoning card, electronic equipment and storage medium
CN110908968B (en) Method, device, equipment and storage medium for avoiding frightened groups during file lock unlocking
CN114579187B (en) Instruction distribution method and device, electronic equipment and readable storage medium
CN115495151A (en) Rule engine migration method, device, equipment, storage medium and program product
CN115168509A (en) Processing method and device of wind control data, storage medium and computer equipment
CN114362968B (en) Method, device, equipment and medium for acquiring random number by block chain
CN117271098B (en) AI model calculation core scheduling method, device, equipment and storage medium
CN116579914B (en) Execution method and device of graphic processor engine, electronic equipment and storage medium
CN116055386B (en) Port weight updating method, device, chip and storage medium
CN116610453A (en) Task allocation method and device, electronic equipment and storage medium
CN116069474A (en) Task scheduling method, device, equipment and medium
CN117194018A (en) Processing method and device of system temperature control algorithm in multi-core and multi-chip environment
CN116737605B (en) Data prefetching method, device, equipment and medium based on chip multilevel storage
CN117130970A (en) Multi-chip data transmission method, device, chip and storage medium
CN117076720A (en) Embedded table access method and device, electronic equipment and storage medium
CN115442432A (en) Control method, device, equipment and storage medium
CN116801001A (en) Video stream processing method and device, electronic equipment and storage medium
CN117608660A (en) Instruction scheduling method, device, medium and electronic equipment
CN117591249A (en) Transaction processing method, device, electronic equipment and storage medium
CN117234736A (en) Instruction processing method, device, equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant