CN117149398A

CN117149398A - Memory allocation method and device

Info

Publication number: CN117149398A
Application number: CN202210552003.0A
Authority: CN
Inventors: 邬志影; 刘雷; 王瑞涛; 薛阳
Original assignee: Beijing Simm Computing Technology Co ltd
Current assignee: Beijing Simm Computing Technology Co ltd
Priority date: 2022-05-20
Filing date: 2022-05-20
Publication date: 2023-12-01
Also published as: WO2023221626A1

Abstract

The embodiment of the invention discloses a memory allocation method and device. The method comprises the steps of obtaining an intermediate representation of a neural network, wherein the intermediate representation comprises a memory operator of the neural network, a first calculation operator and a second calculation operator, and the first calculation operator is a precursor operator of the memory operator; searching a binding operator in the intermediate representation, wherein the binding operator is used for binding the input or output of the memory operator; and performing memory allocation on the memory operator, the first computation operator and the second computation operator of the neural network according to the output of the binding operator and the second computation operator. By the method, the occupation of the on-chip memory can be reduced, and the multiplexing rate of the on-chip memory is improved.

Description

Memory allocation method and device

Technical Field

The invention relates to the technical field of computers, in particular to a memory allocation method and device.

Background

With the evolution of the neural network algorithm, the computational power requirement on the artificial intelligence (Artificial Intelligence, AI) chip is higher and higher, and the larger the computational power is, the larger the data volume requirement is, and further the more data needs to be transmitted and stored, so that the memory inside the chip needs to be increased and the bandwidth needs to be increased, but the design difficulty and the cost are increased due to the fact that the memory inside the chip is increased or the bandwidth is increased. The main stream AI chip internally comprises a first-level cache (L1) and a second-level cache (L2), wherein the L1 and the L2 can also be called as on-chip memories, in the calculation process, a data block resides on the L1 or the L2 in the chip, so that the data migration time can be reduced, but because the cost is limited, the L1 or the L2 cannot be designed to be too large, too much data cannot be stored, and the requirement of large calculation force on the data amount is large, for example, for a certain determined neural network model, the total data amount and the required calculation force are basically determined, the effective distribution of the memories is completed, and the on-chip memories with very scarce resources are fully utilized.

In the prior art, there are two modes of memory allocation methods, namely a dynamic memory allocation method and a static memory allocation method, and specifically, the dynamic memory allocation method comprises a best-fit algorithm, a first-fit algorithm, a worth-first algorithm and a next-fit algorithm, and for a neural network model, the dynamic memory allocation method cannot achieve minimum memory occupation; the static memory allocation method is used in a neural network model with fixed input size, the memory is allocated in advance in a compiling stage, namely, the memory is allocated uniformly before the neural network model performs reasoning, the size and address offset of a memory block required in the reasoning process are determined, after the neural network model completes the last reasoning, the memory applied before the neural network model is released uniformly, the output of each operator is allocated, but the memory layout relationship and the life cycle relationship among the operators are not considered from the global angle, and the memory required in the calculation of the neural network model cannot be reduced to the greatest extent.

In summary, how to reduce the occupation of the on-chip memory and improve the multiplexing rate of the on-chip memory is a problem to be solved at present.

Disclosure of Invention

In view of this, the embodiments of the present invention provide a method and apparatus for allocating memory, which can reduce the occupation of on-chip memory and increase the multiplexing rate of on-chip memory.

In a first aspect, an embodiment of the present invention provides a method for allocating memory, where the method includes: obtaining an intermediate representation of a neural network, wherein the intermediate representation comprises a memory operator of the neural network, a first calculation operator and a second calculation operator, and the first calculation operator is a precursor operator of the memory operator; searching a binding operator in the intermediate representation, wherein the binding operator is used for binding the input or output of the memory operator; and performing memory allocation on the memory operator, the first computation operator and the second computation operator of the neural network according to the output of the binding operator and the second computation operator.

Optionally, the method further comprises:

and releasing the memory after the memory corresponding to any input or output of the memory operator is used.

Optionally, the acquiring an intermediate representation of the neural network includes:

converting the neural network into an intermediate representation of the graph;

and performing descending processing on the graph intermediate representation to generate a network processing unit intermediate representation as an intermediate representation of the neural network.

Optionally, the converting the neural network into the intermediate representation of the graph specifically includes:

the neural network is converted to a graph intermediate representation by a deep learning compiler.

Optionally, the step of performing a descent process on the intermediate representation of the graph to generate an intermediate representation of the network processing unit specifically includes:

acquiring a memory operator of the neural network in the middle representation of the graph;

and in response to the memory operator conforming to the binding rule, inserting a binding operator corresponding to the memory operator, and generating a network processing unit intermediate representation.

Optionally, the memory operator includes a slice operator and a concat operator.

Optionally, the memory allocation for the memory operator, the first computation operator and the second computation operator of the neural network according to the output of the binding operator and the second computation operator specifically includes:

traversing the intermediate representation of the neural network, and acquiring an object to be allocated in the intermediate representation, wherein the object to be allocated represents a memory required by the input or output of a memory operator of the neural network and a memory required by the output of a second calculation operator of the neural network;

adding the obtained object to be allocated to an initial array to be allocated;

acquiring a sub-object corresponding to the binding operator from the initial array to be allocated, and adding the sub-object into the binding object of the binding operator; wherein, the sub-object represents an object to be allocated of the memory required by each input or output of the memory operator corresponding to the binding operator;

Deleting the child object from the initial array to be allocated, adding the binding object to the initial array to be allocated, and generating a target array to be allocated;

and performing memory allocation on the memory operator, the first calculation operator and the second calculation operator of the neural network according to the target array to be allocated.

Optionally, the memory allocation of the memory operator, the first computation operator and the second computation operator of the neural network according to the target to-be-allocated array specifically includes:

and performing memory allocation on the memory operator, the first calculation operator and the second calculation operator of the neural network according to the sequence of the required memory size represented by the object in the target array to be allocated.

Optionally, the method further comprises:

and updating the head address offset of each sub-object in the binding object according to the memory size required by each input or output of the memory operator.

In a second aspect, an embodiment of the present invention provides a device for memory allocation, where the device includes:

the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring an intermediate representation of a neural network, the intermediate representation comprises a memory operator of the neural network, a first calculation operator and a second calculation operator, and the first calculation operator is a precursor operator of the memory operator;

The query unit is used for searching a binding operator in the intermediate representation, and the binding operator is used for binding the input or output of the memory operator;

the distribution unit is used for distributing the memory of the memory operator, the first calculation operator and the second calculation operator of the neural network according to the output of the binding operator and the second calculation operator.

Optionally, the apparatus further comprises: and the release unit is used for releasing the memory after the memory corresponding to any input or output of the memory operator is used.

Optionally, the apparatus further comprises:

a conversion unit for converting the neural network into a middle representation of the graph;

and the generating unit is used for carrying out descending processing on the graph intermediate representation and generating a network processing unit intermediate representation as an intermediate representation of the neural network.

Optionally, the conversion unit is specifically configured to: the neural network is converted to a graph intermediate representation by a deep learning compiler.

Optionally, the generating unit is specifically configured to:

Optionally, the distribution unit is specifically configured to:

adding the obtained object to be allocated to an initial array to be allocated;

Optionally, the distribution unit is specifically configured to:

And performing memory allocation on the memory operator, the first calculation operator and the second calculation operator of the neural network according to the sequence of the required memory size represented by the object in the target array to be allocated. Optionally, the apparatus further comprises: an updating unit: and updating the head address offset of each sub-object in the binding object according to the memory size required by each input or output of the memory operator.

In a third aspect, embodiments of the present invention provide computer program instructions which, when executed by a processor, implement a method as in the first aspect or any one of the possibilities of the first aspect.

In a fourth aspect, embodiments of the present invention provide a computer-readable storage medium having stored thereon

Computer program instructions are stored which, when executed by a processor, implement the method of the first aspect or any one of the possibilities of the first aspect.

In a fifth aspect, an embodiment of the present invention provides a chip comprising a memory and a processing core, the memory being configured to store one or more computer program instructions, wherein the one or more computer program instructions are executed by the processing core to implement the method of the first aspect or any one of the possibilities of the first aspect.

In a sixth aspect, an embodiment of the present invention provides a board, where the board includes the chip of the fifth aspect.

In a seventh aspect, an embodiment of the present invention provides a server, where the server includes the board card of the sixth aspect.

The method comprises the steps of obtaining an intermediate representation of a neural network, wherein the intermediate representation comprises a memory operator of the neural network, a first calculation operator and a second calculation operator, and the first calculation operator is a precursor operator of the memory operator; searching a binding operator in the intermediate representation, wherein the binding operator is used for binding the input or output of the memory operator; and performing memory allocation on the memory operator, the first computation operator and the second computation operator of the neural network according to the output of the binding operator and the second computation operator. By the method, the occupation of the on-chip memory can be reduced, and the multiplexing rate of the on-chip memory is improved.

Drawings

The above and other objects, features and advantages of the present invention will become more apparent from the following description of embodiments of the present invention with reference to the accompanying drawings, in which:

FIG. 1 is a schematic diagram of a prior art structure between memory operators;

FIG. 2 is a schematic diagram of another prior art inter-memory operator structure;

FIG. 3 is a flow chart of a method for memory allocation according to an embodiment of the present invention;

FIG. 4 is a flowchart of another method for memory allocation according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of an intermediate representation relationship in an embodiment of the present invention;

FIG. 6 is a flowchart illustrating a memory allocation process according to an embodiment of the present invention;

FIG. 7 is a schematic diagram of a memory structure according to an embodiment of the invention;

FIG. 8 is a schematic diagram of a bound memory structure in an embodiment of the invention;

FIG. 9 is a diagram illustrating another bound memory structure in accordance with an embodiment of the present invention;

fig. 10 is a schematic diagram of an apparatus for memory allocation according to an embodiment of the present invention.

Detailed Description

The present disclosure is described below based on examples, but the present disclosure is not limited to only these examples. In the following detailed description of the present disclosure, certain specific details are set forth in detail. The present disclosure may be fully understood by those skilled in the art without a review of these details. Well-known methods, procedures, flows, components and circuits have not been described in detail so as not to obscure the nature of the disclosure.

Moreover, those of ordinary skill in the art will appreciate that the drawings are provided herein for illustrative purposes and that the drawings are not necessarily drawn to scale.

Unless the context clearly requires otherwise, the words "comprise," "comprising," and the like throughout the application are to be construed as including but not being exclusive or exhaustive; that is, it is the meaning of "including but not limited to".

In the description of the present disclosure, it is to be understood that the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. Furthermore, in the description of the present disclosure, unless otherwise indicated, the meaning of "a plurality" is two or more.

In the prior art, the memory allocation method has two modes, namely a dynamic memory allocation method and a static memory allocation method. The dynamic memory allocation method comprises a best-fit algorithm, a first-fit algorithm, a next-first algorithm and a next-fit algorithm, and for a neural network model, the dynamic memory allocation method cannot achieve minimum memory occupation; specifically, memory allocation is taken as an example for any neural network model, a current operator of the neural network model, which needs to be allocated with memory, is determined, the size of the memory required by the calculation of the current operator is determined, whether the memory to be allocated has idle memory blocks meeting the requirements is judged, if so, the idle memory blocks meeting the requirements are used as the memory required by the calculation of the current operator, and the memory blocks are removed from the memory to be allocated. And releasing the distributed memory blocks after all operators of the neural network model are operated, so that the minimum memory occupation cannot be achieved; in addition, the memory is required to be frequently allocated and released in each reasoning process, so that the execution efficiency of the neural network model reasoning is affected. The static memory allocation method is suitable for the neural network model with fixed input size, specifically, the memory of all operators of the neural network model is pre-allocated according to the size of the memory required by the operators in the compiling stage, namely, the memory is uniformly allocated before the neural network model operates, and the memory allocated before the neural network model is uniformly released after the neural network model completes the last operation; the static memory allocation method allocates the output of each operator, but the memory layout relationship and the life cycle relationship among the operators are not considered from the global angle, so that the memory required by the calculation of the neural network model cannot be reduced to the greatest extent. In summary, how to reduce the occupation of the on-chip memory and improve the multiplexing rate of the on-chip memory is a problem to be solved at present.

In the actual reasoning process, a plurality of memory operators are arranged besides the calculation operators, and in the operation process, data on an input memory of the memory operators need to be copied (copy) to an output memory of the memory operators, so that the input and the output of the memory operators need to be allocated with the memory, and as the memory operators only relate to operations among the memories, the memory optimization opportunities can be searched in the memory layout relation of the memory operators. The memory operator is assumed to include a slice operator and a concat operator, where the slice operator is used to split one input into multiple outputs, and the concat operator is used to combine multiple inputs into one output, and the following details of the problems in the prior art are described by two specific embodiments.

Detailed description of the preferred embodiments,

As shown in fig. 1, fig. 1 includes an op1 operator, a slice operator, an op2 operator, an op3 operator, an op4 operator, an op5 operator, an op6 operator, and an op7 operator, where the op1 operator, the op2 operator, the op3 operator, the op4 operator, the op5 operator, the op6 operator, and the op7 operator are calculation operators, which may also be referred to as calculation type operators, and the slice operator is an internal memory operator, and the operation process of fig. 1 is assumed to be as follows: op1- > slice- > op2- > op5- > op3- > op6- > op4- > op7; in the prior art, 1 part of memory is allocated to the output of all calculation operators and memory operators, then the slice operator reads data from the memory corresponding to the output of the op1 and splits the data, and then copies the data into the memory corresponding to the output of the slice operator, namely copies the data into the sub-memory 0, the sub-memory 1 and the sub-memory 2; in the method, not only is additional copy operation introduced, but also the input and output of the slice operator occupy two parts of memory.

Second embodiment,

As shown in fig. 2, the fig. 2 includes an op1 operator, an op2 operator, an op3 operator, an op4 operator, an op5 operator, an op6 operator, and a concat operator, where the op1 operator, the op2 operator, the op3 operator, the op4 operator, the op5 operator, and the op6 operator are calculation operators, and the concat operator is a memory operator, and the operation process of fig. 2 is assumed to be as follows: op1- > op4- > op2- > op5- > op3- > op6- > concat; 1 part of memory is allocated for the output of each calculation operator, and the allocation addresses are random, so that the memories allocated to the ops 4, the ops 5 and the ops 6 are discontinuous, and 1 part of memory is also allocated for the output of the concat operator when the concat operator is operated, and then the memories corresponding to the outputs of the ops 4, the ops 5 and the ops 6 are read and combined and copy is performed on the memories, so that additional copy operation is introduced in the method, and more memories are occupied.

On this basis, in the embodiment of the present invention, in order to solve the problem of reducing the occupation of the on-chip memory and improving the multiplexing rate of the on-chip memory, a method for allocating the memory is provided, specifically as shown in fig. 3, fig. 3 is a flowchart of a method for allocating the memory according to the embodiment of the present invention, which specifically includes:

Step S300, obtaining an intermediate representation of the neural network.

The intermediate representation comprises a memory operator of the neural network, a first calculation operator and a second calculation operator, wherein the first calculation operator is a precursor operator of the memory operator, and the second calculation operator is other calculation operators except the first calculation operator in the neural network.

Specifically, the precursor operator of the memory operator refers to that when the memory operator is executed in the neural network, the output data of the precursor operator is directly used as the input data of the memory operator. For example, as shown in fig. 1, the memory operator is a slice operator, and the precursor operator of the slice operator is op1 in fig. 1.

Step S301, searching a binding operator in the intermediate representation, where the binding operator is used to bind the input or output of the memory operator. And after the intermediate representation is obtained, the binding operator inserted in the intermediate representation can be found out.

Thus, the input of the binding operator is the input or output of the memory operator, and the output of the binding operator indicates that the input or output of the memory operator is bound as an object. For example, as shown in fig. 1, if the binding operator binds 3 outputs (0, 1, 2 in fig. 1) of the slice operator, 0, 1, 2 in fig. 1 are inputs of the binding operator, and the output of the binding operator is an object after 0, 1, 2 in fig. 1 are bound.

Step S302, performing memory allocation on the memory operator, the first computation operator and the second computation operator of the neural network according to the output of the binding operator and the second computation operator. And the memory operator and the first computation operator share the whole memory, and the whole memory comprises a plurality of sub memories corresponding to a plurality of inputs or a plurality of outputs of the memory operator. Further, the number of the multiple inputs or the multiple outputs of the memory operator is at least two, and the specific number is not limited.

Specifically, performing memory allocation on the memory operator, the first computation operator and the second computation operator of the neural network refers to determining writing positions and reading positions of input data or output data of all operators of the neural network in the operation process of the neural network.

In this embodiment, since the binding operator binds the input or output of the memory operator, a piece of memory is allocated to the memory operator and the first computation operator according to the output of the binding operator, so that the output data of the memory operator and the first computation operator share a piece of memory, which not only reduces copy operation in the neural network operation process, but also reduces occupation of the memory.

For example, the neural network shown in fig. 1 includes a slice operator, and the binding operator binds 3 outputs of the slice operator, that is, the output of the binding operator is an object after the 3 outputs of the slice operator are bound. When the memories are distributed, a whole memory is distributed to the slice operator and the precursor operator op1 thereof according to the output of the binding operator, the output data of the op1 and the plurality of output data of the slice operator share the distributed whole memory, the whole memory comprises 3 sub memories corresponding to 3 outputs of the slice operator, specifically, when the neural network operates, the op1 writes the output data of the whole memory into the whole memory, the op2, the op3 and the op4 can respectively read the data from the 3 sub memories of the whole memory, the additional copy operation of the slice operator is reduced, and the input data and the output data of the slice operator share one memory, so that the occupation of the memory is reduced.

For example, the neural network shown in fig. 2 includes a concat operator, and the binding operator binds 3 inputs of the concat operator, that is, the output of the binding operator is an object after the 3 inputs of the concat operator are bound. When memory allocation is performed, a whole memory is allocated to the concat operator and the precursor operators op4, op5 and op6 thereof according to the output of the binding operator, so that the output data of the concat operator and the output data of the op4, op5 and op6 share the whole memory, wherein the whole memory comprises 3 sub memories corresponding to 3 inputs of the concat operator, specifically, when the neural network is in operation, the op4, the op5 and the op6 respectively write the output data of the neural network into the 3 sub memories of the whole memory, and a calculation operator (not shown in fig. 2) at the back of the concat operator reads data from the whole memory, so that additional copy operation of the concat operator is reduced, and the input data and the output data of the concat operator share one memory, thereby reducing the occupation of the memory.

In the embodiment of the present invention, another memory allocation method is also provided, as shown in fig. 4, and fig. 4 is a flowchart of a memory allocation method in the embodiment of the present invention, which specifically includes:

step S400, obtaining a neural network.

Step S401, converting the neural network into a middle representation of the graph.

Specifically, the neural network is converted into a Graph intermediate representation (Graph Intermediate Representation, graph IR) through a deep learning compiler, wherein the Graph IR comprises a plurality of operators of the neural network, including a memory operator and a calculation operator, and according to the foregoing, the embodiment of the invention mainly optimizes the memory corresponding to the memory operator.

Step S402, the intermediate representation of the graph is subjected to descending processing, and an intermediate representation of the network processing unit is generated.

The middle representation of the graph comprises a memory operator of the neural network, a first calculation operator and a second calculation operator, wherein the first calculation operator is a precursor operator of the memory operator.

Specifically, the Graph IR is subjected to descending (lower) processing, the process of generating the intermediate representation NPU IR of the network processing unit is also called bundling optimization, in the lower process, a memory operator of a neural network in the Graph IR is obtained, a binding operator corresponding to the memory operator is inserted in response to the memory operator conforming to a binding rule, and the network processing unit (Network Processing Unit, NPU) IR is generated, wherein the first computing operator and the memory operator corresponding to the binding operator share the same memory. The binding rule means that the memory operator has a plurality of inputs or a plurality of outputs, and the life cycles of the memories required by the plurality of inputs or the plurality of outputs are not identical, that is, the life cycles of the memories required by at least two inputs or two outputs are different.

For example, assuming that the memory operator is a slice operator, the slice operator has a plurality of outputs, the life cycles of memories required by the plurality of outputs are not identical, and the memory operator accords with a binding rule, and all the outputs of the slice operator are bound together, namely a binding operator corresponding to the slice operator is inserted; assuming that the memory operator is a concat operator, the concat operator is provided with a plurality of inputs, the life cycles of memories required by the plurality of inputs are not identical, and the memory operator accords with a binding rule, and all the inputs of the concat operator are bound together, namely a binding operator corresponding to the concat operator is inserted; assuming that the memory operator is a conventional memory operator which does not conform to the binding rule, a binding operator corresponding to the conventional memory operator does not need to be inserted, for example, a reshape operator (tensor reconstruction operator refers to the adjustment of the shape of the tensor) has one input and one output, and the reshape operator is the conventional memory operator which does not conform to the binding rule; alternatively, the life cycle of the memory required by the 3 inputs of the concat operator shown in fig. 2 is identical, and the concat operator shown in fig. 2 is a conventional memory operator that does not conform to the binding rule.

In one possible implementation, when inserting a binding operator (bundle op) at a suitable location in the neutral network's intermediate representation, assuming that the memory operator has 3 inputs or 3 outputs, each input or output requiring memory is represented by the created object%3,%2,%0, the binding operator may be formally represented as follows:

％4＝bundle(％3，％2，％0)

wherein,% 3,% 2,% 0 is the bound child object (representing the child memory required for each input or output, i.e., the input of the binding operator), and% 4 is the output of the binding operator.

In the embodiment of the present invention, a relationship diagram of the neural network model, graph IR in the middle of the Graph, and NPU IR in the middle of the network processing unit in the processing procedure of step S400 to step S402 is shown in fig. 5.

Step S403, searching a binding operator in the intermediate representation, where the binding operator is used to bind the input or output of the memory operator.

And step S404, performing memory allocation on the memory operator, the first calculation operator and the second calculation operator of the neural network according to the output of the binding operator and the second calculation operator.

Specifically, as shown in fig. 6, the processing procedure of performing memory allocation according to the binding operator and the second computing operator specifically includes:

Step S600, traversing the NPU IR, and obtaining an object to be allocated in the intermediate representation, where the object to be allocated represents a memory required for input or output of a memory operator of the neural network, and a memory required for output of a second computation operator of the neural network. The object to be allocated is an object created to represent the required memory size (size) and the life cycle of the memory, and after the object to be allocated is acquired, it can be determined in what period the memory of the size is used.

For example, assume that the NPU IR of the neural network shown in fig. 1 is:

％1＝alloc()；

％2＝alloc()；

％3＝alloc()；

％4＝bundle(％1,％2,％3)；

op1(％4)//write；

op2(％1)//read；

op5(…)；

dealloc(％1)；

op3(％2)//read；

op6(…)；

dealloc(％2)；

op4(％3)//read；

op7(…)；

dealloc(％3)

the object to be allocated% 1 represents a sub-memory 0 required by the output of the slice operator, the object to be allocated% 2 represents a sub-memory 1 required by the output of the slice operator, and the object to be allocated% 3 represents a sub-memory 2 required by the output of the slice operator. Namely,% 1,% 2,% 3 are sub-objects.

In the above example, the NPU IR is traversed to obtain the objects to be allocated in the intermediate representation, where the objects to be allocated represent the memory required for the input or output of the slice operator of the neural network, and the memory required for the output of the second computation operators op2 to op7 of the neural network (the objects to be allocated in the second computation operators op2 to op7 in fig. 1 are not shown in the NPU IR), and then the objects to be allocated,% 1,% 2, and% 3 that represent the memory required for the 3 outputs of the slice operator may be obtained through this step, and other objects to be allocated that need to be obtained are not described herein.

Step S601, adding the obtained object to be allocated to an initial array to be allocated.

The initial array to be allocated is a function representation of all objects to be allocated.

The initial to-be-allocated array may be denoted as unassigned Buffers.

Step S602, a sub-object corresponding to the binding operator is obtained from the initial array to be allocated, and the sub-object is added into the binding object of the binding operator.

And if the sub object represents an object to be allocated of the memory required by each input or output of the memory operator corresponding to the binding operator, the binding object represents a whole memory required by all the inputs or outputs of the memory operator. And polling the NPU IR once again, finding all Bundle ops in the NPU IR, and further searching the sub-object corresponding to the binding operator in the initial array to be allocated.

Specifically, taking a slice operator as an example, according to the output of the Bundle op, searching the sub-objects (sub-buffer) corresponding to the binding operator corresponding to the slice operator from unassigned Buffers as% 1,% 2 and% 3, and adding the plurality of sub-objects to the binding object (Bundle buffer).

Step S603, deleting the child object from the initial array to be allocated, and adding the binding object to the initial array to be allocated, so as to generate a target array to be allocated.

As shown in fig. 7, the objects included in the target to-be-allocated array are Buffer1, buffer1 … … Buffer2. The sub-objects included in the Buffer1 are Buffer2 and Buffer3, the sub-objects included in the Buffer2 are Buffer4 and Buffer3, and the sub-objects included in the Buffer3 are Buffer5, buffer6 and Buffer7. The Buffer1 and the Buffer1 and … … Buffer2 can be used as root nodes.

Specifically, all sub buffers are deleted from unassigned Buffers, and buffer is added to unassigned Buffers, so as to obtain a final target memory array to be allocated.

And step S604, performing memory allocation on the memory operator, the first calculation operator and the second calculation operator of the neural network according to the target array to be allocated.

Specifically, since the object in the target to-be-allocated array only includes the binding object and the to-be-allocated object of the second computation operator. Therefore, through step S604, memory allocation can be implemented according to the outputs of the binding operator and the second computing operator, where the output data of the memory operator and the first computing operator share a memory. Therefore, the memory occupation is reduced.

Specifically, memory allocation is performed on the memory operator, the first computation operator and the second computation operator of the neural network according to the sequence of the required memory size represented by the objects in the target array to be allocated. Optionally, the objects in the target array to be allocated are ordered from large to small according to the size of the indicated required memory, and the memory is allocated according to the ordering, namely, the operator with larger required memory is allocated with the memory first, and then the operator with smaller required memory is allocated with the memory, so that the situation that the remaining continuous memory space is insufficient for being allocated to the operator with larger required memory after the operator with smaller required memory is allocated with the memory first is avoided, and further, the more effective utilization of the memory resource is realized.

As shown in fig. 7, buffer1 may be an object to be allocated corresponding to a conventional memory operator, or may be an object to be allocated corresponding to a computation operator; the Buffer1 is a binding object, and the Buffer1 also has subordinate nodes, namely a sub-object Buffer2 and a sub-object Buffer3 of the Buffer 1; the Bundle Buffer2 is a binding object, the Bundle Buffer2 also has subordinate nodes, namely a sub-object Buffer4 and a sub-object Bundle Buffer3 of the Bundle Buffer1, the subordinate nodes of the binding operator can also comprise binding objects, and the Bundle Buffer3 also has subordinate nodes; namely, the sub-objects Buffer5, buffer6 and Buffer7 of the Buffer3 perform memory allocation according to the size of the required memory represented by the root node (such as Buffer1 and Buffer1 … … Buffer2 shown in fig. 7), and the sub-memories of the binding operator are sequentially allocated to the sub-nodes after allocation, so that continuity of the sub-memories of the binding operator is further ensured.

In one possible implementation, after the memory allocation, address backfilling is performed, that is, telling the following operator to obtain data from the address.

In one possible implementation, the method further includes: and updating the head address offset of each sub-object in the binding object according to the memory size required by each input or output of the memory operator.

In one possible implementation, the lifetime of the sub-memory required for each input or output represented by each sub-object is recorded while the head address offset is updated into each sub-object included in the binding object, and the sub-object is added to the allocated buffer array, instead of the binding object being added to the allocated buffer array.

For example, assuming that one binding object includes three sub-objects, 3 sub-memories represented by the three sub-objects are consecutive, as shown in fig. 8, a vertical axis is a size (size), a horizontal axis is a life cycle (t), and a first address Offset of a sub-memory represented by a first sub-object is Offset0, and a size is size20; the first address Offset of the sub-memory represented by the second sub-object is Offset20, and the size is size15; the first address Offset of the sub-memory represented by the third sub-object is Offset35, and the size is size20; the length of the horizontal axis of the three sub-memories is the life cycle.

In one possible implementation, the memory is released after the memory usage corresponding to either input or output of the memory operator is completed. Specifically, according to the output of the binding operator, the memory operator and the first computing operator thereof are distributed to a whole memory, the whole memory comprises a plurality of sub-memories corresponding to a plurality of inputs or a plurality of outputs of the memory operator, and after the use of the sub-memories corresponding to any one of the inputs or outputs of the memory operator of the whole memory is completed, the sub-memories are released.

In one possible implementation manner, the sub-memories represented by any sub-object in the binding object (a whole memory required by all inputs or outputs of the memory operator) can be released independently after the use is completed, and it is not necessary to wait for all the sub-memories of the memory operator to be released together, so that the multiplexing rate of the memory can be improved.

For example, as shown in fig. 9, assume that the memory allocated by one memory operator includes three sub-memories, the vertical axis is size (size), the horizontal axis is life cycle (t), the first address Offset of the first sub-memory is Offset0, and the size is size20; the first address Offset of the second sub-memory is Offset20, and the size is size15; the first address Offset of the third sub-memory is Offset35, and the size is size20; the length of the transverse axis of the three sub-memories is the life cycle; and after the first sub-memory is used, the first sub-memory can be released, and the released memory is in an idle state and can be reassigned to other operators with the life cycle not overlapped with that of the sub-memory.

In the embodiment of the invention, the memory allocation method based on the binding mechanism binds the sub memories with different life cycles and memory relativity, solves the problem of memory multiplexing in the allocation stage, for example, a slice operator, the life cycle of the whole memory is longer (namely the longest life cycle of a plurality of sub memories), but the life cycle of each sub memory is different, and after the sub memory is released, the sub memory can be completely multiplexed by operators with non-overlapping life cycles; and the memory is linked together in advance through a memory allocation method of a binding mechanism, such as a concat operator, so that the process of copying each sub memory to a new large memory is avoided, the memory is saved, the overall network performance is also improved to a certain extent, the saved memory can be used for loading more data by other operators to meet the calculation force requirement, and meanwhile, the problem that the data overflows to the next stage of storage due to insufficient on-chip memory can be reduced.

Fig. 10 is a schematic diagram of an apparatus for memory allocation according to an embodiment of the present invention. As shown in fig. 10, the apparatus of the present embodiment includes the acquisition unit 1001, the inquiry unit 1002, and the allocation unit 1003.

The obtaining unit 1001 is configured to obtain an intermediate representation of a neural network, where the intermediate representation includes a memory operator of the neural network, a first computation operator, and a second computation operator, and the first computation operator is a precursor operator of the memory operator; the query unit 1002 is configured to find a binding operator in the intermediate representation, where the binding operator is used to bind an input or an output of the memory operator; the allocating unit 1003 is configured to allocate memory to the memory operator of the neural network, the first computing operator, and the second computing operator according to the outputs of the binding operator and the second computing operator.

Optionally, the apparatus further comprises:

Optionally, the generating unit is specifically configured to:

Optionally, the distribution unit is specifically configured to:

adding the obtained object to be allocated to an initial array to be allocated;

Optionally, the distribution unit is specifically configured to:

In an embodiment of the present invention, there is also provided computer program instructions which, when executed by a processor, implement the method of any of the above embodiments.

In an embodiment of the present invention, there is also provided a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the method of any of the above embodiments.

An embodiment of the present invention provides a chip including a memory for storing one or more computer program instructions, and a processing core, where the one or more computer program instructions are executed by the processing core to implement the method of any of the above embodiments.

The embodiment of the invention provides a board card, which comprises a chip.

The embodiment of the invention provides a server, which comprises the board card.

As will be appreciated by one skilled in the art, aspects of embodiments of the present invention may be embodied as a system, method or computer program product. Accordingly, aspects of embodiments of the invention may take the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc.) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a "circuit," module "or" system. Furthermore, aspects of embodiments of the invention may take the form of: a computer program product embodied in one or more computer-readable media having computer-readable program code embodied thereon.

Any combination of one or more computer readable media may be utilized. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. The computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of embodiments of the present invention, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, such as in baseband or as part of a carrier wave. Such a propagated signal may take any of a variety of forms, including, but not limited to: electromagnetic, optical, or any suitable combination thereof. The computer readable signal medium may be any of the following: a computer-readable storage medium is not a computer-readable storage medium and can communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.

Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations for aspects of embodiments of the present invention may be written in any combination of one or more programming languages, including: object oriented programming languages such as Java, smalltalk, C ++, etc.; and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package; executing partly on the user computer and partly on the remote computer; or entirely on a remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention described above describe aspects of embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable medium that can direct a computer, other programmable data processing apparatus, or other devices to function in a particular manner, such that the instructions stored in the computer-readable medium produce an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide processes for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.

The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, and various modifications and variations may be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method of memory allocation, the method comprising:

obtaining an intermediate representation of a neural network, wherein the intermediate representation comprises a memory operator of the neural network, a first calculation operator and a second calculation operator, and the first calculation operator is a precursor operator of the memory operator;

searching a binding operator in the intermediate representation, wherein the binding operator is used for binding the input or output of the memory operator;

and performing memory allocation on the memory operator, the first computation operator and the second computation operator of the neural network according to the output of the binding operator and the second computation operator.

2. The method of claim 1, wherein the method further comprises:

3. The method of claim 1, wherein the obtaining an intermediate representation of a neural network comprises:

converting the neural network into an intermediate representation of the graph;

4. A method according to claim 3, wherein said converting the neural network into an intermediate representation of the graph, in particular comprises:

5. A method according to claim 3, wherein said down-processing said intermediate representation of the graph to generate an intermediate representation of the network processing unit, comprises in particular:

6. The method of any of claims 1-5, wherein the memory operators comprise a slice operator and a concat operator.

7. The method of any one of claims 1-6, wherein the performing memory allocation on the memory operator, the first computation operator, and the second computation operator of the neural network according to the output of the binding operator and the second computation operator specifically includes:

adding the obtained object to be allocated to an initial array to be allocated;

8. The method of claim 7, wherein the memory allocation of the memory operator, the first computation operator, and the second computation operator of the neural network according to the target to-be-allocated array specifically comprises:

9. The method of claim 7, wherein the method further comprises:

10. An apparatus for memory allocation, the apparatus comprising: