WO2022068663A1 - 内存分配方法、相关设备及计算机可读存储介质 - Google Patents

内存分配方法、相关设备及计算机可读存储介质 Download PDF

Info

Publication number
WO2022068663A1
WO2022068663A1 PCT/CN2021/119829 CN2021119829W WO2022068663A1 WO 2022068663 A1 WO2022068663 A1 WO 2022068663A1 CN 2021119829 W CN2021119829 W CN 2021119829W WO 2022068663 A1 WO2022068663 A1 WO 2022068663A1
Authority
WO
WIPO (PCT)
Prior art keywords
tensor data
memory space
node
tensor
allocated
Prior art date
Application number
PCT/CN2021/119829
Other languages
English (en)
French (fr)
Inventor
张臻
兰布鲁·艾奥尼斯
德·胡安·哈维尔
刘畅
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to EP21874324.3A priority Critical patent/EP4209902A4/en
Publication of WO2022068663A1 publication Critical patent/WO2022068663A1/zh
Priority to US18/127,300 priority patent/US20230236888A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5011Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
    • G06F9/5016Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals the resource being the memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • G06F9/5033Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering data affinity
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/50Indexing scheme relating to G06F9/50
    • G06F2209/509Offload

Definitions

  • the present application relates to the technical field of artificial intelligence, and in particular, to a memory allocation method, a related device, and a computer-readable storage medium.
  • Run the entire neural network then allocate memory for the entire neural network in the order in which the entire neural network runs.
  • the neural network needs to occupy 100M memory space, 10M memory space and 50M memory space in turn during the running process.
  • a 100M memory space can be allocated for the neural network.
  • 10M memory space it is judged whether the above allocated 10M memory space can be reused. If it is used, no new memory space is allocated for the 10M memory space applied for, but the above-mentioned 100M memory space is reused.
  • the neural network applies for 50M of memory space, it first determines whether the 50M of memory space can reuse the allocated 100M of memory space. If it can be reused, it will not allocate new memory for the applied 50M of memory space. memory space.
  • the present application provides a memory allocation method, a related device and a computer-readable storage medium, which can avoid unreasonable memory allocation.
  • the unreasonable memory allocation can be reflected in the fact that the entire neural network occupies a large amount of memory.
  • a memory allocation method may include the following steps: first, obtain a computation graph corresponding to a neural network; wherein, the computation graph includes N nodes and directed edges connecting different nodes, and the nodes are used to indicate A computing logic in a neural network, the directed edge is used to indicate the flow of tensor data in the computing logic; the directed edge of the computation graph carries tensor data, and the computation graph includes M tensor data, where M is greater than or equal to 1 integer; secondly, based on the sorting results of the M tensor data, memory space is allocated to the M tensor data in turn, wherein, if one tensor data in the M tensor data can reuse the allocated memory space At least a part, at least a part of the reusable memory space of the tensor data is allocated to the tensor data, the allocated memory space is the memory space that has been allocated to the M tensor data before the tensor data, and the sorting result
  • the constraint relationship indicates the available memory space of one tensor data in the M tensor data and the available memory of other tensor data in the M tensor data respectively. space relationship.
  • the M tensor data is taken as a whole, which describes the allocated memory space of the whole. There may be some tensor data in this whole. has not been allocated to the memory space. Specifically, it refers to the memory space of one or more tensor data that has been allocated memory space in the M tensor data before allocating memory space to the tensor data described in the above method.
  • press The order should allocate memory space for the mth tensor data in the sorting result, then the memory space allocated to the M tensor data is the memory space allocated by the first m-1 allocated tensor data, and m is less than M, greater than 1.
  • the node to which the tensor data flows is the consumer node.
  • the node from which the tensor data flows out is the production node.
  • one tensor data can be carried on different directed edges, and one tensor data can also be carried on one directed edge.
  • the memory allocation device sequentially allocates a memory space of a corresponding size to each tensor data based on the sorting result of the M tensor data.
  • the allocation and reuse of memory space can avoid unreasonable memory allocation, thereby saving the memory that the entire neural network needs to occupy, and optimizing the memory allocation of the neural network.
  • the method may further include the following steps: if the allocated memory space cannot be reused for the tensor data, allocate other memory space for the tensor data, and the other memory space is the same as the allocated memory space. Space is different.
  • the constraint relationship indicates at least one of the following relationships: the relationship between the available memory space of one tensor data and the available memory space of another tensor data is reusable, the available memory of one tensor data The relationship between the space and the available memory space of another tensor data is not reusable, and the relationship between the available memory space of one tensor data and the available memory space of another tensor data is not reusable and continuous.
  • the above constraints have different priorities.
  • the constraint relationship is carried in a constraint relationship table, and the constraint relationship table includes identifiers of M data tensors.
  • the relationship between the available memory space of another tensor data is reusable.
  • the second value indicates that the relationship between the available memory space of one tensor data and the available memory space of another tensor data is not reusable.
  • Indicates that the relationship between the available memory space of one tensor data and the available memory space of another tensor data is non-reusable and contiguous.
  • the first value, the second value and the third value may be numerical values that can be distinguished from each other. For example, the first value may be "0", the second value may be "1", and the third value may be "2" .
  • the first tensor data in the case that all the consuming nodes of the first tensor data are upstream nodes of the production node of the second tensor data, or, in the case that all the consuming nodes of the second tensor data are all In the case of the downstream node of the production node of the first tensor data, the first tensor data can be reused as the memory space allocated for the second tensor data; in all consumption of the first tensor data If the node is not the upstream node of the production node of the second tensor data, or, if all the consuming nodes of the second tensor data are not the downstream nodes of the production node of the first tensor data , the first tensor data cannot be reused as the memory space allocated for the second tensor data; the first tensor data and the second tensor data are any of the M tensor data
  • the consumption node is the node where the tensor
  • node A is the upstream node of node B, which means that in the calculation graph, one or more directed edges can pass from node A to node B, and between node A and node B can be from node A to node B It can also be connected by multiple directed edges and the nodes that these directed edges pass through.
  • the constraint relationship of each tensor data can be determined, which provides a basis for the subsequent acquisition of the sorting results of the M tensor data.
  • the computing graph includes multiple computing subtasks
  • the computing subtasks indicate a computing function through a set of nodes and edges related to the set of nodes
  • the execution relationship between the multiple computing subtasks is: Parallel execution; the method may further include the following steps: in a calculation subtask, if there is no directed edge between two adjacent nodes, adding a directed edge between the two adjacent nodes to update the calculation graph; Among them, each added directed edge carries corresponding tensor data; two adjacent nodes are two adjacent nodes in the execution order of the calculation subtask; here, the execution order refers to the order with a temporal relationship. Based on the updated computation graph, obtain the information of each tensor data.
  • the execution relationships between the calculation subtasks are all parallel, in a calculation subtask in the calculation graph, if there is no directed edge between two adjacent nodes, then in the adjacent node A directed edge is added between two nodes to update the computational graph, which provides a basis for subsequent analysis of the ancestral relationship (eg, upstream nodes) of each node based on the computational graph and determination of the corresponding constraint relationship of each node.
  • the execution relationship between multiple computing subtasks is parallel, which means that the time periods required to execute the multiple computing subtasks overlap in the case of the same time reference, and it is not emphasized that the computing subtasks are performed at the same time. begin, and/or end at the same time. In practical applications, the above computing subtasks with a parallel execution relationship can be executed in parallel by different processor cores.
  • the computation graph further includes a first computation subtask and a second computation subtask whose execution relationship is serial, and the execution sequence of the first computation subtask precedes the second computation subtask; update the computation graph
  • the implementation process may also include the following steps: if there is no directed edge between the last node of the first calculation subtask and the first node of the second calculation subtask, then the last node of the first calculation subtask is connected with the first node of the second calculation subtask. A directed edge is added between the first nodes of the two computation subtasks.
  • the computing graph can be updated through this implementation, which provides the basis for subsequent analysis of the ancestor relationship of each node and determination of the corresponding constraint relationship of each node based on the computing graph. foundation.
  • the execution relationship between the calculation subtask 1 and the calculation subtask is serial means that the calculation subtask 2 is executed only after the processor finishes executing the calculation subtask.
  • the execution relationship between the calculation subtask 1 and the calculation subtask 2 is serial, and the execution relationship between the calculation subtask 2 and the calculation subtask 3 is parallel, which means that the calculation subtask 1 and the calculation subtask can be 2
  • the computing subtask 1 and the computing subtask 2 are run by the processor core 1
  • the computing subtask 3 is run by the processor core 2.
  • the time periods required for the processor core 1 and the processor core 2 to execute the above-mentioned calculation subtasks overlap when the time base is the same.
  • the identifier of the production node of the tensor data is smaller than the identifier of the consumption node of the tensor data; the production node of the tensor data is adjacent to the consumption node of the tensor data of two nodes.
  • the corresponding identifier of each node can be determined, which provides a basis for subsequent analysis of the ancestor relationship of each node and determination of the corresponding constraint relationship of each node based on the corresponding identifier of each node.
  • the identifier of each node in the computation graph is used to determine the information of each tensor data in the M tensor data.
  • the ancestral relationship of each node can be analyzed according to the identification of each node (the ancestral relationship can reflect which nodes are production nodes and which nodes are consumption nodes). node), and then combine the ancestor relationship to obtain the constraint relationship of each tensor data.
  • the information of each tensor data indicates a constraint relationship corresponding to each tensor data
  • the method may further include the following steps: acquiring M tensors according to the constraint relationship corresponding to each tensor data The constraint amount corresponding to the data; the constraint amount is the number of tensor data in other tensor data that cannot be reused in the same memory space as the tensor data; Sort the data to obtain the sorting result of the M tensor data.
  • the information of each tensor data indicates the number of nodes to which each tensor data flows. Sort the data to get the sorted result of the M tensor data.
  • the M tensor data can also be sorted in descending order according to at least two kinds of information of each tensor data, so as to obtain the sorting result of the M tensor data .
  • sort the M tensor data in descending order according to the respective constraints of the M tensor data and the corresponding memory space size of each tensor data.
  • the sorting result of the M tensor data is obtained.
  • the method may further include the step of: using a heuristic algorithm to sort the M tensor data based on the information of each tensor data, so as to obtain the M tensor data within a preset time period sorting results.
  • the sorting result is an optimized sorting result, wherein the maximum memory size that the neural network corresponding to the optimized sorting result needs to occupy is smaller than the maximum memory size that the neural network needs to occupy according to the sorting result before optimization memory size.
  • an embodiment of the present application also provides a memory allocation method, which may include the following steps: first, obtaining a computation graph corresponding to the neural network; wherein the computation graph includes N nodes and directed edges connecting different nodes , the node is used to indicate a calculation logic in the neural network, and the directed edge is used to indicate the flow of tensor data in the calculation logic; the directed edge of the calculation graph carries tensor data, and the calculation graph includes M tensors data, M is an integer greater than 1; secondly, based on the constraint relationship corresponding to each tensor data, according to the execution order of the M tensor data in the neural network, the memory space is allocated to the M tensor data in turn, where if M One tensor data in the tensor data can reuse at least a part of the allocated memory space, then at least a part of the memory space that can be reused by the tensor data is allocated to the tensor data, and the allocated memory space is
  • the terminal device can sequentially allocate memory space to the M tensor data according to the execution order of the M tensor data based on the corresponding constraint relationship of each tensor data, so as to avoid in parallel scenarios.
  • the operator operation result is wrong due to the reuse of the same memory space by the operator, which can ensure the accuracy of the calculation result of the neural network.
  • the method may further include the following steps: if the allocated memory space cannot be reused for the tensor data, allocate other memory space for the tensor data, and the other memory space is the same as the allocated memory space. Space is different.
  • the method proposed in this application can solve the problem of unreasonable memory allocation.
  • the unreasonable memory allocation can be reflected in: avoiding too large memory allocated for the neural network, avoiding the problem of different execution flows in parallel scenarios. If the operator reuses the same memory space, the result of the operator operation is wrong, which can ensure the accuracy of the calculation result of the neural network.
  • an embodiment of the present application provides a memory allocation device, the device may include: a calculation graph obtaining unit for obtaining a calculation graph corresponding to a neural network; wherein the calculation graph includes N nodes and a number of nodes connecting different nodes.
  • a calculation graph obtaining unit for obtaining a calculation graph corresponding to a neural network; wherein the calculation graph includes N nodes and a number of nodes connecting different nodes.
  • Directional edge, the directed edge of the calculation graph carries tensor data, and the calculation graph includes M tensor data, where M is an integer greater than 1; the allocation unit is used for sorting results based on the M tensor data, which are given in turn.
  • the M pieces of tensor data are allocated memory space, wherein if one tensor data of the M pieces of tensor data can reuse at least a part of the allocated memory space, then at least a part of the memory space that can be reused by the tensor data is allocated
  • the allocated memory space is the memory space that has been allocated to M tensor data before the tensor data.
  • the sorting result indicates the order in which the memory space is allocated for the M tensor data.
  • the sorting result is the same as the M
  • the information of each tensor data in the tensor data is related to the information of each tensor data, and the information of each tensor data indicates at least one of the following information: the constraint relationship corresponding to each tensor data and the number of nodes to which each tensor data flows,
  • the constraint relationship indicates the relationship between the available memory space of one tensor data in the M tensor data and the available memory space of other tensor data in the M tensor data respectively.
  • the allocation unit is also used to: if the allocated memory space cannot be reused for the tensor data, allocate other memory space for the tensor data, and the other memory space is the same as the allocated memory space. different.
  • the constraint relationship indicates at least one of the following relationships: the relationship between the available memory space of one tensor data and the available memory space of another tensor data is reusable, the available memory of one tensor data The relationship between the space and the available memory space of another tensor data is not reusable, and the relationship between the available memory space of one tensor data and the available memory space of another tensor data is not reusable and continuous.
  • the constraint relationship is carried in a constraint relationship table, and the constraint relationship table includes identifiers of M data tensors.
  • the relationship between the available memory space of another tensor data is reusable.
  • the second value indicates that the relationship between the available memory space of one tensor data and the available memory space of another tensor data is not reusable.
  • Indicates that the relationship between the available memory space of one tensor data and the available memory space of another tensor data is non-reusable and contiguous.
  • the first tensor data may reuse the memory space allocated for the second tensor data;
  • the consuming node is not the upstream node of the production node of the second tensor data, or, when all the consuming nodes of the second tensor data are not the downstream nodes of the production node of the first tensor data , the first tensor data cannot be reused as the memory space allocated for the second tensor data; the first tensor data and the second tensor data are among the M tensor data Any two; the consumer node is the node to which the tensor data flows, and the production node is the node to which
  • the computing graph includes multiple computing subtasks, the computing subtasks indicate a computing function through a set of nodes and edges related to the set of nodes, and the execution relationship between the multiple computing subtasks is: parallel execution;
  • the device further includes: updating a calculation graph unit, for in a calculation subtask, if there is no directed edge between two adjacent nodes, adding a directed edge between two adjacent nodes to update Computational graph; in which, each added directed edge carries corresponding tensor data; the two adjacent nodes are the two adjacent nodes in the execution order of the calculation subtask; the information unit is obtained, which is used based on the updated Computational graph to get information about each tensor data.
  • the computation graph further includes a first computation subtask and a second computation subtask whose execution relationship is serial, and the execution sequence of the first computation subtask precedes the second computation subtask; update the computation graph
  • the unit is also used for: if there is no directed edge between the last node of the first calculation subtask and the first node of the second calculation subtask, then the last node of the first calculation subtask and the second calculation subtask have no directed edge. Add directed edges between the first nodes of the task.
  • the identifier of the production node of the tensor data is smaller than the identifier of the consumption node of the tensor data; the production node of the tensor data is the same as the identifier of the tensor data
  • the consuming nodes are two adjacent nodes.
  • the identifier of each node in the computation graph is used to determine the information of each tensor data in the M tensor data.
  • the information of each tensor data indicates a constraint relationship corresponding to each tensor data
  • the apparatus further includes: a first sorting unit, configured to The constraint relationship corresponding to the quantitative data obtains the corresponding constraint amount of the M tensor data; the constraint amount is the number of tensor data in other tensor data that cannot be multiplexed with the tensor data in the same memory space; according to Sorting the M tensor data according to the constraints corresponding to each of the M tensor data, so as to obtain a sorting result of the M tensor data.
  • the information of each tensor data indicates the number of nodes to which each tensor data flows
  • the apparatus further includes: a second sorting unit, configured to The number of consuming nodes corresponding to the tensor data, and the M tensor data are sorted to obtain the sorting result of the M tensor data.
  • the apparatus further includes:
  • a third sorting unit configured to use a heuristic algorithm to sort the M tensor data based on the information of each tensor data, so as to obtain the sorting result of the M tensor data within a preset time period .
  • the sorting result is an optimized sorting result, wherein the maximum memory size that the neural network corresponding to the optimized sorting result needs to occupy is smaller than the size determined according to the sorting result before optimization.
  • the maximum memory size that the neural network needs to occupy is smaller than the size determined according to the sorting result before optimization.
  • an embodiment of the present application further provides a memory allocation device, the device may include: a calculation graph obtaining unit for obtaining a calculation graph corresponding to a neural network; wherein the calculation graph includes N nodes and different connections The directed edge of the node, the tensor data is carried on the directed edge of the calculation graph, the calculation graph includes M tensor data, and the M is an integer greater than 1; the allocation unit is used for each For the constraint relationship corresponding to the tensor data, according to the execution order of the M tensor data in the neural network, the memory space is allocated to the M tensor data in sequence, wherein, if the M tensor data in the At least a part of the allocated memory space can be reused for one tensor data of the Before the tensor data, the memory space has been allocated to the M tensor data, and the constraint relationship indicates that the available memory space of one tensor data in the M tensor data is related to the M tens
  • the allocation unit is further configured to: if the allocated memory space cannot be reused for the tensor data, allocate another memory space for the tensor data, and the other memory space Different from the allocated memory space.
  • an embodiment of the present application further provides a memory allocation device, the memory allocation device may include a memory and a processor, and the memory is used to store a computer program that supports the memory allocation device to perform the above method, and the computer program includes a program Instructions, the processor is configured to invoke the program instructions to execute the memory allocation method of any one of the above-mentioned first aspect or any one of the second aspect.
  • embodiments of the present application further provide a computer-readable storage medium, where a computer program is stored in the computer storage medium, and the computer program includes program instructions, and the program instructions, when executed by a processor, cause the processing
  • the processor executes the memory allocation method of any one of the above-mentioned first aspect or any one of the second aspect.
  • the embodiments of the present application further provide a computer program, the computer program includes computer software instructions, and when executed by a computer, the computer software instructions cause the computer to execute any one of the above-mentioned first aspect or The memory allocation method of any one of the second aspect.
  • FIG. 1a is a schematic structural diagram of a calculation diagram of a neural network provided by an embodiment of the application
  • FIG. 1b is a schematic structural diagram of a calculation diagram of another neural network provided by an embodiment of the application.
  • FIG. 1c is a schematic diagram of the execution sequence of each operator in a calculation diagram provided by an embodiment of the present application.
  • 1d is a schematic diagram of the execution sequence of each operator in a calculation graph provided by an embodiment of the present application.
  • 2a is a schematic diagram of a calculation diagram of a neural network and an execution sequence of each operator in the calculation diagram provided by an embodiment of the application;
  • 2b is a schematic diagram of allocating memory space for tensor data according to an embodiment of the present application
  • 2c is a schematic diagram of a computation graph of a neural network in a parallel scenario and an execution sequence of each operator in the computation graph provided by an embodiment of the present application;
  • 3a is a schematic structural diagram of a memory allocation device provided by an embodiment of the application.
  • 3b is a schematic diagram of a server or terminal device side architecture provided by an embodiment of the present application.
  • FIG. 3c is a schematic diagram of a network architecture provided by an embodiment of the present application.
  • 3d is a schematic diagram of a directed acyclic graph DAG provided by an embodiment of the present application.
  • 4a is a schematic flowchart of a memory allocation method provided by an embodiment of the present application.
  • FIG. 4b is a schematic structural diagram of a convolutional neural network 400 provided by an embodiment of the present application.
  • 4c is a schematic diagram of determining a sorting result according to an embodiment of the present application.
  • 4d is a schematic diagram of a memory space in an allocated set provided by an embodiment of the present application.
  • 4e is a schematic diagram of an updated calculation graph provided by an embodiment of the present application.
  • 4f is a schematic diagram of a memory space in an allocated set provided by an embodiment of the present application.
  • 4g is a schematic diagram of a memory space in an allocated set provided by an embodiment of the present application.
  • FIG. 5 is a schematic flowchart of another memory allocation method provided by an embodiment of the present application.
  • FIG. 6 is a schematic structural diagram of a memory allocation device 60 according to an embodiment of the present application.
  • FIG. 7 is a schematic structural diagram of a memory allocation device 70 according to an embodiment of the present application.
  • any embodiment or design approach described in the embodiments of the present application as “exemplarily” or “such as” should not be construed as preferred or advantageous over other embodiments or designs. Rather, the use of words such as “exemplarily” or “such as” is intended to present the related concepts in a specific manner.
  • “A and/or B” means A and B, and A or B has two meanings.
  • “A, and/or B, and/or C” means any one of A, B, and C, alternatively, means any two of A, B, and C, alternatively, means A and B and C.
  • a neural network can be composed of neural units, and a neural unit can refer to an operation unit that takes x s and an intercept b as inputs, and the output of the operation unit can be:
  • ws is the weight of x s
  • b is the bias of the neural unit.
  • f is an activation function of the neural unit, which is used to introduce nonlinear characteristics into the neural network to convert the input signal in the neural unit into an output signal. The output signal of this activation function can be used as the input of the next convolutional layer.
  • the activation function can be a sigmoid function.
  • a neural network is a network formed by connecting many of the above single neural units together, that is, the output of one neural unit can be the input of another neural unit.
  • the input of each neural unit can be connected with the local receptive field of the previous layer to extract the features of the local receptive field, and the local receptive field can be an area composed of several neural units.
  • Deep neural network also known as multi-layer neural network
  • DNN Deep neural network
  • the neural network inside DNN can be divided into three categories: input layer, hidden layer, and output layer.
  • the first layer is the input layer
  • the last layer is the output layer
  • the middle layers are all hidden layers.
  • Layers can be fully connected or not fully connected. When the layers are fully connected, any neuron in the i-th layer must be connected to any neuron in the i+1-th layer.
  • DNN looks complicated, it is not complicated in terms of the work of each layer.
  • the coefficient from the kth neuron in the L-1 layer to the jth neuron in the Lth layer is defined as It should be noted that the input layer does not have a w parameter.
  • more hidden layers allow the network to better capture the complexities of the real world.
  • a model with more parameters is more complex and has a larger "capacity", which means that it can complete more complex learning tasks.
  • Training the deep neural network is the process of learning the weight matrix, and its ultimate goal is to obtain the weight matrix of all layers of the trained deep neural network (the weight matrix formed by the vectors w of many layers).
  • a computational graph is a way of describing the computational process of a neural network using a graph structure. If the computation has obvious modularity, and there are obvious temporal and logical dependencies between modules, it can usually be described using a directed graph structure.
  • the neural network model can be abstracted into a directed graph structure composed of tensor data and operators. Nodes, also known as operators.
  • a directed edge refers to an edge with a direction, which describes the pointing direction between operators and is used to characterize the dependencies between operators.
  • a node is used as an example for description.
  • the computation graph contains 8 operators and 9 tensor data.
  • the 8 operators are operator a, operator b, operator c, operator d, operator e, operator f, operator g and operator h
  • the 9 tensor data are tensors respectively data t0, tensor data t1, tensor data t2, tensor data t3, tensor data t4, tensor data t5, tensor data t6, tensor data t7, and tensor data t8.
  • the directed edge indicates that the tensor data t0 generated by operator a is the input of operator b; at the same time, there is a directed edge between operator a and operator c, the direction of the directed edge
  • the directed edge indicates that the tensor data t0 generated by operator a is the input of operator b.
  • the way of describing the neural network model using the computational graph is conducive to the overall grasp of the entire neural network computational task.
  • the expression of the computational graph is also convenient for scheduling and parallel execution of computational tasks.
  • operators in the computation graph can be assigned to multiple computation subtasks, and different computation subtasks can be parallelized or serialized.
  • the running order of operators in the same calculation subtask is serial.
  • operator a and operator c are assigned to calculation subtask 0
  • operator b operator d
  • operator e and operator f are assigned to calculation subtask 1
  • operator g and operator h are assigned to the calculation subtask 2.
  • multiple computing subtasks can be obtained by splitting the neural network computing task, so that multiple computing subgraphs can be obtained, and one computing subgraph is one computing subtask.
  • the execution order of the operators among different computation subtasks depends on the directions of the directed edges between the operators, for example Say, in calculation subtask 0, the execution order between operators is operator a-operator c; in calculation subtask 1, the execution order between operators is operator b-operator d/operator e-operator f; in the calculation subtask 2, the execution order between the operators is operator g-operator h.
  • operator c and operator g are executed in parallel in two calculation subtasks.
  • the execution between operator e and operator g is not limited. order. Specifically, the execution sequence of each operator in the calculation graph may be as shown in FIG. 1c. It should be noted that, in FIG. 1c, the execution order between operator d and operator e is not limited, and based on this, it is expressed as operator d/operator e.
  • the execution order of operators among different computing subtasks depends on the operators The directions of the directed edges.
  • the execution order of the operators is operator a-operator c; in the calculation subtask 1, the execution order of the operators is the calculation subtask 1.
  • the operator g in the calculation subtask 2 will be executed after the operation of the operator f in the calculation subtask 1 is completed.
  • the execution sequence of each operator in the calculation graph may be as shown in FIG. 1d.
  • operator A depends on operator B, which means that operator A must wait for the kernel function corresponding to operator B to finish executing before starting its own computing task.
  • a tensor is only a feature description of a piece of stored data, and a tensor records information such as the shape and type of the data.
  • a tensor should be understood as tensor data, which may include input tensor data, output tensor data, and feature tensor data in a neural network model.
  • rank, shape and dimension number are generally used to describe the latitude of a tensor.
  • the relationship can be expressed as the following table:
  • the tensor A 4, which represents a number.
  • the tensor A [6,2], which represents a two-dimensional matrix. Specifically, the matrix is a matrix with 6 rows and 2 columns.
  • the allocated set refers to the set of allocated memory space information in the process of storing the tensor data of the neural network.
  • the above allocated set may also be called a shared memory queue, which is not specifically limited in this application.
  • the first strategy called In-Place strategy, this strategy means that the input and output of each node in the neural network share a memory space.
  • the second strategy called Co-share strategy, this strategy means that a certain memory space can be used by multiple nodes in the neural network. When these nodes are all executed, the life cycle of the memory space ends.
  • the memory space is available to other nodes in the neural network.
  • the life cycle of memory space A can be preset as (1, 2, 3), which means that memory space A can be used by node 1, node 2 and node 3.
  • node 1, node 2 and node 3 are all executed, The life cycle of memory space A ends. At this time, memory space A can be placed in the free linked list for use by other nodes in the neural network.
  • the specific method of memory allocation is: allocate and reuse memory space according to the order of execution of nodes in the neural network, and the effect of memory allocation is poor.
  • the neural network needs to occupy 100M memory space, 10M memory space and 50M memory space in turn during the running process.
  • a 100M memory space can be allocated for the neural network.
  • 10M memory space it is judged whether the above allocated 10M memory space can be reused. If it is used, no new memory space is allocated for the 10M memory space applied for, but the above-mentioned 100M memory space is reused.
  • the neural network applies for 50M of memory space, it first determines whether the 50M of memory space can reuse the allocated 100M of memory space. If it can be reused, it will not allocate new memory for the applied 50M of memory space. memory space.
  • the applied 10M memory space and the applied 50M memory space can reuse the allocated 100M memory space, the applied 10M memory space will reuse the allocated 100M memory space, and an additional 50M memory space will be allocated to the neural network. Therefore, the entire neural network needs to occupy 150M of memory space, resulting in a large amount of memory occupied by the entire neural network and unreasonable memory allocation.
  • the present application provides a memory allocation method.
  • the main principle of the method is to obtain the sorting result of multiple tensor data in the entire neural network according to the information of each tensor data, wherein the information of each tensor data is It can include at least one of the size of the memory space that each tensor data needs to occupy, the corresponding constraint relationship of each tensor data, and the consumption operator.
  • the information of each tensor data it includes its corresponding identifier. , and then allocate memory space for the tensor data in sequence according to the sorting result.
  • the situation of unreasonable memory planning space in the prior art can be avoided, thereby saving the memory that the entire neural network needs to occupy, and optimizing the memory allocation of the neural network.
  • the method can also solve the problem of incorrect operator operation results caused by the reuse of the same memory space for operators in different computing subtasks in a parallel scenario.
  • the entire neural network includes 8 nodes, and the indices are respectively a to h according to the running order.
  • the indices are respectively a to h according to the running order.
  • the implementation process of pre-allocating memory for tensor data can be shown in Figure 2b.
  • the first memory space is allocated for tensor data t0, and the first memory space is allocated for tensor data t1.
  • Two memory spaces Before the operation logic of the operator b is simulated and executed, a third memory space is allocated for the tensor data t2, and a fourth memory space is allocated for the tensor data t3.
  • a fifth memory space is allocated for the tensor data t4. At the same time, the first memory space is released, and the first memory space can be reused for the next tensor data.
  • the above-mentioned reusable first memory space is allocated to the tensor data t5.
  • the third memory space is released, and the third memory space can be reused for the next tensor data.
  • the above-mentioned reusable third memory space is allocated to the tensor data t6.
  • the fourth memory space is released, and the fourth memory space can be reused for the next tensor data.
  • the above-mentioned fourth memory space that can be reused is allocated to the tensor data t7.
  • the first memory space and the third memory space are released, and for the next tensor data, the above-mentioned first memory space and the third memory space can be reused.
  • the above-mentioned reusable first memory space is allocated to the tensor data t8.
  • the fifth memory space is released.
  • the operation logic of the operator h is simulated and executed, the second memory space, the fourth memory space and the first memory space are released.
  • the operation results of the neural network are stored in the set memory space.
  • the size of the memory planning space determined for the entire neural network is the sum of the sizes of the above five memory spaces.
  • the computation graph shown in Fig. 1b includes three computation subtasks, and the three computation subtasks respectively represent different neural network computation subtasks.
  • the relationship between these three computing subtasks can be serial or parallel.
  • the execution order of the operators is serial.
  • Figure 2c when the relationship between the calculation subtask 0 and the calculation subtask 1 is parallel (for example, the above two calculation subtasks can be run by different processor cores), since the operator c and the operator The execution order between d is a parallel relationship.
  • the present application provides another memory allocation method.
  • the main principle of the method is: in a parallel scenario, determine the corresponding constraint relationship of each tensor data in the calculation graph, and then, based on each tensor data corresponding The constraints of allocating memory space for tensor data.
  • the method described in this application can be applied to online training/inference of neural network, and can also be applied to offline training/inference of neural network.
  • the processor CPU communicates with the artificial intelligence processor through the I/O bus to allocate memory for the running neural network.
  • the general-purpose processor obtains the neural network offline file stored in the hard disk, and allocates memory for the neural network offline file when calling the neural network offline file.
  • the memory allocation device may specifically be a server or a terminal device.
  • the server or terminal device side may include deep learning algorithms, deep learning frameworks, computing resources, and memory resources.
  • the deep learning algorithm can call computing resources and memory resources through the deep learning framework.
  • Caffe can support various types of deep learning frameworks, image classification and image segmentation, as well as Convolutional Neural Networks (CNN). ), convolutional neural network (Region-CNN, RCNN), long short-term memory neural network (Long Short-Term Memory, LSTM) and fully connected neural network design for object detection.
  • the deep learning algorithm may include a network model
  • the deep learning framework may include NET classes, layers, blobs, task management and memory management modules (syncmem), wherein the memory management module may be provided with a MemModel module , based on the original logic of blob and memory management module, which can realize memory optimization.
  • the above-mentioned network model may specifically be a network model of a neural network.
  • the NET class can store the Directed acyclic graph (DAG) corresponding to the neural network.
  • DAG Directed acyclic graph
  • the neural network includes A, There are five nodes B, C, E, F and G.
  • the output parameters of node A (for example, tensor data) are used as input parameters of node B, and the output parameters of node B are used as input parameters of node C and node F respectively.
  • the output parameters of node C are used as input parameters of node E, and the output parameters of nodes E and F are used as input parameters of node G.
  • a layer is used to store the information of the nodes included in the neural network, and a node can also be called a layer.
  • Blob is used to store the information of the memory space occupied by the corresponding input parameters, output parameters and intermediate parameters of each node of the neural network during the operation process.
  • the memory management module is used to manage and allocate the information of the memory space occupied by the neural network.
  • FIG. 4a is a schematic flowchart of a memory allocation method for a neural network according to an embodiment of the present application.
  • the execution body of the method may be a server running the neural network, or may be a terminal device running the neural network.
  • the execution subject is a terminal device running a neural network as an example for description.
  • the method may include but is not limited to the following steps:
  • Step S401 obtaining a computation graph corresponding to the neural network; wherein, the computation graph includes N nodes and directed edges connecting different nodes, the directed edges of the computation graph carry tensor data, and the computation graph includes M tensors data, M is an integer greater than 1.
  • the node is used to indicate a kind of computing logic in the neural network, that is, a function that implements a certain function.
  • OP can be used to represent nodes and tensor to represent tensor data.
  • the specific structure of the convolutional neural network may be as shown in FIG. 4b, and the convolutional neural network (CNN) 400 may include an input layer 410, a convolutional layer/pool A pooling layer 420 (where the pooling layer is optional), a fully connected layer 430 and an output layer 440.
  • the fully connected layer 430 refers to a fully connected network structure.
  • the fully connected feature can be represented by the product of the input data of the hidden layer 1 and the weight tensor corresponding to the hidden layer 1.
  • the fully connected feature can be quantified as ⁇ x, where ⁇ represents the weight tensor corresponding to the hidden layer 1, and x represents the input data of the hidden layer 1.
  • the convolution layer 420 is used to extract the features of the input data.
  • the convolution layer 420 is used to extract the features of the input image to reduce the parameters brought by the input image;
  • the fully connected layer 430 is used to integrate local information with class discrimination in the convolutional layer 420 (or the pooling layer).
  • the fully connected layer 430 can connect the features extracted by the convolutional layer 420.
  • the excitation function of each neuron in the fully connected layer 430 generally adopts the ReLU function.
  • the output value of the last fully connected layer 430 is passed to an output, for example, softmax logistic regression can be used for classification, so that the processing result can be obtained.
  • the processing result can be the recognition probability of the image, so the processing result can be output through the output layer 440 .
  • the terminal device can obtain the computation graph corresponding to the above-mentioned convolutional neural network.
  • the computation graph includes convolution nodes, fully connected nodes (FC), activation nodes (Relu), pooling nodes (Pooling), classifier nodes (softmax), and the like.
  • the directed edge may be used to represent the connection relationship between nodes, the directed edge carries tensor data, and the direction of the directed edge is used to reflect the flow direction of the tensor data.
  • Step S402 based on the sorting result of the M tensor data, allocate memory space to the M tensor data in turn, wherein, if one tensor data in the M tensor data can reuse the memory space in the allocated memory space. at least a part of the memory space that can be reused by the tensor data is allocated to the tensor data, and the allocated memory space is the memory space that has been allocated to the M tensors before the tensor data memory space for the amount of data.
  • the sorting result of the M pieces of tensor data indicates the execution order when the memory space is allocated for the M pieces of tensor data, and the sorting result is related to the information of each tensor data in the M pieces of tensor data,
  • the information of each tensor data indicates at least one of the following information: a constraint relationship corresponding to each tensor data and the number of nodes to which each tensor data flows.
  • tensor data may include input tensor data, output tensor data, and intermediate tensor data.
  • a consuming node refers to a node that consumes tensor data in a computation graph, that is, a node to which tensor data flows.
  • consumption refers to the use and consumption of substances (eg, tensor data) during node operations.
  • a production node refers to a node that generates tensor data in a computation graph, that is, a node from which tensor data flows out.
  • production is the inverse process of "consumption”, which means the output of the node operation process.
  • node A is an upstream node of node B means that there is at least one path in the computation graph that can go from node A to node B.
  • the upstream node corresponding to node B can be obtained by reverse traversal (that is, along the opposite direction of the directed edge).
  • the constraint relationship may be carried in a constraint relationship table.
  • the first value may indicate that each tensor data can be multiplexed with other tensor data in the same memory space
  • the second value indicates that Each tensor data cannot be multiplexed with other tensor data in the same memory space
  • the third value indicates that each tensor data can be continuously stored in the same memory space with other tensor data.
  • the first value, the second value and the third value may be numerical values that can be distinguished from each other.
  • the first value may be "0”
  • the second value may be "1”
  • the third value may be "2" .
  • the above constraint relationships have different priorities.
  • the relationship between the available memory spaces of two tensors is non-reusable and continuous, it must also be non-reusable, then the two The relationship between the available memory space of the tensors is indicated as non-reusable and contiguous in the constraint relationship. That is, it can be understood that the priority of non-reusable and continuous is higher than that of non-reusable.
  • the constraint relationship may not be limited to the representation form of the constraint relationship table, and may also be presented by other data structures.
  • the implementation process of determining the constraint relationship corresponding to each tensor data in the calculation graph may include: judging whether all the consuming nodes of the first tensor data are upstream nodes of the production node of the second tensor data, If so, it is determined that the first tensor data can be reused as the memory space allocated for the second tensor data; if not, it is determined that the first tensor data cannot be reused as the memory space allocated for the second tensor data.
  • the implementation process of determining the constraint relationship corresponding to each tensor data in the calculation graph may include: judging whether all consuming nodes of the second tensor data are downstream of the production nodes of the first tensor data Node, if yes, it is determined that the first tensor data can be reused as the memory space allocated for the second tensor data; if not, it is determined that the first tensor data cannot be reused as the memory space allocated for the second tensor data.
  • At least part of the memory space allocated by tensor data A that can be reused as tensor data B means that tensor data A can be completely reused as the memory space allocated by tensor data B, or, tensor data A can completely reuse the memory space allocated by tensor data B. Part of the memory space allocated for tensor data B can be reused.
  • node A is an upstream node of node B means that there is at least one path in the computation graph that can go from node A to node B.
  • the upstream node corresponding to node B can be obtained by reverse traversal (that is, along the opposite direction of the directed edge).
  • the implementation process of determining the size of the memory space occupied by each tensor data in the calculation graph may include: running the neural network on the terminal device, recording the size of the memory space occupied by each tensor data in the neural network, according to The size of the memory space occupied by each recorded tensor data is used to determine the size of the memory space occupied by each tensor data when the neural network is running, which provides a basis for subsequent allocation of the corresponding memory space to the tensor data.
  • the entire neural network includes node 1 and node 2.
  • the terminal device runs the neural network through an artificial intelligence processor, which can be recorded during the operation of the neural network.
  • the size of the memory space occupied by tensor data 1 is 1000Kb, and the size of tensor data 2
  • the size of the memory space that needs to be occupied is 500Kb, so that the size of the memory space that each tensor data needs to occupy when the neural network is running can be determined according to the size of the memory space that each tensor data recorded needs to occupy.
  • each tensor data has its own corresponding identifier by default.
  • the calculation graph shown in Figure 1b includes 8 operators and 9 tensors data, among which, 9 tensor data can be represented as tensor data t0, tensor data t1, tensor data t2, tensor data t3, tensor data t4, tensor data t5, tensor data t6, tensor data t7 and tensor data t8. It can be understood that the corresponding identifier of each tensor data is unique.
  • the above identifiers may be a series of sequential numbers, so that the order of the tensor data may be determined according to the respective identifiers of each tensor data.
  • the constraint amount corresponding to each of the M tensor data can be obtained according to the constraint relationship corresponding to each tensor data, and the constraint amount is that the same memory space cannot be reused with the tensor data in other tensor data After that, according to the size of the corresponding constraints of the M tensor data, sort the M tensor data in descending order, and obtain the sorting result of the M tensor data.
  • the M pieces of tensor data may be sorted in descending order according to the size of the number of consuming nodes corresponding to each of the M pieces of tensor data, to obtain a sorting result of the M pieces of tensor data.
  • the M tensor data can also be sorted according to at least two kinds of information of each tensor data in descending order to obtain a sorting result of the M tensor data.
  • the calculation diagram includes two tensor data, which are tensor Data 1 and tensor data 2.
  • the memory space occupied by tensor data 1 is 1000Kb
  • the constraint relationship between tensor data 1 and tensor data 2 is that tensor data 1 cannot be duplicated with tensor data 2.
  • the same memory space is used, and its constraint is 1; the memory space occupied by tensor data 1 is 500Kb, and the constraint relationship between tensor data 2 and tensor data 1 is that tensor data 2 cannot be used with tensor data. 1 reuses the same memory space, and its constraint is 1. Sort the above two tensor data in descending order, and get the sorting result: tensor data 1, tensor data 2.
  • the M pieces of tensor data may be sorted through a heuristic algorithm, so as to obtain a sorting result of the M pieces of tensor data within a preset time period.
  • a heuristic algorithm refers to an algorithm constructed based on intuition or experience, which gives a feasible solution for each instance of the combinatorial optimization problem to be solved at an acceptable cost (referring to computing time and space). The degree of deviation of the optimal solution cannot generally be predicted.
  • each tensor data includes the corresponding identifier of each tensor data, and may also include one or more of the following: the size of the memory space that each tensor data needs to occupy, the size of each tensor data The corresponding constraints of tensor data, and the number of consumer nodes to which each tensor data flows.
  • the terminal device sorts the M tensor data through the heuristic algorithm, it needs to consider the order of the information contained in each tensor data, and then use the order as an independent individual for sorting.
  • the mixed sorting results between these 4 pieces of information can include 632 kinds.
  • the calculation graph includes 4 tensor data, namely tensor data 1, tensor data 2, tensor data 3, tensor data 4, and tensor data 5.
  • the terminal device uses the above heuristic algorithm to determine the value of 5. Sort the tensor data, and obtain the sorting sequence determined by the heuristic algorithm within the preset time period (for example, the sorting sequence is: tensor data 2, tensor data 3, tensor data 4, tensor data 1 and tensor data Quantity data 5) is the sorting result of 5 tensor data. Therefore, the terminal device can allocate memory space for the tensor data according to the determined sorting result.
  • the memory allocation device may call a heuristic algorithm (for example, the heuristic algorithm includes a deterministic algorithm and a random algorithm) through a constraint programming solver CPsolver (Constraint Programming solver) to sort the M tensor data.
  • a heuristic algorithm for example, the heuristic algorithm includes a deterministic algorithm and a random algorithm
  • CPsolver Constraint Programming solver
  • the above sorting results may be sorting results that need to be optimized, or may be sorting results that do not require optimization.
  • the above sorting result is an optimized sorting result, wherein the maximum memory size that the neural network corresponding to the optimized sorting result needs to occupy is smaller than the neural network that needs to be occupied according to the pre-optimized sorting result.
  • maximum memory size For example, the calculation graph includes 4 tensor data, namely tensor data 1, tensor data 2, tensor data 3, tensor data 4 and tensor data 5, among which, tensor data 1, tensor data 2 , tensor data 3, and tensor data 4 in order: tensor data 1, tensor data 2, tensor data 3, and tensor data 4.
  • the memory allocation device determines the position of the tensor data 5 according to different judgment conditions, wherein the judgment conditions may include but are not limited to: for a certain tensor data, allocate the tensor data for the tensor data The first address corresponding to the memory space of the tensor data is the smallest or largest; for a certain tensor data, the difference between the size of the memory space corresponding to the determined potential position and the size of the memory space occupied by the tensor data satisfies the threshold, for example, the The threshold can be 0 or other values.
  • the terminal device allocates memory space for the tensor data according to the sorting result. For example, the maximum memory size required to run the entire neural network is determined by the allocated memory space. 4500Kb; when the position of the tensor data 5 in the sorting result is the possible position 2, the terminal device allocates memory space for the tensor data according to the sorting result.
  • the allocated memory space determines the maximum memory required to run the entire neural network
  • the size is 3500Kb; when the position of the tensor data 5 in the sorting result is the possible position 3, the terminal device allocates memory space for the tensor data according to the sorting result.
  • the maximum memory size is 5000Kb; when the position of the tensor data 5 in the sorting result is the possible position 4, the terminal device allocates memory space for the tensor data according to the sorting result. For example, it is determined by the allocated memory space to run the entire neural network.
  • the maximum memory size required is 4000Kb.
  • computation subtask 0 includes node a and node c, and its execution order is node a-node c;
  • computation subtask 1 includes node b and node d , node e and node f, the execution order is node b-node d/e-node f;
  • the calculation subtask 2 includes node g and node h, and its execution order is node g-node h.
  • the execution relationship among the calculation subtask 0, the calculation subtask 1, and the calculation subtask 2 is parallel. Since in each calculation subtask, there are directed edges between two adjacent nodes, at this time, there is no need to adjust the calculation graph.
  • the upstream node corresponding to each node determines the upstream node corresponding to each node, the output tensor data and input tensor data corresponding to each node.
  • node A as an upstream node of node B as an example, it means that there is at least one path from node A to node B in the computation graph.
  • the upstream node corresponding to each node, the output tensor data and input tensor data corresponding to each node can be shown in Table 1:
  • node a is the starting node, and there is no corresponding upstream node.
  • the output tensor data t0 can be obtained and the output tensor data t1.
  • node a is the upstream node of node b, which means that there is a path from node a to node b in the calculation graph.
  • its input tensor data is t0
  • the output tensor data t2 and the output tensor data t3 can be obtained.
  • the implementation process of determining the upstream nodes corresponding to other nodes, outputting tensor data, and inputting tensor data will not be repeated here.
  • each tensor data determines whether all the consuming nodes of the first tensor data are upstream nodes of the production nodes of the second tensor data, and if so, it is determined that the first tensor data can be reused as the memory space allocated by the second tensor data; If not, it is determined that the first tensor data cannot reuse the memory space allocated for the second tensor data.
  • the above constraint relationship may be carried in a constraint relationship table.
  • the first value may indicate that each tensor data can be multiplexed with other tensor data in the same memory space
  • the second value may indicate that each tensor data can be reused in the same memory space.
  • the tensor data cannot be reused in the same memory space with other tensor data
  • the third value indicates that each tensor data can be stored contiguously in the same memory space with other tensor data.
  • the first value "0" indicates that the tensor data can be multiplexed with other tensor data except itself in the same memory space;
  • the second value "1” indicates that the tensor data cannot be used with other than itself.
  • the tensor data other than itself reuses the same memory space;
  • the third value "2" indicates that the tensor data can be stored contiguously in the same memory space with other tensor data except itself.
  • node b is the control selection node, and this node has two branches, one branch is: node b-node d-node f; one branch is: node b - Node e - Node f. In one operation of the neural network, only one branch is active.
  • the constraint relationship between tensor data t2 and tensor data t3 is that tensor data t2 and tensor data t3 do not require two independent memory spaces, that is: The same memory space can be reused; for tensor data t5 and tensor data t6, the constraint relationship between tensor data t5 and tensor data t6 is that tensor data t5 and tensor data t6 do not need two Independent memory space, that is: the same memory space can be reused.
  • the M tensor data are sorted from large to small, and the sorting result of the M tensor data is obtained.
  • sorting multiple tensor data that are constrained to be stored contiguously in the same memory space can be sorted as an independent whole.
  • the corresponding constraint amount of each tensor data can be obtained.
  • the tensor data t0 cannot be combined with other tensor data (t1, t2 , t3, t4, t5, t6, t7, t8) multiplex the same memory space, and its constraint is 8;
  • tensor data t1 cannot be used with other tensor data (t0, t2, t3, t4, t5, t6, t7, t8) multiplex the same memory space, and its constraint is 8;
  • tensor data t2 cannot be used with tensor data (t0, t1, t3, t4, t5, t8) multiplex the same memory space, and its constraint is 6; for tensor data t3, tensor data t3 cannot be multiplexed with
  • the constraint is 5; for tensor data t8, tensor data t8 cannot be combined with tensor data (t0, t2, t3, t4, t5, t6, t7), the constraint amount is 7.
  • the size of the memory space occupied by the tensor data t0 is 500Kb; the size of the memory space occupied by the tensor data t1 is 500Kb; the size of the memory space occupied by the tensor data t2 is 500Kb; the size of the memory space occupied by the tensor data t3 is The space size is 500Kb; the memory space occupied by tensor data t4 is 500Kb; the memory space occupied by tensor data t5 is 1000Kb; the memory space occupied by tensor data t6 is 1000Kb; the memory space occupied by tensor data t7 is 1000Kb; The size of the memory space is 1000Kb; the size of the memory space occupied by the tensor data t8 is 1000Kb.
  • each tensor data can also be sorted as a separate whole.
  • the allocated memory space includes the first memory space; wherein, the first memory space includes the memory space a, The second memory space b and the third memory space c, the memory space a, the memory space b and the memory space c are continuous memory spaces, wherein the memory space a is used to store the tensor data t1 (that is: the size of the memory space a equal to the size of the tensor data t1), the memory space b is used to store the tensor data t7 (that is, the size of the memory space b is equal to the size of the tensor data t7), and the memory space c is used to store the tensor data t8 (ie : the size of the memory space c is equal to the size of the tensor data t8); after that, allocate memory space for the tensor data t5, and the implementation
  • the tensor data t5 cannot reuse the allocated first memory space, at this time, according to the size of the memory space required by the tensor data t5 A second memory space of a corresponding size is allocated for the tensor data t5. At this time, the allocated memory space includes the first memory space and the second memory space.
  • allocate memory space for the tensor data t6 may include: combining the constraint relationship between the tensor data t6 and other tensor data in Table 2 to determine whether the tensor data t6 can reuse the allocated memory space ( The first memory space and the second memory space), since the tensor data t6 can reuse the second memory space, the second memory space is allocated to the tensor data t6.
  • allocate memory space for the tensor data t0 may include: combining the constraint relationship between the tensor data t0 and other tensor data in Table 2 to determine whether the tensor data t0 can reuse the allocated memory space ( The first memory space and the second memory space), since the tensor data t0 cannot reuse the allocated memory space, at this time, according to the size of the memory space that the tensor data t0 needs to occupy, a third corresponding size of the tensor data t0 is allocated.
  • the memory space in this case, the allocated memory space includes the first memory space, the second memory space and the third memory space.
  • allocate memory space for the tensor data t4 may include: combining the constraint relationship between the tensor data t4 and other tensor data in Table 2 to determine whether the tensor data t4 can reuse the allocated memory space ( The first memory space, the second memory space and the third memory space), since the tensor data t4 cannot reuse the allocated memory space, at this time, according to the size of the memory space occupied by the tensor data t4, the tensor data t4 is allocated A fourth memory space of a corresponding size, in this case, the allocated memory space includes the first memory space, the second memory space, the third memory space and the fourth memory space.
  • allocate memory space for the tensor data t2 may include: combining the constraint relationship between the tensor data t2 and other tensor data in Table 1 to determine whether the tensor data t2 can reuse the allocated memory space ( the first memory space, the second memory space, the third memory space and the fourth memory space), since the tensor data t2 can reuse the first memory space, the first memory space is allocated to the tensor data t2 (for example, the The memory space c in the first memory space is allocated to the tensor data t2).
  • the implementation process may include: judging whether the tensor data can reuse the allocated memory space according to the constraint relationship between the tensor data t3 and other tensor data in Table 2 (No. a memory space, a second memory space, a third memory space and a fourth memory space), since the tensor data can reuse the first memory space, the first memory space is allocated to the tensor data t3 (for example, the first memory space can be assigned The memory space c in the memory space is allocated to the tensor data t3).
  • the allocation process can be shown in Figure 4d.
  • an independent memory space is allocated for each tensor data, and the relationship between the tensor data and the memory space is one-to-one correspondence, that is, the number of tensor data and the memory space are in one-to-one correspondence.
  • the amount of memory space is the same.
  • each memory space includes its corresponding first address and storage space size.
  • the memory space allocated for each tensor data can be shown in Table 4:
  • [3500, 4000[ means including 3500-3999, excluding 4000, and its storage space size is 500.
  • the maximum memory size that the entire neural network needs to occupy can be determined by using the allocated memory space. For example, based on the allocation process in Figure 4d, it can be determined that the maximum memory size that the neural network shown in Figure 1b needs to occupy is 4500Kb. Through this implementation, the maximum memory size required by the computer device to run the entire neural network can be determined, and the situation in which the allocated memory cannot support the normal operation of the neural network can be avoided.
  • the corresponding memory space is allocated for each tensor data, it is verified whether the allocated memory space is correct according to the corresponding constraint relationship of each tensor data; Allocate memory space for the amount of data. For example, for tensor data t8, in the first memory space, determine whether the memory space corresponding to tensor data t8 is on the right side of the memory space corresponding to tensor data t1. For example, as shown in Table 4, tensor data The memory space corresponding to t1 is [0,500[, and the memory space corresponding to tensor data t8 is [500,1500[.
  • the tensor data t8 can be determined.
  • the corresponding memory space is to the right of the memory space corresponding to the tensor data t1, which means that the allocation is correct.
  • tensor data t1, tensor data t7, and tensor data t8 determine whether the memory spaces corresponding to tensor data t1, tensor data t8, and tensor data t7 are contiguous memory spaces, for example, As shown in Table 4, the corresponding memory spaces of tensor data t1, tensor data t8 and tensor data t7 are contiguous memory spaces, wherein the first memory space is [0,500[, the second memory space is [500, 1500[, the third memory space is [1500, 2500[, based on the corresponding storage space of tensor data t1, tensor data t8 and tensor data t7, the tensor data t1, tensor data t8 and tensor data can be determined
  • the corresponding memory space of t7 is a continuous memory space, which means that the allocation is correct. Through this implementation, unreasonable memory allocation can be avoided. For example, the unreasonable memory allocation
  • computation subtask 0 includes node a and node c, and its execution order is node a-node c;
  • computation subtask 1 includes node b and node d , node e and node f, the execution order is node b-node d/e-node f;
  • the calculation subtask 2 includes node g and node h, and its execution order is node g-node h.
  • the execution relationship between the calculation subtask 0 and the calculation subtask 1 is parallel, and the execution relationship between the calculation subtask 1 and the calculation subtask 2 is serial.
  • the upstream node corresponding to each node determines the upstream node corresponding to each node, the output tensor data and input tensor data corresponding to each node.
  • the upstream node corresponding to each node, the output tensor data and input tensor data corresponding to each node can be shown in Table 5:
  • node a-node b-node d-node f node a-node b- Node e-node f
  • node a-node c-node f it can be determined that the upstream nodes of node f may include node a, node b, node c, node d, and node e.
  • the tensor data input to node f includes tensor data t5, tensor data t6 and tensor data t4, and the tensor data output by node f includes tensor data t7 and tensor data t dep .
  • the implementation process of determining the upstream nodes corresponding to other nodes, outputting tensor data, and inputting tensor data will not be repeated here.
  • each tensor data determines whether all the consuming nodes of the first tensor data are upstream nodes of the production nodes of the second tensor data, and if so, it is determined that the first tensor data can be reused as the memory space allocated by the second tensor data; If not, it is determined that the first tensor data cannot reuse the memory space allocated for the second tensor data.
  • the above constraint relationship may be carried in a constraint relationship table.
  • the first value may indicate that each tensor data can be multiplexed with other tensor data in the same memory space
  • the second value may indicate that each tensor data can be reused in the same memory space.
  • the tensor data cannot be reused in the same memory space with other tensor data
  • the third value indicates that each tensor data can be stored contiguously in the same memory space with other tensor data.
  • the first value "0" indicates that the tensor data can be multiplexed with other tensor data except itself in the same memory space;
  • the second value "1” indicates that the tensor data cannot be used with other than itself.
  • the tensor data other than itself reuses the same memory space;
  • the third value "2" indicates that the tensor data can be stored contiguously in the same memory space with other tensor data except itself.
  • node b is the control selection node, and this node has two branches, one branch is: node b-node d-node f; one branch is: node b - Node e - Node f. In an operation of the neural network, only one branch is active.
  • the constraint relationship between tensor data t2 and tensor data t3 is that tensor data t2 and tensor data t3 do not require two independent memory spaces, that is: The same memory space can be reused; for tensor data t5 and tensor data t6, the constraint relationship between tensor data t5 and tensor data t6 is that tensor data t5 and tensor data t6 do not need two Independent memory space, that is: the same memory space can be reused.
  • the M tensor data are sorted from large to small, and the sorting result of the M tensor data is obtained.
  • sorting multiple tensor data that are constrained to be stored contiguously in the same memory space can be sorted as an independent whole.
  • each tensor data can be obtained.
  • the tensor data t0 cannot be used with other tensor data (t1, t2, t3, t4, t5, t6, t7) multiplex the same memory space, and its constraint is 7;
  • tensor data t1 cannot be used with other tensor data (t0, t2, t3, t4, t5 , t6, t7, t8) multiplex the same memory space, and its constraint is 8;
  • tensor data t2, tensor data t2 cannot be multiplexed with tensor data (t0, t1, t4, t5).
  • a memory space, its constraint is 4; for tensor data t3, tensor data t3 cannot reuse the same memory space with tensor data (t0, t1, t4, t6), and its constraint is 4; For tensor data t4, tensor data t4 cannot be combined with tensor data (t0, t1, t2, t3, t5, t6, t7, t8), and its constraint is 8; for tensor data t5, Tensor data t5 cannot share the same memory space with tensor data (t0, t1, t2, t4, t7), and its constraint is 5; for tensor data t6, tensor data t6 cannot be used with tensor data.
  • the size of the memory space occupied by the tensor data t0 is 500Kb; the size of the memory space occupied by the tensor data t1 is 500Kb; the size of the memory space occupied by the tensor data t2 is 500Kb; the size of the memory space occupied by the tensor data t3 is The space size is 500Kb; the memory space occupied by tensor data t4 is 500Kb; the memory space occupied by tensor data t5 is 1000Kb; the memory space occupied by tensor data t6 is 1000Kb; the memory space occupied by tensor data t7 is 1000Kb; The size of the memory space is 1000Kb; the size of the memory space occupied by the tensor data t8 is 1000Kb.
  • the first memory space includes memory space a, memory space b and memory space c, memory space a, memory space b and memory space c are continuous memory spaces, where memory space a is used to store tensor data t1 (that is, the size of memory space a is equal to the size of tensor data t1 ), memory space b is used to store tensor data t8 (that is: the size of memory space b is equal to the size of tensor data t8), and memory space c is used to store tensor data t7 (that is: the size of memory space c is equal to The size of the tensor data t7); after that, allocate memory space for the tensor data t5, and the implementation process may include: combining the constraints between the tensor data t5 and other
  • the second memory space is allocated as the tensor data t5 (for example, the memory in the first memory space can be reused Space b is allocated to tensor data t5).
  • allocate memory space for the tensor data t6 and the implementation process may include: combining the constraint relationship between the tensor data t6 and other tensor data in Table 7 to determine whether the tensor data t6 can reuse the allocated memory space ( the first memory space), since the tensor data t6 can reuse the first memory space, the first memory space is allocated as the tensor data t6 (for example, the memory space b in the first memory space can be allocated to the tensor data t6 ).
  • allocate memory space for the tensor data t0 may include: combining the constraints between the tensor data t0 and other tensor data in Table 7 to determine whether the tensor data t0 can reuse the allocated memory space ( The first memory space), since the tensor data t0 cannot reuse the allocated memory space, at this time, according to the size of the memory space that the tensor data t0 needs to occupy, a second memory space of the corresponding size is allocated for the tensor data t0.
  • the allocated memory space includes the first memory space and the second memory space.
  • allocate memory space for the tensor data t4 may include: judging whether the tensor data t4 can reuse the allocated memory space ( The first memory space and the second memory space), since the tensor data t4 cannot reuse the memory space in the allocated combination, at this time, according to the size of the memory space occupied by the tensor data t4, the corresponding size of the tensor data t4 is allocated.
  • the third memory space in this case, the allocated memory space includes the first memory space, the second memory space and the third memory space.
  • allocate memory space for the tensor data t2 may include: combining the constraint relationship between the tensor data t2 and other tensor data in Table 7 to determine whether the tensor data t2 can reuse the allocated memory space ( The first memory space, the second memory space and the third memory space), since the tensor data t2 can reuse the first memory space, the first memory space is allocated to the tensor data t2 (for example, the first memory space can be The third inner memory space is allocated to tensor data t2).
  • the implementation process may include: judging whether the tensor data can reuse the allocated memory space according to the constraint relationship between the tensor data t3 and other tensor data in Table 7 (No. a memory space, a second memory space and a third memory space), since the tensor data can reuse the first memory space, the first memory space is allocated to the tensor data t3 (for example, the first memory space in the first memory space can be assigned Three memory spaces are allocated to tensor data t3). Specifically, the allocation process can be shown in Figure 4f.
  • an independent memory space is allocated for each tensor data, and the relationship between the tensor data and the memory space is one-to-one correspondence, that is, the number of tensor data and the memory space are in one-to-one correspondence.
  • the amount of memory space is the same.
  • each memory space includes its corresponding first address and storage space size.
  • the memory space allocated for each tensor data can be shown in Table 8:
  • the maximum memory size that the entire neural network needs to occupy can be determined by using the allocated memory space.
  • the maximum memory size required for the neural network shown in Figure 4e is 3500Kb.
  • the corresponding memory space is allocated for each tensor data, it is verified whether the allocated memory space is correct according to the corresponding constraint relationship of each tensor data; Allocate memory space for the amount of data. For example, for tensor data t8, in the first memory space, determine whether the memory space corresponding to tensor data t8 is on the right side of the memory space corresponding to tensor data t1. For example, as shown in Table 4, tensor data The memory space corresponding to t1 is [0,500[, and the memory space corresponding to tensor data t8 is [500,1000[.
  • the tensor data t8 can be determined.
  • the corresponding memory space is to the right of the memory space corresponding to the tensor data t1, which means that the allocation is correct.
  • unreasonable memory allocation can be avoided.
  • the unreasonable memory allocation can be reflected in the allocation result mutually exclusive with the constraint relationship in the allocated memory space.
  • computation subtask 0 includes node a and node c, and its execution order is node a-node c;
  • computation subtask 1 includes node b and node d , node e and node f, the execution order is node b-node d/e-node f;
  • the calculation subtask 2 includes node g and node h, and its execution order is node g-node h.
  • the execution relationship among the calculation subtask 0, the calculation subtask 1, and the calculation subtask 2 is parallel.
  • the sequential execution order of each node is obtained, and then the same node is encoded in sequence according to the sequential execution order of each node to obtain the corresponding identifier (for example, serial number) of each node,
  • the identifier corresponding to the former is smaller than the identifier corresponding to the latter node.
  • the identification of each node in the calculation subtask 0 can be as shown in Table 9:
  • the End ID is greater than the Start ID, it means that the node corresponding to the End ID is the upstream node of the node corresponding to the Start ID.
  • computation subtask 0 includes node a and node c, and its execution order is node a-node c;
  • computation subtask 1 includes node b and node d , node e and node f, the execution order is node b-node d/e-node f;
  • the calculation subtask 2 includes node g and node h, and its execution order is node g-node h.
  • the execution relationship between the calculation subtask 0 and the calculation subtask 1 is parallel, and the execution relationship between the calculation subtask 1 and the calculation subtask 2 is serial.
  • the sequential execution order of each node is obtained, and then the same node is encoded in sequence according to the sequential execution order of each node to obtain the corresponding identifier (eg, serial number) of each node.
  • the sequential execution order of each node in the at least two computing subtasks is obtained, and the nodes in the at least two computing subtasks are executed according to the sequential execution order of each node. Encoding is performed in sequence to obtain the corresponding identifier of each node; in two adjacent nodes, the identifier corresponding to the former node is smaller than the identifier corresponding to the latter node.
  • the identification of each node in the calculation subtask 2 can be as shown in Table 10:
  • the End ID is greater than the Start ID, it means that the node corresponding to the End ID is the upstream node of the node corresponding to the Start ID.
  • the computation graph corresponding to the neural network includes three tensor data, which can be represented as tensor data t000, tensor data t100, and tensor data t200, wherein the memory space occupied by the tensor data t000 The size is 1000Kb, the memory space occupied by tensor data t100 is 600Kb, and the memory space occupied by tensor data t200 is 450Kb.
  • the first value "0" indicates that the tensor data can reuse the same memory space with other data except itself; the second value "1" indicates that the tensor data cannot be reused Tensor data other than itself reuses the same memory space.
  • the tensor data t000 can be multiplexed with the same memory space as the tensor data t100, or the same memory space with the tensor data t200; for the tensor data t100
  • tensor data t100 can reuse the same memory space with tensor data t000, but cannot reuse the same memory space with tensor data t200; for tensor data t200, tensor data t200 can be used with tensor data t200.
  • Data t000 multiplexes the same memory space, and cannot multiplex the same memory space with tensor data t100.
  • the memory allocation device allocates the memory space for the above three tensor data
  • a first memory space is created, and the first memory space is used to store the tensor data t000.
  • the first address of the created first memory space is a
  • the size of its storage space is the size of the memory space occupied by the tensor data t000
  • a second memory space is created, and the second memory space is used to store the tensor data t100 .
  • the first address of the created second memory space is a (that is, the same as the first address of the first memory space), and its storage space
  • the size is the memory space occupied by the tensor data t100.
  • the size of the memory space occupied by the tensor data t000 is 1000Kb
  • the size of the memory space occupied by the tensor data t100 is 600Kb, which means that the tensor data t100 reuses part of the memory space in the first memory space [ a, 600[.
  • a third memory space is created, and the third memory space is used to store the tensor data t200. Since the tensor data t200 can reuse the memory space [a, 1000[] allocated for the tensor data t000, the memory space [a, 600[] allocated for the tensor data t100 cannot be reused. Therefore, the first memory space of the third memory space [a, 600[] The address is 600, and the size of its storage space is the size of the memory space occupied by the tensor data t200, which means that the tensor data t200 reuses part of the memory space [600,1000[] in the first memory space. Specifically, the allocation process can be shown in Figure 4g.
  • the above constraint relationship may also be embodied in: the memory space allocated by tensor data and the memory space allocated by other tensor data except itself are non-contiguous memory spaces (for example, for Quantity data waste memory space 1, allocate memory space 2 for tensor data 2, where there is a space gap between memory space 1 and memory space 2), whether the tensor data and other tensor data other than itself satisfy the space co-location Constraints, where the spatial colocation constraint is reflected in: the i-th tensor data cannot be multiplexed with the j-th tensor data in the same memory space, and the i-th tensor data can be multiplexed with the k-th tensor data in the same memory space, then The i-th tensor data cannot be multiplexed with the l-th tensor data in the same memory space, where the l-th tensor data is the overlapping tensor
  • the memory allocation device sequentially allocates a memory space of a corresponding size to each tensor data based on the sorting result of the M tensor data.
  • the allocation and reuse of memory space can avoid unreasonable memory allocation, thereby saving the memory that the entire neural network needs to occupy, and optimizing the memory allocation of the neural network.
  • FIG. 5 is a schematic flowchart of a memory allocation method for a neural network provided by an embodiment of the present application.
  • the execution body of the method may be a server running the neural network, or may be a terminal device running the neural network.
  • the execution subject is a terminal device running a neural network as an example for description.
  • the method may include but is not limited to the following steps:
  • Step S501 obtaining a computation graph corresponding to the neural network; wherein, the computation graph includes N nodes and directed edges connecting different nodes, the directed edges of the computation graph carry tensor data, and the computation graph includes M tensor data , M is an integer greater than 1.
  • step S501 For the specific implementation of step S501, please refer to the foregoing description, and details are not repeated here.
  • Step S502 Based on the constraint relationship corresponding to each tensor data, according to the execution order of the M tensor data in the neural network, sequentially allocate memory space to the M tensor data.
  • the implementation process of determining the constraint relationship corresponding to each tensor data in the calculation graph may include: determining whether all consuming nodes of the first tensor data are upstream of the production nodes of the second tensor data Node, if yes, it is determined that the first tensor data can be reused as the memory space allocated for the second tensor data; if not, it is determined that the first tensor data cannot be reused as the memory space allocated for the second tensor data.
  • the above method can be applied to a scenario in which multiple computing subtasks are parallel (for example, the execution relationship between multiple computing subtasks is all parallel; for another example, the execution relationship between multiple computing subtasks is Including serial and parallel), the terminal device obtains the corresponding constraint relationship of each tensor data, and then, according to the execution order of the tensor data in the neural network and the corresponding constraint relationship of each tensor data, the tensor data is allocated in turn
  • the memory can avoid the error of the operator operation result caused by the reuse of the same memory space in different computing subtasks in the parallel scenario, and can ensure the accuracy of the calculation result of the neural network.
  • FIG. 6 is a schematic structural diagram of a memory allocation device 60 according to an embodiment of the present application.
  • the memory allocation device 60 shown in FIG. 6 may include:
  • the calculation graph includes M tensor data, where M is an integer greater than 1;
  • the allocation unit 602 is configured to sequentially allocate memory to the M tensor data based on the sorting result of the M tensor data space, wherein, if one tensor data in the M tensor data can reuse at least a part of the allocated memory space, at least a part of the memory space that can be reused by the tensor data is allocated to the Tensor data, the allocated memory space is the memory space that has been allocated to the M pieces of tensor data before the tensor data, and the sorting result indicates that the memory space is allocated for the M pieces of tensor data
  • the order of time, the sorting result is related to the information of each tensor data in the
  • the allocating unit 602 is further configured to: if the allocated memory space cannot be reused for the tensor data, allocate another memory space for the tensor data, and the other The memory space is different from the allocated memory space.
  • the constraint relationship indicates at least one of the following relationships: the relationship between the available memory space of one tensor data and the available memory space of another tensor data is reusable, and the relationship between the available memory space of one tensor data The relationship between the available memory space and the available memory space of another tensor data is not reusable, and the relationship between the available memory space of one tensor data and the available memory space of another tensor data is not reusable and continuous.
  • the constraint relationship is carried in a constraint relationship table
  • the constraint relationship table includes identifiers of the M data tensors
  • a first value indicates one The relationship between the available memory space of tensor data and the available memory space of another tensor data is reusable
  • the second value indicates that the relationship between the available memory space of one tensor data and the available memory space of another tensor data is Non-reusable
  • the third value indicates that the relationship between the available memory space of one tensor data and the available memory space of another tensor data is non-reusable and continuous.
  • the first tensor data may reuse the memory space allocated for the second tensor data;
  • the consuming node is not the upstream node of the production node of the second tensor data, or, when all the consuming nodes of the second tensor data are not the downstream nodes of the production node of the first tensor data , the first tensor data cannot be reused as the memory space allocated for the second tensor data; the first tensor data and the second tensor data are among the M tensor data Any two; the consumer node is the node to which the tensor data flows, and the production node is the node to which
  • the computation graph includes a plurality of computation subtasks, the computation subtasks indicate a computation function through a group of nodes and edges related to the group of nodes, and the plurality of computations
  • the execution relationship between the subtasks is parallel execution; the device further includes:
  • the update calculation graph unit 604 is configured to, in one of the calculation subtasks, if there is no directed edge between two adjacent nodes, add a directed edge between the two adjacent nodes to update the A computation graph; wherein, each added directed edge carries corresponding tensor data; the two adjacent nodes are two adjacent nodes in the execution order of the computation subtask;
  • An information acquisition unit 606 is configured to acquire information of each tensor data based on the updated calculation graph.
  • the computation graph further includes a first computation subtask and a second computation subtask whose execution relationship is serial, and the execution order of the first computation subtask is in the second computation subtask.
  • the update computational graph unit 604 is also used for:
  • the identifier of the production node of the tensor data is smaller than the identifier of the consumption node of the tensor data; the production node of the tensor data is the same as the identifier of the tensor data
  • the consuming nodes are two adjacent nodes.
  • the identifier of each node in the computation graph is used to determine the information of each tensor data in the M tensor data.
  • the information of each tensor data indicates a constraint relationship corresponding to each tensor data
  • the apparatus further includes:
  • the first sorting unit 608 is configured to obtain the respective constraints corresponding to the M pieces of tensor data according to the constraint relationship corresponding to each tensor data; the constraints are other tensor data that cannot be combined with the tensor data. Multiplexing the number of tensor data in the same memory space; sorting the M tensor data according to the corresponding constraints of the M tensor data to obtain a sorting result of the M tensor data.
  • the information of each tensor data indicates the number of nodes to which each tensor data flows, and the apparatus further includes:
  • the second sorting unit 6010 is configured to sort the M pieces of tensor data according to the number of consuming nodes corresponding to the M pieces of tensor data, so as to obtain a sorting result of the M pieces of tensor data.
  • the apparatus further includes:
  • the third sorting unit 6012 is configured to use a heuristic algorithm to sort the M tensor data based on the information of each tensor data, so as to obtain the sorting of the M tensor data within a preset time period result.
  • the sorting result is an optimized sorting result, wherein the maximum memory size that the neural network corresponding to the optimized sorting result needs to occupy is smaller than the size determined according to the sorting result before optimization.
  • the maximum memory size that the neural network needs to occupy is smaller than the size determined according to the sorting result before optimization.
  • the memory allocation device obtains the sorting result of each tensor data according to the information of each tensor data, so as to allocate a memory space of a corresponding size for each tensor data according to the sorting result, which is compared with the prior art.
  • the memory space is allocated and reused according to the sequence of the operation of the entire neural network, which can avoid unreasonable memory allocation, thereby saving the memory that the entire neural network needs to occupy, and optimizing the memory allocation of the neural network.
  • a memory allocation apparatus 70 provided by an embodiment of the present application may be specifically a terminal device or a server.
  • the memory allocation device 70 may be embodied as a central control module in the server, or its functions are implemented by the central control module in the server.
  • the memory allocation apparatus 70 may be specifically a central control module in the terminal device, or its functions are implemented by the central control module in the terminal device.
  • the memory allocation apparatus may include a processor 701 , a memory 702 , a communication bus 703 and a communication interface 704 , and the processor 701 is connected to the memory 702 and the communication interface 703 through the communication bus.
  • the processor 701 may adopt a general-purpose central processing unit (Central Processing Unit, CPU), a microprocessor, an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), a graphics processor (Graphics Processing Unit, GPU), a neural network processor (Network Processing Unit, NPU) or one or more integrated circuits, used to execute related programs to execute the memory allocation method described in the method embodiments of the present application.
  • CPU Central Processing Unit
  • ASIC Application Specific Integrated Circuit
  • GPU Graphics Processing Unit
  • NPU neural network processor
  • the processor 701 can also be an integrated circuit chip, which has signal processing capability. In the implementation process, each step of the memory allocation method of the present application may be completed by an integrated logic circuit of hardware in the processor 701 or instructions in the form of software.
  • the above-mentioned processor 701 may also be a general-purpose processor, a digital signal processor (Digital Signal Processing, DSP), an application-specific integrated circuit (ASIC), an off-the-shelf programmable gate array (Field Programmable Gate Array, FPGA) or other programmable logic devices. , discrete gate or transistor logic devices, discrete hardware components.
  • a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
  • the steps of the method disclosed in conjunction with the embodiments of the present application may be directly embodied as executed by a hardware decoding processor, or executed by a combination of hardware and software modules in the decoding processor.
  • the software module can be located in random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, registers and other storage media mature in the art.
  • the storage medium is located in the memory 701, and the processor 701 reads the information in the memory 702, and executes the memory allocation method of the method embodiment of the present application in combination with its hardware.
  • the memory 702 may be a read only memory (Read Only Memory, ROM), a static storage device, a dynamic storage device, or a random access memory (Random Access Memory, RAM).
  • the memory 702 may store programs and data, for example, programs of the memory allocation method in the embodiments of the present application, and the like.
  • the processor 701 and the communication interface 704 are used to execute each step of the memory allocation method of the embodiment of the present application.
  • the program in the embodiment of the present application for implementing the memory allocation method in the embodiment of the present application, and the like.
  • the communication interface 704 implements communication between the memory allocation apparatus 700 and other devices or a communication network using a transceiver such as, but not limited to, a transceiver.
  • a transceiver such as, but not limited to, a transceiver.
  • the trained neural network can be obtained through the communication interface 704 to realize information interaction with execution equipment, client equipment, user equipment or terminal equipment.
  • the memory allocation apparatus may further include an artificial intelligence processor 705, and the artificial intelligence processor 705 may be a neural network processor (Network Processing Unit, NPU), a tensor processor (Tensor Processing Unit, TPU), or a graph Processor (Graphics Processing Unit, GPU) and other processors suitable for large-scale XOR processing.
  • the artificial intelligence processor 705 can be mounted on the main CPU (Host CPU) as a co-processor, and the main CPU assigns tasks to it.
  • the artificial intelligence processor 705 can implement one or more operations involved in the above-mentioned memory allocation method. For example, taking the NPU as an example, the core part of the NPU is an arithmetic circuit, and the controller controls the arithmetic circuit to extract the matrix data in the memory 702 and perform multiplication and addition operations.
  • the processor 701 is used to call data and program codes in the memory, and execute:
  • the computation graph includes N nodes and directed edges connecting different nodes, the directed edges of the computation graph carry tensor data, and the computation graph includes M Tensor data, the M is an integer greater than 1;
  • memory space is allocated to the M pieces of tensor data in sequence, wherein, if one tensor data of the M pieces of tensor data can be reused in the allocated memory space At least a part of the tensor data reusable memory space is allocated to the tensor data, and the allocated memory space is the memory space that has been allocated to the M before the tensor data
  • the memory space of the tensor data, the sorting result indicates the order in which the memory space is allocated for the M pieces of tensor data, and the sorting result is related to the information of each tensor data in the M pieces of tensor data, so
  • the information of each tensor data indicates at least one of the following information: a constraint relationship corresponding to each tensor data and the number of nodes to which each tensor data flows, and the constraint relationship indicates the M The relationship between the available memory space of one tensor data in the pieces of tens
  • processor 701 is also used for:
  • the allocated memory space cannot be reused for the tensor data, other memory space is allocated for the tensor data, and the other memory space is different from the allocated memory space.
  • the constraint relationship indicates at least one of the following relationships: the relationship between the available memory space of one tensor data and the available memory space of another tensor data is reusable, and the available memory space of one tensor data is the same as that of another tensor data.
  • the relationship between the available memory space of tensor data is not reusable, and the relationship between the available memory space of one tensor data and the available memory space of another tensor data is not reusable and continuous.
  • the constraint relationship is carried in a constraint relationship table, the constraint relationship table includes the identifiers of the M data tensors, and in the constraint relationship table, the available memory space of one tensor data is indicated by the first value
  • the relationship with the available memory space of another tensor data is reusable
  • the second value indicates that the relationship between the available memory space of one tensor data and the available memory space of another tensor data is not reusable, through the third value.
  • the value indicates that the relationship between the available memory space of one tensor data and the available memory space of another tensor data is non-reusable and contiguous.
  • the first tensor data can be reused for the memory space allocated for the second tensor data; in the case of all consuming nodes of the first tensor data not the second tensor data In the case of the upstream node of the production node of the tensor data, or, in the case that all the consumption nodes of the second tensor data are not the downstream nodes of the production node of the first tensor data, the first tensor data
  • the amount of data cannot be reused as the memory space allocated for the second tensor data; the first tensor data and the second tensor data are any two of the M tensor data; the The consumer node is the node
  • the computation graph includes a plurality of computation subtasks, the computation subtasks indicate a computation function through a group of nodes and edges related to the group of nodes, and the execution relationship between the plurality of computation subtasks For parallel execution; the processor 701 is also used for:
  • a directed edge is added between the two adjacent nodes to update the calculation graph; wherein, each added The directed edge carries corresponding tensor data; the two adjacent nodes are two adjacent nodes in the execution order of the calculation subtask;
  • the calculation graph further includes a first calculation subtask and a second calculation subtask whose execution relationship is serial, and the execution order of the first calculation subtask is before the second calculation subtask;
  • the processor 701 Update the computational graph further comprising:
  • the identifier of the production node of the tensor data is smaller than the identifier of the consumption node of the tensor data; the production node of the tensor data is adjacent to the consumption node of the tensor data two nodes.
  • the identifier of each node in the calculation graph is used to determine the information of each tensor data in the M tensor data.
  • each tensor data indicates a constraint relationship corresponding to each tensor data
  • the processor 701 is further configured to:
  • the corresponding constraint amount of the M pieces of tensor data is obtained; the constraint amount is the tensor data in other tensor data that cannot be reused in the same memory space as the tensor data. amount of data;
  • the information of each tensor data indicates the number of nodes to which each tensor data flows, and the processor 701 is further configured to:
  • the processor 701 is also used for:
  • the M pieces of tensor data are sorted using a heuristic algorithm based on the information of each tensor data, so as to obtain a sorting result of the M pieces of tensor data within a preset time period.
  • the sorting result is an optimized sorting result, wherein the maximum memory size that the neural network corresponding to the optimized sorting result needs to occupy is smaller than the maximum memory size that the neural network needs to occupy according to the sorting result before optimization. Maximum memory size.
  • Embodiments of the present application also provide a computer storage medium, where instructions are stored in the computer-readable storage medium, when the computer or processor is run on the computer or processor, the computer or processor is made to perform any of the methods described in the foregoing embodiments. one or more steps. If each component module of the above-mentioned device is realized in the form of software functional unit and sold or used as an independent product, it can be stored in the computer-readable storage medium. Based on this understanding, the technical solution of the present application is essentially or The part said to have contributed to the prior art or the whole or part of the technical solution can be embodied in the form of a software product, and the computer product is stored in a computer-readable storage medium.
  • the above-mentioned computer-readable storage medium may be an internal storage unit of the device described in the foregoing embodiments, such as a hard disk or a memory.
  • the above-mentioned computer-readable storage medium can also be an external storage device of the above-mentioned equipment, such as a plug-in hard disk equipped, a smart memory card (Smart Media Card, SMC), a secure digital (Secure Digital, SD) card, a flash memory card (Flash Card) )Wait.
  • the above-mentioned computer-readable storage medium may also include both an internal storage unit of the above-mentioned device and an external storage device.
  • the above-mentioned computer-readable storage medium is used to store the above-mentioned computer program and other programs and data required by the above-mentioned device.
  • the above-mentioned computer-readable storage medium can also be used to temporarily store data that has been output or is to be output.
  • the aforementioned storage medium includes various media that can store program codes, such as ROM, RAM, magnetic disk, or optical disk.
  • the modules in the apparatus of the embodiment of the present application may be combined, divided and deleted according to actual needs.
  • Computer-readable media may include computer-readable storage media, which corresponds to tangible media, such as data storage media, or communication media including any medium that facilitates transfer of a computer program from one place to another (eg, according to a communication protocol) .
  • a computer-readable medium may generally correspond to (1) a non-transitory tangible computer-readable storage medium, or (2) a communication medium, such as a signal or carrier wave.
  • Data storage media can be any available media that can be accessed by one or more computers or one or more processors to retrieve instructions, code and/or data structures for implementing the techniques described in this application.
  • the computer program product may comprise a computer-readable medium.
  • the disclosed system, apparatus and method may be implemented in other manners.
  • the apparatus embodiments described above are only illustrative.
  • the division of the units is only a logical function division. In actual implementation, there may be other division methods.
  • multiple units or components may be combined or Can be integrated into another system, or some features can be ignored, or not implemented.
  • the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit.
  • the functions, if implemented in the form of software functional units and sold or used as independent products, may be stored in a computer-readable storage medium.
  • the technical solution of the present application can be embodied in the form of a software product in essence, or the part that contributes to the prior art or the part of the technical solution.
  • the computer software product is stored in a storage medium, including Several instructions are used to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present application.
  • the aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disk or optical disk and other media that can store program codes .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Biophysics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Neurology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本申请提供了一种内存分配方法、相关设备及计算机可读存储介质,该方法包括:获取神经网络对应的计算图;基于M个张量数据的排序结果,依次给M个张量数据分配内存空间,其中,若M个张量数据中的一个张量数据可复用已分配的内存空间中的至少一部分,则将张量数据可复用的至少一部分内存空间分配给张量数据,已分配的内存空间为在张量数据之前,已经分配给M个张量数据的内存空间,排序结果指示为M个张量数据分配内存空间时的顺序,排序结果与M个张量数据中每个张量数据的信息有关。实施本申请,可避免内存分配不合理。

Description

内存分配方法、相关设备及计算机可读存储介质
本申请要求于2020年09月29日提交国家知识产权局、申请号为202011057095.2、申请名称为“内存分配方法、相关设备及计算机可读存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及人工智能技术领域,尤其涉及一种内存分配方法、相关设备及计算机可读存储介质。
背景技术
在当前的计算机深度学习领域,为了取得更好的算法精度,深度学习神经网络越来越复杂,硬件能力限制了神经网络向更深的方向发展,必须进行内存的优化。为了实现内存优化,业界通常采用以下内存分配策略:
运行整个神经网络,然后按照整个神经网络运行的前后顺序,为整个神经网络分配内存。比如,神经网络在运行过程中,依次需要占用100M的内存空间、10M的内存空间和50M的内存空间。当神经网络申请100M的内存空间时,可为神经网络分配一个100M的内存空间,然后,当神经网络申请10M的内存空间时,判断是否可以复用上述已分配的10M的内存空间,如果可以复用,则不再为所申请的10M内存空间分配新的内存空间,而是复用上述100M的内存空间。同理,当神经网络申请50M的内存空间时,先判断该50M的内存空间是否可以复用已分配的100M内存空间,如果可以复用,则不再为所申请的50M的内存空间分配新的内存空间。
通过上述描述可以知道的是,在现有技术中,当神经网络申请一内存空间时,首先需要判断一下该申请的内存空间是否可以复用已分配的内存空间,如果可以,则直接分配该申请的内存空间复用已分配的内存空间,如果不可以,再为该内存空间申请分配新的内存空间。但是,若申请的10M内存空间和申请的50M内存空间都可复用已分配的100M内存空间,会出现申请的10M内存空间复用已分配的100M内存空间,而对神经网络额外分配一50M的内存空间,从而整个神经网络需占用150M的内存空间,导致整个神经网络占用的内存较大,内存分配不合理。
发明内容
本申请提供了一种内存分配方法、相关设备及计算机可读存储介质,可避免出现内存分配不合理。例如,该内存分配不合理可以体现在:整个神经网络占用的内存较大。
第一方面,提供了一种内存分配方法,该方法可以包括如下步骤:首先,获取神经网络对应的计算图;其中,计算图包括N个节点和连接不同节点的有向边,节点用于指示神经网络中的一种计算逻辑,有向边用于指示计算逻辑中张量数据的流向;计算图的有向边上承载有张量数据,计算图中包括M个张量数据,M为大于1整数;其次,基于M个张量数据的排序结果,依次给M个张量数据分配内存空间,其中,若M个张量数据中的一个张量数据 可复用已分配的内存空间中的至少一部分,则将张量数据可复用的至少一部分内存空间分配给张量数据,已分配的内存空间为在张量数据之前,已经分配给M个张量数据的内存空间,排序结果指示为M个张量数据分配内存空间时的顺序,排序结果与M个张量数据中每个张量数据的信息有关,每个张量数据的信息指示以下信息中的至少一种:每个张量数据对应的约束关系以及每个张量数据流向的节点的数量,约束关系指示M个张量数据中一个张量数据的可用内存空间分别与M个张量数据中的其他张量数据的可用内存空间的关系。其中,已经分配给M个张量数据的内存空间这句话中,是将M个张量数据作为一个整体,描述的是这个整体已分配的内存空间,这个整体里可能有一部分张量数据是还没有被分配到内存空间的。具体来说,是指,在给上述方法中描述的张量数据分配内存空间前,该M个张量数据中已经被分配了内存空间的一个或多个张量数据的内存空间,例如,按次序该给排序结果中的第m个张量数据分配内存空间,则已经分配给M个张量数据的内存空间就是前m-1个已分配的张量数据所分配到的内存空间,m小于M,大于1。
当然,对于排序结果中的第一个张量,由于没有已分配给M个张量数据的内存空间,直接分配内存空间即可,这是现有技术,不再展开描述。
通俗地来说,张量数据流向的节点即为消费节点。张量数据流出的节点即为生产节点。
需要说明的是,一个张量数据可以承载于不同的有向边上,一个张量数据也可以承载在一个有向边上。
实施本申请实施例,内存分配装置基于M个张量数据的排序结果依次为每个张量数据分配相应大小的内存空间,相较于现有技术中,按照整个神经网络运行的前后顺序,进行内存空间的分配和复用,可以避免出现内存分配不合理现象,从而可节省整个神经网络需要占用的内存,优化了神经网络的内存分配。
在一种可能的实现方式中,该方法还可以包括如下步骤:若张量数据不可复用已分配的内存空间,则为张量数据分配其他的内存空间,其他的内存空间与已分配的内存空间不同。
在一种可能的实现方式中,约束关系指示以下至少一种关系:一个张量数据的可用内存空间与另一个张量数据的可用内存空间的关系是可复用,一个张量数据的可用内存空间与另一个张量数据的可用内存空间的关系是不可复用,一个张量数据的可用内存空间与另一个张量数据的可用内存空间的关系是不可复用且连续。在实际应用中,上述约束关系具有不同的优先级,换句话说,当两个张量的可用内存空之间的关系是不可复用且连续时,必然也是不可复用的,那么这两个张量的可用内存空之间的关系在约束关系中指示为不可复用且连续。也就是可以理解为不可复用且连续的优先级高于不可复用。
在一种可能的实现方式中,约束关系承载于约束关系表,约束关系表中包括M个数据张量的标识,在约束关系表中,通过第一值指示一个张量数据的可用内存空间与另一个张量数据的可用内存空间的关系是可复用,通过第二值指示一个张量数据的可用内存空间与另一个张量数据的可用内存空间的关系是不可复用,通过第三值指示一个张量数据的可用内存空间与另一个张量数据的可用内存空间的关系是不可复用且连续。具体地,第一值、第二值以及第三值可以为能够相互区分的数值,例如,第一值可以为“0”,第二值可以为“1”,第三值可以为“2”。结合前文,一种实现方式下,如果一个关系是不可复用且连续,约束关系表的这个关系只会被标记为“2”,而不是“1”和“2”。通过这一实现方式,为后续结合约束关系获取M个张量数据的排序结果提供了便利。进一步地,当通过该排序结果依次为张量数据分配相应大小的内存空间时,可以避免出现内存不合理现象。
在一些可能的实现方式中,在第一张量数据的所有消费节点是第二张量数据的生产节点 的上游节点的情况下,或,在所述第二张量数据的所有消费节点是所述第一张量数据的生产节点的下游节点的情况下,所述第一张量数据可以复用为所述第二张量数据分配的内存空间;在所述第一张量数据的所有消费节点不是所述第二张量数据的生产节点的上游节点的情况下,或,在所述第二张量数据的所有消费节点不是所述第一张量数据的生产节点的下游节点的情况下,所述第一张量数据不可以复用为所述第二张量数据分配的内存空间;所述第一张量数据和所述第二张量数据为所述M个张量数据中任意的两个;所述消费节点为张量数据流向的节点,所述生产节点为张量数据流出的节点。具体来说,节点A是节点B的上游节点是指,在计算图中可以通过一条或多条有向边从节点A到节点B,节点A和节点B之间可以由从节点A指向节点B的有向边连接,也可以由多条有向边以及这些有向边经过的节点连接。第一第一通过这一实现方式,可以确定每个张量数据的约束关系,为后续获取M个张量数据的排序结果提供了基础。当通过该排序结果依次为张量数据分配相应大小的内存空间时,可以避免出现内存不合理现象。
在一种可能的实现方式中,计算图包括多个计算子任务,计算子任务通过一组节点和与一组节点有关的边指示一种计算功能,多个计算子任务之间的执行关系为并行执行;该方法还可以包括如下步骤:在一个计算子任务中,若相邻两个节点之间没有有向边,则在相邻两个节点之间添加有向边,以更新计算图;其中,添加的每条有向边上承载有相应的张量数据;相邻两个节点为计算子任务中执行顺序相邻的两个节点;这里,执行顺序是指具有时序关系的顺序。基于更新后的计算图,获取每个张量数据的信息。实施本申请实施例,在计算子任务之间的执行关系全部为并行的情况下,在计算图中的一个计算子任务内,若相邻两个节点之间没有有向边,则在相邻两个节点之间添加一条有向边,以更新计算图,为后续基于计算图分析各节点的祖先关系(例如,上游节点)和确定每个节点各自对应的约束关系提供了基础。需要说明的是,多个计算子任务之间的执行关系是并行是指,执行该多个计算子任务所需的时间段在同一时间基准的情况下有重叠,不强调计算子任务在同一时间开始,和/或在同一时间上结束。在实际应用中,可以通过不同的处理器核来并行执行上述具有并行执行关系的计算子任务。
在一种可能的实现方式中,计算图还包括执行关系为串行的第一计算子任务和第二计算子任务,第一计算子任务的执行顺序在第二计算子任务之前;更新计算图的实现过程还可以包括如下步骤:若第一计算子任务的最后一个节点与第二计算子任务的第一个节点之间没有有向边,则在第一计算子任务的最后一个节点与第二计算子任务的第一个节点之间添加有向边。在计算子任务之间的执行关系包括串行和并行的情况下,通过这一实现方式可以更新计算图,为后续基于计算图分析各节点的祖先关系和确定每个节点各自对应的约束关系提供了基础。例如,计算子任务1与计算子任务之间的执行关系为串行是指,在处理器执行完计算子任务之后,才执行计算子任务2。又例如,计算子任务1与计算子任务2之间的执行关系为串行,计算子任务2与计算子任务3之间的执行关系为并行是指,可以将计算子任务1与计算子任务2看做一个整体,通过处理器核1来运行计算子任务1和计算子任务2,通过处理器核2来运行计算子任务3。处理器核1和处理器核2在执行上述计算子任务所需的时间段在同一时间基准的情况下有重叠。
在一种可能的实现方式中,在计算图中,张量数据的生产节点的标识小于该张量数据的消费节点的标识;张量数据的生产节点与该张量数据的消费节点为相邻的两个节点。通过这一实现方式可以确定每个节点各自对应的标识,为后续基于每个节点各自对应的标识分析各节点的祖先关系和确定每个节点各自对应的约束关系提供了基础。
在一种可能的实现方式中,计算图中各节点的标识用于确定M个张量数据中每个张量数据的信息。以每个张量数据的信息指示每个张量数据的约束关系为例,可以根据各节点的标识,分析各节点的祖先关系(祖先关系关系中可以反映哪些节点是生产节点,哪些节点是消费节点),然后,结合该祖先关系获取每个张量数据的约束关系。
在一种可能的实现方式中,每个张量数据的信息指示每个张量数据对应的约束关系,该方法还可以包括如下步骤:根据每个张量数据对应的约束关系获取M个张量数据各自对应的约束量;约束量为其他张量数据中不可以与张量数据复用同一个内存空间的张量数据的数量;根据M个张量数据各自对应的约束量,对M个张量数据排序,以得到M个张量数据的排序结果。
在一种可能的实现方式中,每个张量数据的信息指示每个张量数据流向的节点的数量,方法还包括:根据M个张量数据各自对应的消费节点数量,对M个张量数据排序,以得到M个张量数据的排序结果。
需要说明的是,在一些可能的实现方式中,还可以根据每个张量数据的信息中的至少两种对M个张量数据按照从大到小排序,得到M个张量数据的排序结果。例如,根据M个张量数据各自对应的约束量和每个张量数据各自对应的内存空间大小对M个张量数据按照从大到小排序,得到M个张量数据的排序结果。又例如,根据M个张量数据各自对应的约束量和每个张量数据各自对应的消费节点数量按照从大到小排序,得到M个张量数据的排序结果。
在一些可能的实现方式中,该方法还可以包括如下步骤:基于每个张量数据的信息使用启发式算法对M个张量数据进行排序,以在预设时间段内获取M个张量数据的排序结果。在一种可能的实现方式中,排序结果是优化后的排序结果,其中,优化后的排序结果对应的神经网络需要占用的最大内存大小小于根据优化前的排序结果确定的神经网络需要占用的最大内存大小。通过这一实现方式,由于优化后的排序结果确定神经网络需要占用的最大内存大小小于根据排序结果确定神经网络需要占用的最大内存大小,可以节省内存空间。
第二方面,本申请实施例还提供了一种内存分配方法,该方法可以包括如下步骤:首先,获取神经网络对应的计算图;其中,计算图包括N个节点和连接不同节点的有向边,节点用于指示神经网络中的一种计算逻辑,有向边用于指示计算逻辑中张量数据的流向;计算图的有向边上承载有张量数据,计算图中包括M个张量数据,M为大于1整数;其次,基于每个张量数据对应的约束关系,按照M个张量数据在神经网络中的执行顺序,依次给M个张量数据分配内存空间,其中,若M个张量数据中的一个张量数据可复用已分配的内存空间中的至少一部分,则将张量数据可复用的至少一部分内存空间分配给张量数据,已分配的内存空间为在张量数据之前,已经分配给M个张量数据的内存空间,约束关系指示M个张量数据中一个张量数据的可用内存空间与M个张量数据中的其他张量数据的可用内存空间的关系。实施本申请实施例,终端设备可以基于每个张量数据各自对应的约束关系,按照M个张量数据的执行顺序,依次给M个张量数据分配内存空间,可以避免在并行场景下,不同执行流中因算子复用同一个内存空间而带来的算子运算结果出错的情形,可以保证神经网络的计算结果的准确性。
在一种可能的实现方式中,该方法还可以包括如下步骤:若张量数据不可复用已分配的内存空间,则为张量数据分配其他的内存空间,其他的内存空间与已分配的内存空间不同。
总的来说,本申请提出的方法可以解决内存分配不合理的问题,例如,内存分配不合理可以体现在:避免为神经网络分配的内存过大,避免在并行场景下,不同执行流中因算子复 用同一个内存空间而带来的算子运算结果出错的情形,可以保证神经网络的计算结果的准确性。
第三方面,本申请实施例提供了一种内存分配装置,该装置可以包括:获取计算图单元,用于获取神经网络对应的计算图;其中,计算图包括N个节点和连接不同节点的有向边,计算图的有向边上承载有张量数据,计算图中包括M个张量数据,M为大于1的整数;分配单元,用于基于M个张量数据的排序结果,依次给M个张量数据分配内存空间,其中,若M个张量数据中的一个张量数据可复用已分配的内存空间中的至少一部分,则将张量数据可复用的至少一部分内存空间分配给张量数据,已分配的内存空间为在张量数据之前,已经分配给M个张量数据的内存空间,排序结果指示为M个张量数据分配内存空间时的顺序,排序结果与M个张量数据中每个张量数据的信息有关,每个张量数据的信息指示以下信息中的至少一种:每个张量数据对应的约束关系以及每个张量数据流向的节点的数量,约束关系指示M个张量数据中一个张量数据的可用内存空间分别与M个张量数据中的其他张量数据的可用内存空间的关系。
在一种可能的实现方式中,分配单元,还用于:若张量数据不可复用已分配的内存空间,则为张量数据分配其他的内存空间,其他的内存空间与已分配的内存空间不同。
在一种可能的实现方式中,约束关系指示以下至少一种关系:一个张量数据的可用内存空间与另一个张量数据的可用内存空间的关系是可复用,一个张量数据的可用内存空间与另一个张量数据的可用内存空间的关系是不可复用,一个张量数据的可用内存空间与另一个张量数据的可用内存空间的关系是不可复用且连续。
在一种可能的实现方式中,约束关系承载于约束关系表,约束关系表中包括M个数据张量的标识,在约束关系表中,通过第一值指示一个张量数据的可用内存空间与另一个张量数据的可用内存空间的关系是可复用,通过第二值指示一个张量数据的可用内存空间与另一个张量数据的可用内存空间的关系是不可复用,通过第三值指示一个张量数据的可用内存空间与另一个张量数据的可用内存空间的关系是不可复用且连续。
在一种可能的实现方式中,在第一张量数据的所有消费节点是第二张量数据的生产节点的上游节点的情况下,或,在所述第二张量数据的所有消费节点是所述第一张量数据的生产节点的下游节点的情况下,所述第一张量数据可以复用为所述第二张量数据分配的内存空间;在所述第一张量数据的所有消费节点不是所述第二张量数据的生产节点的上游节点的情况下,或,在所述第二张量数据的所有消费节点不是所述第一张量数据的生产节点的下游节点的情况下,所述第一张量数据不可以复用为所述第二张量数据分配的内存空间;所述第一张量数据和所述第二张量数据为所述M个张量数据中任意的两个;所述消费节点为张量数据流向的节点,所述生产节点为张量数据流出的节点。
在一种可能的实现方式中,计算图包括多个计算子任务,计算子任务通过一组节点和与一组节点有关的边指示一种计算功能,多个计算子任务之间的执行关系为并行执行;装置还包括:更新计算图单元,用于在一个计算子任务中,若相邻两个节点之间没有有向边,则在相邻两个节点之间添加有向边,以更新计算图;其中,添加的每条有向边上承载有相应的张量数据;相邻两个节点为计算子任务中执行顺序相邻的两个节点;获取信息单元,用于基于更新后的计算图,获取每个张量数据的信息。
在一种可能的实现方式中,计算图还包括执行关系为串行的第一计算子任务和第二计算子任务,第一计算子任务的执行顺序在第二计算子任务之前;更新计算图单元,还用于:若 第一计算子任务的最后一个节点与第二计算子任务的第一个节点之间没有有向边,则在第一计算子任务的最后一个节点与第二计算子任务的第一个节点之间添加有向边。
在一种可能的实现方式中,在所述计算图中,张量数据的生产节点的标识小于所述张量数据的消费节点的标识;所述张量数据的生产节点与所述张量数据的消费节点为相邻的两个节点。
在一种可能的实现方式中,所述计算图中各节点的标识用于确定所述M个张量数据中每个张量数据的信息。
在一种可能的实现方式中,所述每个张量数据的信息指示所述每个张量数据对应的约束关系,所述装置还包括:第一排序单元,用于根据所述每个张量数据对应的约束关系获取所述M个张量数据各自对应的约束量;所述约束量为其他张量数据中不可以与张量数据复用同一个内存空间的张量数据的数量;根据所述M个张量数据各自对应的约束量,对所述M个张量数据排序,以得到所述M个张量数据的排序结果。
在一种可能的实现方式中,所述每个张量数据的信息指示所述每个张量数据流向的节点的数量,所述装置还包括:第二排序单元,用于根据所述M个张量数据各自对应的消费节点数量,对所述M个张量数据排序,以得到所述M个张量数据的排序结果。
在一种可能的实现方式中,所述装置还包括:
第三排序单元,用于基于所述每个张量数据的信息使用启发式算法对所述M个张量数据进行排序,以在预设时间段内获取所述M个张量数据的排序结果。
在一种可能的实现方式中,所述排序结果是优化后的排序结果,其中,所述优化后的排序结果对应的所述神经网络需要占用的最大内存大小小于根据优化前的排序结果确定的所述神经网络需要占用的最大内存大小。
第四方面,本申请实施例还提供了一种内存分配装置,该装置可以包括:获取计算图单元,用于获取神经网络对应的计算图;其中,所述计算图包括N个节点和连接不同节点的有向边,所述计算图的有向边上承载有张量数据,所述计算图中包括M个张量数据,所述M为大于1的整数;分配单元,用于基于每个张量数据对应的约束关系,按照所述M个张量数据在所述神经网络中的执行顺序,依次给所述M个张量数据分配内存空间,其中,若所述M个张量数据中的一个张量数据可复用已分配的内存空间中的至少一部分,则将所述张量数据可复用的至少一部分内存空间分配给所述张量数据,所述已分配的内存空间为在所述张量数据之前,已经分配给所述M个张量数据的内存空间,所述约束关系指示所述M个张量数据中一个张量数据的可用内存空间与所述M个张量数据中的其他张量数据的可用内存空间的关系。
在一种可能的实现方式中,分配单元,还用于:若所述张量数据不可复用已分配的内存空间,则为所述张量数据分配其他的内存空间,所述其他的内存空间与所述已分配的内存空间不同。
第五方面,本申请实施例还提供一种内存分配设备,该内存分配设备可以包括存储器和处理器,所述存储器用于存储支持内存分配设备执行上述方法的计算机程序,所述计算机程序包括程序指令,所述处理器被配置用于调用所述程序指令,执行上述第一方面中任一项或第二方面中任一项的内存分配方法。
第六方面,本申请实施例还提供一种计算机可读存储介质,所述计算机存储介质存储有计算机程序,所述计算机程序包括程序指令,所述程序指令当被处理器执行时使所述处理器执行上述第一方面中任一项或第二方面中任一项的内存分配方法。
第七方面,本申请实施例还提供了一种计算机程序,所述计算机程序包括计算机软件指令,所述计算机软件指令当被计算机执行时使所述计算机执行如上述第一方面中任一项或第二方面中任一项的内存分配方法。
附图说明
图1a为本申请实施例提供的一种神经网络的计算图的结构示意图;
图1b为本申请实施例提供的另一种神经网络的计算图的结构示意图;
图1c为本申请实施例提供的一种计算图中各算子的执行顺序的示意图;
图1d为本申请实施例提供的一种计算图中各算子的执行顺序的示意图;
图2a为本申请实施例提供的一种神经网络的计算图以及计算图中各算子的执行顺序的示意图;
图2b为本申请实施例提供的一种为张量数据分配内存空间的示意图;
图2c为本申请实施例提供的一种在并行场景下神经网络的计算图以及计算图中各算子的执行顺序的示意图;
图3a为本申请实施例提供的一种内存分配设备的结构示意图;
图3b为本申请实施例提供的服务器或终端设备侧架构的示意图;
图3c为本申请实施例提供的一种网络架构示意图;
图3d为本申请实施例提供的一种有向无环图DAG的示意图;
图4a为本申请实施例提供的一种内存分配方法的流程示意图;
图4b为本申请实施例提供的一种卷积神经网络400的结构示意图;
图4c为本申请实施例提供的一种确定排序结果的示意图;
图4d为本申请实施例提供的一种已分配集合中内存空间的示意图;
图4e为本申请实施例提供的一种更新后的计算图的示意图;
图4f为本申请实施例提供的一种已分配集合中内存空间的示意图;
图4g为本申请实施例提供的一种已分配集合中内存空间的示意图;
图5为本申请实施例提供的另一种内存分配方法的流程示意图;
图6为本申请实施例提供的一种内存分配装置60的结构示意图;
图7为本申请实施例提供的一种内存分配设备70的结构示意图。
具体实施方式
下面结合附图对本申请实施例中的技术方案进行清楚、完整的描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。
本申请的说明书以及附图中的术语“第一”和“第二”等是用于区分不同的对象,或者用于区别对同一对象的不同处理,而不是用于描述对象的特定顺序。此外,本申请的描述中所提到的术语“包括”和“具有”以及它们的任何变形,意图在于覆盖不排他的包含。例如包含了一些列步骤或单元的过程、方法、系统、产品或设备没有限定于已列出的步骤或单元,而是可选 地还包括其他没有列出的步骤或单元,或可选地还包括对于这些过程、方法、产品或设备固有的其他步骤或单元。需要说明的是,本申请实施例中,“示例性地”或者“例如”等词用于表示作例子、例证或说明。本申请实施例中被描述为“示例性地”或者“例如”的任何实施例或设计方法不应被解释为比其他实施例或设计方案更优地或更具优势。确切而言,使用“示例性地”或者“例如”等词旨在以具体方式呈现相关概念。在本申请实施例中,“A和/或B”表示A和B,A或B两个含义。“A,和/或B,和/或C”表示A、B、C中的任一个,或者,表示A、B、C中的任两个,或者,表示A和B和C。
为了更好的理解本申请所描述的技术方案,下面先解释本申请实施例涉及的相关技术术语:
(1)神经网络
神经网络可以是由神经单元组成的,神经单元可以是指以x s和截距b为输入的运算单元,该运算单元的输出可以为:
Figure PCTCN2021119829-appb-000001
其中,s=1、2、……n,n为大于1的自然数,w s为x s的权重,b为神经单元的偏置。f为神经单元的激活函数(activation functions),用于将非线性特性引入神经网络中,来将神经单元中的输入信号转换为输出信号。该激活函数的输出信号可以作为下一层卷积层的输入。激活函数可以是sigmoid函数。神经网络是将许多个上述单一的神经单元联结在一起形成的网络,即一个神经单元的输出可以是另一个神经单元的输入。每个神经单元的输入可以与前一层的局部接受域相连,来提取局部接受域的特征,局部接受域可以是由若干个神经单元组成的区域。
(2)深度神经网络
深度神经网络(deep neural network,DNN),也称多层神经网络,可以理解为具有很多层隐含层的神经网络,这里的“很多”并没有特别的度量标准。从DNN按不同层的位置划分,DNN内部的神经网络可以分为三类:输入层,隐含层,输出层。一般来说第一层是输入层,最后一层是输出层,中间的层数都是隐含层。层与层之间可以是全连接的,也可以是非全连接的。在层与层之间是全连接的情况下,第i层的任意一个神经元一定与第i+1层的任意一个神经元相连。虽然DNN看起来很复杂,但是就每一层的工作来说,其实并不复杂,简单来说就是如下线性关系表达式:
Figure PCTCN2021119829-appb-000002
其中,
Figure PCTCN2021119829-appb-000003
是输入向量,
Figure PCTCN2021119829-appb-000004
是输出向量,b是偏移向量,w是权重矩阵(也称系数),α()是激活函数。每一层仅仅是对输入向量
Figure PCTCN2021119829-appb-000005
经过如此简单的操作得到输出向量
Figure PCTCN2021119829-appb-000006
由于DNN层数多,则系数w和偏移向量b的数量也就很多了。这些参数在DNN中的定义如下所述:以系数w为例:假设在一个三层的DNN中,第二层的第4个神经元到第三层的第2个神经元的线性系数定义为
Figure PCTCN2021119829-appb-000007
上标3代表系数W所在的层数,而下标对应的是输出的第三层索引2和输入的第二层索引4。
总结就是:第L-1层的第k个神经元到第L层的第j个神经元的系数定义为
Figure PCTCN2021119829-appb-000008
需要注意的是,输入层是没有w参数的。在深度神经网络中,更多的隐含层让网络更能够刻画现实世界中的复杂情形。理论上而言,参数越多的模型复杂度越高,“容量”也就越大,也就意味着它能完成更复杂的学习任务。训练深度神经网络的也就是学习权重矩阵的过程,其最终目的是得到训练好的深度神经网络的所有层的权重矩阵(由很多层的向量w形成的权重矩阵)。
(3)计算图
在本申请中,计算图是使用图结构对神经网络的计算过程进行描述的一种方式。如果计算有明显的模块性,并且模块之间有明显的时间上和逻辑上的依赖关系,通常可以使用有向图结构来进行描述。在实际应用中,图结构的基本元素有两个,分别为节点和有向边。神经网络模型可以抽象为张量数据和算子所组成的有向图结构。节点,又称为算子。顾名思义,有向边是指,带方向的边,其描述了算子与算子之间的指向方向,用于表征算子与算子之间的依赖关系。在本申请中,出于阐述的便利,以节点为例进行说明。
如图1a所示,计算图中包含8个算子和9个张量数据。具体地,8个算子分别为算子a、算子b、算子c、算子d、算子e、算子f、算子g和算子h,9个张量数据分别为张量数据t0、张量数据t1、张量数据t2、张量数据t3、张量数据t4、张量数据t5、张量数据t6、张量数据t7和张量数据t8。以算子a、算子b和算子c为例,对算子a来说,算子a与算子b之间存在一条有向边,该有向边的指向方向为算子a指向算子b,该有向边表示算子a生成的张量数据t0为算子b的输入;与此同时,算子a与算子c之间存在一条有向边,该有向边的指向方向为算子a指向算子c,该有向边表示算子a生成的张量数据t0为算子b的输入。
一般来说,对神经网络模型使用计算图的方式进行描述,有利于对整个神经网络计算任务进行整体的把握,与此同时,计算图的表达方式也方便对计算任务进行调度和并行执行。
在本申请中,计算图内的算子可以被分配到多个计算子任务,不同计算子任务之间可以并行,也可以串行。同一个计算子任务内的算子的运行顺序为串行。例如,如图1b所示,算子a和算子c被分配到计算子任务0内,算子b、算子d、算子e和算子f被分配到计算子任务1内,算子g和算子h被分配到计算子任务2内。关于如何确定计算图中计算子任务的数量请参考现有的实现方式,此处不多加限定。例如,可以通过对神经网络计算任务进行拆分,得到多个计算子任务,从而可以得到多个计算子图,一个计算子图即为一个计算子任务。
在一些实施例中,当计算子任务0、计算子任务1和计算子任务2处于并行状态时,不同计算子任务之间算子的执行顺序取决于算子间有向边的指向,举例来说,在计算子任务0内,算子之间的执行顺序为算子a-算子c;在计算子任务1内,算子之间的执行顺序为算子b-算子d/算子e-算子f;在计算子任务2内,算子之间的执行顺序为算子g-算子h。在图1b中,算子c和算子g在两个计算子任务中并行执行,由于算子c生成的张量数据t4为算子g的输入,所以,在算子c运行结束之后才会执行算子g。算子e和算子g在两个计算子任务中并行执行,对算子e来说,算子e生成的张量数据t6不是算子g的输入;对算子g来说,算子g生成的张量数据t8不是算子e的输入,也即:算子e和算子g之间没有张量数据的产生和消费,所以,并不限定算子e与算子g之间的执行顺序。具体地,该计算图中各算子的执行顺序可以如图1c所示。需要说明的是,在图1c中,并不限定算子d与算子e之间的执行顺序,基于此,将其表示为算子d/算子e。
在一些实施例中,当计算子任务0、计算子任务1处于并行状态,计算子任务1和计算子任务2处于串行状态时,不同计算子任务之间算子的执行顺序取决于算子间有向边的指向,举例来说,在计算子任务0内,算子之间的执行顺序为算子a-算子c;在计算子任务1内,算子之间的执行顺序为算子b-算子d/算子e-算子f;在计算子任务2内,算子之间的执行顺序为算子g-算子h。对计算子任务1和计算子任务2来说,在计算子任务1中的算子f运行结束之后才会运行计算子任务2中的算子g。具体地,该计算图中各算子的执行顺序可以如图1d所示。
(4)依赖关系
在本申请中,算子A依赖于算子B,表示算子A在开始它自己的计算任务前必须等待算子B对应的内核函数执行完毕。
(5)张量
在本申请中,张量仅仅是对存储的一块数据的特征描述,张量记录了数据的形状、类型等信息。
在本申请中,张量应该理解为张量数据,可以包括神经网络模型中的输入张量数据、输出张量数据,也可以包括特征张量数据等。
以人工智能深度学习框架TensorFlow为例,一般使用阶(rank)、形状(shape)和纬数(dimension number)来描述张量的纬度,其关系可以表示为下表所示:
形状 维数 举例
0 [] 0-D 4
1 [D1] 1-D [2]
2 [D1,D2] 2-D [6,2]
3 [D1,D2,D3] 3-D [7,3,2]
 
n [D1,D2,D3,…Dn] n-D 形状为[D1,D2,D3,…Dn]的张量
如上表所示,张量A=4,其表示一个数。
如上表所示,张量A=[6,2],其表示二维矩阵。具体地,该矩阵为6行2列的矩阵。
(6)已分配集合
在本申请中,已分配集合是指,存储神经网络的张量数据的过程中,已分配内存空间信息的集合。上述已分配集合也可以称为共享内存队列,本申请对此不作具体限定。
(7)神经网络的内存分配策略
第一种策略:称为In-Place策略,该策略是指神经网络中各节点的输入和输出共用一块内存空间。
第二种策略:称为Co-share策略,该策略是指某块内存空间可以供神经网络中的多个节点使用,当这些节点均执行完毕时,该内存空间的生命周期结束,此时该内存空间可供神经网络中的其他节点使用。比如,可预设内存空间A的生命周期为(1,2,3),代表该内存空间A可供节点1、节点2和节点3使用,当节点1、节点2以及节点3均执行完成,内存空间A的生命周期结束,此时,内存空间A可放置于空闲链表中,供神经网络中的其他节点使用。
目前,针对上述第二种策略,内存分配的具体方法为:按照神经网络中节点执行的先后顺序进行内存空间的分配和复用,内存分配效果较差。
比如,神经网络在运行过程中,依次需要占用100M的内存空间、10M的内存空间和50M的内存空间。当神经网络申请100M的内存空间时,可为神经网络分配一个100M的内存空间,然后,当神经网络申请10M的内存空间时,判断是否可以复用上述已分配的10M的内存空间,如果可以复用,则不再为所申请的10M内存空间分配新的内存空间,而是复用上述100M的内存空间。同理,当神经网络申请50M的内存空间时,先判断该50M的内存空间是 否可以复用已分配的100M内存空间,如果可以复用,则不再为所申请的50M的内存空间分配新的内存空间。但是,若申请的10M内存空间和申请的50M内存空间都可复用已分配的100M内存空间,会出现申请的10M内存空间复用已分配的100M内存空间,而对神经网络额外分配一50M的内存空间,从而整个神经网络需占用150M的内存空间,导致整个神经网络占用的内存较大,内存分配不合理。
针对以上,本申请提供一种内存分配方法,该方法的主要原理为:根据每个张量数据的信息获取整个神经网络中多个张量数据的排序结果,其中,每个张量数据的信息可以包括每个张量数据需要占用内存空间大小、每个张量数据各自对应的约束关系以及消费算子中的至少一种,此外,对每个张量数据来说,均包括各自对应的标识,然后,按照排序结果依次为张量数据分配内存空间。采用上述方法,可以避免上述现有技术中出现的内存规划空间不合理的情况,从而可以节省整个神经网络需要占用的内存,优化了神经网络的内存分配。此外,该方法还可以解决并行场景下,不同计算子任务中因算子复用同一个内存空间而带来的算子运算结果出错的问题。
又比如,如图2a所示,整个神经网络包括8个节点,按照运行的先后顺序,索引分别为a至h。而通过预分析,可获得上述图1a所示的神经网络在运行时,按照算子的执行顺序{a,b,c,d,e,f,g,h},先后需占用5个内存空间,每个内存空间用于张量数据,分别为第一内存空间、第二内存空间、第三内存空间、第四内存空间以及第五内存空间。
具体来说,为张量数据预先分配内存的实现过程可以如图2b所示,当模拟执行算子a的运算逻辑之前,为张量数据t0分配第一内存空间,为张量数据t1分配第二内存空间。当模拟执行算子b的运算逻辑之前,为张量数据t2分配第三内存空间,为张量数据t3分配第四内存空间。当模拟执行算子c的运算逻辑之前,为张量数据t4分配第五内存空间。与此同时,释放第一内存空间,对接下来的张量数据来说,可以复用第一内存空间。当模拟执行算子d的运算逻辑之前,为张量数据t5分配上述可以复用的第一内存空间。与此同时,释放第三内存空间,对接下来的张量数据来说,可以复用第三内存空间。当模拟执行算子e的运算逻辑之前,为张量数据t6分配上述可以复用的第三内存空间。与此同时,释放第四内存空间,对接下来的张量数据来说,可以复用第四内存空间。当模拟执行算子f的运算逻辑之前,为张量数据t7分配上述可以复用的第四内存空间。与此同时,释放第一内存空间和第三内存空间,对接下来的张量数据来说,可以复用上述第一内存空间和第三内存空间。当模拟执行算子g的运算逻辑之前,为张量数据t8分配上述可以复用的第一内存空间。与此同时,释放第五内存空间。当模拟执行算子h的运算逻辑之前,释放第二内存空间、第四内存空间以及第一内存空间。与此同时,将神经网络的运算结果存储在设定的内存空间中。
如图2b所示,为整个神经网络确定的内存规划空间的大小为上述5个内存空间的大小之和。
在图1b所示的计算图中,该计算图包括三个计算子任务,这三个计算子任务分别代表不同的神经网络计算子任务。一般来说,这三个计算子任务之间的关系可以为串行,也可以为并行。在同一个计算子任务内,算子之间的执行顺序为串行。如图2c所示,当计算子任务0与计算子任务1之间的关系为并行时(例如,可用通过不同的处理器核来运行上述两个计算子任务),由于算子c与算子d之间的执行顺序为并行关系,如果在执行算子d的运算时,算子c的运算并没有完成,在这种情况下,如果张量数据t0和张量数据t5复用同一个内存空间(第一内存空间),会导致算子a生成的张量数据t0被覆盖,从而导致算子c的运算结果出错。可以理解的是,现有技术中,在计算图中存在并行计算子任务的情况下,图2b所示的内 存规划空间不合理,该内存规划空间不合理体现在:在两个并行计算子任务中,不同计算子任务中的算子复用同一个内存空间,会导致其中一个算子的运算结果出错。
针对以上,本申请提供另一种内存分配方法,该方法的主要原理为:在并行场景下,确定计算图中每个张量数据各自对应的约束关系,然后,基于每个张量数据各自对应的约束关系为张量数据分配内存空间。采用上述方法,可以避免上述现有技术中出现的内存规划空间不合理的情况,避免不同计算子任务中因算子复用同一个内存空间而带来的算子运算结果出错的情形,可以保证神经网络的计算结果的准确性。
为了便于更好的理解本申请,下面介绍几个本申请所描述的方法可以应用的应用场景:
如图3a所示,本申请所描述的方法可以应用在神经网络在线训练/推理,也可以应用在神经网络离线训练/推理。具体来说,在神经网络在线训练/推理的场景下,处理器CPU与人工智能处理器通过I/O总线进行通信,为处于运行状态的神经网络分配内存。在神经网络离线/推理的场景下,通用处理器通过获取硬盘中存储的神经网络离线文件,当调用该神经网络离线文件时,为其分配内存。
在本申请中,内存分配设备可以具体为服务器或终端设备。如图3b所示,服务器或终端设备侧中可以包括深度学习算法、深度学习框架、计算资源以及内存资源等。其中,深度学习算法可通过深度学习框架来调用计算资源和内存资源。
以卷积神经网络框架(Convolution Architecture For Fast Feature embedding,Caffe)为例,Caffe可以支持多种类型的深度学习框架、面向图像分类和图像分割,还可以支持卷积神经网络(Convolutional Neural Networks,CNN)、用于目标检测的卷积神经网络(Region-CNN,RCNN)、长短期记忆神经网络(Long Short-Term Memory,LSTM)和全连接神经网络设计。如图3c所示,深度学习算法中可以包括网络模型,深度学习框架中可包括NET类、层、Blob、任务管理和内存管理模块(syncmem),其中,在内存管理模块中可设置有MemModel模块,基于blob和内存管理模块的原有逻辑,可实现内存优化。
在本申请实施例中,上述网络模型可具体为神经网络的网络模型。NET类中可存储神经网络所对应的有向无环图(Directed acyclic graph,DAG),比如,如图3d所示,提供DAG的一种示例,在图3d的示例中,神经网络包括A、B、C、E、F以及G五个节点,节点A的输出参数(例如,张量数据),作为节点B的输入参数,节点B的输出参数,分别作为节点C和节点F的输入参数,节点C的输出参数,作为节点E的输入参数,节点E和节点F的输出参数,作为节点G的输入参数。层,用于存储神经网络所包括节点的信息,节点也可称为层。Blob用于存储神经网络的每个节点在运算过程中,所对应的输入参数、输出参数以及中间参数所占用内存空间的信息。内存管理模块用于对神经网络所占用内存空间的信息进行管理和分配。
请参见图4a,为本申请实施例提供的一种神经网络的内存分配方法的流程示意图。该方法的执行主体可以为运行神经网络的服务器,也可以为运行神经网络的终端设备。出于阐述的便利,以执行主体为运行神经网络的终端设备为例,进行说明。在图4a所示的方法流程示意图中,可设定整个神经网络在运行时,需要多个内存空间。如图4a所示,该方法可以包括但不限于如下步骤:
步骤S401、获取神经网络对应的计算图;其中,计算图包括N个节点和和连接不同节点的有向边,计算图的有向边上承载有张量数据,计算图中包括M个张量数据,M为大于1的整数。
在本申请实施例中,节点,用于指示神经网络中的一种计算逻辑,也即:实现某种特定功能的函数。在实际应用中,可以使用OP表示节点,tensor表示张量数据。
例如,以神经网络为卷积神经网络为例,该一种卷积神经网络的具体结构可以如图4b所示,该卷积神经网络(CNN)400可以包括输入层410,卷积层/池化层420(其中池化层为可选的)、全连接层430以及输出层440。这里,全连接层430是指全连接特性的网络结构。以隐含层1为例,可以通过隐含层1的输入数据与隐含层1对应的权值张量的乘积来表示全连接特性,例如,该全连接特性可以量化为ωx,其中,ω表示隐含层1对应的权值张量,x表示隐含层1的输入数据。具体来说,卷积层420,用于提取输入数据的特征,例如,当输入数据为图像时,卷积层420用于提取输入图像的特征,以减少输入图像带来的参数;全连接层430,用于整合卷积层420(或者池化层)中具有类别区分性的局部信息,例如,全连接层430可以连接卷积层420提取到的特征。在实际应用中,为了提升卷积神经网络400的网络性能,全连接层430中每个神经元的激励函数一般采用ReLU函数。最后一层全连接层430的输出值被传递给一个输出,例如,可以采用softmax逻辑回归(softmax regression)进行分类,从而可以得到处理结果。例如,该处理结果可以为图像的识别概率,从而可以通过输出层440输出该处理结果。
终端设备可以获取上述卷积神经网络对应的计算图,该计算图包括卷积节点,全连接节点(FC),激活节点(Relu)、池化节点(Pooling)、分类器节点(softmax)等。
在本申请实施例中,有向边可以用于表征节点与节点之间的连接关系,该有向边上承载有张量数据,有向边的指向用于反映张量数据的流向。
步骤S402、基于M个张量数据的排序结果,依次给M个张量数据分配内存空间,其中,若所述M个张量数据中的一个张量数据可复用已分配的内存空间中的至少一部分,则将所述张量数据可复用的至少一部分内存空间分配给所述张量数据,所述已分配的内存空间为在所述张量数据之前,已经分配给所述M个张量数据的内存空间。
在本申请实施例中,M个张量数据的排序结果指示为M个张量数据分配内存空间时的执行顺序,该排序结果与M个张量数据中的每个张量数据的信息有关,每个张量数据的信息指示以下信息中的至少一种:每个张量数据对应的约束关系以及每个张量数据流向的节点的数量。
在本申请中,张量数据可以包括输入张量数据、输出张量数据以及中间张量数据。
在本申请中,消费节点是指,在计算图中,消费张量数据的节点,也即张量数据流向的节点。顾名思义,“消费”是说,节点运算过程中对物质(例如,张量数据)的使用和消耗。
在本申请中,生产节点是指,在计算图中,生成张量数据的节点,也即张量数据流出的节点。顾名思义,“生产”是“消费”的逆过程,表示节点运算过程中输出的事物。
在本申请中,节点A是节点B的上游节点是指,在计算图中存在至少一条路径可以从节点A到节点B。例如,可以在计算图中,以节点B为起点,通过反向遍历的方式(也即:沿着有向边的反方向)获取节点B对应的上游节点。
在本申请中,该约束关系可以承载于约束关系表,在该约束关系表中,可以通过第一值指示各张量数据可以与其他张量数据复用同一个内存空间,通过第二值指示各张量数据不可以与其他张量数据复用同一个内存空间,通过第三值指示各张量数据可以与其他张量数据在同一个内存空间中进行连续存储。
具体地,第一值、第二值以及第三值可以为能够相互区分的数值,例如,第一值可以为“0”,第二值可以为“1”,第三值可以为“2”。
在一些实施例中,上述约束关系具有不同的优先级,换句话说,当两个张量的可用内存空之间的关系是不可复用且连续时,必然也是不可复用的,那么这两个张量的可用内存空之间的关系在约束关系中指示为不可复用且连续。也就是可以理解为不可复用且连续的优先级高于不可复用。
需要说明的是,在一些实施例中,该约束关系还可以不局限于约束关系表的表现形态,还可以通过其他数据结构来呈现。
在一些实施例中,确定计算图中每个张量数据各自对应的约束关系的实现过程可以包括:判断第一张量数据的所有消费节点是否为第二张量数据的生产节点的上游节点,若是,则确定第一张量数据可以复用为第二张量数据分配的内存空间;若否,则确定第一张量数据不可以复用为第二张量数据分配的内存空间。
在一些实施例中,确定计算图中每个张量数据各自对应的约束关系的实现过程可以包括:判断第二张量数据的所有消费节点是否为所述第一张量数据的生产节点的下游节点,若是,则确定第一张量数据可以复用为第二张量数据分配的内存空间;若否,则确定第一张量数据不可以复用为第二张量数据分配的内存空间。
在本申请中,张量数据A可以复用为张量数据B分配的至少一部分内存空间是指,张量数据A可以完全复用为张量数据B分配的内存空间,或者,张量数据A可以复用为张量数据B分配的一部分内存空间。
在本申请中,节点A是节点B的上游节点是指,在计算图中存在至少一条路径可以从节点A到节点B。例如,可以在计算图中,以节点B为起点,通过反向遍历的方式(也即:沿着有向边的反方向)获取节点B对应的上游节点。
在一些实施例中,确定计算图中每个张量数据需要占用的内存空间大小的实现过程可以包括:终端设备运行神经网络,记录神经网络中每个张量数据需要占用的内存空间大小,根据记录的每个张量数据需要占用的内存空间大小来确定神经网络处于运行状态时,每个张量数据需要占用的内存空间大小,为后续给张量数据分配相应的内存空间提供了基础。比如,整个神经网络包括节点1和节点2,终端设备通过人工智能处理器运行该神经网络,可记录在神经网络运行过程中,张量数据1需要占用的内存空间大小为1000Kb,张量数据2需要占用的内存空间大小为500Kb,从而可以根据记录的每个张量数据需要占用的内存空间大小来确定神经网络处于运行状态时,每个张量数据需要占用的内存空间大小。
此外需要说明的是,在本申请中,默认每个张量数据均有自身对应的标识,以图1b所示的计算图为例,在该计算图中包括8个算子和9个张量数据,其中,9个张量数据可以表示为张量数据t0、张量数据t1、张量数据t2、张量数据t3、张量数据t4、张量数据t5、张量数据t6、张量数据t7和张量数据t8。可以理解的是,每个张量数据各自对应的标识具有唯一性。
在一些实施例中,上述标识可以为一系列有顺序的号码,从而可以根据每个张量数据各自对应的标识确定张量数据的顺序。
在一些实施例中,可以根据每个张量数据对应的约束关系获取M个张量数据各自对应的约束量,该约束量为其他张量数据中不可以与张量数据复用同一个内存空间的张量数据的数量;之后,根据M个张量数据各自对应的约束量的大小,对M个张量数据按照从大到小排序,得到M个张量数据的排序结果。
在一些实施例中,可以根据M个张量数据各自对应的消费节点数量的大小,对M个张量数据按照从大到小排序,得到M个张量数据的排序结果。
可以理解的是,还可以根据每个张量数据的信息中的至少两种对M个张量数据按照从大 到小排序,得到M个张量数据的排序结果。例如,以每个张量数据的信息包括每个张量数据需要占用的内存空间大小和每个张量数据各自对应的约束关系为例,计算图中包括2个张量数据,分别为张量数据1和张量数据2,其中,张量数据1需要占用的内存空间大小为1000Kb,张量数据1与张量数据2之间的约束关系为张量数据1不可以与张量数据2复用同一个内存空间,其约束量为1;张量数据1需要占用的内存空间大小为500Kb,张量数据2与张量数据1之间的约束关系为张量数据2不可以与张量数据1复用同一个内存空间,其约束量为1。对上述2个张量数据按照从大到小进行排序,得到的排序结果:张量数据1、张量数据2。
在一些实施例中,可以通过启发式算法对M个张量数据进行排序,以在预设时间段内获取M个张量数据的排序结果。这里,启发式算法是指,一个基于直观或经验构造的算法,在可接受的花费(指计算时间和空间)下给出待解决组合优化问题每一个实例的一个可行解,该可行解与最优解的偏离程度一般不能被预计。
如前所述,每个张量数据的信息包括每个张量数据各自对应的标识之外,还可以包括如下的一项或多项:每个张量数据需要占用的内存空间大小、每个张量数据各自对应的约束关系、每个张量数据流向的消费节点数量。在终端设备通过启发式算法对M个张量数据进行排序时,需要考虑每个张量数据所包含的信息之间的顺序,然后,以该顺序作为一个独立的个体进行排序。例如,当每个张量数据的信息包括每个张量数据各自对应的标识、每个张量数据需要占用的内存空间大小、每个张量数据各自对应的约束关系和每个张量数据流向的消费节点数量的情况下,这4个信息之间的混合排序结果可以包括632种。
举例来说,计算图中包括4个张量数据,分别为张量数据1、张量数据2、张量数据3、张量数据4和张量数据5,终端设备通过上述启发式算法对5个张量数据进行排序,获取在预设时间段内启发式算法确定的排序序列(例如,该排序序列为:张量数据2、张量数据3、张量数据4、张量数据1和张量数据5)为5个张量数据的排序结果。从而,终端设备可以根据确定好的排序结果为张量数据分配内存空间。
在一些实施例中,内存分配装置可以通过约束规划求解器CPsolver(Constraint Programmingsolver)调用启发式算法(例如,启发式算法包括确定性算法和随机算法)来进行M个张量数据排序。需要说明的是,上述排序结果可能为需要优化的排序结果,也可能为不需要优化的排序结果。
在一些实施例中,为了节省内存,上述排序结果为优化后的排序结果,其中,优化后的排序结果对应的神经网络需要占用的最大内存大小小于根据优化前的排序结果确定的神经网络需要占用的最大内存大小。例如,计算图中包括4个张量数据,分别为张量数据1、张量数据2、张量数据3、张量数据4和张量数据5,其中,张量数据1、张量数据2、张量数据3、张量数据4之间的排列顺序依次为:张量数据1、张量数据2、张量数据3、张量数据4,在该排序结果中,如图4c所示,张量数据5存在4种可能的位置(可能位置1、可能位置2、可能位置3以及可能位置4),其中,可能位置1为介于张量数据1和张量数据2之间的位置;可能位置2为介于张量数据2和张量数据3之间的位置;可能位置3为介于张量数据3和张量数据4之间的位置;可能位置4为位于张量数据4之后的位置。如图4c所示,上述4个潜在的可能位置各自对应着不同的内存空间。基于上述潜在的可能位置,内存分配装置根据不同的判断条件来确定张量数据5的位置,其中,该判断条件可以包括但不限于:对某一个张量数据来说,为该张量数据分配的内存空间对应的首地址最小或最大;对某一个张量数据来说,确定潜在可能位置对应的内存空间大小与张量数据需要占用的内存空间大小之间的差值满足阈值,例如,该阈值可以为0,也可以为其他数值。在该阈值为0的情况下,表示潜在 可能位置对应的内存空间大小等于张量数据需要占用的内存空间大小。当张量数据5在排序结果中的位置为可能位置1时,终端设备根据该排序结果为张量数据分配内存空间,例如,通过已分配的内存空间确定运行整个神经网络需要的最大内存大小为4500Kb;当张量数据5在排序结果中的位置为可能位置2时,终端设备根据该排序结果为张量数据分配内存空间,例如,通过已分配的内存空间确定运行整个神经网络需要的最大内存大小为3500Kb;当张量数据5在排序结果中的位置为可能位置3时,终端设备根据该排序结果为张量数据分配内存空间,例如,通过已分配的内存空间确定运行整个神经网络需要的最大内存大小为5000Kb;当张量数据5在排序结果中的位置为可能位置4时,终端设备根据该排序结果为张量数据分配内存空间,例如,通过已分配的内存空间确定运行整个神经网络需要的最大内存大小为4000Kb。
下面结合具体的实例阐述如何为张量数据分配内存空间的:
在一些实施例中,以图1b所示的计算图为例,计算子任务0中包含节点a和节点c,其执行顺序为节点a-节点c;计算子任务1中包含节点b、节点d、节点e和节点f,其执行顺序为节点b-节点d/e-节点f;计算子任务2中包含节点g和节点h,其执行顺序为节点g-节点h。其中,计算子任务0、计算子任务1和计算子任务2之间的执行关系为并行。由于在每个计算子任务中,相邻两个节点之间均有有向边,此时,无需对该计算图进行调整。
其次,在该计算图中,确定每个节点对应的上游节点,每个节点对应的输出张量数据以及输入张量数据。以节点A为节点B的上游节点为例,表示在计算图中,存在至少一条路径可以从节点A到节点B。具体地,每个节点对应的上游节点,每个节点对应的输出张量数据以及输入张量数据可以如表1所示:
节点 上游节点 输出张量数据 输入张量数据
a - t0,t1 -
b a t2,t3 t0
c a t4 t0
d a,b t5 t2
e a,b t6 t3
f a,b,c,d,e t7 t5,t6,t4
g a,c t8 t4
h a,b,c,d,e,f,g - t1,t7,t8
表1
如表1所示,在图1b所示的计算图中,以节点a为例,节点a为起始节点,没有对应的上游节点,节点a在运算的过程中,可以得到输出张量数据t0和输出张量数据t1。又例如,以节点b为例,节点a为节点b的上游节点,这表示在计算图中存在一条路径从节点a到节点b,节点b在运算的过程中,其输入张量数据为t0,可以得到输出张量数据t2和输出张量数据t3。关于确定其他节点对应的上游节点,输出张量数据以及输入张量数据的实现过程,此处不多加赘述。
之后,确定每个张量数据各自对应的约束关系。例如,可以判断第一张量数据的所有消费节点是否为第二张量数据的生产节点的上游节点,若是,则确定第一张量数据可以复用为第二张量数据分配的内存空间;若否,则确定第一张量数据不可以复用为第二张量数据分配的内存空间。又例如,可以判断第二张量数据的所有消费节点是否为所述第一张量数据的生 产节点的下游节点,若是,则确定第一张量数据可以复用为第二张量数据分配的内存空间;若否,则确定第一张量数据不可以复用为第二张量数据分配的内存空间。
具体地,上述约束关系可以承载于约束关系表,在该约束关系表中,可以通过第一值指示各张量数据可以与其他张量数据复用同一个内存空间,通过第二值指示各张量数据不可以与其他张量数据复用同一个内存空间,通过第三值指示各张量数据可以与其他张量数据在同一个内存空间中进行连续存储。出于阐述的便利,通过第一值“0”指示张量数据可以与除自身之外的其他张量数据复用同一个内存空间;通过第二值“1”指示张量数据不可以与除自身之外的其他张量数据复用同一个内存空间;通过第三值“2”指示张量数据可以与除自身之外的其他张量数据在同一个内存空间中进行连续存储。需要说明的是,上述描述只是一种示例,不应构成限定。具体地,该约束关系表可以表示为如表2所示:
  t0 t1 t2 t3 t4 t5 t6 t7 t8
t0 - 1 1 1 1 1 1 1 1
t1 1 - 1 1 1 1 1 1 1
t2 1 1 - 1 1 1 0 0 1
t3 1 1 0 - 1 0 1 0 1
t4 1 1 1 1 - 1 1 1 1
t5 1 1 1 0 1 - 0 1 1
t6 1 1 0 1 1 0 - 1 1
t7 1 1 0 0 1 1 1 2 -
t8 1 2 1 1 1 1 1 - 1
表2
需要说明的是,在考虑是否复用内存空间的情况下,两两张量数据之间的约束关系是对称的,该对称性体现在:与自身的逆关系完全相同的那种关系。如表2所示,以确定张量数据t2与张量数据t8之间的约束关系为例,在图1b所示的计算图中,张量数据t7的生产节点为f,张量数据t2的消费节点为d,由于张量数据t2的消费节点d是张量数据t7的生产节点f的上游节点,此时,可以确定张量数据t7可以与张量数据t2复用同一个内存空间。在图1b所示的计算图中,在计算子任务1内,节点b为控制选择节点,该节点有两条分支,一条分支为:节点b-节点d-节点f;一条分支为:节点b-节点e-节点f。在神经网络的一次运算中,只有一条分支是有效的。对张量数据t2和张量数据t3来说,张量数据t2与张量数据t3之间的约束关系为张量数据t2和张量数据t3并不需要两个独立的内存空间,也即:可以复用同一个内存空间;对张量数据t5和张量数据t6来说,张量数据t5与张量数据t6之间的约束关系为张量数据t5和张量数据t6并不需要两个独立的内存空间,也即:可以复用同一个内存空间。
还需要说明的是,在考虑将多个张量数据存储在同一个连续的内存空间的情况下,两两张量数据之间的约束关系是不对称的。此时,需要考虑两两内存空间之间的有序性。
之后,以每个张量数据的信息包括每个张量数据需要占用的内存空间大小和每个张量数据各自对应的约束关系为例,根据每个张量数据需要占用的内存空间大小和每个张量数据各自对应的约束量的大小,对M个张量数据按照从大到小进行排序,得到M个张量数据的排序结果。在排序时,可以将约束关系为在同一个内存空间中进行连续存储的多个张量数据作为一个独立的整体进行排序。
具体地,基于表2所示的约束关系,可以得到每个张量数据各自对应的约束量,例如, 对张量数据t0来说,张量数据t0不可以与其他张量数据(t1、t2、t3、t4、t5、t6、t7、t8)复用同一个内存空间,其约束量为8;对张量数据t1来说,张量数据t1不可以与其他张量数据(t0、t2、t3、t4、t5、t6、t7、t8)复用同一个内存空间,其约束量为8;对张量数据t2来说,张量数据t2不可以与张量数据(t0、t1、t3、t4、t5、t8)复用同一个内存空间,其约束量为6;对张量数据t3来说,张量数据t3不可以与张量数据(t0、t1、t4、t6、t8)复用同一个内存空间,其约束量为5;对张量数据t4来说,张量数据t4不可以与张量数据(t0、t1、t2、t3、t5、t6、t7、t8),其约束量为8;对张量数据t5来说,张量数据t5不可以与张量数据(t0、t1、t2、t4、t7、t8)复用同一个内存空间,其约束量为6;对张量数据t6来说,张量数据t6不可以与张量数据(t0、t1、t3、t4、t7、t8),其约束量为6;对张量数据t7来说,张量数据t7不可以与张量数据(t0、t1、t4、t5、t6),其约束量为5;对张量数据t8来说,张量数据t8不可以与张量数据(t0、t2、t3、t4、t5、t6、t7),其约束量为7。
进一步地,张量数据t0需要占用的内存空间大小为500Kb;张量数据t1需要占用的内存空间大小为500Kb;张量数据t2需要占用的内存空间大小为500Kb;张量数据t3需要占用的内存空间大小为500Kb;张量数据t4需要占用的内存空间大小为500Kb;张量数据t5需要占用的内存空间大小为1000Kb;张量数据t6需要占用的内存空间大小为1000Kb;张量数据t7需要占用的内存空间大小为1000Kb;张量数据t8需要占用的内存空间大小为1000Kb。
从而可以对上述9个张量数据按照从大到小进行排序,该排序结果可以如表3所示:
t1,t8,t7 t5 t6 t0 t4 t2 t3
表3
需要说明的是,在排序时,将约束关系为在同一个内存空间中进行连续存储的多个张量数据作为一个单独的整体进行排序的方法只是一种示例,不应构成限定。在实际应用中,还可以将每个张量数据作为一个单独的整体进行排序。
那么,在这种情况下,首先,分配第一内存空间给张量数据t1、t7和t8,此时,已分配的内存空间包括第一内存空间;其中,第一内存空间包括内存空间a,第二内存空间b和第三内存空间c,内存空间a,内存空间b和内存空间c为连续的内存空间,其中,内存空间a用于存储张量数据t1(也即:内存空间a的大小等于张量数据t1的大小),内存空间b用于存储张量数据t7(也即:内存空间b的大小等于张量数据t7的大小),内存空间c用于存储张量数据t8(也即:内存空间c的大小等于张量数据t8的大小);之后,为张量数据t5分配内存空间,该实现过程可以包括:结合表2中张量数据t5与其他张量数据之间的约束关系判断张量数据t5是否可复用已分配的内存空间(第一内存空间),由于张量数据t5不可复用已分配的第一内存空间,此时,根据张量数据t5需要的内存空间大小为张量数据t5分配相应大小的第二内存空间,此时,已分配的内存空间包括第一内存空间和第二内存空间。之后,为张量数据t6分配内存空间,该实现过程可以包括:结合表2中张量数据t6与其他张量数据之间的约束关系判断张量数据t6是否可复用已分配的内存空间(第一内存空间和第二内存空间),由于张量数据t6可以复用第二内存空间,将第二内存空间分配给张量数据t6。之后,为张量数据t0分配内存空间,该实现过程可以包括:结合表2中张量数据t0与其他张量数据之间的约束关系判断张量数据t0是否可复用已分配的内存空间(第一内存空间和第二内存空间),由于张量数据t0不可复用已分配的内存空间,此时,根据张量数据t0需要占用的内存空间大小为张量数据t0分配相应大小的第三内存空间,在这种情况下,已分配的内存空间包括第一内存空间、第二内存空间和第三内存空间。之后,为张量数据t4分配内存空间,该实现过程可以包括:结合表2中张量数据t4与其他张量数据之间的约束关系判断张量数据t4 是否可复用已分配的内存空间(第一内存空间、第二内存空间和第三内存空间),由于张量数据t4不可复用已分配的内存空间,此时,根据张量数据t4需要占用的内存空间大小为张量数据t4分配相应大小的第四内存空间,在这种情况下,已分配的内存空间包括第一内存空间、第二内存空间、第三内存空间和第四内存空间。之后,为张量数据t2分配内存空间,该实现过程可以包括:结合表1中张量数据t2与其他张量数据之间的约束关系判断张量数据t2是否可复用已分配的内存空间(第一内存空间、第二内存空间、第三内存空间和第四内存空间),由于张量数据t2可复用第一内存空间,将第一内存空间分配给张量数据t2(例如,可以将第一内存空间中的内存空间c分配给张量数据t2)。之后,为张量数据t3分配内存空间,该实现过程可以包括:结合表2中张量数据t3与其他张量数据之间的约束关系判断张量数据是否可复用已分配的内存空间(第一内存空间、第二内存空间、第三内存空间和第四内存空间),由于张量数据可复用第一内存空间,将第一内存空间分配给张量数据t3(例如,可以将第一内存空间中的内存空间c分配给张量数据t3)。具体地,该分配过程可以如图4d所示。可以理解的是,通过这一实现方式,为每个张量数据均分配了一个独立的内存空间,张量数据与内存空间之间的关系为一一对应,也即:张量数据的数量与内存空间的数量是相同的。
可以理解的是,在上述分配内存空间的过程中,每个内存空间包括各自对应的首地址和存储空间大小。具体地,为每个张量数据分配的内存空间可以如表4所示:
标识 张量数据大小 首地址 存储空间大小 排序结果
张量数据t0 500 3500 [3500,4000[ 6
张量数据t1 500 0 [0,500[ 1
张量数据t2 500 1500 [1500,2000[ 8
张量数据t3 500 1500 [1500,2000[ 9
张量数据t4 500 4000 [4000,4500[ 7
张量数据t5 1000 2500 [2500,3500[ 4
张量数据t6 1000 2500 [2500,3500[ 5
张量数据t7 1000 1500 [1500,2500[ 3
张量数据t8 1000 500 [500,1500[ 2
表4
如表4所示,[3500,4000[表示包括3500-3999,不包括4000,其存储空间大小为500。
在一些实施例中,在确定了已分配的内存空间之后,通过已分配的内存空间,可以确定整个神经网络需要占用的最大内存大小。例如,基于图4d的分配过程,可以确定图1b所示的神经网络需要占用的最大内存大小为4500Kb。通过这一实现方式,可以确定计算机设备运行整个神经网络需要的最大内存大小,可以避免分配的内存无法支持神经网络正常运行的情形。
在一些实施例中,在为每个张量数据分配了相应的内存空间之后,根据每个张量数据各自对应的约束关系验证已分配的内存空间是否正确,若否,则重新为M个张量数据分配内存空间。例如,对张量数据t8来说,在第一内存空间中,判断张量数据t8对应的内存空间是否在张量数据t1对应的内存空间的右边,例如,如表4所示,张量数据t1对应的内存空间为[0,500[,张量数据t8对应的内存空间为[500,1500[,基于张量数据t1对应的内存空间以及张量数据t8对应的内存空间,可以确定张量数据t8对应的内存空间在张量数据t1对应的内存空间的右边,这意味着该分配是正确的。又例如,对张量数据t1、张量数据t7以及张量数据 t8来说,判断张量数据t1、张量数据t8以及张量数据t7各自对应的内存空间是否为连续的内存空间,例如,如表4所示,张量数据t1、张量数据t8以及张量数据t7各自对应的内存空间为连续的内存空间,其中,第一内存空间为[0,500[,第二内存空间为[500,1500[,第三内存空间为[1500,2500[,基于张量数据t1、张量数据t8以及张量数据t7各自对应的存储空间,可以确定张量数据t1、张量数据t8以及张量数据t7各自对应的内存空间为一段连续的内存空间,这意味着该分配是正确的。通过这一实现方式,可以避免出现内存分配不合理的情形。例如,该内存分配不合理可以体现在:已分配的内存空间中出现与约束关系互斥的分配结果。
在一些实施例中,以图1b所示的计算图为例,计算子任务0中包含节点a和节点c,其执行顺序为节点a-节点c;计算子任务1中包含节点b、节点d、节点e和节点f,其执行顺序为节点b-节点d/e-节点f;计算子任务2中包含节点g和节点h,其执行顺序为节点g-节点h。其中,计算子任务0与计算子任务1之间的执行关系为并行,计算子任务1与计算子任务2之间的执行关系为串行。由于在计算图中的同一个计算子任务内,相邻两个节点之间均有有向边,此时,无需对每个计算子任务对应的计算图进行调整。针对执行关系为串行的两个计算子任务(例如,计算子任务1和计算子任务2),由于计算子任务1的最后一个节点f与计算子任务2的第一个节点g之间没有有向边,此时,在计算子任务1的最后一个节点f与计算子任务2的第一个节点g之间添加一条有向边,得到更新后的计算图如图4e所示。
其次,在该计算图中,确定每个节点对应的上游节点,每个节点对应的输出张量数据以及输入张量数据。具体地,每个节点对应的上游节点,每个节点对应的输出张量数据以及输入张量数据可以如表5所示:
节点 上游节点 输出张量数据 输入张量数据
a - t0,t1 -
b a t2,t3 t0
c a t4 t0
d a,b t5 t2
e a,b t6 t3
f a,b,c,d,e t7,t dep t5,t6,t4
g a,b,c,d,e,f t8 t4,t dep
h a,b,c,d,e,f,g - t1,t7,t8
表5
如表5所示,在图4e所示的计算图中,以节点f为例,由于在计算图中,存在如下路径:节点a-节点b-节点d-节点f,节点a-节点b-节点e-节点f,节点a-节点c-节点f,可以确定节点f的上游节点可以包括节点a、节点b、节点c、节点d、节点e。输入节点f的张量数据包括张量数据t5,张量数据t6以及张量数据t4,节点f输出的张量数据包括张量数据t7和张量数据t dep。关于确定其他节点对应的上游节点,输出张量数据以及输入张量数据的实现过程,此处不多加赘述。
之后,确定每个张量数据各自对应的约束关系。例如,可以判断第一张量数据的所有消费节点是否为第二张量数据的生产节点的上游节点,若是,则确定第一张量数据可以复用为第二张量数据分配的内存空间;若否,则确定第一张量数据不可以复用为第二张量数据分配的内存空间。又例如,可以判断第二张量数据的所有消费节点是否为所述第一张量数据的生产节点的下游节点,若是,则确定第一张量数据可以复用为第二张量数据分配的内存空间; 若否,则确定第一张量数据不可以复用为第二张量数据分配的内存空间。
具体地,上述约束关系可以承载于约束关系表,在该约束关系表中,可以通过第一值指示各张量数据可以与其他张量数据复用同一个内存空间,通过第二值指示各张量数据不可以与其他张量数据复用同一个内存空间,通过第三值指示各张量数据可以与其他张量数据在同一个内存空间中进行连续存储。出于阐述的便利,通过第一值“0”指示张量数据可以与除自身之外的其他张量数据复用同一个内存空间;通过第二值“1”指示张量数据不可以与除自身之外的其他张量数据复用同一个内存空间;通过第三值“2”指示张量数据可以与除自身之外的其他张量数据在同一个内存空间中进行连续存储。需要说明的是,上述描述只是一种示例,不应构成限定。具体地,该数据结构可以表示为如表6所示:
  t0 t1 t2 t3 t4 t5 t6 t7 t8
t0 - 1 1 1 1 1 1 1 0
t1 1 - 1 1 1 1 1 1 1
t2 1 1 - 0 1 1 0 0 0
t3 1 1 0 - 1 0 1 0 0
t4 1 1 1 1 - 1 1 1 1
t5 1 1 1 0 1 - 0 1 0
t6 1 1 0 1 1 0 - 1 0
t7 1 1 0 0 1 1 1 2 -
t8 0 2 0 0 1 0 0 - 1
表6
需要说明的是,在考虑是否复用内存空间的情况下,两两张量数据之间的约束关系是对称的,该对称性体现在:与自身的逆关系完全相同的那种关系。如表2所示,以确定张量数据t2与张量数据t8之间的约束关系为例,在图4e所示的计算图中,张量数据t7的生产节点为f,张量数据t2的消费节点为d,由于张量数据t2的消费节点d是张量数据t7的生产节点f的上游节点,此时,可以确定张量数据t7可以与张量数据t2复用同一个内存空间。在图4e所示的计算图中,在计算子任务1内,节点b为控制选择节点,该节点有两条分支,一条分支为:节点b-节点d-节点f;一条分支为:节点b-节点e-节点f。在神经网络的一次运算中,只有一条分支是有效的。对张量数据t2和张量数据t3来说,张量数据t2与张量数据t3之间的约束关系为张量数据t2和张量数据t3并不需要两个独立的内存空间,也即:可以复用同一个内存空间;对张量数据t5和张量数据t6来说,张量数据t5与张量数据t6之间的约束关系为张量数据t5和张量数据t6并不需要两个独立的内存空间,也即:可以复用同一个内存空间。
还需要说明的是,在考虑将多个张量数据存储在连续的内存空间的情况,两两张量数据之间的约束关系是不对称的。进一步地,需要考虑两两内存空间之间的有序性。
之后,以每个张量数据的信息包括每个张量数据需要占用的内存空间大小和每个张量数据各自对应的约束关系为例,根据每个张量数据需要占用的内存空间大小和每个张量数据各自对应的约束量的大小,对M个张量数据按照从大到小进行排序,得到M个张量数据的排序结果。在排序时,可以将约束关系为在同一个内存空间中进行连续存储的多个张量数据作为一个独立的整体进行排序。
基于表6所示的约束关系,可以得到每个张量数据各自对应的约束量,例如,对张量数据t0来说,张量数据t0不可以与其他张量数据(t1、t2、t3、t4、t5、t6、t7)复用同一个内 存空间,其约束量为7;对张量数据t1来说,张量数据t1不可以与其他张量数据(t0、t2、t3、t4、t5、t6、t7、t8)复用同一个内存空间,其约束量为8;对张量数据t2来说,张量数据t2不可以与张量数据(t0、t1、t4、t5)复用同一个内存空间,其约束量为4;对张量数据t3来说,张量数据t3不可以与张量数据(t0、t1、t4、t6)复用同一个内存空间,其约束量为4;对张量数据t4来说,张量数据t4不可以与张量数据(t0、t1、t2、t3、t5、t6、t7、t8),其约束量为8;对张量数据t5来说,张量数据t5不可以与张量数据(t0、t1、t2、t4、t7)复用同一个内存空间,其约束量为5;对张量数据t6来说,张量数据t6不可以与张量数据(t0、t1、t3、t4、t7),其约束量为5;对张量数据t7来说,张量数据t7不可以与张量数据(t0、t1、t4、t5、t6),其约束量为5;对张量数据t8来说,张量数据t8不可以与张量数据(t4、t8),其约束量为2。
进一步地,张量数据t0需要占用的内存空间大小为500Kb;张量数据t1需要占用的内存空间大小为500Kb;张量数据t2需要占用的内存空间大小为500Kb;张量数据t3需要占用的内存空间大小为500Kb;张量数据t4需要占用的内存空间大小为500Kb;张量数据t5需要占用的内存空间大小为1000Kb;张量数据t6需要占用的内存空间大小为1000Kb;张量数据t7需要占用的内存空间大小为1000Kb;张量数据t8需要占用的内存空间大小为1000Kb。
从而可以对上述9个张量数据按照从大到小进行排序,该排序结果可以如表7所示:
t1,t8,t7 t5 t6 t0 t4 t2 t3
表7
需要说明的是,在排序时,将约束关系为在同一个内存空间中进行连续存储的多个张量数据作为一个单独的整体进行排序的方法只是一种示例,不应构成限定。在实际应用中,还可以将每个张量数据作为一个单独的整体进行排序。
那么,在这种情况下,首先,分配第一内存空间给张量数据t1、t7和t8,此时,已分配包括第一内存空间;其中,第一内存空间包括内存空间a,内存空间b和内存空间c,内存空间a,内存空间b和内存空间c为连续的内存空间,其中,内存空间a用于存储张量数据t1(也即:内存空间a的大小等于张量数据t1的大小),内存空间b用于存储张量数据t8(也即:内存空间b的大小等于张量数据t8的大小),内存空间c用于存储张量数据t7(也即:内存空间c的大小等于张量数据t7的大小);之后,为张量数据t5分配内存空间,该实现过程可以包括:结合表7中张量数据t5与其他张量数据之间的约束关系判断张量数据t5是否可复用已分配的内存空间(第一内存空间),由于张量数据t5可以复用第一内存空间,将第二内存空间分配为张量数据t5(例如,可以将第一内存空间中的内存空间b分配给张量数据t5)。之后,为张量数据t6分配内存空间,该实现过程可以包括:结合表7中张量数据t6与其他张量数据之间的约束关系判断张量数据t6是否可复用已分配的内存空间(第一内存空间),由于张量数据t6可以复用第一内存空间,将第一内存空间分配为张量数据t6(例如,可以将第一内存空间中的内存空间b分配给张量数据t6)。之后,为张量数据t0分配内存空间,该实现过程可以包括:结合表7中张量数据t0与其他张量数据之间的约束关系判断张量数据t0是否可复用已分配的内存空间(第一内存空间),由于张量数据t0不可复用已分配的内存空间,此时,根据张量数据t0需要占用的内存空间大小为张量数据t0分配相应大小的第二内存空间,在这种情况下,已分配的内存空间包括第一内存空间和第二内存空间。之后,为张量数据t4分配内存空间,该实现过程可以包括:结合表7中张量数据t4与其他张量数据之间的约束关系判断张量数据t4是否可复用已分配的内存空间(第一内存空间和第二内存空间),由于张量数据t4不可复用已分配结合中的内存空间,此时,根据张量数据t4需要占用的内存 空间大小为张量数据t4分配相应大小的第三内存空间,在这种情况下,已分配的内存空间包括第一内存空间、第二内存空间和第三内存空间。之后,为张量数据t2分配内存空间,该实现过程可以包括:结合表7中张量数据t2与其他张量数据之间的约束关系判断张量数据t2是否可复用已分配的内存空间(第一内存空间、第二内存空间和第三内存空间),由于张量数据t2可复用第一内存空间,将第一内存空间分配给张量数据t2(例如,可以将第一内存空间中的第三内内存空间分配给张量数据t2)。之后,为张量数据t3分配内存空间,该实现过程可以包括:结合表7中张量数据t3与其他张量数据之间的约束关系判断张量数据是否可复用已分配的内存空间(第一内存空间、第二内存空间和第三内存空间),由于张量数据可复用第一内存空间,将第一内存空间分配给张量数据t3(例如,可以将第一内存空间中的第三内存空间分配给张量数据t3)。具体地,该分配过程可以如图4f所示。可以理解的是,通过这一实现方式,为每个张量数据均分配了一个独立的内存空间,张量数据与内存空间之间的关系为一一对应,也即:张量数据的数量与内存空间的数量是相同的。
可以理解的是,在上述分配内存空间的过程中,每个内存空间包括各自对应的首地址和存储空间大小。具体地,为每个张量数据分配的内存空间可以如表8所示:
标识 张量数据大小 首地址 存储空间大小 排序结果
t0 500 2500 [2500,3000[ 6
t1 500 0 [0,500[ 1
t2 500 1500 [1500,2000[ 8
t3 500 1500 [1500,2000[ 9
t4 500 3000 [3000,3500[ 7
t5 1000 500 [500,1500[ 4
t6 1000 500 [500,1000[ 5
t7 1000 1500 [1500,2500[ 3
t8 1000 500 [500,1000[ 2
表8
在一些实施例中,在确定了已分配的内存空间之后,通过已分配的内存空间,可以确定整个神经网络需要占用的最大内存大小。例如,图4e所示的神经网络需要占用的最大内存大小为3500Kb。通过这一实现方式,可以确定计算机设备运行整个神经网络需要的最大内存大小,可以避免分配的内存无法支持神经网络正常运行的情形。
在一些实施例中,在为每个张量数据分配了相应的内存空间之后,根据每个张量数据各自对应的约束关系验证已分配的内存空间是否正确,若否,则重新为M个张量数据分配内存空间。例如,对张量数据t8来说,在第一内存空间中,判断张量数据t8对应的内存空间是否在张量数据t1对应的内存空间的右边,例如,如表4所示,张量数据t1对应的内存空间为[0,500[,张量数据t8对应的内存空间为[500,1000[,基于张量数据t1对应的内存空间以及张量数据t8对应的内存空间,可以确定张量数据t8对应的内存空间在张量数据t1对应的内存空间的右边,这意味着该分配是正确的。通过这一实现方式,可以避免出现内存分配不合理的情形。例如,该内存分配不合理可以体现在:已分配的内存空间中出现与约束关系互斥的分配结果。
在一些实施例中,以图1b所示的计算图为例,计算子任务0中包含节点a和节点c,其执行顺序为节点a-节点c;计算子任务1中包含节点b、节点d、节点e和节点f,其执行顺 序为节点b-节点d/e-节点f;计算子任务2中包含节点g和节点h,其执行顺序为节点g-节点h。其中,计算子任务0、计算子任务1与计算子任务2之间的执行关系为并行。在计算图中的同一个计算子任务内,获取各节点的先后执行顺序,然后,根据各节点的先后执行顺序对同一个节点依次进行编码,得到每个节点对应的标识(例如,序号),在相邻两个节点中,前一个对应的标识小于后一个节点对应的标识。例如,以图1b中的计算子任务0为例,计算子任务0内每个节点的标识可以如表9所示:
Figure PCTCN2021119829-appb-000009
表9
需要说明的是,表9所示的标识具有唯一性。
例如,若End ID大于Start ID,表示End ID对应的节点为Start ID对应的节点的上游节点。之后,确定每个张量数据各自对应的约束关系。关于如何根据每个张量数据的信息对M个张量数据进行排序,以及如何为张量数据分配内存空间的具体实现请参考前述描述,此处不多加赘述。
在一些实施例中,以图1b所示的计算图为例,计算子任务0中包含节点a和节点c,其执行顺序为节点a-节点c;计算子任务1中包含节点b、节点d、节点e和节点f,其执行顺序为节点b-节点d/e-节点f;计算子任务2中包含节点g和节点h,其执行顺序为节点g-节点h。其中,计算子任务0与计算子任务1之间的执行关系为并行,计算子任务1与计算子任务2之间的执行关系为串行。在计算图中的同一个计算子任务内,获取各节点的先后执行顺序,然后,根据各节点的先后执行顺序对同一个节点依次进行编码,得到每个节点对应的标识(例如,序号)。在执行关系为串行的至少两个计算子任务内,获取上述至少两个计算子任务内各节点的先后执行顺序,并根据各节点的先后执行顺序对上述至少两个计算子任务内的节点依次进行编码,得到每个节点各自对应的标识;在相邻两个节点中,前一个节点对应的标识小于后一个节点对应的标识。
例如,以图1b中的计算子任务2为例,计算子任务2内每个节点的标识可以如表10所示:
Figure PCTCN2021119829-appb-000010
表10
需要说明的是,表10所示的标识具有唯一性。
例如,若End ID大于Start ID,表示End ID对应的节点为Start ID对应的节点的上游节点。之后,确定每个张量数据各自对应的约束关系。关于如何根据每个张量数据的信息对M个张量数据进行排序,以及如何为张量数据分配内存空间的具体实现请参考前述描述,此处不多加赘述。
在一些实施例中,神经网络对应的计算图中包含3个张量数据,分别可以表示为张量数据t000、张量数据t100和张量数据t200,其中,张量数据t000需要占用的内存空间大小为 1000Kb,张量数据t100需要占用的内存空间大小为600Kb,张量数据t200需要占用的内存空间大小为450Kb。
进一步地,上述3个张量数据各自对应的约束关系可以如表11所示:
  t000 t100 t200
t000 - 0 0
t100 0 - 1
t200 0 1 -
表11
在表11所示的约束关系表中,通过第一值“0”表示张量数据可以除自身之外的其他数据复用同一个内存空间;通过第二值“1”表示张量数据不可以除自身之外的其他张量数据复用同一个内存空间。关于如何确定每个张量数据各自对应的约束关系,请参考前述描述,此处不多加赘述。
如表11所示,对张量数据t000来说,张量数据t000可以与张量数据t100复用同一个内存空间,也可以与张量数据t200复用同一个内存空间;对张量数据t100来说,张量数据t100可以与张量数据t000复用同一个内存空间,不可以与张量数据t200复用同一个内存空间;对张量数据t200来说,张量数据t200可以与张量数据t000复用同一个内存空间,不可以与张量数据t100复用同一个内存空间。
之后,根据每个张量数据各自需要占用的内存空间大小对上述3个张量数据按照从大到小进行排序,得到排序结果,具体地,该排序结果可以如表12所示:
t000 t100 t200
表12
那么,在这种情况下,内存分配装置为上述3个张量数据分配内存空间时,首先,创建第一内存空间,第一内存空间用于存储张量数据t000。例如,创建好的第一内存空间的首地址为a,其存储空间大小为张量数据t000需要占用的内存空间大小;之后,创建第二内存空间,第二内存空间用于存储张量数据t100。由于张量数据t100可以复用为张量数据t000分配的内存空间,此时,创建的第二内存空间的首地址为a(也即:与第一内存空间的首地址相同),其存储空间的大小为张量数据t100需要占用的内存空间。如前所述,张量数据t000需要占用的内存空间大小为1000Kb,张量数据t100需要占用的内存空间大小为600Kb,这意味着张量数据t100复用第一内存空间中的部分内存空间[a,600[。之后,创建第三内存空间,第三内存空间用于存储张量数据t200。由于张量数据t200可以复用为张量数据t000分配的内存空间[a,1000[,不可以复用为张量数据t100分配的内存空间[a,600[,因此,第三内存空间的首地址为600,其存储空间大小为张量数据t200需要占用的内存空间大小,这意味着张量数据t200复用第一内存空间中的部分内存空间[600,1000[。具体地,该分配过程可以如图4g所示。
需要说明的是,在一些实施例中,上述约束关系还可以体现在:张量数据分配的内存空间与除自身之外的其他张量数据分配的内存空间是非连续的内存空间(例如,为张量数据废品内存空间1,为张量数据2分配内存空间2,其中,内存空间1和内存空间2之间存在空间间隔),张量数据是否与除自身之外的其他张量数据满足空间同位约束,其中,空间同位约束体现在:第i张量数据不可以与第j张量数据复用同一个内存空间,第i张量数据可以与第k张量数据复用同一个内存空间,则第i张量数据不可以与第l张量数据复用同一个内存空间,其中,第l张量数据为第j张量数据与第k张量数据之间产生重叠的张量数据,等等。
实施本申请实施例,内存分配装置基于M个张量数据的排序结果依次为每个张量数据分配相应大小的内存空间,相较于现有技术中,按照整个神经网络运行的前后顺序,进行内存空间的分配和复用,可以避免出现内存分配不合理现象,从而可节省整个神经网络需要占用的内存,优化了神经网络的内存分配。
请参见图5,为本申请实施例提供的一种神经网络的内存分配方法的流程示意图。该方法的执行主体可以为运行神经网络的服务器,也可以为运行神经网络的终端设备。出于阐述的便利,以执行主体为运行神经网络的终端设备为例,进行说明。在图5所示的方法流程示意图中,可设定整个神经网络在运行时,需要多个内存空间。如图5所示,该方法可以包括但不限于如下步骤:
步骤S501、获取神经网络对应的计算图;其中,计算图包括N个节点和连接不同节点的有向边,计算图的有向边上承载有张量数据,计算图中包括M个张量数据,M为大于1的整数。
关于步骤S501的具体实现,请参考前述描述,此处不多加赘述。
步骤S502、基于每个张量数据对应的约束关系,按照所述M个张量数据在所述神经网络中的执行顺序,依次给所述M个张量数据分配内存空间。
在本申请实施例中,确定计算图中每个张量数据各自对应的约束关系的实现过程可以包括:可以判断第一张量数据的所有消费节点是否为第二张量数据的生产节点的上游节点,若是,则确定第一张量数据可以复用为第二张量数据分配的内存空间;若否,则确定第一张量数据不可以复用为第二张量数据分配的内存空间。又例如,可以判断第二张量数据的所有消费节点是否为所述第一张量数据的生产节点的下游节点,若是,则确定第一张量数据可以复用为第二张量数据分配的内存空间;若否,则确定第一张量数据不可以复用为第二张量数据分配的内存空间。
在本申请实施例中,上述方法可以应用于多个计算子任务并行的场景(例如,多个计算子任务之间的执行关系为全部并行;又例如,多个计算子任务之间的执行关系包括串行和并行),终端设备获取每个张量数据各自对应的约束关系,之后,结合神经网络中张量数据的执行顺序以及每个张量数据各自对应的约束关系依次为张量数据分配内存,可以避免在并行场景下,不同计算子任务中因算子复用同一个内存空间而带来的算子运算结果出错的情形,可以保证神经网络的计算结果的准确性。
上文图1a-图5详细描述了本申请实施例涉及的内存分配方法。下面结合附图介绍本申请实施例涉及的装置。
图6为本申请实施例中一种内存分配装置60的结构示意图。图6所示的内存分配装置60可以包括:
获取计算图单元600,用于获取神经网络对应的计算图;其中,所述计算图包括N个节点和连接不同节点的有向边,所述计算图的有向边上承载有张量数据,所述计算图中包括M个张量数据,所述M为大于1的整数;分配单元602,用于基于所述M个张量数据的排序结果,依次给所述M个张量数据分配内存空间,其中,若所述M个张量数据中的一个张量数据可复用已分配的内存空间中的至少一部分,则将所述张量数据可复用的至少一部分内存空间分配给所述张量数据,所述已分配的内存空间为在所述张量数据之前,已经分配给所述M个张量数据的内存空间,所述排序结果指示为所述M个张量数据分配内存空间时的顺序,所 述排序结果与所述M个张量数据中每个张量数据的信息有关,所述每个张量数据的信息指示以下信息中的至少一种:所述每个张量数据对应的约束关系以及所述每个张量数据流向的节点的数量,所述约束关系指示所述M个张量数据中一个张量数据的可用内存空间分别与所述M个张量数据中的其他张量数据的可用内存空间的关系。在一种可能的实现方式中,所述分配单元602,还用于:若所述张量数据不可复用已分配的内存空间,则为所述张量数据分配其他的内存空间,所述其他的内存空间与所述已分配的内存空间不同。
在一种可能的实现方式中,所述约束关系指示以下至少一种关系:一个张量数据的可用内存空间与另一个张量数据的可用内存空间的关系是可复用,一个张量数据的可用内存空间与另一个张量数据的可用内存空间的关系是不可复用,一个张量数据的可用内存空间与另一个张量数据的可用内存空间的关系是不可复用且连续。
在一种可能的实现方式中,所述约束关系承载于约束关系表,所述约束关系表中包括所述M个数据张量的标识,在所述约束关系表中,通过第一值指示一个张量数据的可用内存空间与另一个张量数据的可用内存空间的关系是可复用,通过第二值指示一个张量数据的可用内存空间与另一个张量数据的可用内存空间的关系是不可复用,通过第三值指示一个张量数据的可用内存空间与另一个张量数据的可用内存空间的关系是不可复用且连续。
在一种可能的实现方式中,在第一张量数据的所有消费节点是第二张量数据的生产节点的上游节点的情况下,或,在所述第二张量数据的所有消费节点是所述第一张量数据的生产节点的下游节点的情况下,所述第一张量数据可以复用为所述第二张量数据分配的内存空间;在所述第一张量数据的所有消费节点不是所述第二张量数据的生产节点的上游节点的情况下,或,在所述第二张量数据的所有消费节点不是所述第一张量数据的生产节点的下游节点的情况下,所述第一张量数据不可以复用为所述第二张量数据分配的内存空间;所述第一张量数据和所述第二张量数据为所述M个张量数据中任意的两个;所述消费节点为张量数据流向的节点,所述生产节点为张量数据流出的节点。
在一种可能的实现方式中,所述计算图包括多个计算子任务,所述计算子任务通过一组节点和与所述一组节点有关的边指示一种计算功能,所述多个计算子任务之间的执行关系为并行执行;所述装置还包括:
更新计算图单元604,用于在一个所述计算子任务中,若相邻两个节点之间没有有向边,则在所述相邻两个节点之间添加有向边,以更新所述计算图;其中,添加的每条有向边上承载有相应的张量数据;所述相邻两个节点为所述计算子任务中执行顺序相邻的两个节点;
获取信息单元606,用于基于所述更新后的计算图,获取每个张量数据的信息。
在一种可能的实现方式中,所述计算图还包括执行关系为串行的第一计算子任务和第二计算子任务,所述第一计算子任务的执行顺序在所述第二计算子任务之前;所述更新计算图单元604,还用于:
若第一计算子任务的最后一个节点与第二计算子任务的第一个节点之间没有有向边,则在所述第一计算子任务的最后一个节点与所述第二计算子任务的第一个节点之间添加有向边。
在一种可能的实现方式中,在所述计算图中,张量数据的生产节点的标识小于所述张量数据的消费节点的标识;所述张量数据的生产节点与所述张量数据的消费节点为相邻的两个节点。
在一种可能的实现方式中,所述计算图中各节点的标识用于确定所述M个张量数据中每个张量数据的信息。
在一种可能的实现方式中,所述每个张量数据的信息指示所述每个张量数据对应的约束关系,所述装置还包括:
第一排序单元608,用于根据所述每个张量数据对应的约束关系获取所述M个张量数据各自对应的约束量;所述约束量为其他张量数据中不可以与张量数据复用同一个内存空间的张量数据的数量;根据所述M个张量数据各自对应的约束量,对所述M个张量数据排序,以得到所述M个张量数据的排序结果。
在一种可能的实现方式中,所述每个张量数据的信息指示所述每个张量数据流向的节点的数量,所述装置还包括:
第二排序单元6010,用于根据所述M个张量数据各自对应的消费节点数量,对所述M个张量数据排序,以得到所述M个张量数据的排序结果。
在一种可能的实现方式中,所述装置还包括:
第三排序单元6012,用于基于所述每个张量数据的信息使用启发式算法对所述M个张量数据进行排序,以在预设时间段内获取所述M个张量数据的排序结果。
在一种可能的实现方式中,所述排序结果是优化后的排序结果,其中,所述优化后的排序结果对应的所述神经网络需要占用的最大内存大小小于根据优化前的排序结果确定的所述神经网络需要占用的最大内存大小。
本申请实施例中,各个的单元的具体实现可以参见上述实施例中的相关描述,此处不再赘述。
实施本申请实施例,内存分配装置根据每个张量数据的信息获取M各张量数据的排序结果,从而根据该排序结果为每个张量数据分配相应大小的内存空间,相较于现有技术中,按照整个神经网络运行的前后顺序,进行内存空间的分配和复用,可以避免出现内存分配不合理现象,从而可节省整个神经网络需要占用的内存,优化了神经网络的内存分配。
如图7所示,本申请实施例提供的一种内存分配装置70,该内存分配装置70可具体为终端设备或服务器。在一些实施例中,内存分配装置70可以具体为服务器中的中控模块,或者其功能由服务器中的中控模块实现。在一些实施例中,内存分配装置70可以具体为终端设备中的中控模块,或者其功能由终端设备中的中控模块实现。如图7所示,该内存分配装置可以包括处理器701、存储器702、通信总线703和通信接口704,处理器701通过通信总线连接存储器702和通信接口703。
处理器701可以采用通用的中央处理器(Central Processing Unit,CPU),微处理器,应用专用集成电路(Application Specific Integrated Circuit,ASIC),图形处理器(Graphics Processing Unit,GPU)、神经网络处理器(Network Processing Unit,NPU)或者一个或多个集成电路,用于执行相关程序,以执行本申请方法实施例的所描述的内存分配方法。
处理器701还可以是一种集成电路芯片,具有信号的处理能力。在实现过程中,本申请的内存分配方法的各个步骤可以通过处理器701中的硬件的集成逻辑电路或者软件形式的指令完成。上述的处理器701还可以是通用处理器、数字信号处理器(Digital Signal Processing,DSP)、专用集成电路(ASIC)、现成可编程门阵列(Field Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。可以实现或者执行本申请实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合本申请实施例所公开的方法的步骤可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可 以位于随机存储器,闪存、只读存储器,可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器701,处理器701读取存储器702中的信息,结合其硬件执行本申请方法实施例的内存分配方法。
存储器702可以是只读存储器(Read Only Memory,ROM),静态存储设备,动态存储设备或者随机存取存储器(Random Access Memory,RAM)。存储器702可以存储程序和数据,例如本申请实施例中内存分配方法的程序等。当存储器701中存储的程序被处理器702执行时,处理器701和通信接口704用于执行本申请实施例的内存分配方法的各个步骤。
例如,本申请实施例中用于实现本申请实施例中内存分配方法的程序等。
通信接口704使用例如但不限于收发器一类的收发装置,来实现内存分配装置700与其他设备或通信网络之间的通信。例如,可以通过通信接口704获取训练好的神经网络,以实现与执行设备、客户设备、用户设备或者终端设备等的信息交互。
可选地,该内存分配装置还可以包括人工智能处理器705,人工智能处理器705可以是神经网络处理器(Network Processing Unit,NPU),张量处理器(Tensor Processing Unit,TPU),或者图形处理器(Graphics Processing Unit,GPU)等一切适合用于大规模异或运算处理的处理器。人工智能处理器705可以作为协处理器挂载到主CPU(Host CPU)上,由主CPU为其分配任务。人工智能处理器705可以实现上述内存分配方法中涉及的一种或多种运算。例如,以NPU为例,NPU的核心部分为运算电路,通过控制器控制运算电路提取存储器702中的矩阵数据并进行乘加运算。
处理器701用于调用存储器中的数据和程序代码,执行:
获取神经网络对应的计算图;其中,所述计算图包括N个节点和连接不同节点的有向边,所述计算图的有向边上承载有张量数据,所述计算图中包括M个张量数据,所述M为大于1的整数;
基于所述M个张量数据的排序结果,依次给所述M个张量数据分配内存空间,其中,若所述M个张量数据中的一个张量数据可复用已分配的内存空间中的至少一部分,则将所述张量数据可复用的至少一部分内存空间分配给所述张量数据,所述已分配的内存空间为在所述张量数据之前,已经分配给所述M个张量数据的内存空间,所述排序结果指示为所述M个张量数据分配内存空间时的顺序,所述排序结果与所述M个张量数据中每个张量数据的信息有关,所述每个张量数据的信息指示以下信息中的至少一种:所述每个张量数据对应的约束关系以及所述每个张量数据流向的节点的数量,所述约束关系指示所述M个张量数据中一个张量数据的可用内存空间分别与所述M个张量数据中的其他张量数据的可用内存空间的关系。
其中,所述处理器701还用于:
若所述张量数据不可复用已分配的内存空间,则为所述张量数据分配其他的内存空间,所述其他的内存空间与所述已分配的内存空间不同。
其中,所述约束关系指示以下至少一种关系:一个张量数据的可用内存空间与另一个张量数据的可用内存空间的关系是可复用,一个张量数据的可用内存空间与另一个张量数据的可用内存空间的关系是不可复用,一个张量数据的可用内存空间与另一个张量数据的可用内存空间的关系是不可复用且连续。
其中,所述约束关系承载于约束关系表,所述约束关系表中包括所述M个数据张量的标识,在所述约束关系表中,通过第一值指示一个张量数据的可用内存空间与另一个张量数据的可用内存空间的关系是可复用,通过第二值指示一个张量数据的可用内存空间与另一个张 量数据的可用内存空间的关系是不可复用,通过第三值指示一个张量数据的可用内存空间与另一个张量数据的可用内存空间的关系是不可复用且连续。
其中,在第一张量数据的所有消费节点是第二张量数据的生产节点的上游节点的情况下,或,在所述第二张量数据的所有消费节点是所述第一张量数据的生产节点的下游节点的情况下,所述第一张量数据可以复用为所述第二张量数据分配的内存空间;在所述第一张量数据的所有消费节点不是所述第二张量数据的生产节点的上游节点的情况下,或,在所述第二张量数据的所有消费节点不是所述第一张量数据的生产节点的下游节点的情况下,所述第一张量数据不可以复用为所述第二张量数据分配的内存空间;所述第一张量数据和所述第二张量数据为所述M个张量数据中任意的两个;所述消费节点为张量数据流向的节点,所述生产节点为张量数据流出的节点。
其中,所述计算图包括多个计算子任务,所述计算子任务通过一组节点和与所述一组节点有关的边指示一种计算功能,所述多个计算子任务之间的执行关系为并行执行;所述处理器701还用于:
在一个所述计算子任务中,若相邻两个节点之间没有有向边,则在所述相邻两个节点之间添加有向边,以更新所述计算图;其中,添加的每条有向边上承载有相应的张量数据;所述相邻两个节点为所述计算子任务中执行顺序相邻的两个节点;
基于所述更新后的计算图,获取每个张量数据的信息。
其中,所述计算图还包括执行关系为串行的第一计算子任务和第二计算子任务,所述第一计算子任务的执行顺序在所述第二计算子任务之前;所述处理器701更新所述计算图,还包括:
若第一计算子任务的最后一个节点与第二计算子任务的第一个节点之间没有有向边,则在所述第一计算子任务的最后一个节点与所述第二计算子任务的第一个节点之间添加有向边
其中,在所述计算图中,张量数据的生产节点的标识小于所述张量数据的消费节点的标识;所述张量数据的生产节点与所述张量数据的消费节点为相邻的两个节点。
其中,所述计算图中各节点的标识用于确定所述M个张量数据中每个张量数据的信息。
其中,所述每个张量数据的信息指示所述每个张量数据对应的约束关系,所述处理器701还用于:
根据所述每个张量数据对应的约束关系获取所述M个张量数据各自对应的约束量;所述约束量为其他张量数据中不可以与张量数据复用同一个内存空间的张量数据的数量;
根据所述M个张量数据各自对应的约束量,对所述M个张量数据排序,以得到所述M个张量数据的排序结果。
其中,所述每个张量数据的信息指示所述每个张量数据流向的节点的数量,所述处理器701还用于:
根据所述M个张量数据各自对应的消费节点数量,对所述M个张量数据排序,以得到所述M个张量数据的排序结果。
所述处理器701还用于:
基于所述每个张量数据的信息使用启发式算法对所述M个张量数据进行排序,以在预设时间段内获取所述M个张量数据的排序结果。
其中,所述排序结果是优化后的排序结果,其中,所述优化后的排序结果对应的所述神经网络需要占用的最大内存大小小于根据优化前的排序结果确定的所述神经网络需要占用的最大内存大小。
应理解,各个器件的实现还可以对应参照上述内存分配方法实施例中的相应描述,本申请实施例不再赘述。
本申请实施例还提供了一种计算机存储介质,该计算机可读存储介质中存储有指令,当其在计算机或处理器上运行时,使得计算机或处理器执行上述任一个实施例所述方法中的一个或多个步骤。上述装置的各组成模块如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在所述计算机可读取存储介质中,基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机产品存储在计算机可读存储介质中。
上述计算机可读存储介质可以是前述实施例所述的设备的内部存储单元,例如硬盘或内存。上述计算机可读存储介质也可以是上述设备的外部存储设备,例如配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。进一步地,上述计算机可读存储介质还可以既包括上述设备的内部存储单元也包括外部存储设备。上述计算机可读存储介质用于存储上述计算机程序以及上述设备所需的其他程序和数据。上述计算机可读存储介质还可以用于暂时地存储已经输出或者将要输出的数据。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,可通过计算机程序来指令相关的硬件来完成,该计算机的程序可存储于计算机可读取存储介质中,该程序在执行时,可包括如上述各方法的实施例的流程。而前述的存储介质包括:ROM、RAM、磁碟或者光盘等各种可存储程序代码的介质。
本申请实施例方法中的步骤可以根据实际需要进行顺序调整、合并和删减。
本申请实施例装置中的模块可以根据实际需要进行合并、划分和删减。
可以理解,本领域普通技术人员可以意识到,结合本申请各个实施例中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
本领域技术人员能够领会,结合本申请各个实施例中公开描述的各种说明性逻辑框、模块和算法步骤所描述的功能可以硬件、软件、固件或其任何组合来实施。如果以软件来实施,那么各种说明性逻辑框、模块、和步骤描述的功能可作为一或多个指令或代码在计算机可读媒体上存储或传输,且由基于硬件的处理单元执行。计算机可读媒体可包含计算机可读存储媒体,其对应于有形媒体,例如数据存储媒体,或包括任何促进将计算机程序从一处传送到另一处的媒体(例如,根据通信协议)的通信媒体。以此方式,计算机可读媒体大体上可对应于(1)非暂时性的有形计算机可读存储媒体,或(2)通信媒体,例如信号或载波。数据存储媒体可为可由一或多个计算机或一或多个处理器存取以检索用于实施本申请中描述的技术的指令、代码和/或数据结构的任何可用媒体。计算机程序产品可包含计算机可读媒体。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信 连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。
所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。

Claims (32)

  1. 一种内存分配方法,其特征在于,包括:
    获取神经网络对应的计算图;其中,所述计算图包括N个节点和连接不同节点的有向边,所述计算图的有向边上承载有张量数据,所述计算图中包括M个张量数据,所述M为大于1的整数;
    基于所述M个张量数据的排序结果,依次给所述M个张量数据分配内存空间,其中,若所述M个张量数据中的一个张量数据可复用已分配的内存空间中的至少一部分,则将所述张量数据可复用的至少一部分内存空间分配给所述张量数据,所述已分配的内存空间为在所述张量数据之前,已经分配给所述M个张量数据的内存空间,所述排序结果指示为所述M个张量数据分配内存空间时的顺序,所述排序结果与所述M个张量数据中每个张量数据的信息有关,所述每个张量数据的信息指示以下信息中的至少一种:所述每个张量数据对应的约束关系以及所述每个张量数据流向的节点的数量,所述约束关系指示所述M个张量数据中一个张量数据的可用内存空间分别与所述M个张量数据中的其他张量数据的可用内存空间的关系。
  2. 如权利要求1所述的方法,其特征在于,若所述张量数据不可复用已分配的内存空间,则为所述张量数据分配其他的内存空间,所述其他的内存空间与所述已分配的内存空间不同。
  3. 如权利要求1或2所述的方法,其特征在于,所述约束关系指示以下至少一种关系:一个张量数据的可用内存空间与另一个张量数据的可用内存空间的关系是可复用,一个张量数据的可用内存空间与另一个张量数据的可用内存空间的关系是不可复用,一个张量数据的可用内存空间与另一个张量数据的可用内存空间的关系是不可复用且连续。
  4. 如权利要求3所述的方法,其特征在于,所述约束关系承载于约束关系表,所述约束关系表中包括所述M个数据张量的标识,在所述约束关系表中,通过第一值指示一个张量数据的可用内存空间与另一个张量数据的可用内存空间的关系是可复用,通过第二值指示一个张量数据的可用内存空间与另一个张量数据的可用内存空间的关系是不可复用,通过第三值指示一个张量数据的可用内存空间与另一个张量数据的可用内存空间的关系是不可复用且连续。
  5. 如权利要求1-4任一项所述的方法,其特征在于,在第一张量数据的所有消费节点是第二张量数据的生产节点的上游节点的情况下,或,在所述第二张量数据的所有消费节点是所述第一张量数据的生产节点的下游节点的情况下,所述第一张量数据可以复用为所述第二张量数据分配的内存空间;在所述第一张量数据的所有消费节点不是所述第二张量数据的生产节点的上游节点的情况下,或,在所述第二张量数据的所有消费节点不是所述第一张量数据的生产节点的下游节点的情况下,所述第一张量数据不可以复用为所述第二张量数据分配的内存空间;所述第一张量数据和所述第二张量数据为所述M个张量数据中任意的两个;所述消费节点为张量数据流向的节点,所述生产节点为张量数据流出的节点。
  6. 如权利要求1-5任一项所述的方法,其特征在于,所述计算图包括多个计算子任务,所述计算子任务通过一组节点和与所述一组节点有关的边指示一种计算功能,所述多个计算子 任务之间的执行关系为并行执行;所述方法还包括:
    在一个所述计算子任务中,若相邻两个节点之间没有有向边,则在所述相邻两个节点之间添加有向边,以更新所述计算图;其中,添加的每条有向边上承载有相应的张量数据;所述相邻两个节点为所述计算子任务中执行顺序相邻的两个节点;
    基于所述更新后的计算图,获取每个张量数据的信息。
  7. 如权利要求6所述的方法,其特征在于,所述计算图还包括执行关系为串行的第一计算子任务和第二计算子任务,所述第一计算子任务的执行顺序在所述第二计算子任务之前;所述更新所述计算图,还包括:
    若第一计算子任务的最后一个节点与第二计算子任务的第一个节点之间没有有向边,则在所述第一计算子任务的最后一个节点与所述第二计算子任务的第一个节点之间添加有向边。
  8. 如权利要求1-5任一项所述的方法,其特征在于,在所述计算图中,张量数据的生产节点的标识小于所述张量数据的消费节点的标识;所述张量数据的生产节点与所述张量数据的消费节点为相邻的两个节点。
  9. 如权利要求8所述的方法,其特征在于,所述计算图中各节点的标识用于确定所述M个张量数据中每个张量数据的信息。
  10. 如权利要求1所述的方法,其特征在于,所述每个张量数据的信息指示所述每个张量数据对应的约束关系,所述方法还包括:
    根据所述每个张量数据对应的约束关系获取所述M个张量数据各自对应的约束量;所述约束量为其他张量数据中不可以与张量数据复用同一个内存空间的张量数据的数量;
    根据所述M个张量数据各自对应的约束量,对所述M个张量数据排序,以得到所述M个张量数据的排序结果。
  11. 如权利要求1所述的方法,其特征在于,所述每个张量数据的信息指示所述每个张量数据流向的节点的数量,所述方法还包括:
    根据所述M个张量数据各自对应的消费节点数量,对所述M个张量数据排序,以得到所述M个张量数据的排序结果。
  12. 如权利要求1所的方法,其特征在于,所述方法还包括:
    基于所述每个张量数据的信息使用启发式算法对所述M个张量数据进行排序,以在预设时间段内获取所述M个张量数据的排序结果。
  13. 如权利要求12所述的方法,其特征在于,所述排序结果是优化后的排序结果,其中,所述优化后的排序结果对应的所述神经网络需要占用的最大内存大小小于根据优化前的排序结果确定的所述神经网络需要占用的最大内存大小。
  14. 一种内存分配方法,其特征在于,包括:
    获取神经网络对应的计算图;其中,所述计算图包括N个节点和连接不同节点的有向边,所述计算图的有向边上承载有张量数据,所述计算图中包括M个张量数据,所述M为大于1的整数;
    基于每个张量数据对应的约束关系,按照所述M个张量数据在所述神经网络中的执行顺序,依次给所述M个张量数据分配内存空间,其中,若所述M个张量数据中的一个张量数据可复用已分配的内存空间中的至少一部分,则将所述张量数据可复用的至少一部分内存空间分配给所述张量数据,所述已分配的内存空间为在所述张量数据之前,已经分配给所述M个张量数据的内存空间,所述约束关系指示所述M个张量数据中一个张量数据的可用内存空间与所述M个张量数据中的其他张量数据的可用内存空间的关系。
  15. 如权利要求14所述的方法,其特征在于,若所述张量数据不可复用已分配的内存空间,则为所述张量数据分配其他的内存空间,所述其他的内存空间与所述已分配的内存空间不同。
  16. 一种内存分配装置,其特征在于,包括:
    获取计算图单元,用于获取神经网络对应的计算图;其中,所述计算图包括N个节点和连接不同节点的有向边,所述计算图的有向边上承载有张量数据,所述计算图中包括M个张量数据,所述M为大于1的整数;
    分配单元,用于基于所述M个张量数据的排序结果,依次给所述M个张量数据分配内存空间,其中,若所述M个张量数据中的一个张量数据可复用已分配的内存空间中的至少一部分,则将所述张量数据可复用的至少一部分内存空间分配给所述张量数据,所述已分配的内存空间为在所述张量数据之前,已经分配给所述M个张量数据的内存空间,所述排序结果指示为所述M个张量数据分配内存空间时的顺序,所述排序结果与所述M个张量数据中每个张量数据的信息有关,所述每个张量数据的信息指示以下信息中的至少一种:所述每个张量数据对应的约束关系以及所述每个张量数据流向的节点的数量,所述约束关系指示所述M个张量数据中一个张量数据的可用内存空间分别与所述M个张量数据中的其他张量数据的可用内存空间的关系。
  17. 如权利要求16所述的装置,其特征在于,所述分配单元,还用于:
    若所述张量数据不可复用已分配的内存空间,则为所述张量数据分配其他的内存空间,所述其他的内存空间与所述已分配的内存空间不同。
  18. 如权利要求16或17所述的装置,其特征在于,所述约束关系指示以下至少一种关系:一个张量数据的可用内存空间与另一个张量数据的可用内存空间的关系是可复用,一个张量数据的可用内存空间与另一个张量数据的可用内存空间的关系是不可复用,一个张量数据的可用内存空间与另一个张量数据的可用内存空间的关系是不可复用且连续。
  19. 如权利要求18所述的装置,其特征在于,所述约束关系承载于约束关系表,所述约束关系表中包括所述M个数据张量的标识,在所述约束关系表中,通过第一值指示一个张量数据的可用内存空间与另一个张量数据的可用内存空间的关系是可复用,通过第二值指示一个张量数据的可用内存空间与另一个张量数据的可用内存空间的关系是不可复用,通过第三 值指示一个张量数据的可用内存空间与另一个张量数据的可用内存空间的关系是不可复用且连续。
  20. 如权利要求16-19任一项所述的装置,在第一张量数据的所有消费节点是第二张量数据的生产节点的上游节点的情况下,或,在所述第二张量数据的所有消费节点是所述第一张量数据的生产节点的下游节点的情况下,所述第一张量数据可以复用为所述第二张量数据分配的内存空间;在所述第一张量数据的所有消费节点不是所述第二张量数据的生产节点的上游节点的情况下,或,在所述第二张量数据的所有消费节点不是所述第一张量数据的生产节点的下游节点的情况下,所述第一张量数据不可以复用为所述第二张量数据分配的内存空间;所述第一张量数据和所述第二张量数据为所述M个张量数据中任意的两个;所述消费节点为张量数据流向的节点,所述生产节点为张量数据流出的节点。
  21. 如权利要求16-20任一项所述的装置,其特征在于,所述计算图包括多个计算子任务,所述计算子任务通过一组节点和与所述一组节点有关的边指示一种计算功能,所述多个计算子任务之间的执行关系为并行执行;所述装置还包括:
    更新计算图单元,用于在一个所述计算子任务中,若相邻两个节点之间没有有向边,则在所述相邻两个节点之间添加有向边,以更新所述计算图;其中,添加的每条有向边上承载有相应的张量数据;所述相邻两个节点为所述计算子任务中执行顺序相邻的两个节点;
    获取信息单元,用于基于所述更新后的计算图,获取每个张量数据的信息。
  22. 如权利要求21所述的装置,其特征在于,所述计算图还包括执行关系为串行的第一计算子任务和第二计算子任务,所述第一计算子任务的执行顺序在所述第二计算子任务之前;所述更新计算图单元,还用于:
    若第一计算子任务的最后一个节点与第二计算子任务的第一个节点之间没有有向边,则在所述第一计算子任务的最后一个节点与所述第二计算子任务的第一个节点之间添加有向边。
  23. 如权利要求16-20任一项所述的装置,其特征在于,在所述计算图中,张量数据的生产节点的标识小于所述张量数据的消费节点的标识;所述张量数据的生产节点与所述张量数据的消费节点为相邻的两个节点。
  24. 如权利要求23所述的装置,其特征在于,所述计算图中各节点的标识用于确定所述M个张量数据中每个张量数据的信息。
  25. 如权利要求16所述的装置,其特征在于,所述每个张量数据的信息指示所述每个张量数据对应的约束关系,所述装置还包括:
    第一排序单元,用于根据所述每个张量数据对应的约束关系获取所述M个张量数据各自对应的约束量;所述约束量为其他张量数据中不可以与张量数据复用同一个内存空间的张量数据的数量;根据所述M个张量数据各自对应的约束量,对所述M个张量数据排序,以得到所述M个张量数据的排序结果。
  26. 如权利要求16所述的装置,其特征在于,所述每个张量数据的信息指示所述每个张量数据流向的节点的数量,所述装置还包括:
    第二排序单元,用于根据所述M个张量数据各自对应的消费节点数量,对所述M个张量数据排序,以得到所述M个张量数据的排序结果。
  27. 如权利要求16所述的装置,其特征在于,所述装置还包括:
    第三排序单元,用于基于所述每个张量数据的信息使用启发式算法对所述M个张量数据进行排序,以在预设时间段内获取所述M个张量数据的排序结果。
  28. 如权利要求27所述的装置,其特征在于,所述排序结果是优化后的排序结果,其中,所述优化后的排序结果对应的所述神经网络需要占用的最大内存大小小于根据优化前的排序结果确定的所述神经网络需要占用的最大内存大小。
  29. 一种内存分配装置,其特征在于,包括:
    获取计算图单元,用于获取神经网络对应的计算图;其中,所述计算图包括N个节点和连接不同节点的有向边,所述计算图的有向边上承载有张量数据,所述计算图中包括M个张量数据,所述M为大于1的整数;
    分配单元,用于基于每个张量数据对应的约束关系,按照所述M个张量数据在所述神经网络中的执行顺序,依次给所述M个张量数据分配内存空间,其中,若所述M个张量数据中的一个张量数据可复用已分配的内存空间中的至少一部分,则将所述张量数据可复用的至少一部分内存空间分配给所述张量数据,所述已分配的内存空间为在所述张量数据之前,已经分配给所述M个张量数据的内存空间,所述约束关系指示所述M个张量数据中一个张量数据的可用内存空间与所述M个张量数据中的其他张量数据的可用内存空间的关系。
  30. 如权利要求29所述的装置,其特征在于,所述分配单元,还用于:
    若所述张量数据不可复用已分配的内存空间,则为所述张量数据分配其他的内存空间,所述其他的内存空间与所述已分配的内存空间不同。
  31. 一种分配内存装置,其特征在于,包括处理器和存储器;
    所述存储器用于存储计算机程序,所述计算机程序包括程序指令;
    所述处理器被配置用于调用所述程序指令,执行如权利要求1-15任一项所述的方法。
  32. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质存储有计算机程序,所述计算机程序包括程序指令,所述程序指令当被处理器执行时使所述处理器执行如权利要求1-15任一项所述的方法。
PCT/CN2021/119829 2020-09-29 2021-09-23 内存分配方法、相关设备及计算机可读存储介质 WO2022068663A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP21874324.3A EP4209902A4 (en) 2020-09-29 2021-09-23 MEMORY ALLOCATION METHOD, ASSOCIATED APPARATUS AND COMPUTER-READABLE STORAGE MEDIUM
US18/127,300 US20230236888A1 (en) 2020-09-29 2023-03-28 Memory allocation method, related device, and computer-readable storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011057095.2 2020-09-29
CN202011057095.2A CN114327844A (zh) 2020-09-29 2020-09-29 内存分配方法、相关设备及计算机可读存储介质

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/127,300 Continuation US20230236888A1 (en) 2020-09-29 2023-03-28 Memory allocation method, related device, and computer-readable storage medium

Publications (1)

Publication Number Publication Date
WO2022068663A1 true WO2022068663A1 (zh) 2022-04-07

Family

ID=80949565

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/119829 WO2022068663A1 (zh) 2020-09-29 2021-09-23 内存分配方法、相关设备及计算机可读存储介质

Country Status (4)

Country Link
US (1) US20230236888A1 (zh)
EP (1) EP4209902A4 (zh)
CN (1) CN114327844A (zh)
WO (1) WO2022068663A1 (zh)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114819084A (zh) * 2022-04-26 2022-07-29 北京百度网讯科技有限公司 模型推理方法、装置、设备及存储介质
CN115268936A (zh) * 2022-09-27 2022-11-01 之江实验室 一种用于计算图编译的优化方法及装置
CN117032954A (zh) * 2023-07-17 2023-11-10 北京泛睿科技合伙企业(有限合伙) 针对终端训练模型的内存优化方法、系统、设备及介质
WO2024211755A1 (en) * 2023-04-05 2024-10-10 Nvidia Corporation Tensor dimension ordering techniques

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117149398A (zh) * 2022-05-20 2023-12-01 北京希姆计算科技有限公司 一种内存分配的方法和装置
CN115018064A (zh) * 2022-06-27 2022-09-06 中国科学技术大学 一种计算节点的空间分配方法及装置
CN115080240B (zh) * 2022-06-29 2023-10-10 美的集团(上海)有限公司 语音处理模型的部署方法、电子设备及存储介质
CN118132245A (zh) * 2022-12-01 2024-06-04 马上消费金融股份有限公司 资源分配方法、装置及电子设备
CN116700996B (zh) * 2023-08-04 2023-11-07 北京燧原智能科技有限公司 一种神经网络的内存分配方法、装置、设备及介质
CN116893904B (zh) * 2023-09-11 2023-12-26 腾讯科技(深圳)有限公司 神经网络模型的内存管理方法、装置、设备、介质及产品

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180088996A1 (en) * 2016-09-23 2018-03-29 Apple Inc. Systems and Methods of Memory Allocation for Neural Networks
CN108829610A (zh) * 2018-04-02 2018-11-16 浙江大华技术股份有限公司 一种神经网络前向计算过程中的内存管理方法及设备
CN110490313A (zh) * 2019-08-14 2019-11-22 北京中科寒武纪科技有限公司 一种内存复用方法及其相关产品
CN110597616A (zh) * 2018-06-13 2019-12-20 华为技术有限公司 一种神经网络的内存分配方法及装置
CN111488221A (zh) * 2020-06-29 2020-08-04 北京一流科技有限公司 静态网络中的内存空间预配系统及其方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180088996A1 (en) * 2016-09-23 2018-03-29 Apple Inc. Systems and Methods of Memory Allocation for Neural Networks
CN108829610A (zh) * 2018-04-02 2018-11-16 浙江大华技术股份有限公司 一种神经网络前向计算过程中的内存管理方法及设备
CN110597616A (zh) * 2018-06-13 2019-12-20 华为技术有限公司 一种神经网络的内存分配方法及装置
CN110490313A (zh) * 2019-08-14 2019-11-22 北京中科寒武纪科技有限公司 一种内存复用方法及其相关产品
CN111488221A (zh) * 2020-06-29 2020-08-04 北京一流科技有限公司 静态网络中的内存空间预配系统及其方法

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4209902A4

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114819084A (zh) * 2022-04-26 2022-07-29 北京百度网讯科技有限公司 模型推理方法、装置、设备及存储介质
CN114819084B (zh) * 2022-04-26 2024-03-01 北京百度网讯科技有限公司 模型推理方法、装置、设备及存储介质
CN115268936A (zh) * 2022-09-27 2022-11-01 之江实验室 一种用于计算图编译的优化方法及装置
WO2024211755A1 (en) * 2023-04-05 2024-10-10 Nvidia Corporation Tensor dimension ordering techniques
CN117032954A (zh) * 2023-07-17 2023-11-10 北京泛睿科技合伙企业(有限合伙) 针对终端训练模型的内存优化方法、系统、设备及介质
CN117032954B (zh) * 2023-07-17 2024-04-26 北京泛睿科技合伙企业(有限合伙) 针对终端训练模型的内存优化方法、系统、设备及介质

Also Published As

Publication number Publication date
US20230236888A1 (en) 2023-07-27
CN114327844A (zh) 2022-04-12
EP4209902A4 (en) 2024-01-10
EP4209902A1 (en) 2023-07-12

Similar Documents

Publication Publication Date Title
WO2022068663A1 (zh) 内存分配方法、相关设备及计算机可读存储介质
CN104036451B (zh) 基于多图形处理器的模型并行处理方法及装置
CN113449857B (zh) 一种数据处理方法和数据处理设备
WO2024114399A1 (zh) 分布式执行深度学习任务的优化方法和分布式系统
Ooi et al. SINGA: A distributed deep learning platform
CN112084038B (zh) 神经网络的内存分配方法及装置
CN104035751B (zh) 基于多图形处理器的数据并行处理方法及装置
Karloff et al. A model of computation for MapReduce
JáJá Parallel algorithms
US20190130268A1 (en) Tensor radix point calculation in a neural network
CN110347636B (zh) 数据执行体及其数据处理方法
US20190138373A1 (en) Multithreaded data flow processing within a reconfigurable fabric
US11144291B1 (en) Loop-oriented neural network compilation
US20240211397A1 (en) Processor Cluster Address Generation
Xiao et al. Plasticity-on-chip design: Exploiting self-similarity for data communications
CN114492782A (zh) 基于强化学习的神经网络的片上核心编译映射方法及装置
CN115168281B (zh) 一种基于禁忌搜索算法的神经网络片上映射方法和装置
CN115269204B (zh) 一种用于神经网络编译的内存优化方法及装置
US12079734B1 (en) Compilation time reduction for memory and compute bound neural networks
CN112764893A (zh) 数据处理方法和数据处理系统
Krömer et al. A comparison of many-threaded differential evolution and genetic algorithms on CUDA
Ravikumar et al. Acceleration of Image Processing and Computer Vision Algorithms
US20190130276A1 (en) Tensor manipulation within a neural network
Nichols et al. MagmaDNN: accelerated deep learning using MAGMA
Shu et al. Design of deep learning accelerated algorithm for online recognition of industrial products defects

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21874324

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2021874324

Country of ref document: EP

Effective date: 20230405

NENP Non-entry into the national phase

Ref country code: DE