CN116204847A - Calculation graph optimization method, device and equipment - Google Patents

Calculation graph optimization method, device and equipment Download PDF

Info

Publication number
CN116204847A
CN116204847A CN202111431551.XA CN202111431551A CN116204847A CN 116204847 A CN116204847 A CN 116204847A CN 202111431551 A CN202111431551 A CN 202111431551A CN 116204847 A CN116204847 A CN 116204847A
Authority
CN
China
Prior art keywords
node set
node
memory
producer
tensor
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111431551.XA
Other languages
Chinese (zh)
Inventor
高雄
吴一迪
程彬
张兆创
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Priority to CN202111431551.XA priority Critical patent/CN116204847A/en
Priority to PCT/CN2022/133304 priority patent/WO2023093689A1/en
Publication of CN116204847A publication Critical patent/CN116204847A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/18Complex mathematical operations for evaluating statistical data, e.g. average values, frequency distributions, probability functions, regression analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The application provides a calculation map optimization method, a calculation map optimization device and calculation map optimization equipment, and relates to the technical field of artificial intelligence. The method can obtain a second calculation graph from the first calculation graph based on the data dependency relationship between the parameters and the nodes in the first calculation graph; and then combining the second calculation map with the first calculation map to obtain a new calculation map, fusing operators in the new calculation map to obtain an optimized calculation map, and finally executing the optimized calculation map. Therefore, the calculation graph is optimized in a mode of combining recalculation and operator fusion, so that the memory occupation is obviously reduced, larger recalculation expenditure is not introduced, and the problem that a network with one or more ultra-large tensors cannot execute is solved.

Description

Calculation graph optimization method, device and equipment
Technical Field
The present disclosure relates to the field of artificial intelligence technologies, and in particular, to a method, an apparatus, and a device for optimizing a computational graph.
Background
Artificial intelligence (artificial intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is a branch of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar manner to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision. Research in the field of artificial intelligence includes robotics, natural language processing, computer vision, decision and reasoning, man-machine interaction, recommendation and search, AI-based theory, and the like.
At present, with the continuous development of computer technology, AI networks have also been widely used. Moreover, AI networks are becoming more and more complex, and the scale of their network models is showing an increasing trend, such as: the number of network layers, the parameter amount, the data set and the like of the AI network are more and more, so that the memory consumption corresponding to the network model of the AI network is more and more. Because the memory of the current computer hardware is relatively small, and is usually 16 Gigabytes (GB) or 32GB, the current computer hardware will have difficulty in supporting the network model of the AI network when the memory consumption corresponding to the network model of the AI network is larger and larger. Therefore, how to reduce the memory consumption of the network model of the AI network is a technical problem that needs to be solved at present.
Disclosure of Invention
The application provides a computational graph optimization method, a device, equipment, a computer storage medium, a computer program product and a chip, wherein the computational graph is optimized in a mode of combining recalculation and operator fusion, so that the memory occupation is remarkably reduced, larger recalculation expenditure is not introduced, and the problem that a network with one or more ultra-large tensors cannot execute is solved.
In a first aspect, the present application provides a computation graph optimization method, including: obtaining a second calculation graph from the first calculation graph based on a data dependency relationship between parameters and nodes in the first calculation graph, wherein the parameters comprise one or more of an operator fusion rule, a memory threshold value of a memory occupied by a tensor output by a single node in the first calculation graph, a peak memory threshold value corresponding to the first calculation graph and a data mutation threshold value corresponding to the node in the first calculation graph, the peak memory threshold value is a threshold value of the memory occupied by tensor which is required to be used at the moment of executing one node in the calculation graph executing process, and the data mutation threshold value is a threshold value of data mutation generated by at least one node in the first calculation graph; combining the second calculation graph with the first calculation graph to obtain a third calculation graph, wherein in the third calculation graph, a first directed edge is arranged between a first node outputting the first tensor and a second node inputting the first tensor, a second directed edge is arranged between a third node outputting the second tensor and a fourth node inputting the second tensor, the first directed edge points to the second node from the first node, the second directed edge points to the fourth node from the third node, the first node and the fourth node correspond to nodes in the first calculation graph, and the second node and the third node correspond to nodes in the second calculation graph; fusing operators in the third calculation map to obtain a fourth calculation map; a fourth computational graph is performed. In this way, the second calculation diagram needing to be recalculated is screened from the first calculation diagram based on the parameters, the second calculation diagram is combined with the first calculation diagram to obtain a new calculation diagram, operator fusion is carried out on the new calculation diagram, and the fused calculation diagram is executed, so that the calculation diagram is optimized in a mode of combining the recalculation and the operator fusion, the memory occupation is remarkably reduced, larger recalculation expenditure is not introduced, and the problem that a network with one or more oversized tensors cannot execute is solved.
In one possible implementation manner, based on the data dependency relationship between the parameters and the nodes in the first computational graph, the second computational graph is obtained from the first computational graph, and specifically includes: obtaining N first node sets from producers contained in the first calculation graph based on data dependency relations between parameters and nodes in the first calculation graph, wherein N is a positive integer greater than or equal to 1, the first node sets comprise at least one node, and the nodes contained in one first node set are directly related or indirectly related to one tensor output by one producer in the first calculation graph and are directly related or indirectly related to at least one tensor input by one producer in the first calculation graph; and re-calculating each first node set to obtain N re-calculation sub-graphs, wherein the N re-calculation sub-graphs form a second calculation graph. In this way, the second calculation graph can be obtained from the first calculation graph according to the data dependency relationship and the parameters among the nodes in the first calculation graph.
In one possible implementation, the parameter is a memory threshold; the memory occupied by the output tensor corresponding to the first node set is higher than the memory threshold. Therefore, the node set corresponding to the output tensor occupying the too high memory can be automatically used as the needed node set, and the problem that a single operator and/or tensor occupies the memory is solved.
In one possible implementation, the parameter is a peak memory threshold; an indirect path exists between the producer and the consumer corresponding to the first set of nodes, and the peak memory of each node between the producer and the consumer corresponding to the first set of nodes when executed is higher than a peak memory threshold. In this way, the node set corresponding to the excessively high peak memory occupation can be automatically used as the required node set, so that the problem that a single operator and/or tensor occupies memory is solved.
In one possible implementation, the parameter is a data mutation threshold; the offset value between the memory occupied by the output tensor corresponding to the first node set and the memory occupied by the at least one input tensor corresponding to the first node set is higher than the data abrupt change threshold. Thus, the node set with the data mutation can be automatically used as the needed node set, and the problem of overlarge memory occupation caused by the data mutation is solved.
In one possible implementation, the parameter is an operator fusion rule; the first node set and the consumers corresponding to the first node set can be fused; alternatively, the first set of nodes cannot be fused with the consumer corresponding to the first set of nodes, and an indirect path exists between the producer and consumer corresponding to the first set of nodes. Thus, the needed node set can be screened out in a mode of operator fusion rules.
In one possible implementation, the parameters are a memory threshold and a peak memory threshold; an indirect path does not exist between a producer and a consumer corresponding to the first node set, and the memory occupied by the output tensor corresponding to the first node set is higher than a memory threshold; and/or, an indirect path exists between the producer and the consumer corresponding to the first node set, and the peak memory of each node between the producer and the consumer corresponding to the first node set is higher than the peak memory threshold value when the node is executed. Therefore, the required node set can be screened out in a mode of combining the memory threshold value and the peak memory threshold value, and the screening efficiency and accuracy are improved.
In one possible implementation, the parameters are a memory threshold and a data mutation threshold; the memory occupied by the output tensor corresponding to the first node set is higher than the memory threshold, and/or the deviation value between the memory occupied by the output tensor corresponding to the first node set and the memory occupied by the at least one input tensor corresponding to the first node set is higher than the data mutation threshold. Therefore, the required node set can be screened out in a mode of combining the memory threshold value and the data mutation threshold value, and the screening efficiency and accuracy are improved.
In one possible implementation, the parameters are memory thresholds and operator fusion rules; the first set of nodes meets one or more of the following conditions: the memory occupied by the output tensor corresponding to the first node set is higher than the memory threshold, or the consumers corresponding to the first node set and the first node set can be fused, or the consumers corresponding to the first node set and the first node set can not be fused, and an indirect path exists between the producer and the consumer corresponding to the first node set. Therefore, the needed node set can be screened out in a mode of combining the memory threshold value and the operator fusion rule, and the screening efficiency and accuracy are improved.
In one possible implementation, the parameters are a peak memory threshold and a data mutation threshold; an indirect path exists between the producer and the consumer corresponding to the first node set, and the peak memory of each node between the producer and the consumer corresponding to the first node set is higher than a peak memory threshold value when the node is executed; and/or, a deviation value between the memory occupied by the output tensor corresponding to the first node set and the memory occupied by the at least one input tensor corresponding to the first node set is higher than a data mutation threshold value. Therefore, the required node set can be screened out in a mode of combining the peak memory threshold value and the data mutation threshold value, and the screening efficiency and accuracy are improved.
In one possible implementation, the parameters are memory thresholds and operator fusion rules; the first set of nodes meets one or more of the following conditions: an indirect path exists between the producer and the consumer corresponding to the first node set, and the peak memory of each node between the producer and the consumer corresponding to the first node set is higher than the peak memory threshold value when the node is executed, or the consumer corresponding to the first node set and the first node set can be fused, or the consumer corresponding to the first node set and the first node set cannot be fused, and an indirect path exists between the producer and the consumer corresponding to the first node set. Therefore, the required node set can be screened out in a mode of combining the peak memory threshold value and the operator fusion rule, and the screening efficiency and accuracy are improved.
In one possible implementation, the parameters are data mutation thresholds and operator fusion rules; the first set of nodes meets one or more of the following conditions: the deviation value between the memory occupied by the output tensor corresponding to the first node set and the memory occupied by the at least one input tensor corresponding to the first node set is higher than the data mutation threshold, or the consumers corresponding to the first node set and the first node set can be fused, or the consumers corresponding to the first node set and the first node set can not be fused, and an indirect path exists between the producer corresponding to the first node set and the consumers. Therefore, the needed node set can be screened out in a mode of combining the data mutation threshold value and the operator fusion rule, and the screening efficiency and accuracy are improved.
In one possible implementation, the parameters are a memory threshold, a peak memory threshold, and a data mutation threshold; the first set of nodes meets one or more of the following conditions: there is no indirect path between the producer and the consumer corresponding to the first node set, and the memory occupied by the output tensor corresponding to the first node set is higher than the memory threshold, or there is an indirect path between the producer and the consumer corresponding to the first node set, and the peak memory when each node between the producer and the consumer corresponding to the first node set executes is higher than the peak memory threshold, or the deviation value between the memory occupied by the output tensor corresponding to the first node set and the memory occupied by at least one input tensor corresponding to the first node set is higher than the data mutation threshold. Therefore, the required node set can be screened out in a mode of combining the memory threshold, the peak memory threshold and the data mutation threshold, and the screening efficiency and accuracy are improved.
In one possible implementation, the parameters are a memory threshold, a peak memory threshold, and an operator fusion rule; the first set of nodes meets one or more of the following conditions: there is no indirect path between the producer and consumer corresponding to the first node set, and the memory occupied by the output tensor corresponding to the first node set is higher than the memory threshold, or there is an indirect path between the producer and consumer corresponding to the first node set, and the peak memory when each node between the producer and consumer corresponding to the first node set executes is higher than the peak memory threshold, or the consumer corresponding to the first node set and the first node set can be fused, or the consumer corresponding to the first node set and the first node set cannot be fused, and there is an indirect path between the producer and the consumer corresponding to the first node set. Therefore, the required node set can be screened out in a mode of combining the memory threshold, the peak memory threshold and the operator fusion rule, and the screening efficiency and accuracy are improved.
In one possible implementation, the parameters are a memory threshold, a data mutation threshold, and an operator fusion rule; the first set of nodes meets one or more of the following conditions: the memory occupied by the output tensor corresponding to the first node set is higher than the memory threshold, or the deviation value between the memory occupied by the output tensor corresponding to the first node set and the memory occupied by the at least one input tensor corresponding to the first node set is higher than the data mutation threshold, or the consumers corresponding to the first node set and the first node set can be fused, or the consumers corresponding to the first node set and the first node set cannot be fused, and an indirect path exists between the producer and the consumer corresponding to the first node set. Therefore, the needed node set can be screened out in a mode of combining the memory threshold, the data mutation threshold and the operator fusion rule, and the screening efficiency and accuracy are improved.
In one possible implementation, the parameters are a peak memory threshold, a data mutation threshold, and an operator fusion rule; the first set of nodes meets one or more of the following conditions: an indirect path exists between the producer and the consumer corresponding to the first node set, the peak memory when each node between the producer and the consumer corresponding to the first node set executes is higher than a peak memory threshold, or the deviation value between the memory occupied by the output tensor corresponding to the first node set and the memory occupied by at least one input tensor corresponding to the first node set is higher than a data mutation threshold, or the consumer corresponding to the first node set and the first node set can be fused, or the consumer corresponding to the first node set and the first node set cannot be fused, and an indirect path exists between the producer and the consumer corresponding to the first node set. Therefore, the required node set can be screened out in a mode of combining the peak memory threshold value, the data mutation threshold value and the operator fusion rule, and the screening efficiency and accuracy are improved.
In one possible implementation, the parameters are a memory threshold, a peak memory threshold, a data mutation threshold, and an operator fusion rule; the first set of nodes meets one or more of the following conditions: there is no indirect path between the producer and consumer corresponding to the first node set, and the memory occupied by the output tensor corresponding to the first node set is higher than the memory threshold, or there is an indirect path between the producer and consumer corresponding to the first node set, and the peak memory when each node between the producer and consumer corresponding to the first node set executes is higher than the peak memory threshold, or the offset value between the memory occupied by the output tensor corresponding to the first node set and the memory occupied by at least one input tensor corresponding to the first node set is higher than the data mutation threshold, or the consumer corresponding to the first node set and the consumer corresponding to the first node set can be fused, or the first node set and the consumer corresponding to the first node set can not be fused, and there is an indirect path between the producer and the consumer corresponding to the first node set. Therefore, the required node set can be screened out in a mode of combining the memory threshold value, the peak memory threshold value, the data mutation threshold value and the operator fusion rule, and the screening efficiency and accuracy are improved.
In one possible implementation manner, based on the data dependency relationship between the parameters and the nodes in the first computation graph, at least one first node set is obtained from the producer contained in the first computation graph, and specifically includes:
based on the data dependency relationship between the nodes in the first calculation graph, a first list is obtained, wherein the first list comprises the corresponding relationship between the producer and the consumer in the first calculation graph; obtaining an alternative node set according to the tensor output by each producer in the first list, wherein the alternative node set at least comprises the first node set, and for any producer in the first list, the set of nodes which are in any producer and are directly related and indirectly related to the tensor output by any producer is taken as a node set; a first set of nodes is selected from the set of candidate nodes based on the parameters. In this way, the first set of nodes may be obtained by the producer in the first computational graph.
In one possible implementation manner, the recalculating is performed on each first node set to obtain N recalculation sub-graphs, specifically including: copying producer subgraphs corresponding to any node set in the N first node sets; deleting nodes except the nodes contained in any node set in the producer subgraph, and deleting edges irrelevant to any node set in the producer subgraph to obtain a recalculation subgraph corresponding to any node set.
In one possible implementation manner, the merging the second calculation map with the first calculation map to obtain the third calculation map specifically includes: and respectively constructing a directed edge between a node of each recalculation subgraph of the N recalculation subgraphs, which inputs the first target tensor, and a node of the first calculation graph, which outputs the first target tensor, and a directed edge between a node of each recalculation subgraph of the N recalculation subgraphs, which outputs the second target tensor, and a node of the first calculation graph, which inputs the second target tensor, so as to obtain a third calculation graph.
In one possible implementation, before obtaining the second calculation graph needing to be recalculated from the first calculation graph based on the data dependency relationship between the parameter and the node in the first calculation graph, the method further includes: and fusing operators in the first computational graph. Thereby improving the performance of operators in the computational graph; in addition, the number of operators can be reduced through the first operator fusion, so that the cost for analyzing the operators in the subsequent recalculation process can be reduced; in addition, the range of analyzing operators in the subsequent recalculation process can be enlarged, and the recalculation effect is improved.
In a second aspect, the present application provides a computation graph optimization apparatus, including: at least one memory for storing a program; at least one processor for executing the memory-stored program, the processor being adapted to perform the method provided in the first aspect when the memory-stored program is executed.
In a third aspect, the present application provides an apparatus comprising: at least one memory for storing a program; at least one processor for executing the memory-stored program, the processor being adapted to perform the method provided in the first aspect when the memory-stored program is executed.
In a fourth aspect, the present application provides a computer readable storage medium storing a computer program which, when run on an electronic device, causes the electronic device to perform the method provided in the first aspect.
In a fifth aspect, the present application provides a computer program product for causing an electronic device to perform the method provided in the first aspect when the computer program product is run on the electronic device.
In a sixth aspect, the present application provides a chip comprising at least one processor and an interface; an interface for providing program instructions or data to at least one processor; at least one processor is configured to execute program line instructions to implement the method provided in the first aspect.
It will be appreciated that the advantages of the second to sixth aspects may be found in the relevant description of the first aspect, and are not described here again.
Drawings
FIG. 1 is a schematic diagram of an artificial intelligence subject framework provided in an embodiment of the present application;
FIG. 2 is a schematic diagram of a training process of a network model of an AI network according to an embodiment of the disclosure;
FIG. 3 is a schematic diagram of a training process of a network model of another AI network provided in an embodiment of the application;
FIG. 4 is a schematic diagram of a comparison of the operator before and after fusion according to an embodiment of the present application;
FIG. 5 is a graph showing the peak memory comparison before and after operator fusion according to an embodiment of the present application;
FIG. 6 is a schematic diagram of a graph fusion process under an AI open source computing framework provided in an embodiment of the application;
FIG. 7 is a schematic diagram of an operator fusion process provided by an embodiment of the present application;
FIG. 8 is a schematic diagram of topological ordering of operators in a computational graph according to an embodiment of the present application;
FIG. 9 is a schematic diagram of steps for screening a recalculation node set by using a memory method according to an embodiment of the present application;
FIG. 10 is a schematic diagram illustrating another step of screening a recalculation node set by using an in-memory method according to an embodiment of the present application;
FIG. 11 is a schematic diagram of a computational graph provided by embodiments of the present application;
FIG. 12 is a schematic diagram of comparing peak memories before and after recalculating a calculation map according to an embodiment of the present application;
FIG. 13 is a schematic diagram of steps for filtering a recalculation node set through operator fusion rules according to an embodiment of the present application;
FIG. 14 is a schematic diagram of a process for determining a recalculation map provided by an embodiment of the present application;
FIG. 15 is a schematic diagram of a process for merging a determined recalculation graph with an obtained calculation graph after performing an operator fusion on the original calculation graph according to an embodiment of the present application;
FIG. 16 is a schematic diagram of a process for operator fusion provided by embodiments of the present application;
FIG. 17 is a schematic diagram illustrating the comparison of the memory sizes required before and after optimizing the computation graph by operator fusion and recalculation according to the embodiment of the present application;
FIG. 18 is a flowchart of a method for optimizing a computation graph according to an embodiment of the present application;
fig. 19 is a schematic structural diagram of a chip according to an embodiment of the present application.
Detailed Description
The term "and/or" herein is an association relationship describing an associated object, and means that there may be three relationships, for example, a and/or B may mean: a exists alone, A and B exist together, and B exists alone. The symbol "/" herein indicates that the associated object is or is a relationship, e.g., A/B indicates A or B.
The terms "first" and "second" and the like in the description and in the claims are used for distinguishing between different objects and not for describing a particular sequential order of objects. For example, the first response message and the second response message, etc. are used to distinguish between different response messages, and are not used to describe a particular order of response messages.
In the embodiments of the present application, words such as "exemplary" or "such as" are used to mean serving as examples, illustrations, or descriptions. Any embodiment or design described herein as "exemplary" or "for example" should not be construed as preferred or advantageous over other embodiments or designs. Rather, the use of words such as "exemplary" or "such as" is intended to present related concepts in a concrete fashion.
In the description of the embodiments of the present application, unless otherwise specified, the meaning of "a plurality of" means two or more, for example, a plurality of processing units means two or more processing units and the like; the plurality of elements means two or more elements and the like.
For ease of understanding, technical terms involved in the present solution will be described first.
(1) Graph calculation fusion
The graph fusion is a network performance optimization technology specific to MindSpore. The method can automatically analyze and optimize the logic of the existing network computational graph, combine the target hardware capability, perform optimization such as computational simplification and substitution, operator splitting and fusion, operator specialization compiling and the like on the computational graph, so as to improve the utilization rate of equipment computational resources and realize the overall optimization of network performance. Compared with the traditional optimization technology, the graph computation fusion has the unique advantages of multi-operator cross-boundary joint optimization, cross-layer collaboration with MindAG (Polyheat-based operator compiler), just-in-time compilation and the like. In addition, the whole optimization process can be automatically completed only after the user opens the corresponding configuration, and other additional perceptions are not needed by network developers, so that the user can focus on the realization of the network algorithm.
(2) Tensor (Tensor)
Tensors (tensors) are a data structure used to represent data in an AI framework, and are generally composed of related attributes such as a shape (representing the size and dimension of the data), a data type (e.g., flash 32, int, etc.), and a memory address.
(3) Operator
In the field of deep learning, an operator generally refers to a calculation unit, and can perform simple or complex calculations like addition, subtraction, multiplication, division, and the like, and is generally performed in units of operators on acceleration hardware.
(4) Memory multiplexing
Memory multiplexing is an optimization technique. In the deep learning network, multiplexing of physical memory can be realized by pointing Tensor of different layers to the same address, so as to achieve the purpose of reducing memory occupation.
(5) Calculation map
The computational graph is a computational function expressed by a directed graph with Operators (Operators) as nodes. The computation graph is mainly represented by nodes (nodes) and edges (edges). Nodes in the computational graph may represent operators. Edges between two operators in a computational graph may represent a relationship between the two operators. The edges may have directions, and the directions of the edges may be the flow direction of data between operators. For example, when an edge between operator a and operator b in the computational graph is pointed to by operator a to operator b, then the tensor output by operator a is the input to operator b. In the AI framework, the computing function expressed by the computing graph sequentially calls operator nodes in the execution directed graph on the input tensor, and the final output tensor is obtained.
(6) Peak memory
The peak memory refers to the memory occupied by tensors which are needed to be used at the moment of executing a certain operator in the executing process of the calculation map. For example, referring to fig. 5 (a), at the moment of executing the operator D, the tensor output by the operator a, the tensor output by the operator C, and the tensor to be output by the operator D are all tensors to be used, so the peak memory at this time is t0+t1+t2; wherein, the input tensor of the operator B and the output tensor of the operator C are multiplexed into one memory.
(7) The producer
The producer is a computational subgraph that generates data, i.e., an operator that generates data, from two computational subgraphs in the computational graph that have data dependencies. For example, referring to FIG. 4, operator O 1 Sum operator O 2 For example, operator O 1 Sum operator O 2 Has a data dependency relationship in which operator O 1 The output tensor is operator O 2 The input tensor of (2), therefore, operator O 1 Can be referred to as a producer.
(8) Consumers are provided with
The consumer is a computational subgraph that uses data, i.e., an operator that uses data, from two computational subgraphs that have data dependencies in the computational graph. For example, referring to FIG. 4, operator O 1 Sum operator O 2 For example, operator O 1 Sum operator O 2 Has a data dependency relationship in which operator O 1 The output tensor is operator O 2 The input tensor of (2), therefore, operator O 2 Can be called asWhich is the consumer.
(9) Node set
A node set refers to a set of one or more nodes. The nodes included in a node set are each directly or indirectly related to a tensor output by a producer in the computational graph, and are each directly or indirectly related to at least one tensor input by the producer. Furthermore, it is also understood that when a plurality of nodes are included in the node set, a computation graph composed of the plurality of nodes may obtain at least one tensor from outside and output one tensor to outside, that is, at least one node of the plurality of nodes may obtain an externally input tensor, and only one node may output the tensor to outside, while tensors output by other nodes may only be used inside the computation graph.
For example, as shown in fig. 11, gather, scatter and Div may form a node set, and the three nodes form a calculation graph, in which Gather, scatter and Div are sequentially connected, where the gap and the Scatter have edges, and the Scatter and the Div have edges, and each of the three nodes is directly related or indirectly related to the tensor B output by the corresponding producer fuseo 1, and is directly related or indirectly related to the tensor A input by the producer fuseo 1; in other words, the computational graph composed of these three nodes can obtain one tensor a from outside through the Gather and can output one tensor B to outside through Div, wherein neither the Gather nor the Scater can output tensors to outside the computational graph, and the tensors output can only be used inside the computational graph. As shown in fig. 14, in (B) as 14, gather may constitute a node set, and a calculation map corresponding to the node set may be as shown in (C) of fig. 14, and may be a tensor a obtained from the outside and a tensor B outputted to the outside.
(10) Data dependency between nodes
When data required by one node needs to be provided by another node, the two nodes have a data dependency relationship. For example, when the tensor output by one node is the tensor input by another node, then there is a data dependency between the two nodes.
(11) Data mutation threshold
The data mutation threshold refers to a threshold for generating data mutation by at least one node in the calculation map. Taking a node as an example, the data abrupt change threshold is a threshold of a deviation value (such as a difference value or a ratio value) between a memory occupied by a tensor output by the node and a memory occupied by a tensor input by the node. Taking a plurality of nodes as an example, the plurality of nodes may form a node set as described above, and the data abrupt change threshold is a threshold of a deviation value (such as a difference value or a ratio value) between a memory occupied by an output tensor corresponding to the node set formed by the plurality of nodes and a memory occupied by at least one input tensor corresponding to the node set. The output tensor corresponding to the node set refers to tensor of the computation graph formed by the nodes in the node set to be output to the outside, and the input tensor corresponding to the node set refers to tensor of the computation graph formed by the nodes in the node set to be acquired from the outside.
Next, a description will be given of a technical solution in the embodiment of the present application.
By way of example, FIG. 1 illustrates a schematic diagram of an artificial intelligence subject framework that describes the overall workflow of an artificial intelligence system, applicable to general artificial intelligence field requirements.
The above-described artificial intelligence topic framework is described below in terms of two dimensions, the "Intelligent information chain" (horizontal axis) and the "IT value chain" (vertical axis).
The "intelligent information chain" reflects a list of processes from the acquisition of data to the processing. For example, there may be general procedures of intelligent information awareness, intelligent information representation and formation, intelligent reasoning, intelligent decision making, intelligent execution and output. In this process, the data undergoes a "data-information-knowledge-wisdom" gel process.
The "IT value chain" reflects the value that artificial intelligence brings to the information technology industry from the underlying infrastructure of personal intelligence, information (provisioning and processing technology implementation), to the industrial ecological process of the system.
(1) Infrastructure:
the infrastructure provides computing capability support for the artificial intelligence system, realizes communication with the outside world, and realizes support through the base platform. Communicating with the outside through the sensor; the computing power is provided by a smart chip (CPU, NPU, GPU, ASIC, FPGA and other hardware acceleration chips); the basic platform comprises a distributed computing framework, a network and other relevant platform guarantees and supports, and can comprise cloud storage, computing, interconnection and interworking networks and the like. For example, the sensor and external communication obtains data that is provided to a smart chip in a distributed computing system provided by the base platform for computation.
(2) Data
The data of the upper layer of the infrastructure is used to represent the data source in the field of artificial intelligence. The data relate to graphics, images, voice and text, and also relate to the internet of things data of the traditional equipment, including service data of the existing system and sensing data such as force, displacement, liquid level, temperature, humidity and the like.
(3) Data processing
Data processing typically includes data training, machine learning, deep learning, searching, reasoning, decision making, and the like.
Wherein machine learning and deep learning can perform symbolized and formalized intelligent information modeling, extraction, preprocessing, training and the like on data.
Reasoning refers to the process of simulating human intelligent reasoning modes in a computer or an intelligent system, and carrying out machine thinking and problem solving by using formal information according to a reasoning control strategy, and typical functions are searching and matching.
Decision making refers to the process of making decisions after intelligent information is inferred, and generally provides functions of classification, sequencing, prediction and the like.
(4) General capability
After the data has been processed, some general-purpose capabilities can be formed based on the result of the data processing, such as algorithms or a general-purpose system, for example, translation, text analysis, computer vision processing, speech recognition, image recognition, etc.
(5) Intelligent product and industry application
The intelligent product and industry application refers to products and applications of an artificial intelligent system in various fields, is encapsulation of an artificial intelligent overall solution, and realizes land application by making intelligent information decisions, and the application fields mainly comprise: intelligent manufacturing, intelligent transportation, intelligent home, intelligent medical treatment, intelligent security, automatic driving, safe city, intelligent terminal and the like.
It should be noted that, the operator fusion process referred to in the present application may be located in the data processing stage in (3) above.
By way of example, fig. 2 shows a training process of a network model of an AI network. The training process of the network model may include a forward computing process and a backward computing process. As shown in fig. 1, the left part of the broken line x is a forward calculation process, the right part of the broken line x is a backward calculation process, and each block in the figure represents a tensor. In fig. 1, tensors in the same row multiplex the same memory address, such as: tensor a and tensor a b Multiplexing a memory address, tensor B and tensor B b Multiplexing a memory address, …, tensor F and tensor F b Multiplexing a memory address, etc. In fig. 1, the tensor a b Tensor B b Tensor C b Tensor D b Tensor E b Tensor F b The inverse operator data corresponding to tensor a, tensor B, tensor C, tensor D, tensor E, and tensor F, respectively. In fig. 1, during the training of the network model, forward computation may be performed before backward computation is performed. The tensors obtained by forward calculation are stored in the memory of the corresponding computing equipment. In the backward calculation, tensor A b The calculation depends on tensors In and B b Tensor B b The calculation depends on tensors A and C b Tensor C b The computation of the tensor B and the tensor D depends on b Tensor D b The computation of tensors C and E depends on b And so on, wherein, in the backward calculation process, after a tensor is calculated, the tensor obtained by forward calculation corresponding to the tensor is releasedMemory, e.g. to calculate tensor B b Then, the memory occupied by the tensor A is released, so that the tensor A obtained by subsequent calculation b The memory occupied by tensor a can be reused. As can be seen, in the training process of the network model shown in fig. 1, the data obtained by the forward computation occupies the memory for a long time, which results in excessive memory consumption of the network model.
In general, in fig. 1, the problem of memory occupation can be solved by adopting a recalculation mode or an operator fusion mode. When the recalculation mode is adopted, the memory occupied by the partial result obtained by the forward calculation in fig. 1 is released in advance in the mode of time-to-space conversion in fig. 1, and the partial result is recalculated in the subsequent backward calculation, so that the life cycle of the partial data is shortened, and the purpose of reducing the memory occupation is achieved. Specifically, as shown in FIG. 3, the forward computed tensor A may be computed in the backward direction to obtain the tensor B b The memory was previously occupied continuously. The tensor D calculated in the forward direction can be calculated in the backward direction to obtain the tensor E b The memory was previously occupied continuously. The tensor B, the tensor C, the tensor E and the tensor F obtained by forward calculation can be released immediately after the use of the four tensors in the forward calculation process, and the calculation processes respectively corresponding to the four tensors in the reverse calculation process are re-executed. Wherein, the memory occupied by the tensor B can be released after the tensor C is obtained by forward calculation, the memory occupied by the tensor C can be released after the tensor D is obtained by forward calculation, and the tensor D is calculated in the backward direction b The tensor C can be recalculated when the tensor C is calculated backward b At this point, tensor B may be recalculated. Thus, the purpose of reducing the memory occupation can be achieved by a recalculation mode. With continued reference to fig. 1, if each block occupies one memory in fig. 1 and the forward computation ends when the tensor F is obtained, the data obtained after the forward computation ends occupies 7 memories in total; in fig. 3, the four tensors are considered to be unoccupied, and are obtained after the forward calculation in fig. 3, since the tensors B, C, E, and F are released immediately after the use of the tensors in the forward calculationAnd occupies a total of 3 memories. Therefore, the purpose of reducing the memory occupation can be achieved by a recalculation mode.
However, when the recalculation is performed, it is often necessary to manually specify the recalculation point or automatically calculate the recalculation point. The proposal of manually recalculating points often requires a user to identify recalculating dividing points, has high requirements on the user capacity, is respectively arranged aiming at different models and different hardware rear ends, and has low efficiency; solutions for automatically computing the recalculation points either have long solution times (solution time hours class, exceeding most training tasks themselves) or require statistics at run-time and cannot be combined with operator fusion. In addition, the recalculation mode can only be used for recalculating the data dependence between the forward direction and the reverse direction, the recalculation of the data dependence between the forward direction and the reverse direction can not be performed, and the use scene is small. Furthermore, the manner of recalculation does not solve the memory occupation problem of a single operator and a single oversized tensor. For example, with continued reference to fig. 3, if the data size of the tensor a is too large, the occupied memory will be large, and in this case, the memory will be continuously occupied greatly.
When the operator fusion mode is adopted, the network performance is improved by fusing a plurality of adjacent operators into one operator, and meanwhile, the use cost of a memory can be reduced. Specifically, as shown in fig. 4, fig. 4 (a) is a calculation mode before fusion, in which tensor a is input to operator O 1 Then tensor B is obtained and input to operator O 2 Then tensor C is obtained and input to operator O 3 Tensor D is obtained. Operator O is integrated by utilizing preset fusion rules 1 、O 2 And O 3 The three operators may be fused into one operator, and at this time, the tensor D obtained from the tensor a is converted into the form shown in fig. 4 (B). In fig. 4 (B), tensor a is input to the fused operator (O 1 +O 2 +O 3 ) The tensor D can then be obtained. Comparing fig. 4 (a) and (B) shows that there are 4 tensors in fig. 4 (a) that need to occupy memory, and 2 tensors in fig. 4 (B) that need to occupy memory. Therefore, the operator fusion mode is adopted to reduce the occupation of the memory.
But is limited by the internal memory size and hardware processing power, operator fusion can only handle operators of adjacent spaces, and the fusion range and scale are limited. And a large amount of computation processes exist between the forward operator and the reverse operator, so that the forward computation results cannot be directly fused, and the forward computation results still need to be temporarily stored in a memory. In addition, in a partial multi-input scene, operator fusion can extend the life cycle of different tensors to the whole fusion operator, and the memory occupation is increased instead. Illustratively, as shown in fig. 5, (a) of fig. 5 is an execution order of each operator before operator fusion, a life cycle of an input tensor and an output tensor of each operator, and a peak memory of each operator when executing, and (B) of fig. 5 is an execution order of each operator after operator fusion, a life cycle of an input tensor and an output tensor of each operator, and a peak memory of each operator when executing. In fig. 5, the operator a, the operator B, and the operator C may be fused into an operator E. In fig. 5 (a), the tensor output by the operator B is multiplexed with the tensor input by the operator a by one memory, the tensor output by the operator C is multiplexed with the tensor input by the operator B by one memory, and the tensor output by the operator D is multiplexed with the tensor input by the operator C by one memory. In fig. 5 (B), the tensor output by the operator D and the tensor input by the operator E may be multiplexed by one memory, and when the operator C is executing, since this corresponds to the operator E being executing, the memories T0, T1 and T2 cannot be multiplexed at this time, and only one memory, i.e. the memory T3, can be applied. As can be seen from fig. 5 (a), the peak memory when using the calculation map is (t0+t1+t2) before operator fusion, and as can be seen from fig. 5 (B), the peak memory when using the calculation map is (t0+t1+t2+t3) after operator fusion. It can be seen that the peak memory after operator fusion is significantly increased over the peak memory before operator fusion.
As can be seen from the above analysis, the problem of network failure caused by high peak memory occupation due to tensors of long life cycle crossing long path dependence and the problem of network failure caused by exceeding memory limit by a single operator and a single tensor can be solved by adopting a recalculation mode or an operator fusion mode.
In view of this, the present application provides a calculation map optimization method, which mainly obtains a recalculation map to be recalculated from a calculation map to be optimized, merges the obtained recalculation map with the calculation map to be optimized, generates a new calculation map, and finally performs operator fusion on the new calculation map. Therefore, by adopting a mode of combining recalculation and operator fusion, memory occupation is remarkably reduced, and meanwhile, larger recalculation expenditure (including compiling time and running time) is not introduced, so that the problem that a network with one or more ultra-large tensors cannot execute is solved.
By way of example, FIG. 6 shows a schematic diagram of a graph fusion process under the framework of AI open source computing. The AI open source computing framework shown in FIG. 6 may be mindSpore. As shown in fig. 6, the AI open source computing framework may include a front end representation layer (not shown), a mindscore layer, and a MindSpore Auto Kernel Generator layer (mindscore AGK layer). Wherein the front-end presentation layer may provide at least one application programming (application programming interface, API) interface for the user. When using the AI open source computing framework, a user can select the network model 401 that he desires to use at the front-end presentation layer and configure various data through the front-end presentation layer. Thereafter, a network model and an automatic differential generation user configured MindSpore Expressio (ME) front end representation may be built in the mindscore layer. Then, a computational graph can be automatically built in the MindSpore layer based on the user-configured ME front end representation. Then, public optimization may be first performed on the computational graph, for example: unnecessary code is eliminated, the same data is optimized for multiple occurrences, and so on. Then, back-end optimization is performed on the calculation map, for example: and selecting a more suitable expression mode according to the characteristics of different hardware. After the calculation map is subjected to back-end optimization, an optimized calculation map can be generated. The process of obtaining the optimized calculation graph can be understood as a graph calculation fusion process. The computational graph optimized by graph computation fusion can contain a plurality of operator definition descriptions, and the descriptions describe the computation process of each operator in the computational graph. The MindSpore AGK layer can be transmitted to Mind AKG in json format for compiling and optimizing, and a code generator is called to generate a back-end hardware code and compile the back-end hardware code into a back-end Kernel. The backend Kernel may be understood as the expression of an operator on hardware, for example: binary files corresponding to operators, and the like. After the layer and operator compilation is completed, the MindSpore layer can call Kernel according to the sequence of the calculation diagrams and execute the corresponding calculation diagrams.
Wherein, the back-end optimization part in MindSpore image layer is mainly focused in the application. The back-end optimization part in the MindSpore layer can reduce the memory consumption of the network model by recalculating the processing mode fused with operators, so that the back-end optimization method can provide the optimization on the performance and the memory for the back-end of different types of hardware (such as a graphic processor, a neural network processor, a central processing unit and the like).
The following describes the optimization scheme of the calculation chart provided in the application in detail based on the MindSpore AI open source calculation framework described above and with reference to the attached drawings. It should be understood that the MindSpore AI open source computing framework described in the embodiments of the present application is only schematically illustrated, and the computing graph optimization method provided in the embodiments of the present application is not limited to this framework, but may also be applied to other frameworks, where the application is still within the protection scope of the present application.
By way of example, the computational graph optimization scheme may be, but is not limited to being, run on any apparatus, device, platform, or cluster of devices having computing, processing capabilities. The calculation map optimization scheme mainly comprises two steps of setting a configuration item interface and calculation map optimization. The configuration item interface is mainly used for configuring some necessary parameters by a user so as to be used in the optimization process of the calculation map; computational graph optimization is primarily a process of combining operator fusion with recalculation, described in detail below.
Setting configuration item interface
The mindscore may be configured with a Context module and a graph fusion module. The Context module may be used to receive a user's global settings for the running process. The graph fusion module may be, but is not limited to being responsible for graph fusion of computational graphs and operators. Wherein the graph fusion module may be, but is not limited to, an AI compiler module configured in mindscore, which may be responsible for compiling optimization of the computation graph and operator.
In the embodiment of the present application, two configuration item interfaces may be added in the Context module, but not limited to. One configuration item interface may be used to configure memory thresholds and/or peak memory thresholds and another configuration item interface may be used to configure data thresholds. Wherein the user can set the values of the two above configuration items according to the actual network and the data set scene. When not set by the user, system default values may be employed, or automatically generated from the hardware environment, for example: the memory size of the graphics processor (graphics processing unit, GPU) may be, but is not limited to being, a memory threshold and/or a peak memory threshold.
It can be understood that the setting configuration item interface can be set in real time, or can be preset, and the setting configuration item interface can be specific according to actual situations and is not limited herein.
(II) computational graph optimization
The process of computing graph optimization is mainly a process of combining operator fusion and recalculation. This process may be performed, but is not limited to, in a graph fusion module configured in MindSpore. The graphic fusion module can read a memory threshold and/or a peak memory threshold and a data threshold configured in the Context module by a user. The computational graph optimization process mainly comprises primary operator fusion, recalculation and secondary operator fusion, which are described in detail below.
(1) Primary operator fusion
When the calculation map is optimized, operator fusion can be performed on the obtained calculation map. By way of example, but not limited to, the computation graph may be divided into a plurality of computation subgraphs based on a preset operator fusion rule (such as an operator fusion rule of elementwise+elementwise, an operator fusion rule of elementwise+reduction, an operator fusion rule of segment+elementwise, etc.), and according to a class of an operator in the computation graph (such as computationally intensive, memory intensive, etc.) and a back-end feature of hardware, etc., wherein each computation subgraph may correspond to an operator obtained after fusion (hereinafter referred to as a "fusion operator"), thereby obtaining an optimized computation graph, i.e., a primary fusion subgraph. Thereby improving the performance of operators in the computational graph. In addition, the number of operators can be reduced through the first operator fusion, so that the cost for analyzing the operators in the subsequent recalculation process can be reduced. In addition, the range of analyzing operators in the subsequent recalculation process can be enlarged, and the recalculation effect is improved. For example, referring to fig. 11, it is assumed that before operator fusion, the operator Div has no data mutation, and after operator fusion, the output tensor B corresponding to the fused op1 is far greater than the input tensor a corresponding to the operator Div, so before operator fusion, the operator Div is difficult to be used as a recalculation node set in a data mutation mode, but after operator fusion, the fused operator fused op1 corresponding to the operator Div has data mutation, so that the set of operators related to the output tensor B of the operator Div can be used as a recalculation node set, and the range of analysis of the operators in the recalculation process is enlarged through operator fusion.
For example, as shown in fig. 7, fig. 7 (a) is an initially obtained calculation map including 6 operators, and the 6 operators are Slice, gather, scatterAdd, matMul, mul and Div, respectively. In fig. 7 (a), the Slice output is tensor a; gather's input is tensor A and output is tensor B; the input of ScatterAdd is tensor B, and the output is tensor C; the input of MatMul is tensor C, and the output is tensor D; the input of Mul is tensor D, and the output is tensor E; the inputs of Div are tensor B and tensor D. If the preset operator fusion rule is: gather and ScatterAdd are fused, mul and Div are fused, and the calculation map shown in (B) of FIG. 7 can be obtained after operator fusion of (A) of FIG. 7 is performed based on the operator fusion rule. In fig. 7 (B), 4 computation subgraphs are included, and the 4 computation subgraphs correspond to 4 fusion operators respectively, where the 4 fusion operators are respectively: slice, fuseOp1, matMul, and FuseOp2. In fig. 7 (B), the Slice output is tensor a; the input of FuseOp1 is tensor A, and the output is tensor B and tensor C; the input of MatMul is tensor C, and the output is tensor D; the inputs to FuseOp2 are tensor D and tensor B. It will be appreciated that since (B) of fig. 7 is a calculation map obtained by performing operator fusion, each operator in (B) of fig. 7 may be referred to as a "fusion operator".
It can be appreciated that in the embodiment of the present application, the primary operator fusion may be selected according to the actual situation. In some embodiments, this step may be performed followed by a subsequent recalculation step. In other embodiments, this step may not be performed, but rather the subsequent recalculation steps may be performed directly.
(2) Recalculation
The process of recalculating generally includes one or more of building a producer-consumer list, determining a set of recalculation alternatives, screening the set of recalculation points from the set of recalculation alternatives, determining a recalculation map, and generating a new calculation map, as described in detail below.
A) Building producer-consumer lists
According to the data dependency relationship among the nodes in the calculation graph, firstly, topological ordering is carried out on the nodes in the calculation graph, for example, ordering is carried out according to the physical relationship among all operators in the calculation graph, and then, a producer-consumer list is obtained according to the ordering result. The producer generates data in the two calculation subgraphs with the data dependency relationship, namely generates the operator of the data, and the consumer uses the calculation subgraphs of the data in the two calculation subgraphs with the data dependency relationship, namely uses the operator of the data.
For example, as shown in fig. 8, fig. 8 (a) is a calculation sub-graph obtained after performing one fusion. The topological ordering of the nodes in the computation graph shown in fig. 8 (a) can be performed based on the data dependency relationship between the nodes in the computation graph, and an ordering result shown in fig. 8 (B) can be obtained. In fig. 8 (B), it is: the Slice and the FuseOp1 have data dependency, the FuseOp1 has data dependency with MatMul and FuseOp2 respectively, the MatMul and the Div have data dependency, and the Div and the FuseOp2 have data dependency. Among them, for Slice and FuseOp1, slice can generate data, fuseOp1 can use Slice generated data, so Slice is producer, fuseOp1 is consumer. Similarly, for FuseOp1 and MatMul, fuseOp1 is the producer and MatMul is the consumer; for FuseOp1 and FuseOp2, fuseOp1 is the producer and FuseOp2 is the consumer; matMul is the producer and Div is the consumer for MatMul and Div; for Div and fuseo 2, div is the producer and fuseo 2 is the consumer. Thus, from fig. 8 (B), a producer-consumer list corresponding to the calculation map in fig. 8 (a) can be obtained, and this list can be shown in table 1.
TABLE 1
Producer (producer) Consumer (consumer)
Slice FuseOp1
FuseOp1 MatMul
FuseOp1 FuseOp2
MatMul Div
Div FuseOp2
B) Determining a set of recalculated alternatives
After the producer-consumer list is constructed, for any one producer in the list, the set of operators in the producer and directly and indirectly related to the tensor output by the producer can be used as one datum in the recalculation candidate set, which can be referred to as a recalculation node set. Thereby, a recalculated candidate set is obtained. The recalculation node set may include one node or a plurality of nodes.
For example, with continued reference to FIG. 8, consider the FuseOp1 producer as an example, which outputs tensors that are tensor B and tensor C, respectively. The node directly related to tensor B is Gather, and there is no node indirectly related to tensor B. The node directly related to tensor C is ScatterAdd, and the node indirectly related to tensor C is Gather. Thus, the recalculation alternative set obtained by this producer of FuseOp1 is [ (Gather), (Gather, scatterAdd) ]. The combination of recalculation alternatives comprises two recalculation node sets, one recalculation node set being (Gather) comprising one node, namely node Gather, and the other recalculation node set being (Gather, scatterAdd) comprising two nodes, namely node Gather and ScatterAdd.
C) Screening a set of recalculation nodes from a set of recalculation alternatives
The manner of filtering the recalculation point set from the recalculation candidate set may include one or more of memory size based filtering, data mutation based filtering, operator fusion rule based filtering, and the like. The three modes can be selected or combined arbitrarily, and the three modes can be specific according to actual conditions and are not limited herein. These three ways are described separately below.
a) Memory size based screening
When screening based on the memory size, the required recalculation node set corresponding to the producer can be screened from the recalculation candidate sets according to the memory size occupied by tensors output by each recalculation node set in the recalculation candidate sets. Specifically, as shown in fig. 9, in S901, the memory occupied by the output tensor corresponding to each recalculation node set in the recalculation candidate set may be compared with a memory threshold or a peak memory threshold set in the earlier stage. In S902, when the memory occupied by the output tensor corresponding to the recalculation node set is greater than or equal to the memory threshold or the peak memory threshold set in the earlier stage, the recalculation node set is determined to be the required recalculation node set, and the recalculation node set may be reserved at this time. In S903, when the memory occupied by the output tensor corresponding to the recalculation node set is smaller than the memory threshold or the peak memory threshold set in the earlier stage, it is determined that the recalculation node set is not a required recalculation node set, and the recalculation node set may be discarded at this time. Thereby screening the desired set of recalculation points from the set of recalculation alternatives. In one example, the output tensor corresponding to the recalculation node set refers to a tensor output by the producer corresponding to the recalculation node set, and each node in the recalculation node set is directly or indirectly related to the tensor. The input tensor corresponding to the recalculation node set refers to at least one tensor input by a producer corresponding to the recalculation node set, and each node in the recalculation node set is directly related or indirectly related to each tensor in the at least one tensor.
As another possible implementation, in the process of obtaining the producer-consumer list, it may be recorded whether there is an indirect path between the producer and the consumer, i.e. whether the tensor output by the producer is also indirectly an input tensor for the consumer, in addition to being directly an input tensor for the consumer. For example, with continued reference to fig. 8, for the producer-consumer of FuseOp1-FuseOp2, the tensor B output by the producer FuseOp1 may be directly the tensor of the input of consumer FuseOp2, and the tensor C output by the producer FuseOp1 may be indirectly the tensor of the input of FuseOp2 via MatMul and Div, so there is an indirect path between FuseOp1 and FuseOp 2.
Wherein, for any producer-consumer, as shown in fig. 10, it may be judged whether an indirect path exists between the producer-consumer at S1001. When no indirect path exists between the producer and the consumer, the required recalculation node set corresponding to the producer can be screened from the recalculation candidate set according to the memory size occupied by the output tensor corresponding to each recalculation node set corresponding to the producer in the recalculation candidate set. Specifically, in S1002, the memory occupied by the output tensor corresponding to each recalculation node set corresponding to the producer in the recalculation candidate set may be compared with a memory threshold or a peak memory threshold set in the earlier stage. In S1003, when the memory occupied by the output tensor corresponding to the recalculation node set is greater than or equal to the memory threshold set in the earlier stage, the recalculation node set is determined to be the required recalculation node set, and the recalculation node set may be reserved at this time. In S1004, when the memory occupied by the output tensor corresponding to the recalculation node set is smaller than the memory threshold or the peak memory threshold set in the earlier stage, it is determined that the recalculation node set is not the required recalculation node set, and the recalculation node set may be discarded at this time. Thereby screening the desired set of recalculation points from the set of recalculation alternatives.
For example, with continued reference to fig. 8, for MatMul-Div this producer-consumer, there is no indirect path between the two. If the memory threshold set in the early stage is 40 bytes, the tensor output by the producer MatMul occupies 50 bytes of memory. Because there is only a direct path between MatMul and Div, there is no indirect path, and the memory occupied by the tensor output by the MatMul of the producer is larger than the memory threshold set in the earlier stage, matMul can be used as the required recalculation node set.
Illustratively, for determining the memory occupied by the tensor output by the producer, the determination may be made based on, but not limited to, the data size of the tensor output by the producer. After the computational graph is determined, the structure (such as dimension, type) of the tensor output by each operator in the computational graph can be determined, so that the memory size occupied by the tensor output by each operator can be determined according to the structure of the tensor output by each operator. For example, if the dimension of the tensor is [4,4,3,2], the type is float32, and since the byte occupied by the data of each data type is float32 is 4, the memory size occupied by the tensor is: 4x4x3x2x4 = 384 bytes.
With continued reference to FIG. 10, in S1005, when an indirect path exists between the producer and the consumer, the peak memories of the execution of the operators on the indirect path between the producer and the consumer may be calculated, respectively. Next, in S1006, it is determined whether the peak memory corresponding to at least one operator is greater than or equal to the peak memory threshold set in the previous stage. Next, in S1007, when there is at least one operator corresponding to a peak memory greater than or equal to the peak memory threshold set earlier, the set of operators in the producer-consumer that are directly related and indirectly related to each tensor input to the consumer by the producer may be used as a desired set of recalculation nodes. When there is an indirect path between the producer and the consumer where the peak memories corresponding to the operators are smaller than the peak memory threshold set in advance, the set of operators in the producer-consumer that are directly related and indirectly related to the tensor input by the producer to the consumer may be regarded as the set of recalculation nodes that are not required, and the set of recalculation nodes may be discarded at S1008. Thus, the desired set of recalculation points is screened from the set of recalculation alternatives.
For example, as shown in fig. 11, fuseo 1 and fuseo 2 may constitute a producer-consumer in fig. 11. There is an indirect path between fuseOp1 and fuseOp2, and the operator on the indirect path is MatMul. If the memory (i.e., the sum of the memories occupied by tensor B, tensor C, tensor D, and tensor E) during MatMul execution is greater than the peak memory threshold set in advance, the set of operators directly related and indirectly related to tensor B and the set of operators directly related and indirectly related to tensor E may be used as the required recalculation node set. The set of operators directly related and indirectly related to the tensor B is Gather, and the set of operators directly related and indirectly related to the tensor E is (Gather, scanteradd, div), so that the recalculated node sets screened at this time are (Gather) and (Gather, scanteradd, div), respectively.
In some embodiments, to improve the accuracy of screening based on the peak memory threshold, the size between the memory occupied by the tensor input to the consumer by the producer (i.e. the tensor output by the producer) and the memory occupied by the input tensor required by the producer and related to the tensor can be used as the judgment condition of the attachment in screening. When there is an indirect path between the producer and the consumer, when the memory occupied by the tensor input to the consumer by the producer and the producer (i.e. the tensor output by the producer) is larger than the memory occupied by the input tensor related to the tensor required by the producer, and the peak memory corresponding to at least one operator in the execution peak memory of each operator on the indirect path between the producer and the consumer is larger than or equal to the peak memory threshold set in advance, the set of operators directly related and indirectly related to each tensor input to the consumer by the producer in the producer-consumer can be used as a required recalculation node set.
For example, as shown in fig. 12 (a), for the pair of producers and consumers of the operators 1 and 5, there is an indirect path between the two, and the peak memory at the time of execution of the operator 4 is highest, and the peak memory at this time is (t1+t2+t3+t4). If the peak memory when the operator 4 executes exceeds the peak memory threshold set in the earlier stage, the operator 1 can be used as a required recalculation node set. As shown in fig. 12 (B), when operator 1 is taken as the required recalculation node set, and the calculated peak memory is (t0+t2+t3+t4), if T1 is greater than T0, the peak memory corresponding to the calculation map can be reduced after the recalculation, and if T1 is less than T0, the peak memory corresponding to the calculation map is increased after the recalculation, and if T1 is equal to T0, the peak memory corresponding to the calculation map is unchanged after the recalculation, but the system overhead is increased due to the recalculation. Operator 1 can be considered as the desired set of recalculation nodes when T1 is greater than T0.
b) Data mutation based screening
Based on the data mutation screening, it may be determined whether there is a data mutation in a set of recalculation nodes in the set of recalculation alternatives. If the deviation value (such as difference value, ratio value, etc.) between the size of the memory occupied by the output tensor corresponding to the recalculation node set and the size of the memory occupied by the at least one input tensor corresponding to the recalculation node set is greater than or equal to the data threshold set in the earlier stage, which indicates that the data mutation exists at the moment, the recalculation point set is reserved, otherwise, the recalculation point set is abandoned. Thereby eliminating the situation of excessive memory occupation caused by data mutation.
For example, with continued reference to FIG. 11, if the set of recalculation points in the alternative set of recalculations is [ (Gather), (Gather-ScatterAdd-Div) ]. If the memory occupied by the tensor E minus the memory occupied by the tensor A is larger than the data threshold set in the earlier stage, the data mutation exists in the recalculation point set Gather, and the recalculation point set can be reserved at the moment. If the memory occupied by the tensor B minus the memory occupied by the tensor A is smaller than the data threshold set in the earlier stage, the fact that the recalculation point set (Gather-ScatterAdd-Div) has no data mutation is indicated, and the recalculation point set can be discarded at the moment. The set of recalculation points thus screened is (Gather).
For the case of eliminating the memory occupation by the data mutation, for example, as shown in fig. 12, (a) of fig. 12 is an execution order of each operator before recalculation, a life cycle of an input tensor and an output tensor of each operator, and a peak memory of each operator at the time of execution, and (B) of fig. 12 is an execution order of each operator after recalculation, a life cycle of an input tensor and an output tensor of each operator, and a peak memory of each operator at the time of execution. Comparing (a) and (B) of fig. 12, when the operator 4 is executing, the peak memory before the recalculation is t1+t2+t3+t4, and the peak memory after the recalculation is t0+t2+t3+t4. If the value obtained by subtracting T0 from T1 is greater than the data threshold set in the earlier stage, and operator 1 is one recalculation node set in the determined recalculation candidate set, at this time, it can be seen that the peak memory after recalculation is significantly smaller than the peak memory before recalculation when operator 4 is executed, so that the peak memory can be significantly reduced in this case, and operator 1 can be used as the filtered recalculation node set at this time. If the value obtained by subtracting T0 from T1 is smaller than the data threshold set in the earlier stage, and operator 1 is one recalculation node set in the determined recalculation candidate set, at this time, it can be seen that when operator 4 is executed, the difference between the peak memory after recalculation and the peak memory before recalculation is smaller, so that operator 1 can be used as the screened recalculation node set or operator 1 can not be used as the screened recalculation node set.
c) Operator fusion rule-based screening
When the recalculation node set is screened based on the operator fusion rule, as shown in fig. 13 (a), in S1311, it may be determined whether the recalculation node set in the recalculation candidate set and the consumer corresponding to the recalculation node set can be fused based on the operator fusion rule set in advance. At S1312, the set of recalc nodes is retained if fusion is enabled. At S1313, if fusion is not possible, the set of recalc nodes may be discarded. In one example, S1311 may also be understood as determining whether or not a node that outputs a tensor to the outside in the recalculation node set and its corresponding consumer can be fused.
As another possible implementation manner, as shown in (B) of fig. 13, at S1321, it may be determined, based on a preset operator fusion rule, whether the recalculation node set in the recalculation candidate set and the consumer corresponding to the recalculation node set can be fused. At S1322, the set of recalc nodes is retained if fusion is enabled. At S1323, if fusion is not possible, it may be determined whether an indirect path exists between the producer and the consumer corresponding to the recalc node set. Wherein when there is an indirect path, the set of recalculation nodes may be retained, i.e., execution S1322. At S1324, when there is no indirect path, the recalc node join may be discarded. The set of recalculated nodes is thus screened out. When the operator fusion rule is not met, but an indirect path exists, the recalculation can still shorten the life cycle of the tensor with the excessive memory occupation, so that the possible benefit of reducing the peak memory is brought. In one example, S1321 may also be understood as determining whether a node that outputs a tensor to the outside in the recalculation node set and its corresponding consumer can be fused.
For example, with continued reference to fig. 12, assume that the set of recalculation nodes in the determined set of recalculation alternatives is operator 1, and operator 1 and operator 5 cannot be fused. Since the fusion rule is not met between the operator 1 and the operator 5, but an indirect path exists between the operator 1 and the operator 5, the operator 1 can be used as a required recalculation node set. Comparing fig. 12 (a) and (B), if the memory T1 occupied by the tensor output by the operator 1 is too large, the life cycle of the tensor output by the operator 1 can be greatly shortened in fig. 12 (B), and then the peak memory corresponding to the whole calculation map is obviously reduced.
D) Determining a recalculation map
After the recalculation node set is screened, a recalculation graph can be determined from the determined recalculation node set. For example, an edge between a recalculation node set and a consumer corresponding to a producer to which the recalculation node set belongs may be taken as a recalculation point in the calculation map. And copying the producer subgraphs corresponding to the recalculation points, wherein the number of the producer subgraphs is equal to or greater than the number of the screened recalculation node sets. And finally, deleting all nodes except the recalculation node set corresponding to the producer subgraph in the producer subgraph and deleting edges irrelevant to the recalculation node set corresponding to the producer subgraph in the producer subgraph, so that the recalculation graphs corresponding to the recalculation node sets are obtained. Because the recalculation node sets corresponding to the recalculation graphs obtained in the way have small repeated probability, the repeated calculation probability is reduced, the calculation amount is reduced, and the recalculation cost is saved. In one example, deleting all nodes in the producer sub-graph except for the recalculation node set corresponding to the producer sub-graph may be understood as deleting operators in the producer sub-graph that are independent of tensors output by the recalculation node set.
In one example, for any recalculation node set, the producer corresponding to the recalculation node set may be directly copied in the calculation graph to the producer to which the recalculation node set belongs, so as to obtain the producer subgraph corresponding to the recalculation node set. Then, deleting all nodes except the recalculation node set corresponding to the producer subgraph in the producer subgraph and deleting edges irrelevant to the recalculation node set corresponding to the producer subgraph in the producer subgraph, so as to obtain the recalculation graphs corresponding to the recalculation node sets.
For example, as shown in fig. 14, fig. 14 (a) is a fusion subgraph obtained by performing one-time operator fusion on an initial calculation map. If the memory occupied by the tensor B in fig. 14 (a) is higher than the memory threshold configured by the user, it may be determined that the recalculation node set is Gather, and that the edge corresponding to the tensor B is a recalculation point. Next, the producer subgraph corresponding to the recalculation point may be copied, i.e., fuseo 1 is copied, resulting in the fusion operator shown in fig. 14 (B). The producer subgraph corresponding to the recalculation point is copied to be understood as a fusion operator for copying the output tensor B from the fusion subgraph obtained by performing primary operator fusion. Next, in (B) of fig. 14, operators other than the recalculation node set Gather, that is, scanteradd, and edges not related to the recalculation node set Gather, that is, edges between the operators scan and Gather and edges corresponding to the output tensor of the operator scan may be deleted, thereby obtaining a recalculation graph as shown in (C) of fig. 14.
E) Generating a new computational graph
After the recalculation graphs are obtained, according to the relation between the nodes in each recalculation graph and the nodes in the original calculation subgraph, combining each recalculation graph with the original calculation graph, and generating a new calculation graph. The data dependency relationship exists between the node of the recalculation graph for inputting the first tensor and the node of the original calculation subgraph for outputting the first tensor, and the data dependency relationship exists between the node of the recalculation graph for outputting the second tensor and the node of the original calculation subgraph for inputting the second tensor. It can be understood that, for the original calculation subgraph, if the operator fusion has been performed on the original calculation graph when the recalculation step is performed, the original calculation subgraph is a subgraph obtained after the operator fusion; if no operator fusion is performed on the original computational graph when the recalculation step is performed, the original computational graph is the original computational graph.
For example, a directed edge between a node of each recalculation graph inputting the first target tensor and a node of the original calculation graph outputting the first target tensor may be respectively constructed, and a directed edge between a node of each recalculation graph outputting the second target tensor and a node of the original calculation graph inputting the second target tensor may be respectively constructed, so as to obtain a new calculation graph.
For example, as shown in FIG. 15, FIG. 15 (A) is an initial computational graph (i.e., an original computational graph), and after performing one-time operator fusion on the computational graph, a one-time fusion subgraph as shown in FIG. 15 (B) can be obtained. After the first fused subgraph shown in fig. 15 (B) is recalculated, a recalculation graph shown in fig. 15 (C) can be obtained. Since the input of the recalculation graph shown in fig. 15 (C) is tensor a, the output is tensor B, and the operator of the output tensor a in fig. 15 (B) is operator Slice, and the operator of the input tensor B is operator fusop 2, the operator of the recalculation graph in fig. 15 (C) for the input tensor a can be connected to the operator Slice in fig. 15 (B), i.e., a directed edge between the two can be constructed, and the operator of the recalculation graph in fig. 15 (C) for the output tensor B can be connected to the operator fusop 2 in fig. 15 (B), i.e., a directed edge between the two can be constructed. Therefore, a new calculation map shown in fig. 15 (D) can be obtained by combining the recalculation map shown in fig. 15 (C) with the primary fusion subgraph shown in fig. 15 (B).
(3) Secondary operator fusion
And performing operator fusion (namely secondary fusion) on the obtained new calculation graph to obtain a new fusion subgraph, thereby obtaining an optimized calculation graph. For example, the computational graph may be divided into a plurality of computational subgraphs according to the categories of the operators in the computational graph (such as computationally intensive, access memory intensive, etc.) and the back-end features of the hardware, where each computational subgraph may correspond to an operator obtained after fusion (hereinafter referred to as a "fusion operator"). And obtaining an optimized calculation graph, namely a secondary fusion subgraph, wherein the calculation graph is the calculation graph obtained by final optimization. The final optimized computation graph can then be executed. In one example, for the calculation map obtained by executing the final optimization, the calculation map may be compiled first, and then the compiled calculation map may be executed, or the calculation map may be executed directly; in either case, the resulting optimization performed is essentially a computational graph, but in a different form.
For example, as shown in fig. 16, a new calculation map obtained after recalculation is shown in fig. 16 (a). If the preset operator fusion rule is: gather and Mul are fused, and then the fusion subgraph shown in (B) of FIG. 16 can be obtained after the operator fusion of (A) of FIG. 16 is performed based on the operator fusion rule.
In some embodiments, the nodes described in the embodiments of the present application may refer to, but are not limited to, operators, and the operators described in the embodiments of the present application may also refer to, but are not limited to, nodes, and may be specific to the actual situation.
Therefore, the fused subgraph is obtained by carrying out operator fusion on the calculated graph, then the recalculated graph required to be recalculated is determined from the fused subgraph, the determined recalculated graph is combined with the fused subgraph to generate a new calculated graph, and finally the new calculated graph is subjected to operator fusion to obtain the required calculated graph, so that the problems that the network with one or more ultra-large tensors cannot execute are solved while the memory occupation is remarkably reduced without introducing larger recalculation cost.
For ease of understanding, the process of reducing memory usage by the above scheme is illustrated below.
Illustratively, as shown in fig. 17, M and L represent the memory size occupied by the tensor of the operator output, respectively, M represents a medium size, such as 56.8M, and L represents an oversized, such as 27G. The calculation map shown in fig. 17 (a) is a calculation map which is not subjected to any processing, i.e., an initial calculation map. In fig. 17 (a), the operator Slice and the operator ScatterAdd multiplex one memory, and the operator Gather and the operator Div multiplex one memory, so that the memory size required for the calculation map shown in fig. 17 (a) is: m1+m2+l1+l2.
The calculation map shown in fig. 17 (B) is a calculation map obtained by performing one-time operator fusion on the calculation map shown in fig. 17 (a) based on a preset operator fusion rule, that is, a one-time fusion subgraph. In fig. 17 (B), the operator Slice and the operator MatMul multiplex one memory, so the memory size required for the calculation map shown in fig. 17 (B) is: m1+m2+l1+l2.
The calculation map shown in fig. 17 (C) is a new calculation map, which is a calculation map obtained by recalculating the calculation map shown in fig. 17 (B). In fig. 17 (C), the memory cannot be multiplexed between operators, so the memory size required for the calculation map shown in fig. 17 (C) is: m1+m2+m3+l1+l2.
The computation graph shown in fig. 17 (D) is a computation graph obtained by performing secondary operator fusion on the computation graph shown in fig. 17 (C) based on a preset operator fusion rule, i.e., a secondary fusion subgraph, and the computation graph is the required computation graph. In fig. 17 (D), the memory cannot be multiplexed between operators, so the memory size required for the calculation map shown in fig. 17 (D) is: m1+m2+m3+l1.
Assuming that m1=m2=m3 and l1=l2, the memory size occupied by the calculation map shown in fig. 17 (a) is 2m+2l, the memory size occupied by the calculation map shown in fig. 17 (B) is 2m+2l, the memory size occupied by the calculation map shown in fig. 17 (C) is 3m+2l, and the memory size occupied by the calculation map shown in fig. 17 (D) is 3m+l. Since M represents a medium size and L represents an oversized, the calculation map shown in fig. 17 (D) occupies less memory than the calculation map shown in fig. 17 (a) by (L-M). It can be seen from a comparison of fig. 17 (a) and (B) that the memory size occupied by the calculation map does not change after one operator fusion.
Further, in the recalculation process, the recalculation point can be determined based on the memory size, the data mutation, the fusion rule and other modes, so that not only can the memory occupation be effectively reduced and the efficiency of identifying the recalculation point be improved, but also the tensor occupying a larger memory can be converted into internal storage (such as a register space)
Next, a description is given of a calculation map optimization method provided in the embodiment of the present application based on the calculation map optimization scheme described above. It will be appreciated that this approach is another expression of the computational graph optimization scheme described above, both in combination. The method is based on the computational graph optimization scheme described above, and some or all of the method may be described, but are not limited to, in reference to the computational graph optimization scheme described above.
Referring to fig. 18, fig. 18 is a flowchart of a calculation map optimization method according to an embodiment of the present application. It is understood that the method may be performed by any apparatus, device, platform, cluster of devices having computing, processing capabilities. As shown in fig. 18, the calculation map optimization method may include the steps of:
s1801, obtaining a second calculation graph from a first calculation graph based on a data dependency relationship between parameters and nodes in the first calculation graph, wherein the parameters comprise one or more of an operator fusion rule, a memory threshold value of memory occupied by a tensor output by a single node in the first calculation graph, a peak memory threshold value corresponding to the first calculation graph and a data mutation threshold value corresponding to the node in the first calculation graph, the peak memory threshold value is a threshold value of memory occupied by all tensors required to be used at the moment of executing one node in the calculation graph executing process, and the data mutation threshold value is a threshold value of data mutation generated by at least one node in the first calculation graph.
In particular, when optimizing the computation graph, the second computation graph may be derived from the first computation graph based on the data dependency between the parameters and the nodes in the first computation graph. The parameters include one or more of operator fusion rules, memory thresholds of memory occupied by tensors output by single nodes in the first computation graph, peak memory thresholds corresponding to the first computation graph, and data mutation thresholds corresponding to the nodes in the first computation graph. The peak memory threshold is a threshold of memory occupied by all tensors to be used at the moment of executing one node in the executing process of the computing graph, and the data mutation threshold is a threshold of data mutation generated by at least one node in the first computing graph. For example, when the computing frame corresponding to the computing graph is the mindscore ai open source computing frame, the user may set the memory threshold, the peak memory threshold, and the data mutation threshold through the Context module configured in the mindscore. For example, the memory threshold may also be referred to as a memory threshold, the peak memory threshold may also be referred to as a peak memory threshold, and the abrupt data threshold may also be referred to as a data threshold. In some embodiments, a single node may be understood as a node, or may be understood as a node obtained by fusing multiple nodes.
As a possible implementation manner, when obtaining the second calculation graph, N first node sets may be obtained from producers included in the first calculation graph based on the data dependency relationship between the parameters and the nodes in the first calculation graph, where N is a positive integer greater than or equal to 1, and the first node sets include at least one node, where the nodes included in one first node set are directly related or indirectly related to one tensor output by one producer in the first calculation graph, and are directly related or indirectly related to at least one tensor input by one producer in the first calculation graph. Then, each first node set is recalculated to obtain N recalculation subgraphs, wherein the N recalculation subgraphs form a second calculation graph. Thus, a second calculation map is obtained. For example, the second computational graph may include N recalculation sub-graphs. Illustratively, the first set of nodes may be understood as the desired set of recalculation nodes described above, and the recalculation subgraph may be understood as the recalculation graph described above.
In one example, the recalculation of each first node set to obtain N recalculation subgraphs may be accomplished as follows. Specifically, for any node set in the N first node sets, a producer sub-graph (which may also be referred to as a producer) corresponding to the any node set may be copied first. Then, deleting nodes except the node contained in any node set in the producer subgraph and deleting edges irrelevant to any node set in the producer subgraph, thereby obtaining a recalculation subgraph corresponding to any node set. And traversing each node set in the N first node sets to obtain N recalculation sub-graphs.
For obtaining the first node set, a first list may be obtained based on the data dependency relationship between the nodes in the first computation graph, where the first list includes a correspondence relationship between the producer and the consumer in the first computation graph. Illustratively, the first list may be the producer-consumer list described above, which may be as shown in Table 1 above, and this step may be understood as the step of constructing the producer-consumer list in the recalculation step described above.
Then, a candidate node set is obtained according to the tensor output by each producer in the first list, wherein the candidate node set at least comprises the first node set, and for any producer in the first list, the set of nodes which are in any producer and are directly related and indirectly related to the tensor output by any producer is taken as one node set. This step is understood, among others, to be the step of determining the set of recalculation alternatives in the recalculation step described above. Wherein the set of alternative nodes may be understood as the recalculated set of alternatives described above.
Finally, a first set of nodes is selected from the set of candidate nodes based on the parameters. This step may be understood, among other things, as one or more of the recalculation steps described above, as determining a recalculation map from a set of recalculation alternatives based on memory size screening, data mutation screening, operator fusion rule screening, and the like.
In one example, the parameter is a memory threshold; the memory occupied by the output tensor corresponding to the first node set is higher than the memory threshold. Therefore, the node set corresponding to the output tensor occupying the too high memory can be automatically used as the needed node set, and the problem that a single operator and/or tensor occupies the memory is solved.
In one example, the parameter is a peak memory threshold; an indirect path exists between the producer and the consumer corresponding to the first set of nodes, and the peak memory of each node between the producer and the consumer corresponding to the first set of nodes when executed is higher than a peak memory threshold. In this way, the node set corresponding to the excessively high peak memory occupation can be automatically used as the required node set, so that the problem that a single operator and/or tensor occupies memory is solved.
In one example, the parameter is a data mutation threshold; the offset value between the memory occupied by the output tensor corresponding to the first node set and the memory occupied by the at least one input tensor corresponding to the first node set is higher than the data abrupt change threshold. Thus, the node set with the data mutation can be automatically used as the needed node set, and the problem of overlarge memory occupation caused by the data mutation is solved.
In one example, the parameter is an operator fusion rule; the first node set and the consumers corresponding to the first node set can be fused; alternatively, the first set of nodes cannot be fused with the consumer corresponding to the first set of nodes, and an indirect path exists between the producer and consumer corresponding to the first set of nodes. Thus, the needed node set can be screened out in a mode of operator fusion rules.
In one example, the parameters are a memory threshold and a peak memory threshold; an indirect path does not exist between a producer and a consumer corresponding to the first node set, and the memory occupied by the output tensor corresponding to the first node set is higher than a memory threshold; and/or, an indirect path exists between the producer and the consumer corresponding to the first node set, and the peak memory of each node between the producer and the consumer corresponding to the first node set is higher than the peak memory threshold value when the node is executed. Therefore, the required node set can be screened out in a mode of combining the memory threshold value and the peak memory threshold value, and the screening efficiency and accuracy are improved.
In one example, the parameters are a memory threshold and a data mutation threshold; the memory occupied by the output tensor corresponding to the first node set is higher than the memory threshold, and/or the deviation value between the memory occupied by the output tensor corresponding to the first node set and the memory occupied by the at least one input tensor corresponding to the first node set is higher than the data mutation threshold. Therefore, the required node set can be screened out in a mode of combining the memory threshold value and the data mutation threshold value, and the screening efficiency and accuracy are improved.
In one example, the parameters are memory thresholds and operator fusion rules; the first set of nodes meets one or more of the following conditions: the memory occupied by the output tensor corresponding to the first node set is higher than the memory threshold, or the consumers corresponding to the first node set and the first node set can be fused, or the consumers corresponding to the first node set and the first node set can not be fused, and an indirect path exists between the producer and the consumer corresponding to the first node set. Therefore, the needed node set can be screened out in a mode of combining the memory threshold value and the operator fusion rule, and the screening efficiency and accuracy are improved.
In one example, the parameters are a peak memory threshold and a data mutation threshold; an indirect path exists between the producer and the consumer corresponding to the first node set, and the peak memory of each node between the producer and the consumer corresponding to the first node set is higher than a peak memory threshold value when the node is executed; and/or, a deviation value between the memory occupied by the output tensor corresponding to the first node set and the memory occupied by the at least one input tensor corresponding to the first node set is higher than a data mutation threshold value. Therefore, the required node set can be screened out in a mode of combining the peak memory threshold value and the data mutation threshold value, and the screening efficiency and accuracy are improved.
In one example, the parameters are memory thresholds and operator fusion rules; the first set of nodes meets one or more of the following conditions: an indirect path exists between the producer and the consumer corresponding to the first node set, and the peak memory of each node between the producer and the consumer corresponding to the first node set is higher than the peak memory threshold value when the node is executed, or the consumer corresponding to the first node set and the first node set can be fused, or the consumer corresponding to the first node set and the first node set cannot be fused, and an indirect path exists between the producer and the consumer corresponding to the first node set. Therefore, the required node set can be screened out in a mode of combining the peak memory threshold value and the operator fusion rule, and the screening efficiency and accuracy are improved.
In one example, the parameters are a data mutation threshold and an operator fusion rule; the first set of nodes meets one or more of the following conditions: the deviation value between the memory occupied by the output tensor corresponding to the first node set and the memory occupied by the at least one input tensor corresponding to the first node set is higher than the data mutation threshold, or the consumers corresponding to the first node set and the first node set can be fused, or the consumers corresponding to the first node set and the first node set can not be fused, and an indirect path exists between the producer corresponding to the first node set and the consumers. Therefore, the needed node set can be screened out in a mode of combining the data mutation threshold value and the operator fusion rule, and the screening efficiency and accuracy are improved.
In one example, the parameters are a memory threshold, a peak memory threshold, and a data mutation threshold; the first set of nodes meets one or more of the following conditions: there is no indirect path between the producer and the consumer corresponding to the first node set, and the memory occupied by the output tensor corresponding to the first node set is higher than the memory threshold, or there is an indirect path between the producer and the consumer corresponding to the first node set, and the peak memory when each node between the producer and the consumer corresponding to the first node set executes is higher than the peak memory threshold, or the deviation value between the memory occupied by the output tensor corresponding to the first node set and the memory occupied by at least one input tensor corresponding to the first node set is higher than the data mutation threshold. Therefore, the required node set can be screened out in a mode of combining the memory threshold, the peak memory threshold and the data mutation threshold, and the screening efficiency and accuracy are improved.
In one example, the parameters are a memory threshold, a peak memory threshold, and an operator fusion rule; the first set of nodes meets one or more of the following conditions: there is no indirect path between the producer and consumer corresponding to the first node set, and the memory occupied by the output tensor corresponding to the first node set is higher than the memory threshold, or there is an indirect path between the producer and consumer corresponding to the first node set, and the peak memory when each node between the producer and consumer corresponding to the first node set executes is higher than the peak memory threshold, or the consumer corresponding to the first node set and the first node set can be fused, or the consumer corresponding to the first node set and the first node set cannot be fused, and there is an indirect path between the producer and the consumer corresponding to the first node set. Therefore, the required node set can be screened out in a mode of combining the memory threshold, the peak memory threshold and the operator fusion rule, and the screening efficiency and accuracy are improved.
In one example, the parameters are memory thresholds, data mutation thresholds, and operator fusion rules; the first set of nodes meets one or more of the following conditions: the memory occupied by the output tensor corresponding to the first node set is higher than the memory threshold, or the deviation value between the memory occupied by the output tensor corresponding to the first node set and the memory occupied by the at least one input tensor corresponding to the first node set is higher than the data mutation threshold, or the consumers corresponding to the first node set and the first node set can be fused, or the consumers corresponding to the first node set and the first node set cannot be fused, and an indirect path exists between the producer and the consumer corresponding to the first node set. Therefore, the needed node set can be screened out in a mode of combining the memory threshold, the data mutation threshold and the operator fusion rule, and the screening efficiency and accuracy are improved.
In one example, the parameters are a peak memory threshold, a data mutation threshold, and an operator fusion rule; the first set of nodes meets one or more of the following conditions: an indirect path exists between the producer and the consumer corresponding to the first node set, the peak memory when each node between the producer and the consumer corresponding to the first node set executes is higher than a peak memory threshold, or the deviation value between the memory occupied by the output tensor corresponding to the first node set and the memory occupied by at least one input tensor corresponding to the first node set is higher than a data mutation threshold, or the consumer corresponding to the first node set and the first node set can be fused, or the consumer corresponding to the first node set and the first node set cannot be fused, and an indirect path exists between the producer and the consumer corresponding to the first node set. Therefore, the required node set can be screened out in a mode of combining the peak memory threshold value, the data mutation threshold value and the operator fusion rule, and the screening efficiency and accuracy are improved.
In one example, the parameters are a memory threshold, a peak memory threshold, a data mutation threshold, and an operator fusion rule; the first set of nodes meets one or more of the following conditions: there is no indirect path between the producer and consumer corresponding to the first node set, and the memory occupied by the output tensor corresponding to the first node set is higher than the memory threshold, or there is an indirect path between the producer and consumer corresponding to the first node set, and the peak memory when each node between the producer and consumer corresponding to the first node set executes is higher than the peak memory threshold, or the offset value between the memory occupied by the output tensor corresponding to the first node set and the memory occupied by at least one input tensor corresponding to the first node set is higher than the data mutation threshold, or the consumer corresponding to the first node set and the consumer corresponding to the first node set can be fused, or the first node set and the consumer corresponding to the first node set can not be fused, and there is an indirect path between the producer and the consumer corresponding to the first node set. Therefore, the required node set can be screened out in a mode of combining the memory threshold value, the peak memory threshold value, the data mutation threshold value and the operator fusion rule, and the screening efficiency and accuracy are improved.
In some embodiments, operators in the first computational graph may be fused prior to S1801. Thereby improving the performance of operators in the computational graph; in addition, the number of operators can be reduced through the first operator fusion, so that the cost for analyzing the operators in the subsequent recalculation process can be reduced; in addition, the range of analyzing operators in the subsequent recalculation process can be enlarged, and the recalculation effect is improved. Illustratively, the operators in the first computational graph may be fused based on, but not limited to, a predetermined operator fusion rule.
After the second calculation map is obtained, S1802 may be executed.
S1802, merging the second computation graph with the first computation graph to obtain a third computation graph, wherein in the third computation graph, a first directed edge is arranged between a first node outputting the first tensor and a second node inputting the first tensor, a second directed edge is arranged between a third node outputting the second tensor and a fourth node inputting the second tensor, the first directed edge points to the second node from the first node, the second directed edge points to the fourth node, the first node and the fourth node correspond to nodes in the first computation graph, and the second node and the third node correspond to nodes in the second computation graph.
Specifically, after the second calculation map is obtained, the second calculation map may be combined with the first calculation map, so that a third calculation map is obtained. In the third computation graph, a first directed edge is arranged between a first node outputting a first tensor and a second node inputting the first tensor, a second directed edge is arranged between a third node outputting a second tensor and a fourth node inputting the second tensor, the first directed edge points to the second node from the first node, the second directed edge points to the fourth node from the third node, the first node and the fourth node correspond to nodes in the first computation graph, and the second node and the third node correspond to nodes in the second computation graph. The process of obtaining the third calculation map may be understood as a step of generating a new calculation map among the recalculation steps described above.
In one example, when the second computational graph is composed of N recalculation sub-graphs, directional edges between the node of each of the N recalculation sub-graphs that inputs the first target tensor and the node of the first computational graph that outputs the first target tensor may be respectively constructed when the second computational graph is combined with the first computational graph, and directional edges between the node of each of the N recalculation sub-graphs that outputs the second target tensor and the node of the first computational graph that inputs the second target tensor may be respectively constructed, so that a third computational graph is obtained.
S1803, fusing operators in the third calculation map to obtain a fourth calculation map.
Specifically, the operators in the third computation graph may be fused based on a preset operator fusion rule, so as to obtain a fourth computation graph. The fourth computational graph may also be referred to as an optimized computational graph, for example.
S1804, executing a fourth calculation map.
Specifically, after the fourth calculation map is obtained, the calculation map may be executed.
In one example, for executing the fourth calculation map, the calculation map may be compiled first, and then the compiled calculation map may be executed, or the calculation map may be executed directly; in either case, the fourth calculation map is essentially executed, and the representation is different.
In this way, the recalculation is carried out on the wish calculation graph to obtain a recalculation graph, the obtained recalculation graph and the original calculation graph are combined into a new calculation graph, and then the new calculation graph is subjected to operator fusion, so that the optimized calculation graph is obtained; thereafter, the optimized computation graph may be executed. The calculation graph is optimized in a mode of combining recalculation and operator fusion, so that memory occupation is remarkably reduced, larger recalculation expenditure is not introduced, and the problem that a network with one or more ultra-large tensors cannot execute is solved.
Based on the method described in the above embodiments, the embodiments of the present application further provide a chip. Referring to fig. 19, fig. 19 is a schematic structural diagram of a chip according to an embodiment of the present application. As shown in fig. 19, chip 1900 includes one or more processors 1901 and interface circuits 1902. Optionally, the chip 1900 may also include a bus 1903. Wherein:
the processor 1901 may be an integrated circuit chip with signal processing capabilities. In implementation, the steps of the methods described above may be performed by integrated logic circuitry in hardware or instructions in software in the processor 1901. The processor 1901 described above may be a general purpose processor, a digital communicator (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components. The methods and steps disclosed in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The interface circuit 1902 may be used for transmitting or receiving data, instructions, or information, and the processor 1901 may process using the data, instructions, or other information received by the interface circuit 1902 and may transmit process completion information through the interface circuit 1902.
Optionally, the chip further comprises a memory, which may include read only memory and random access memory, and provides operating instructions and data to the processor. A portion of the memory may also include non-volatile random access memory (NVRAM).
Optionally, the memory stores executable software modules or data structures and the processor may perform corresponding operations by invoking operational instructions stored in the memory (which may be stored in an operating system).
Optionally, an interface circuit 1902 may be used to output results of execution by the processor 1901.
The functions corresponding to the processor 1901 and the interface circuit 1902 may be implemented by a hardware design, a software design, or a combination of hardware and software, which is not limited herein.
It will be appreciated that the steps of the method embodiments described above may be performed by logic circuitry in the form of hardware in a processor or instructions in the form of software.
It is to be appreciated that the processor in embodiments of the present application may be a central processing unit (central processing unit, CPU), but may also be other general purpose processors, digital signal processors (digital signal processor, DSP), application specific integrated circuits (application specific integrated circuit, ASIC), field programmable gate arrays (field programmable gate array, FPGA) or other programmable logic devices, transistor logic devices, hardware components, or any combination thereof. The general purpose processor may be a microprocessor, but in the alternative, it may be any conventional processor.
The method steps in the embodiments of the present application may be implemented by hardware, or may be implemented by a processor executing software instructions. The software instructions may be comprised of corresponding software modules that may be stored in random access memory (random access memory, RAM), flash memory, read-only memory (ROM), programmable ROM (PROM), erasable programmable PROM (EPROM), electrically erasable programmable EPROM (EEPROM), registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC.
In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present application, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in or transmitted across a computer-readable storage medium. The computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by a wired (e.g., coaxial cable, fiber optic, digital Subscriber Line (DSL)), or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., a floppy disk, a hard disk, a magnetic tape), an optical medium (e.g., a DVD), or a semiconductor medium (e.g., a Solid State Disk (SSD)), or the like.
It will be appreciated that the various numerical numbers referred to in the embodiments of the present application are merely for ease of description and are not intended to limit the scope of the embodiments of the present application.

Claims (26)

1. A method of computational graph optimization, the method comprising:
obtaining a second calculation graph from a first calculation graph based on a data dependency relationship between parameters and nodes in the first calculation graph, wherein the parameters comprise one or more of an operator fusion rule, a memory threshold value of memory occupied by tensors output by single nodes in the first calculation graph, a peak memory threshold value corresponding to the first calculation graph and a data mutation threshold value corresponding to the nodes in the first calculation graph, the peak memory threshold value is a threshold value of memory occupied by tensors which are required to be used at the moment of executing one node in the calculation graph executing process, and the data mutation threshold value is a threshold value of data mutation generated by at least one node in the first calculation graph;
merging the second calculation graph with the first calculation graph to obtain a third calculation graph, wherein in the third calculation graph, a first directed edge is arranged between a first node outputting a first tensor and a second node inputting the first tensor, a second directed edge is arranged between a third node outputting a second tensor and a fourth node inputting the second tensor, the first directed edge points to the second node from the first node, the second directed edge points to the fourth node, the first node and the fourth node correspond to nodes in the first calculation graph, and the second node and the third node correspond to nodes in the second calculation graph;
Fusing operators in the third calculation map to obtain a fourth calculation map;
and executing the fourth calculation graph.
2. The method according to claim 1, wherein the obtaining a second computation graph from the first computation graph based on the data dependency between the parameters and the nodes in the first computation graph specifically includes:
obtaining N first node sets from producers contained in the first calculation graph based on the data dependency relationship between the parameters and the nodes in the first calculation graph, wherein N is a positive integer greater than or equal to 1, and the first node sets comprise at least one node, and the nodes contained in one first node set are directly related or indirectly related to one tensor output by one producer in the first calculation graph and are directly related or indirectly related to at least one tensor input by one producer in the first calculation graph;
and re-calculating each first node set to obtain N re-calculation sub-graphs, wherein the N re-calculation sub-graphs form the second calculation graph.
3. The method of claim 2, wherein the parameter is the memory threshold;
And the memory occupied by the output tensor corresponding to the first node set is higher than the memory threshold.
4. The method of claim 2, wherein the parameter is the peak memory threshold;
an indirect path exists between the producer and the consumer corresponding to the first node set, and the peak memory of each node between the producer and the consumer corresponding to the first node set is higher than the peak memory threshold value when the node is executed.
5. The method of claim 2, wherein the parameter is the data mutation threshold;
and the deviation value between the memory occupied by the output tensor corresponding to the first node set and the memory occupied by the at least one input tensor corresponding to the first node set is higher than the data mutation threshold value.
6. The method of claim 2, wherein the parameter is the operator fusion rule;
the first node set and consumers corresponding to the first node set can be fused;
alternatively, the consumer corresponding to the first set of nodes cannot be fused, and an indirect path exists between the producer and consumer corresponding to the first set of nodes.
7. The method of claim 2, wherein the parameters are the memory threshold and the peak memory threshold;
an indirect path does not exist between a producer and a consumer corresponding to the first node set, and the memory occupied by the output tensor corresponding to the first node set is higher than the memory threshold;
and/or, an indirect path exists between the producer and the consumer corresponding to the first node set, and the peak memory of each node between the producer and the consumer corresponding to the first node set is higher than the peak memory threshold value when the node is executed.
8. The method of claim 2, wherein the parameters are the memory threshold and the data mutation threshold;
and the memory occupied by the output tensor corresponding to the first node set is higher than the memory threshold, and/or the deviation value between the memory occupied by the output tensor corresponding to the first node set and the memory occupied by at least one input tensor corresponding to the first node set is higher than the data mutation threshold.
9. The method of claim 2, wherein the parameter is the memory threshold and the operator fusion rule;
The first set of nodes meets one or more of the following conditions:
the memory occupied by the output tensor corresponding to the first node set is higher than the memory threshold, or the consumers corresponding to the first node set and the first node set can be fused, or the consumers corresponding to the first node set and the first node set cannot be fused, and an indirect path exists between the producer corresponding to the first node set and the consumers.
10. The method of claim 2, wherein the parameters are the peak memory threshold and the abrupt data change threshold;
an indirect path exists between the producer and the consumer corresponding to the first node set, and the peak memory of each node between the producer and the consumer corresponding to the first node set is higher than the peak memory threshold value when the node is executed;
and/or, a deviation value between the memory occupied by the output tensor corresponding to the first node set and the memory occupied by the at least one input tensor corresponding to the first node set is higher than the data mutation threshold.
11. The method of claim 2, wherein the parameter is the memory threshold and the operator fusion rule;
The first set of nodes meets one or more of the following conditions:
an indirect path exists between the producer and the consumer corresponding to the first node set, and the peak memory of each node between the producer and the consumer corresponding to the first node set is higher than the peak memory threshold value when the node is executed, or the consumer corresponding to the first node set and the first node set can be fused, or the consumer corresponding to the first node set and the first node set cannot be fused, and an indirect path exists between the producer and the consumer corresponding to the first node set.
12. The method of claim 2, wherein the parameters are the data mutation threshold and the operator fusion rule;
the first set of nodes meets one or more of the following conditions:
the deviation value between the memory occupied by the output tensor corresponding to the first node set and the memory occupied by the at least one input tensor corresponding to the first node set is higher than the data mutation threshold, or the consumers corresponding to the first node set and the first node set can be fused, or the consumers corresponding to the first node set and the first node set can not be fused, and an indirect path exists between the producer corresponding to the first node set and the consumers.
13. The method of claim 2, wherein the parameters are the memory threshold, the peak memory threshold, and the abrupt data change threshold;
the first set of nodes meets one or more of the following conditions:
and no indirect path exists between the producer and the consumer corresponding to the first node set, the memory occupied by the output tensor corresponding to the first node set is higher than the memory threshold, or an indirect path exists between the producer and the consumer corresponding to the first node set, the peak memory of each node between the producer and the consumer corresponding to the first node set when executed is higher than the peak memory threshold, or the deviation value between the memory occupied by the output tensor corresponding to the first node set and the memory occupied by at least one input tensor corresponding to the first node set is higher than the data mutation threshold.
14. The method of claim 2, wherein the parameters are the memory threshold, the peak memory threshold, and the operator fusion rule;
the first set of nodes meets one or more of the following conditions:
And no indirect path exists between the producer and the consumer corresponding to the first node set, the memory occupied by the output tensor corresponding to the first node set is higher than the memory threshold, or an indirect path exists between the producer and the consumer corresponding to the first node set, and the peak memory of each node between the producer and the consumer corresponding to the first node set when executed is higher than the peak memory threshold, or the consumer corresponding to the first node set and the consumer corresponding to the first node set can be fused, or the consumer corresponding to the first node set and the consumer corresponding to the first node set can not be fused, and an indirect path exists between the producer and the consumer corresponding to the first node set.
15. The method of claim 2, wherein the parameters are the memory threshold, the data mutation threshold, and the operator fusion rule;
the first set of nodes meets one or more of the following conditions:
the memory occupied by the output tensor corresponding to the first node set is higher than the memory threshold, or the deviation value between the memory occupied by the output tensor corresponding to the first node set and the memory occupied by the at least one input tensor corresponding to the first node set is higher than the data mutation threshold, or the consumer corresponding to the first node set and the consumer corresponding to the first node set can be fused, or the consumer corresponding to the first node set and the consumer corresponding to the first node set cannot be fused, and an indirect path exists between the producer and the consumer corresponding to the first node set.
16. The method of claim 2, wherein the parameters are the peak memory threshold, the data mutation threshold, and the operator fusion rule;
the first set of nodes meets one or more of the following conditions:
an indirect path exists between the producer and the consumer corresponding to the first node set, the peak memory when each node between the producer and the consumer corresponding to the first node set executes is higher than the peak memory threshold, or the deviation value between the memory occupied by the output tensor corresponding to the first node set and the memory occupied by the at least one input tensor corresponding to the first node set is higher than the data mutation threshold, or the consumer corresponding to the first node set and the first node set can be fused, or the consumer corresponding to the first node set and the first node set can not be fused, and the indirect path exists between the producer and the consumer corresponding to the first node set.
17. The method of claim 2, wherein the parameters are the memory threshold, the peak memory threshold, the data mutation threshold, and the operator fusion rule;
The first set of nodes meets one or more of the following conditions:
and an indirect path does not exist between the producer and the consumer corresponding to the first node set, the memory occupied by the output tensor corresponding to the first node set is higher than the memory threshold, or an indirect path exists between the producer and the consumer corresponding to the first node set, the peak memory when each node between the producer and the consumer corresponding to the first node set executes is higher than the peak memory threshold, or the deviation value between the memory occupied by the output tensor corresponding to the first node set and the memory occupied by the at least one input tensor corresponding to the first node set is higher than the data mutation threshold, or the consumer corresponding to the first node set and the consumer corresponding to the first node set can be fused, or the peak memory when each node between the producer and the consumer corresponding to the first node set executes is not fused, and an indirect path exists between the producer and the consumer corresponding to the first node set.
18. The method according to any one of claims 2-17, wherein the obtaining at least one first node set from a producer included in the first computation graph based on the data dependency relationship between the parameter and the nodes in the first computation graph specifically includes:
Based on the data dependency relationship between the nodes in the first calculation graph, a first list is obtained, wherein the first list comprises the corresponding relationship between the producer and the consumer in the first calculation graph;
obtaining an alternative node set according to the tensor output by each producer in the first list, wherein the alternative node set at least comprises the first node set, and for any producer in the first list, the set of nodes which are in any producer and are directly related and indirectly related to the tensor output by any producer is taken as a node set;
and selecting the first node set from the alternative node sets based on the parameters.
19. The method according to any one of claims 2-18, wherein said recalculating each of said first node sets to obtain N recalculation sub-graphs specifically comprises:
copying a producer subgraph corresponding to any node set in the N first node sets;
deleting nodes except for the nodes contained in any node set in the producer subgraph, and deleting edges irrelevant to the any node set in the producer subgraph to obtain a recalculation subgraph corresponding to the any node set.
20. The method according to any one of claims 2-19, wherein the merging the second computational graph with the first computational graph to obtain a third computational graph, specifically comprises:
and respectively constructing a directed edge between a node of each recalculation subgraph of the N recalculation subgraphs, which inputs a first target tensor, and a node of the first calculation graph, which outputs the first target tensor, and a directed edge between a node of each recalculation subgraph of the N recalculation subgraphs, which outputs a second target tensor, and a node of the first calculation graph, which inputs the second target tensor, so as to obtain the third calculation graph.
21. The method according to any one of claims 1-20, wherein the method further comprises, before obtaining a second computation graph from the first computation graph, the second computation graph requiring a recalculation, based on the data dependency between the parameters and the nodes in the first computation graph:
and fusing operators in the first computational graph.
22. A computational graph optimization apparatus, comprising:
at least one memory for storing a program;
at least one processor for executing the memory-stored program, which processor is adapted to perform the method of any of claims 1-21 when the memory-stored program is executed.
23. An apparatus, comprising:
at least one memory for storing a program;
at least one processor for executing the memory-stored program, which processor is adapted to perform the method of any of claims 1-21 when the memory-stored program is executed.
24. A computer readable storage medium storing a computer program which, when run on an electronic device, causes the electronic device to perform the method of any one of claims 1-21.
25. A computer program product, characterized in that the computer program product, when run on an electronic device, causes the electronic device to perform the method of any of claims 1-21.
26. A chip comprising at least one processor and an interface;
the interface is used for providing program instructions or data for the at least one processor;
the at least one processor is configured to execute the program line instructions to implement the method of any of claims 1-21.
CN202111431551.XA 2021-11-29 2021-11-29 Calculation graph optimization method, device and equipment Pending CN116204847A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202111431551.XA CN116204847A (en) 2021-11-29 2021-11-29 Calculation graph optimization method, device and equipment
PCT/CN2022/133304 WO2023093689A1 (en) 2021-11-29 2022-11-21 Computational graph optimization method and apparatus, and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111431551.XA CN116204847A (en) 2021-11-29 2021-11-29 Calculation graph optimization method, device and equipment

Publications (1)

Publication Number Publication Date
CN116204847A true CN116204847A (en) 2023-06-02

Family

ID=86511609

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111431551.XA Pending CN116204847A (en) 2021-11-29 2021-11-29 Calculation graph optimization method, device and equipment

Country Status (2)

Country Link
CN (1) CN116204847A (en)
WO (1) WO2023093689A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117032954A (en) * 2023-07-17 2023-11-10 北京泛睿科技合伙企业(有限合伙) Memory optimization method, system, equipment and medium for terminal training model

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP4202782A1 (en) * 2015-11-09 2023-06-28 Google LLC Training neural networks represented as computational graphs
US20200293838A1 (en) * 2019-03-13 2020-09-17 Deepmind Technologies Limited Scheduling computation graphs using neural networks
CN110941494A (en) * 2019-12-02 2020-03-31 哈尔滨工程大学 Deep learning-oriented GPU parallel computing data processing method

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117032954A (en) * 2023-07-17 2023-11-10 北京泛睿科技合伙企业(有限合伙) Memory optimization method, system, equipment and medium for terminal training model
CN117032954B (en) * 2023-07-17 2024-04-26 北京泛睿科技合伙企业(有限合伙) Memory optimization method, system, equipment and medium for terminal training model

Also Published As

Publication number Publication date
WO2023093689A1 (en) 2023-06-01

Similar Documents

Publication Publication Date Title
JP6549332B2 (en) Network model construction method and apparatus based on machine learning
CN113449857B (en) Data processing method and data processing equipment
US11514324B2 (en) Methods of optimization of computational graphs of neural networks
WO2021190597A1 (en) Processing method for neural network model, and related device
CN113703775B (en) Compiling method, compiling device, compiling equipment and storage medium
US20160299998A1 (en) Logic circuit generation device and method
CN111382347A (en) Object feature processing and information pushing method, device and equipment
CN114598631B (en) Neural network computing-oriented modeling method and device for distributed data routing
CN110750298B (en) AI model compiling method, equipment and storage medium
CN115033391B (en) Data flow method and device for neural network calculation
CN113515672A (en) Data processing method and device, computer readable medium and electronic equipment
US11354360B2 (en) Method and apparatus for compiling computation graphs into an integrated circuit
US20230334292A1 (en) Node fusion method for computational graph and device
CN112633077A (en) Face detection method, system, storage medium and terminal based on intra-layer multi-scale feature enhancement
CN115686527A (en) Compiling method and device based on operator, computer equipment and storage medium
WO2023093689A1 (en) Computational graph optimization method and apparatus, and device
CN113918126B (en) AI modeling flow arrangement method and system based on graph algorithm
CN114997317A (en) Method and device for training wind control model and predicting risk category
Pedrycz et al. A fuzzy set approach to cost estimation of software projects
CN113705798A (en) Processing unit, computing device and computation graph optimization method of deep learning model
JP2002527816A (en) Program optimization apparatus and method
CN112817560B (en) Computing task processing method, system and computer readable storage medium based on table function
CN114626284A (en) Model processing method and related device
US20210034333A1 (en) Boolean multi-flow programming
CN115796228B (en) Operator fusion method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination