WO2024120050A1 - Operator fusion method used for neural network, and related apparatus - Google Patents
Operator fusion method used for neural network, and related apparatus Download PDFInfo
- Publication number
- WO2024120050A1 WO2024120050A1 PCT/CN2023/127261 CN2023127261W WO2024120050A1 WO 2024120050 A1 WO2024120050 A1 WO 2024120050A1 CN 2023127261 W CN2023127261 W CN 2023127261W WO 2024120050 A1 WO2024120050 A1 WO 2024120050A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- fused
- subgraph
- computing
- computing operation
- graph
- Prior art date
Links
- 238000013528 artificial neural network Methods 0.000 title claims abstract description 36
- 238000007500 overflow downdraw method Methods 0.000 title claims abstract description 15
- 230000015654 memory Effects 0.000 claims abstract description 152
- 230000004927 fusion Effects 0.000 claims abstract description 76
- 238000003062 neural network model Methods 0.000 claims abstract description 60
- 238000000034 method Methods 0.000 claims abstract description 59
- 238000004364 calculation method Methods 0.000 claims description 102
- 238000012545 processing Methods 0.000 claims description 34
- 238000004590 computer program Methods 0.000 claims description 21
- 238000006243 chemical reaction Methods 0.000 claims description 18
- 230000005540 biological transmission Effects 0.000 claims description 9
- 238000013473 artificial intelligence Methods 0.000 description 47
- 238000010586 diagram Methods 0.000 description 16
- 230000006870 function Effects 0.000 description 15
- 238000012549 training Methods 0.000 description 13
- 238000013135 deep learning Methods 0.000 description 8
- 238000011161 development Methods 0.000 description 8
- 238000013461 design Methods 0.000 description 5
- 238000004891 communication Methods 0.000 description 4
- 238000007499 fusion processing Methods 0.000 description 4
- 238000012986 modification Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 101100012902 Saccharomyces cerevisiae (strain ATCC 204508 / S288c) FIG2 gene Proteins 0.000 description 2
- 101100233916 Saccharomyces cerevisiae (strain ATCC 204508 / S288c) KAR5 gene Proteins 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 238000005457 optimization Methods 0.000 description 2
- 239000004065 semiconductor Substances 0.000 description 2
- 238000012546 transfer Methods 0.000 description 2
- 101001121408 Homo sapiens L-amino-acid oxidase Proteins 0.000 description 1
- 101000827703 Homo sapiens Polyphosphoinositide phosphatase Proteins 0.000 description 1
- 102100026388 L-amino-acid oxidase Human genes 0.000 description 1
- 102100023591 Polyphosphoinositide phosphatase Human genes 0.000 description 1
- 230000006978 adaptation Effects 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 238000013136 deep learning model Methods 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 239000013307 optical fiber Substances 0.000 description 1
- 230000002085 persistent effect Effects 0.000 description 1
- 238000011084 recovery Methods 0.000 description 1
- 238000009877 rendering Methods 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Definitions
- the present application relates to the field of artificial intelligence technology, and in particular to an operator fusion method and related devices for neural networks.
- AI artificial intelligence
- model training By defining the artificial intelligence (AI) model and solving the model's parameters (which can be called model training), a unique computational logic can be determined. After conversion, this computational logic can be applied to reasoning calculations (which can also be called model reasoning or use). This computational logic can be represented by a graph, which is the computational graph.
- a computation graph usually contains many operators. During model training or inference, each operator has memory access overhead, scheduling overhead, and execution overhead. Therefore, too many operators in the computation graph will seriously affect execution efficiency. Usually, the number of operators can be reduced by operator fusion.
- the present application provides an operator fusion method and related devices for a neural network, which improves the resource utilization of the chip and the computing speed of the neural network model when the neural network model runs on the chip, thereby improving the performance of the neural network in training or reasoning.
- an embodiment of the present application provides an operator fusion method for a neural network, in which:
- the computational graph is used to describe the connection relationship between multiple operators in the neural network model, and each of the multiple operators is used to perform at least one computing operation.
- determine at least two subgraphs to be fused in the computational graph and any of the at least two subgraphs to be fused includes at least one operator.
- the at least two operators included in the first subgraph to be fused are fused into one operator, and the first subgraph to be fused is any one of the at least two subgraphs to be fused.
- the utilization rate required for the first subgraph to be fused is related to the memory size of the set chip and the total number of computing operations included in the at least two operators in the first subgraph to be fused.
- the computational graph is first divided into at least two subgraphs to be fused (any subgraph to be fused can be referred to as the first subgraph to be fused).
- the first subgraph to be fused includes at least two operators and the utilization rate of the amount of resources required to run on the set chip is greater than the utilization rate threshold, the at least two operators included in the first subgraph to be fused are fused into one operator.
- the utilization rate of the amount of resources required to run the first subgraph to be fused on the set chip is related to the memory size of the set chip and the total number of computing operations included in the at least two operators in the first subgraph to be fused, it can be seen that the fusion process takes into account the memory of the set chip, improves the chip resource utilization rate and the computing speed of the neural network model when the neural network model runs on the chip, and thus improves the performance of the neural network in training or reasoning.
- the operator fusion method for a neural network also includes: when the first subgraph to be fused does not meet the fusion condition, splitting the first subgraph to be fused into a second subgraph to be fused and a third subgraph to be fused; when the second subgraph to be fused meets the fusion condition, fusing at least two operators included in the second subgraph to be fused into one operator; and/or, when the third subgraph to be fused meets the fusion condition, fusing at least two operators included in the third subgraph to be fused into one operator.
- the first subgraph to be fused does not meet the fusion condition, it means that if at least the operators included in the first subgraph to be fused are fused, the performance of the neural network model obtained after the chip runs the fusion is poor and the resource utilization is low.
- the first subgraph to be fused is split into the second subgraph to be fused and the third subgraph to be fused, and the at least two operators included in the second fused subgraph and the at least two operators included in the third fused subgraph are fused respectively in combination with the fusion condition. This improves the resource utilization of the neural network model obtained by the chip after executing the operator fusion, thereby improving the computing speed of the neural network model.
- splitting the first subgraph to be fused into the second subgraph to be fused and the third subgraph to be fused includes: Starting from the first output-type computing operation of the first subgraph to be fused, each computing operation of the first subgraph to be fused is traversed by reverse search; the first output-type computing operation is any output-type computing operation of the first subgraph to be fused, and the first output-type computing operation satisfies that the number of computing operations that use the operation result of the first output-type computing operation as input is less than the first quantity threshold; when the second computing operation currently traversed meets the splitting condition, the subgraph with the second computing operation as the root node in the first subgraph to be fused is taken as the second subgraph to be fused, and the subgraphs of the first subgraph to be fused except the subgraph with the second computing operation as the root node are taken as the third subgraph to be fused; the second computing operation satisfies the splitting condition if the computing type of the second computing operation is the set type
- the types of various calculation operations included in the first subgraph to be fused are considered.
- the calculation type of the second calculation operation is a set type or the calculation type of the second calculation operation is the same as the calculation type of the calculation operation of the parent node of the second calculation operation
- the second subgraph to be fused and the third subgraph to be fused are obtained.
- the operator fusion method for a neural network further includes: determining the amount of resources required for the first subgraph to be fused to run on a set chip based on the memory size of the set chip, the memory reuse capability supported by the operation instructions of the set chip, and the total number of computing operations included in at least two operators in the first subgraph to be fused.
- the resource utilization rate of the resource amount required for the first subgraph to be fused to run on the set chip is calculated, which fully considers the performance of the chip, improves the resource utilization rate of the chip, and thus improves the operation speed of the neural network model.
- the amount of resources required for the first subgraph to be fused to run on the set chip is determined based on the memory size of the set chip, the memory reuse capability supported by the operation instructions of the set chip, and the total number of computing operations included in at least two operators in the first subgraph to be fused, including: when the data types generated by the computing operations included in at least two operators are the same, determining the number of memory shares required to run the first subgraph to be fused, the number of memory shares being the difference between the total number of computing operations included in the at least two operators in the first subgraph to be fused and the number of output computing operations included in the at least two operators in the first subgraph to be fused; the output computing operation uses the computing result of the output computing operation as the computing input
- the number of operations is less than a second quantity threshold; according to the memory size of the setting chip, the number of memory copies required to run the first sub-graph to be fused, and the running time corresponding to the type of each computing operation included in the first sub-graph to be fused,
- the memory reuse capability supported by the operation instructions of the setting chip is considered to be reusable or non-reusable. Since the data types generated by each computing operation are the same, reuse is effective. Therefore, in the case of reuse, the difference between the total number of computing operations included in the at least two operators in the first subgraph to be fused and the number of output computing operations included in the at least two operators in the first subgraph to be fused is determined as the number of memory copies required to run at least two operators in the first subgraph to be fused.
- the running rate of the computing operation can be determined based on the number of memory copies and the time required for the computing operation, and then the amount of resources required for each computing operation to run on the setting chip is determined.
- only one computing operation can be executed at the same time, so the minimum amount of resources for the computing operation is used as the amount of resources required for the first subgraph to be fused to run on the setting chip.
- the memory size required for the third computing operation to run on the chip is obtained, and then divided by the time required to execute the third computing operation to obtain the running rate of the third computing operation, so as to determine the amount of resources required for each computing operation to run on the set chip according to the running rate.
- the amount of resources includes bandwidth and/or computing power; converting each operating rate into the amount of resources required for the corresponding computing operation to run on the set chip includes: the amount of resources required for the fourth computing operation to run on the set chip includes bandwidth, the The fourth computing operation is any one of the computing operations included in the at least two operators in the first sub-graph to be fused. According to the conversion relationship between the running rate and the bandwidth, the running rate of the fourth computing operation is converted into the bandwidth required for the fourth computing operation to run on the set chip; or, the amount of resources required for the fifth computing operation to run on the set chip includes computing power, and the fifth computing operation is any one of the computing operations included in the at least two operators in the first sub-graph to be fused. According to the conversion relationship between the running rate and the computing power, the running rate of the fifth computing operation is converted into the computing power required for the fifth computing operation to run on the set chip.
- the bandwidth and/or computing power required for the computing operation to run on the set chip can be determined according to the respective conversion relationships. In this way, the bandwidth and/or computing power required for the computing operation to run on the set chip can be determined according to the bandwidth and/or computing power required for the computing operation to run on the set chip.
- an embodiment of the present application provides an operator fusion device for a neural network, the device comprising:
- a transmission unit used to obtain a neural network model
- a processing unit used to determine a computational graph corresponding to the neural network model, where the computational graph is used to describe a connection relationship between multiple operators in the neural network model; each of the multiple operators is used to perform at least one computational operation;
- the processing unit is further used to determine at least two subgraphs to be fused in the computation graph, wherein any of the at least two subgraphs to be fused includes at least one operator;
- the processing unit is further configured to, when the first subgraph to be fused satisfies a fusion condition, fuse at least two operators included in the first subgraph to be fused into one operator;
- the first sub-graph to be fused is any one of the at least two sub-graphs to be fused
- the fusion conditions include: the first subgraph to be fused includes at least two operators, and the utilization rate of the amount of resources required for the first subgraph to be fused to run on the set chip is greater than the utilization rate threshold, and the utilization rate of the amount of resources required for the first subgraph to be fused to run on the set chip is related to the memory size of the set chip and the total number of computing operations included in the at least two operators in the first subgraph to be fused.
- an embodiment of the present application provides a computer device, including a processor and a memory;
- a memory for storing computer program instructions
- the processor executes the computer program instructions in the memory to perform the method provided in any one of the aforementioned aspects or any possible implementation manner of any one of the aspects.
- an embodiment of the present application further provides a computer-readable storage medium, in which a software program is stored.
- the software program When the software program is read and executed by one or more processors, it can implement any method provided by any design in any aspect.
- the present application provides a computer program product, the computer program product including computer instructions, when executed by a computing device, the computing device performs the method provided in any one of the aforementioned aspects or in any possible implementation of any one of the aspects.
- the computer program product may be a software installation package, and when it is necessary to use the method provided in any one of the aforementioned aspects or in any possible implementation of any one of the aspects, the computer program product may be downloaded and executed on a computing device.
- the present application also provides a computer chip, which is connected to a memory, and the chip is used to read and execute a software program stored in the memory to execute a method provided in any of the foregoing aspects or any possible implementation of any of the aspects.
- FIG1 is a schematic diagram of a calculation graph provided in an embodiment of the present application.
- FIG2 is a flow chart of an operator fusion method for a neural network provided in an embodiment of the present application.
- FIG3 is a flow chart of an operator fusion method for a neural network provided in an embodiment of the present application.
- FIG4 is a flow chart of a method for calculating the amount of resources required for a first subgraph to be fused to run on a set chip provided in an embodiment of the present application;
- FIG5 is a schematic diagram of the structure of a subgraph to be fused provided in an embodiment of the present application.
- FIG6 is a schematic diagram of an operator fusion process provided in an embodiment of the present application.
- FIG7 is a schematic diagram of splitting operators to be fused provided in an embodiment of the present application.
- FIG8 is a schematic diagram of the structure of an operator fusion device for a neural network provided in an embodiment of the present application.
- FIG. 9 is a schematic diagram of the structure of a computer device provided in an embodiment of the present application.
- graphics card memory is a component used to store graphics information to be processed. It is used to store rendering data that has been processed or is about to be extracted by the graphics card chip. In the scenario of neural network model training, it is used to store training samples; in the scenario of neural network model reasoning or application, it is used to store data to be processed.
- the memory reuse capability supported by the chip's computing instructions is used to describe the reuse of instruction memory, which mainly includes two types: instruction memory reuse and instruction memory non-reusable.
- parallelism refers to the maximum number of instructions or data executed in parallel.
- instruction parallelism In the instruction pipeline, executing multiple instructions at the same time is called instruction parallelism.
- bandwidth refers to the amount of data transmitted per unit time by the operators included in the computational graph.
- bandwidth represents the parallelism of data transfer. Specifically, it refers to the amount of data transferred from memory 1 to memory 2 per unit time. For example, if data is transferred from a hard disk to a memory stick or from a memory stick to a hard disk, bandwidth refers to the data transfer rate between the hard disk and the memory stick.
- computing power refers to the number of floating-point operations per second (FLOPS) that the operators included in the computational graph can perform.
- FLOPS floating-point operations per second
- computing power represents the degree of parallelism of data calculation.
- Computing power refers to the number of floating-point operations that an AI chip can complete per second, and is an indicator of hardware computing speed. It is often used to estimate the execution performance of computers, especially in the field of scientific computing that uses a large number of floating-point operations.
- AI artificial intelligence
- model training By defining the artificial intelligence (AI) model and solving the model's parameters (which can be called model training), a unique computational logic can be determined. After conversion, this computational logic can be applied to reasoning calculations (which can also be called model reasoning or use). This computational logic can be represented by a graph, which is the computational graph.
- the computational graph is presented as a directed graph, which defines the flow of data, the calculation of data, and the interdependence between various calculations.
- Figure 1 is a schematic diagram of the structure of a computational graph provided by the present application.
- the computational graph of the AI model consists of operators and edges. Among them, each operator is used to perform at least one computational operation. If the computational graph is in the form of a tree graph, a computational operation can be represented by a computational node, and each computational operation is used to represent the applied mathematical operation (operation).
- the starting point of the data input can be used as an input node, and the end point of the data output or the end point of reading/writing a persistent variable can be used as an output node.
- the operator is the basic computing unit of the AI model.
- the edge is used to represent the input/output relationship between operators or nodes, and the edge can transmit a multidimensional data array whose size can be dynamically adjusted, wherein the multidimensional data array whose size can be dynamically adjusted is a tensor.
- the data structure of the tensor can be used to represent the data in the model, that is, a tensor can correspond to an n-dimensional array or list, where n is an integer greater than or equal to zero.
- Tensors have two attributes: dimension and rank.
- tensors can circulate between operators in the computational graph.
- neural network compilers or deep learning compilers are specific editors in the field of machine learning. They are used to deploy neural network training or reasoning on AI chips.
- an operator may also be referred to as a computing task, an operation (OP), a computing layer, etc.
- a data dimension may also be referred to as a dimension, a shape, etc.
- the AI model in the embodiment of the present application may be a neural network model, etc.
- the AI model can be independently deployed on one or more computing devices.
- each computing device runs all functional operators of the AI model, and each computing device can independently execute the functions of the AI model.
- it can be combined with a load balancing module to evenly share the request volume among multiple computing devices.
- disaster recovery can be achieved.
- the AI models on other computing devices can continue to provide services as usual.
- the various functional operators of an AI model can be distributed and deployed on multiple computing devices, and the multiple computing devices can collaboratively run the AI model according to the data dependency relationship of the AI model.
- the computing device can be an AI chip (for example, including a central processing unit). Central processing unit (CPU), graphics processing unit (GPU), embedded neural network processing unit (NPU), field programmable gate array (FPGA) or application-specific integrated circuit (ASIC) chips), graphics cards, physical servers, etc.
- CPU Central processing unit
- GPU graphics processing unit
- NPU embedded neural network processing unit
- FPGA field programmable gate array
- ASIC application-specific integrated circuit
- the AI development framework is a tool library that allows AI developers to quickly develop AI models.
- the AI development framework encapsulates a variety of callable operators and also contains the tools required for AI model development, training, and deployment. During the construction, training, and reasoning of AI models, the encapsulated operators in the AI framework can be called through the application program interface (API), and then combined with some simple driver codes to complete the corresponding operations.
- API application program interface
- the AI development framework in the industry is usually open source.
- Typical AI development frameworks used for developing deep learning models also known as deep learning frameworks, include PaddlePaddle, Tensorflow, Caffe, Theano, MXNet, Torch, and PyTorch. Developers can install the AI development framework locally and then develop AI models locally, or use the AI development framework on online platforms (for example, online open source framework platforms, public cloud AI basic development platforms, etc.) to develop AI models.
- Model architecture adjustment is a method to fundamentally optimize the performance of the model. It mainly involves adjusting the algorithm structure of the AI model, such as changing the operator type in the AI model, changing the connection relationship between layers of the AI model, etc.
- the deep learning framework is the first layer in the entire deep learning ecosystem. In TensorFlow and MXNet, neural network calculations are further split into various common operators for tensor data.
- the deep learning framework needs to concretize the deep learning tasks expressed by the computational graph structure mapped by the neural network into instructions and data that can be executed by AI chips.
- the deep learning framework uses operators as specific elements to implement computing tasks, and provides a kernel function (Kernel) for each operator to be executed on the AI chip.
- kernel function Kernel
- the deep learning framework schedules the execution of the kernel function corresponding to each operator in the computational graph to complete the calculation of the entire neural network.
- the operators in the computational graph mapped by the neural network are implemented on the AI chip through kernel functions, which is a mode of "off-chip storage ⁇ on-chip computing ⁇ off-chip storage", that is, the input data and output numbers of the operators in the neural network are stored in the global storage, and the kernel function needs to read the input data from the global storage, complete the calculation, and store the results back to the global storage.
- kernel functions which is a mode of "off-chip storage ⁇ on-chip computing ⁇ off-chip storage”
- the kernel function needs to read the input data from the global storage, complete the calculation, and store the results back to the global storage.
- the kernel functions of two or more consecutive operators in the computational graph corresponding to the neural network are merged into a new kernel function, so that the computing tasks corresponding to these operators only require one scheduling overhead. Therefore, a large amount of data transmission from external memory (DRAM) to on-chip memory and from on-chip memory to external memory can be eliminated. For example, in the ResNet-18 neural network, if all operators can be fused together, 99.6% of data transmission can be reduced.
- DRAM external memory
- operator fusion is mainly divided into two categories: manual fusion and automatic fusion.
- Manual fusion Identify the operators to be fused, fuse the corresponding operators together to write a new custom operator, and register the corresponding fusion rules in the framework.
- manual fusion only supports the fusion of specific operators in specific topology scenarios, and the manual adaptation workload is large.
- an embodiment of the present application provides an operator fusion method.
- the calculation graph is divided into multiple subgraphs to be fused, and the operators are fused based on the subgraphs to be fused;
- the hardware information of the AI chip is considered, so that after the operator fusion, not only the data transmission is reduced, but also the data scale of the intermediate data of the fused operator can be matched with the actual data scale stored in the on-chip memory, and the power consumption overhead required for the on-chip memory is also controlled within a reasonable range.
- Such a design optimizes the neural network, improves the execution efficiency, and improves the performance of network training and reasoning.
- FIG2 is a flowchart of an operator fusion method for a neural network proposed by the present application, which is applied to a fusion device, which may be an electronic device or a server.
- the neural network model may be stored on a setting chip or in an electronic device. Prepared in.
- the method includes at least the following steps:
- the setting chip can be an AI chip.
- the electronic device obtains a neural network model and determines a calculation graph corresponding to the neural network model.
- the neural network models that are widely used include BP neural network, Hopfield network, ART network and Kohonen network.
- the neural network model obtained can be any one of them, or it can be other neural network models.
- the computational graph is used to describe the connection relationship between multiple operators in the neural network model, and each of the multiple operators is used to perform a computational operation.
- the computation graph includes 2 operators, operator 1 includes 3 computation operations, and operator 2 includes 3 computation operations.
- the calculation graph corresponding to the obtained neural network model can be directly searched in the calculation graph corresponding to the commonly used model. If the obtained neural network model is a customized neural network model obtained by improving the commonly used model, the method of extracting the neural network model into a calculation graph can be applied to obtain the customized neural network model.
- S202 The electronic device determines at least two subgraphs to be fused in the computation graph.
- the computation graph is divided into at least two subgraphs to be fused, wherein any of the at least two subgraphs to be fused includes at least one operator.
- the methods of determining at least two subgraphs to be fused in the computation graph may include the following:
- a convolution type computing operation whose input is usually an add type, so the operators to which the convolution type computing operation and the add type computing operation belong can be divided into a subgraph to be fused.
- any one of the above three methods can be applied to divide the computation graph into at least two sub-graphs to be fused.
- the first subgraph to be fused is any one of the at least two subgraphs to be fused.
- the fusion condition includes: the first subgraph to be fused includes at least two operators, and the utilization rate of the amount of resources required for the first subgraph to be fused to run on the set chip is greater than the utilization rate threshold.
- the utilization rate of the amount of resources required for the first subgraph to be fused to run on the set chip is related to the memory size of the set chip and the total number of computing operations included in the at least two operators in the first subgraph to be fused.
- the first subgraph to be fused includes only one operator, the first subgraph to be fused is a single operator subgraph and operator fusion cannot be performed. Therefore, one of the fusion conditions is that the first subgraph to be fused includes at least two operators. In addition, another fusion condition is that the utilization rate of the amount of resources required for the first subgraph to be fused to run on the set chip is greater than the utilization rate threshold.
- the utilization threshold is preset, for example, 0.8.
- the utilization of the amount of resources required for the first subgraph to be fused to run on the set chip is the ratio of the amount of resources required for the first subgraph to be fused to run on the set chip to the maximum amount of resources of the set chip.
- the maximum amount of resources of the set chip can be obtained from the factory parameters of the set chip, or calculated according to the factory parameters.
- the electronic device sends the neural network model obtained after the operator fusion to the setting chip.
- the neural network model since the neural network model is running on the set chip, the neural network model obtained after operator fusion is sent to the set chip.
- the neural network model running instruction can be used for neural network model training or neural network model reasoning. After the chip obtains the running instruction of the neural network model, it can process the training data or the reasoning data.
- the fusion condition includes that the utilization rate of the amount of resources required for the first subgraph to be fused to run on the set chip is greater than the utilization rate threshold.
- the utilization rate of the amount of resources required for the first subgraph to be fused to run on the set chip can be calculated first.
- the amount of resources includes computing power and/or bandwidth.
- the amount of resources required for the first subgraph to be fused to run on the set chip is determined according to the memory size of the set chip, the memory reuse capability supported by the operation instructions of the set chip, and the total number of computing operations included in at least two operators in the first subgraph to be fused.
- setting the memory size of the chip usually refers to the on-chip memory.
- Setting the memory reuse capability supported by the chip's operation instructions usually includes two situations: reuse and non-reuse. If the computation graph is in the form of a tree graph, a computation operation can be represented by a computation node, and the total number of computation operations included in at least two operators in the first subgraph to be fused is the total number of computation nodes.
- the electronic device receives hardware information from a setting chip.
- the hardware information includes setting the memory reuse capability supported by the chip's computing instructions and setting the chip's memory size.
- the electronic device determines the number of memory copies required to run the first sub-graph to be fused when the memory reuse capability supported by the chip's operation instructions is set to be reusable and the data types generated by the calculation operations included in at least two operators in the first sub-graph to be fused are the same.
- the number of memory copies is the difference between the total number of computing operations included in at least two operators in the first subgraph to be fused and the number of output computing operations included in at least two operators in the first subgraph to be fused.
- the output computing operation satisfies that the number of computing operations that use the computing results of the output computing operations as input is less than the second quantity threshold.
- the quantity threshold is 2, that is, the number of computing operations that use the computing results of the output computing operations as input is 0 or 1.
- the total number p of computing operations included in the at least two operators in the first subgraph to be fused and the number q of output computing operations included in the at least two operators in the first subgraph to be fused are determined. If the number of computing operations with the operation results of the output computing operations as input is 0 or 1, the nodes where the computing operations are located can be reused, so the difference p-q is the memory fraction F required to run the at least two operators in the first subgraph to be fused.
- the total number of calculation operations included in at least two operators in the first subgraph to be fused is 3.
- Memory 1 stores data b
- memory 2 stores data c.
- the calculation type of the calculation operation is a sum operation.
- data c can be added to data b as the sum of data a.
- the node where the computing operation is located cannot be reused, so p is the number of memory copies required to run at least two operators in the first subgraph to be fused.
- the total number p of computing operations included in the at least two operators in the first subgraph to be fused is determined to be the number of memory copies required to run at least two operators in the first subgraph to be fused.
- the total number p of calculation operations included in at least two operators in the first subgraph to be fused is 3
- memory 1 stores data b
- memory 2 stores data c
- the calculation type of the calculation operation is a sum operation.
- data b, data c, and the summed data a each require a memory, that is, the number of memory copies required to run each operator of the first subgraph to be fused is also 3.
- FIG. 5 is a schematic diagram of the first subgraph to be fused.
- 51 is an output-type computing operation with the calculation result of the output-type computing operation as the input, and the number of computing operations is 0
- 52 is an output-type computing operation with the calculation result of the output-type computing operation as the input, and the number of computing operations is 1
- 53 is an output-type computing operation with the calculation result of the output-type computing operation as the input, and the number of computing operations is 1.
- FIG5 includes addition operation, subtraction operation, multiplication operation, division operation, square root operation, back propagation operation and data conversion operation, etc.
- the form of each operation is described with an example:
- An example of an addition operation is add_14float32[1,2,1,1,16], where add is the calculation type of the calculation operation, 14 is the array dimension of the calculation operation, float32 is the data type of the calculation operation, and [1,2,1,1,16] is the data tensor of the calculation operation.
- sub_3float32[1,2,1,1,16] An example of a subtraction operation is sub_3float32[1,2,1,1,16], where sub is the calculation type of the calculation operation, 3 is the array dimension of the calculation operation, float32 is the data type of the calculation operation, and [1,2,1,1,16] is the data tensor of the calculation operation.
- the multiplication operation mul_1float32[1,2,1,1,16] is used, where mul is the calculation type of the operation, 1 is the array dimension of the operation, float32 is the data type of the operation, and [1,2,1,1,16] is the data tensor of the operation.
- div_6float32[1,2,1,1,16] div is the calculation type of the calculation operation
- 6 is the array dimension of the calculation operation
- float32 is the data type of the calculation operation
- [1,2,1,1,16] is the data tensor of the calculation operation.
- the electronic device determines the running speed corresponding to each computing operation included in the first subgraph to be fused run by the setting chip according to the memory size of the setting chip, the number of memory portions required to run the first subgraph to be fused, and the running time corresponding to the type of each computing operation included in the first subgraph to be fused.
- different computing operations usually have different corresponding running rates.
- the third computing operation is any computing operation included in the at least two operators in the first subgraph to be fused.
- One memory is used to store the calculation result of a calculation operation. Therefore, the memory size M of the chip is set to be divided by the number of memory shares f required to run each operator of the first subgraph to be fused, and the obtained ratio is the memory size required for the third calculation operation to run on the chip.
- the third computing operation is any computing operation included in the first subgraph to be fused, m is the memory size required for the third computing operation to run on the set chip, M is the memory size of the set chip, and f is the number of memory copies required to run the first subgraph to be fused.
- the number of determined operating rates is also S.
- a ratio of a memory size required for the third computing operation to run on a chip to a duration of time required to execute the third computing operation may be determined as the operating rate corresponding to the third computing operation.
- the electronic device converts the operation rate corresponding to each computing operation included in the first sub-graph to be fused into units to obtain the amount of resources required for each computing operation included in the first sub-graph to be fused to run on a set chip.
- the amount of resources required for the fourth computing operation to run on the set chip includes bandwidth
- the fourth computing operation is any computing operation included in the at least two operators in the first sub-graph to be fused. According to the conversion relationship between the running rate and the bandwidth, the running rate of the fourth computing operation is converted into the bandwidth required for the fourth computing operation to run on the set chip.
- the amount of resources required for the fifth computing operation to run on the set chip includes computing power
- the fifth computing operation is any one of the computing operations included in at least two operators in the first sub-graph to be fused. According to the conversion relationship between the running rate and the computing power, the running rate of the fifth computing operation is converted into the computing power required for the fifth computing operation to run on the set chip.
- S404 The electronic device uses the minimum amount of resources among the amounts of resources required for each computing operation included in the first subgraph to be fused to run on the set chip as the amount of resources required for the first subgraph to be fused to run on the set chip.
- the minimum amount of resources required for each computing operation to run on the set chip is the amount of resources required for the first subgraph to be fused to run on the set chip.
- the bandwidth BW0 and computing power F0 required for the first subgraph to be fused to run on the set chip can be obtained.
- the first subgraph to be fused does not meet the fusion condition, it means that if at least two operators included in the first subgraph to be fused are fused, when the obtained neural network model runs on the set chip, the performance of the set chip does not reach a better state. At this time, the first subgraph to be fused is split into the second subgraph to be fused and the third subgraph to be fused. Then, the fusion condition is applied to judge the second subgraph to be fused and the third subgraph to be fused respectively.
- the fusion condition and the fusion process refer to the first subgraph to be fused. The expression is not elaborated here.
- the first subgraph to be fused includes four operators. If the four operators are fused, the fusion condition is not met. Then the subgraph is divided to obtain the second subgraph to be fused and the third subgraph to be fused, wherein the second subgraph to be fused includes operator 1, operator 2 and operator 3; the third subgraph to be fused includes operator 4. If the second subgraph to be fused meets the fusion condition, operator 1, operator 2 and operator 3 of the second subgraph to be fused are fused.
- the third subgraph to be fused is a single operator subgraph, and operator fusion cannot be performed.
- FIG 7 is a schematic diagram of the splitting of the subgraphs to be fused.
- the first output-type computing operation is any output-type computing operation of the first subgraph to be fused
- the first output-type computing operation is any output-type computing operation of the first subgraph to be fused.
- the first output-type computing operation satisfies that the number of computing operations that use the operation result of the first output-type computing operation as input is less than the first quantity threshold.
- the first quantity threshold may be 1, and the number of computing operations using the calculation result of the first output type computing operation as input is 0, that is, there is no computing operation using the calculation result of the first output type computing operation as input.
- the first subgraph to be fused is split into the second subgraph to be fused and the third subgraph to be fused.
- the subgraph with the second computing operation as the root node in the first subgraph to be fused is taken as the second subgraph to be fused, and the subgraphs other than the subgraph with the second computing operation as the root node in the first subgraph to be fused are taken as the third subgraph to be fused.
- the second computing operation meets the splitting condition if the computing type of the second computing operation is a set type (such as a back propagation type) or the computing type of the second computing operation is the same as the computing type of the computing operation of the parent node of the second computing operation.
- the embodiment of the present application also provides an operator fusion device for a neural network, which is used to execute the operator fusion method for a neural network in the method embodiment shown above.
- the operator fusion device for a neural network includes a transmission unit 801 and a processing unit 802.
- a transmission unit 801 is used to obtain a neural network model
- a processing unit 802 is used to determine a computation graph corresponding to the neural network model, where the computation graph is used to describe the connection relationship between multiple operators in the neural network model; each of the multiple operators is used to perform at least one computation operation;
- the processing unit 802 is further configured to determine at least two subgraphs to be fused in the computation graph, wherein any of the at least two subgraphs to be fused includes at least one operator;
- the processing unit 802 is further configured to, when the first subgraph to be fused meets a fusion condition, fuse at least two operators included in the first subgraph to be fused into one operator;
- the first sub-graph to be fused is any one of the at least two sub-graphs to be fused
- the fusion conditions include: the first subgraph to be fused includes at least two operators, and the utilization rate of the amount of resources required for the first subgraph to be fused to run on the set chip is greater than the utilization rate threshold, and the utilization rate of the amount of resources required for the first subgraph to be fused to run on the set chip is related to the memory size of the set chip and the total number of computing operations included in the at least two operators in the first subgraph to be fused.
- processing unit 802 is further configured to:
- the processing unit 802 is specifically configured to:
- the first output type computing operation is any output type computing operation of the first subgraph to be fused, and the first output type computing operation satisfies that the number of computing operations that use the operation result of the first output type computing operation as input is less than a first quantity threshold;
- the subgraph with the second calculation operation as the root node in the first subgraph to be fused is used as the second subgraph to be fused, and the subgraphs in the first subgraph to be fused except the subgraph with the second calculation operation as the root node are used as the third subgraph to be fused;
- the second computing operation satisfies the splitting condition if the computing type of the second computing operation is a set type or the computing type of the second computing operation is the same as the computing type of the computing operation of the parent node of the second computing operation.
- processing unit 802 is further configured to:
- the amount of resources required for the first subgraph to be fused to run on the set chip is determined according to the memory size of the set chip, the memory reuse capability supported by the operation instructions of the set chip, and the total number of computing operations included in at least two operators in the first subgraph to be fused.
- the processing unit 802 is specifically configured to:
- the number of memory shares required to run the first subgraph to be fused determines the number of memory shares required to run the first subgraph to be fused, and the number of memory shares is the difference between the total number of calculation operations included in the at least two operators in the first subgraph to be fused and the number of output calculation operations included in the at least two operators in the first subgraph to be fused; the number of calculation operations in which the output calculation operation uses the calculation result of the output calculation operation as input is less than the second quantity threshold;
- the minimum amount of resources among the amounts of resources required for each computing operation included in the first subgraph to be fused to run on the set chip is used as the amount of resources required for the first subgraph to be fused to run on the set chip.
- the processing unit 802 is specifically configured to:
- the third computing operation is any computing operation included in the first sub-graph to be fused, m is the memory size required for the third computing operation to run on the set chip, M is the memory size of the set chip, and f is the number of memory copies required to run the first sub-graph to be fused;
- V is the running rate of the third computing operation
- t is the time required to execute the third computing operation
- the amount of resources includes bandwidth and/or computing power
- the amount of resources required for the fourth computing operation to run on the set chip includes bandwidth
- the fourth computing operation is any computing operation included in the at least two operators in the first sub-graph to be fused
- the processing unit 802 is specifically used to: convert the running rate of the fourth computing operation into the bandwidth required for the fourth computing operation to run on the set chip according to the conversion relationship between the running rate and the bandwidth; or,
- the amount of resources required for the fifth computing operation to run on the set chip includes computing power.
- the fifth computing operation is any one of the computing operations included in at least two operators in the first sub-graph to be fused.
- the processing unit 802 is specifically used to: convert the running rate of the fifth computing operation into the computing power required for the fifth computing operation to run on the set chip according to the conversion relationship between the running rate and the computing power.
- the division of modules in the embodiments of the present application is schematic and is only a logical function division. There may be other division methods in actual implementation.
- the functional modules in the embodiments of the present application may be integrated into a processing module, or each module may exist physically separately, or two or more modules may be integrated into one module.
- the above-mentioned integrated modules may be implemented in the form of hardware or in the form of software functional modules.
- the above method can be implemented in whole or in part by software, hardware, firmware or any other combination.
- the above method can be implemented in whole or in part in the form of a computer program product.
- the computer program product includes one or more computer instructions.
- the computer program instructions When the computer program instructions are loaded or executed on a computer, the process or function according to the embodiment of the present invention is generated in whole or in part.
- the computer can be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device.
- the computer instructions can be stored in a computer-readable storage medium, or transmitted from one computer-readable storage medium to another computer-readable storage medium.
- the computer instructions can be transmitted from one website, computer, server or data center to another website, computer, server or data center by wired (e.g., coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.).
- the computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains one or more available media sets.
- the available medium can be a magnetic medium (e.g., a floppy disk, a hard disk, a tape), an optical medium (e.g., a DVD), or a semiconductor medium.
- the semiconductor medium can be a solid state drive (SSD).
- the device 900 shown in FIG. 9 includes at least one processor 901 , a memory 902 , and optionally, a communication interface 903 .
- the memory 902 may be a volatile memory, such as a random access memory; the memory may also be a non-volatile memory, such as a read-only memory, a flash memory, a hard disk drive (HDD) or a solid-state drive (SSD), or the memory 902 may be any other medium that can be used to carry or store the desired program code in the form of instructions or data structures and can be accessed by a computer, but is not limited thereto.
- the memory 902 may be a combination of the above memories.
- connection medium between the processor 901 and the memory 902 is not limited in the embodiment of the present application.
- the processor 901 may be a CPU, or other general-purpose processors, digital signal processors (DSP), application-specific integrated circuits (ASIC), field programmable gate arrays (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, artificial intelligence chips, chips on chips, etc.
- a general-purpose processor may be a microprocessor or any conventional processor, etc.
- an independent data transceiver module may also be provided, such as a communication interface 903, for transmitting and receiving data; when the processor 901 communicates with other devices, data may be transmitted through the communication interface 903.
- the computing device takes the form shown in FIG. 9 , and the processor 901 in FIG. 9 can call the computer execution instructions stored in the memory 902 so that the computing device can execute the operator fusion method for a neural network in any of the above method embodiments.
- the functions/implementation processes of the transmission unit 801 and the processing unit 802 in FIG8 can be implemented by the processor 901 in FIG9 calling the computer execution instructions stored in the memory 902.
- the functions/implementation processes of the processing unit in FIG8 can be implemented by the processor 901 in FIG9 calling the computer execution instructions stored in the memory 902
- the transmission function/implementation process of FIG8 can be implemented by the communication interface 903 in FIG9.
- the embodiments of the present application may be provided as methods, systems, or computer program products. Therefore, the present application may adopt the form of a complete hardware embodiment, a complete software embodiment, or an embodiment in combination with software and hardware. Moreover, the present application may adopt the form of a computer program product implemented in one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) that contain computer-usable program code.
- a computer-usable storage media including but not limited to disk storage, CD-ROM, optical storage, etc.
- These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing device to work in a specific manner, so that the instructions stored in the computer-readable memory produce a manufactured product including an instruction device that implements the functions specified in one or more processes in the flowchart and/or one or more boxes in the block diagram.
- These computer program instructions may also be loaded onto a computer or other programmable data processing device so that a series of operational steps are executed on the computer or other programmable device to produce a computer-implemented process, whereby the instructions executed on the computer or other programmable device provide steps for implementing the functions specified in one or more processes in the flowchart and/or one or more boxes in the block diagram.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Software Systems (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Mathematical Physics (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The embodiments of the present application provide an operator fusion method used for a neural network, and a related apparatus. The method comprises: obtaining a neural network model, and determining a computational graph corresponding to the neural network model. The computational graph being able to describe connection relationships between a plurality of operators in the neural network model, and each operator being able to perform a computational operation. In order to improve fusion efficiency, determining within the computational graph at least two sub-graphs to be fused. When a first sub-graph to be fused comprises at least two operators and when the utilization rate of the amount of resources required for the first sub-graph to be fused to operate on a set chip is greater than a utilization rate threshold, fusing into a single operator the at least two operators included in the first sub-graph to be fused. Moreover, the utilization rate required by the first sub-graph to be fused is related to the memory size of the set chip and to the total number of the computational operations included in the first sub-graph to be fused, wherein considering the memory size of the chip allows a fused operator to fully utilize the resources of the chip, thereby further improving the operation speed of the neural network model.
Description
相关申请的交叉引用CROSS-REFERENCE TO RELATED APPLICATIONS
本申请要求在2022年12月09日提交中国专利局、申请号为202211584001.6、申请名称为“一种用于神经网络的算子融合方法及相关装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims priority to a Chinese patent application filed with the Chinese Patent Office on December 9, 2022, with application number 202211584001.6 and application name “An operator fusion method and related device for neural networks”, the entire contents of which are incorporated by reference in this application.
本申请涉及人工智能技术领域,尤其涉及一种用于神经网络的算子融合方法及相关装置。The present application relates to the field of artificial intelligence technology, and in particular to an operator fusion method and related devices for neural networks.
通过对人工智能(artificial intelligence,AI)模型的定义以及模型的参数求解(可以称为是模型训练),可以确定一个唯一的计算逻辑,该计算逻辑经转换处理后可以应用于推理计算中(也可以称为是模型推理或使用),该计算逻辑可以用图表示,该图即为计算图。By defining the artificial intelligence (AI) model and solving the model's parameters (which can be called model training), a unique computational logic can be determined. After conversion, this computational logic can be applied to reasoning calculations (which can also be called model reasoning or use). This computational logic can be represented by a graph, which is the computational graph.
计算图中通常包含很多算子。模型训练或者推理过程中,每个算子都有访问显存开销、调度开销和执行开销等。因此,计算图中的算子数量过多则会严重影响执行效率。通常情况下,可以通过算子融合的方式来减少算子的数量。A computation graph usually contains many operators. During model training or inference, each operator has memory access overhead, scheduling overhead, and execution overhead. Therefore, too many operators in the computation graph will seriously affect execution efficiency. Usually, the number of operators can be reduced by operator fusion.
相关技术中,存在手动融合和自动融合两种方式。无论是哪种方式,都是根据算子的特点进行分类,再根据融合规则进行融合。但是,无法保证融合后的算子得到的神经网络模型在运行时的性能。In the related art, there are two methods: manual fusion and automatic fusion. In either method, operators are classified according to their characteristics and then fused according to fusion rules. However, the performance of the neural network model obtained by the fused operators cannot be guaranteed at runtime.
发明内容Summary of the invention
本申请提供一种用于神经网络的算子融合方法及相关装置,提升了神经网络模型在芯片上运行时的芯片的资源利用率和神经网络模型的运算速度,进而提升神经网络在训练或者推理的性能。The present application provides an operator fusion method and related devices for a neural network, which improves the resource utilization of the chip and the computing speed of the neural network model when the neural network model runs on the chip, thereby improving the performance of the neural network in training or reasoning.
第一方面,本申请实施例提供了一种用于神经网络的算子融合方法,该方法中:In a first aspect, an embodiment of the present application provides an operator fusion method for a neural network, in which:
首先,获取神经网络模型,并确定神经网络模型对应的计算图,计算图用于描述神经网络模型中的多个算子之间的连接关系,多个算子中每个算子用于执行至少一个计算操作。其次,确定计算图中的至少两个待融合子图,至少两个待融合子图中任一待融合子图包括至少一个算子。最后,在第一待融合子图至少两个算子,且第一待融合子图在设定芯片上运行所需的资源量的利用率大于利用率阈值时,将第一待融合子图包括的至少两个算子融合为一个算子,第一待融合子图为至少两个待融合子图中的任一个。第一待融合子图所需的利用率与设定芯片的内存大小以及第一待融合子图中至少两个算子所包括的计算操作的总数量相关。First, obtain a neural network model, and determine a computational graph corresponding to the neural network model, the computational graph is used to describe the connection relationship between multiple operators in the neural network model, and each of the multiple operators is used to perform at least one computing operation. Secondly, determine at least two subgraphs to be fused in the computational graph, and any of the at least two subgraphs to be fused includes at least one operator. Finally, when there are at least two operators in the first subgraph to be fused, and the utilization rate of the amount of resources required for the first subgraph to be fused to run on a set chip is greater than the utilization rate threshold, the at least two operators included in the first subgraph to be fused are fused into one operator, and the first subgraph to be fused is any one of the at least two subgraphs to be fused. The utilization rate required for the first subgraph to be fused is related to the memory size of the set chip and the total number of computing operations included in the at least two operators in the first subgraph to be fused.
通过上述方法,基于待融合子图对算子进行融合,先将计算图分为至少两个待融合子图(任一个待融合子图可称为第一待融合子图),在第一待融合子图包括至少两个算子且在设定芯片上运行所需的资源量的利用率大于利用率阈值时,将第一待融合子图包括的至少两个算子融合为一个算子。由于第一待融合子图在设定芯片上运行所需的资源量的利用率与设定芯片的内存大小以及第一待融合子图中至少两个算子所包括的计算操作的总数量相关可知,该融合过程考虑了设定芯片的内存,提升了神经网络模型在芯片上运行时的芯片的资源利用率和神经网络模型的运算速度,进而提升神经网络在训练或者推理的性能。Through the above method, operators are fused based on the subgraph to be fused, and the computational graph is first divided into at least two subgraphs to be fused (any subgraph to be fused can be referred to as the first subgraph to be fused). When the first subgraph to be fused includes at least two operators and the utilization rate of the amount of resources required to run on the set chip is greater than the utilization rate threshold, the at least two operators included in the first subgraph to be fused are fused into one operator. Since the utilization rate of the amount of resources required to run the first subgraph to be fused on the set chip is related to the memory size of the set chip and the total number of computing operations included in the at least two operators in the first subgraph to be fused, it can be seen that the fusion process takes into account the memory of the set chip, improves the chip resource utilization rate and the computing speed of the neural network model when the neural network model runs on the chip, and thus improves the performance of the neural network in training or reasoning.
在一些示例性的实施方式中,用于神经网络的算子融合方法还包括:在第一待融合子图不满足融合条件时,将第一待融合子图拆分为第二待融合子图和第三待融合子图;在第二待融合子图满足融合条件时,将第二待融合子图包括的至少两个算子融合为一个算子;和/或,在第三待融合子图满足融合条件时,将第三待融合子图包括的至少两个算子融合为一个算子。In some exemplary embodiments, the operator fusion method for a neural network also includes: when the first subgraph to be fused does not meet the fusion condition, splitting the first subgraph to be fused into a second subgraph to be fused and a third subgraph to be fused; when the second subgraph to be fused meets the fusion condition, fusing at least two operators included in the second subgraph to be fused into one operator; and/or, when the third subgraph to be fused meets the fusion condition, fusing at least two operators included in the third subgraph to be fused into one operator.
通过上述方法,如果第一待融合子图不满足融合条件,则表明此时如果将第一待融合子图包括的至少算子进行融合,芯片运行融合后得到的神经网络模型的性能较差,资源利用率较低。这种情况下,将第一待融合子图拆分为第二待融合子图和第三待融合子图,并结合融合条件分别对第二融合子图包括的至少两个算子以及第三融合子图包括的至少两个算子进行融合。以提高芯片在执行算子融合后得到的神经网络模型的资源利用率,进而提高神经网络模型的运算速度。Through the above method, if the first subgraph to be fused does not meet the fusion condition, it means that if at least the operators included in the first subgraph to be fused are fused, the performance of the neural network model obtained after the chip runs the fusion is poor and the resource utilization is low. In this case, the first subgraph to be fused is split into the second subgraph to be fused and the third subgraph to be fused, and the at least two operators included in the second fused subgraph and the at least two operators included in the third fused subgraph are fused respectively in combination with the fusion condition. This improves the resource utilization of the neural network model obtained by the chip after executing the operator fusion, thereby improving the computing speed of the neural network model.
在一些示例性的实施方式中,将第一待融合子图拆分为第二待融合子图和第三待融合子图,包括:
从第一待融合子图的第一输出型计算操作开始,通过反向搜索的方式遍历第一待融合子图的各个计算操作;第一输出型计算操作为第一待融合子图的任意一个输出型计算操作,第一输出型计算操作满足以第一输出型计算操作的运算结果作为输入的计算操作的数量小于第一数量阈值;在当前遍历到的第二计算操作满足拆分条件时,将第一待融合子图中以第二计算操作为根节点的子图作为第二待融合子图,将第一待融合子图中除以第二计算操作为根节点的子图以外的子图作为第三待融合子图;第二计算操作满足拆分条件为第二计算操作的计算类型为设定类型或者第二计算操作的计算类型与作为第二计算操作的父节点的计算操作的计算类型相同。In some exemplary implementations, splitting the first subgraph to be fused into the second subgraph to be fused and the third subgraph to be fused includes: Starting from the first output-type computing operation of the first subgraph to be fused, each computing operation of the first subgraph to be fused is traversed by reverse search; the first output-type computing operation is any output-type computing operation of the first subgraph to be fused, and the first output-type computing operation satisfies that the number of computing operations that use the operation result of the first output-type computing operation as input is less than the first quantity threshold; when the second computing operation currently traversed meets the splitting condition, the subgraph with the second computing operation as the root node in the first subgraph to be fused is taken as the second subgraph to be fused, and the subgraphs of the first subgraph to be fused except the subgraph with the second computing operation as the root node are taken as the third subgraph to be fused; the second computing operation satisfies the splitting condition if the computing type of the second computing operation is the set type or the computing type of the second computing operation is the same as the computing type of the computing operation that is the parent node of the second computing operation.
通过上述方法,在对第一待融合子图进行拆分时,考虑第一待融合子图包括的各个计算操作的类型。在第二计算操作的计算类型为设定类型或者第二计算操作的计算类型和作为第二计算操作的父节点的计算操作的计算类型相同时,得到第二待融合子图和第三待融合子图。继续通过融合条件判断第二待融合子图和第三待融合子图中的算子是否可以融合,进一步提高芯片在执行算子融合后得到的神经网络模型的资源利用率,进而提高神经网络模型的运算速度。Through the above method, when splitting the first subgraph to be fused, the types of various calculation operations included in the first subgraph to be fused are considered. When the calculation type of the second calculation operation is a set type or the calculation type of the second calculation operation is the same as the calculation type of the calculation operation of the parent node of the second calculation operation, the second subgraph to be fused and the third subgraph to be fused are obtained. Continue to judge whether the operators in the second subgraph to be fused and the third subgraph to be fused can be fused through the fusion conditions, further improve the resource utilization of the neural network model obtained by the chip after executing the operator fusion, and then improve the operation speed of the neural network model.
在一些示例性的实施方式中,用于神经网络的算子融合方法还包括:根据设定芯片的内存大小、设定芯片的运算指令所支持的内存复用能力以及第一待融合子图中至少两个算子所包括的计算操作的总数量,确定第一待融合子图在设定芯片上运行所需的资源量。In some exemplary embodiments, the operator fusion method for a neural network further includes: determining the amount of resources required for the first subgraph to be fused to run on a set chip based on the memory size of the set chip, the memory reuse capability supported by the operation instructions of the set chip, and the total number of computing operations included in at least two operators in the first subgraph to be fused.
通过上述方法,不仅考虑了神经网络模型本身的第一待融合子图中至少两个算子所包括的计算操作的总数量,还考虑了芯片的内存大小和设定芯片的运算指令所支持的内存复用能力。应用得到的第一待融合子图在设定芯片上运行所需的资源量计算资源利用率充分考虑了芯片的性能,提高了芯片的资源利用率,进而提高神经网络模型的运算速度。Through the above method, not only the total number of computing operations included in at least two operators in the first subgraph to be fused in the neural network model itself is considered, but also the memory size of the chip and the memory reuse capability supported by the operation instructions of the set chip are considered. The resource utilization rate of the resource amount required for the first subgraph to be fused to run on the set chip is calculated, which fully considers the performance of the chip, improves the resource utilization rate of the chip, and thus improves the operation speed of the neural network model.
在一些示例性的实施方式中,根据设定芯片的内存大小、设定芯片的运算指令所支持的内存复用能力以及第一待融合子图中至少两个算子所包括的计算操作的总数量,确定第一待融合子图在设定芯片上运行所需的资源量,包括:至少两个算子所包括的计算操作产生的数据类型相同时,确定运行第一待融合子图所需的内存份数,内存份数为第一待融合子图中至少两个算子所包括的计算操作的总数量与第一待融合子图中至少两个算子所包括的输出型计算操作的数量的差;输出型计算操作以输出型计算操作的运算结果作为输入的计算操作的数量小于第二数量阈值;根据设定芯片的内存大小、运行第一待融合子图所需的内存份数,以及第一待融合子图包括的每个计算操作所属的类型对应的运行时长,确定设定芯片运行第一待融合子图包括的每个计算操作分别对应的运行速率;将第一待融合子图包括的每个计算操作分别对应的运行速率进行单位转换得到第一待融合子图包括的每个计算操作在设定芯片上运行所需的资源量;将第一待融合子图包括的每个计算操作在设定芯片上运行所需的资源量中最小的资源量作为第一待融合子图在设定芯片上运行所需的资源量。In some exemplary embodiments, the amount of resources required for the first subgraph to be fused to run on the set chip is determined based on the memory size of the set chip, the memory reuse capability supported by the operation instructions of the set chip, and the total number of computing operations included in at least two operators in the first subgraph to be fused, including: when the data types generated by the computing operations included in at least two operators are the same, determining the number of memory shares required to run the first subgraph to be fused, the number of memory shares being the difference between the total number of computing operations included in the at least two operators in the first subgraph to be fused and the number of output computing operations included in the at least two operators in the first subgraph to be fused; the output computing operation uses the computing result of the output computing operation as the computing input The number of operations is less than a second quantity threshold; according to the memory size of the setting chip, the number of memory copies required to run the first sub-graph to be fused, and the running time corresponding to the type of each computing operation included in the first sub-graph to be fused, the running speed corresponding to each computing operation included in the first sub-graph to be fused is determined; the running speed corresponding to each computing operation included in the first sub-graph to be fused is converted into units to obtain the amount of resources required for each computing operation included in the first sub-graph to be fused to run on the setting chip; the minimum amount of resources among the amounts of resources required for each computing operation included in the first sub-graph to be fused to run on the setting chip is used as the amount of resources required for the first sub-graph to be fused to run on the setting chip.
通过上述方法,计算运行第一待融合子图中至少两个算子所需的内存份数时,考虑了设定芯片的运算指令所支持的内存复用能力为可复用以及不可复用的情况。由于各个计算操作的产生的数据类型相同时,可复用才有效,因此,在可复用的情况下,确定第一待融合子图中至少两个算子所包括的计算操作的总数量与第一待融合子图中至少两个算子所包括的输出型计算操作的数量的差值为运行第一待融合子图中至少两个算子所需的内存份数。由于计算操作的运行速率与资源量存在变化关系,因此可以基于内存份数和计算操作所需的时长确定运行速率,进而确定各个计算操作在设定芯片上运行所需的资源量。另外,同一时刻只能有一个计算操作执行,因此,将计算操作最小的资源量作为第一待融合子图在设定芯片上运行所需的资源量。By the above method, when calculating the number of memory copies required to run at least two operators in the first subgraph to be fused, the memory reuse capability supported by the operation instructions of the setting chip is considered to be reusable or non-reusable. Since the data types generated by each computing operation are the same, reuse is effective. Therefore, in the case of reuse, the difference between the total number of computing operations included in the at least two operators in the first subgraph to be fused and the number of output computing operations included in the at least two operators in the first subgraph to be fused is determined as the number of memory copies required to run at least two operators in the first subgraph to be fused. Since there is a variable relationship between the running rate of the computing operation and the amount of resources, the running rate can be determined based on the number of memory copies and the time required for the computing operation, and then the amount of resources required for each computing operation to run on the setting chip is determined. In addition, only one computing operation can be executed at the same time, so the minimum amount of resources for the computing operation is used as the amount of resources required for the first subgraph to be fused to run on the setting chip.
在一些示例性的实施方式中,第三计算操作在设定芯片上运行所需的内存大小满足:m=M/f;其中,第三计算操作为第一待融合子图所包括的计算操作中的任一个计算操作,m第三计算操作在设定芯片上运行所需的内存大小,M为设定芯片的内存大小,f为运行第一待融合子图所需的内存份数;第三计算操作的运行速率满足V=m/t;其中,V为第三计算操作的运行速率,t为执行第三计算操作所需的时长。In some exemplary embodiments, the memory size required for the third computing operation to run on the set chip satisfies: m=M/f; wherein the third computing operation is any computing operation included in the first sub-graph to be fused, m is the memory size required for the third computing operation to run on the set chip, M is the memory size of the set chip, and f is the number of memory copies required to run the first sub-graph to be fused; the running rate of the third computing operation satisfies V=m/t; wherein V is the running rate of the third computing operation, and t is the time required to execute the third computing operation.
通过上述方法,得到第三计算操作在芯片上运行所需的内存大小,再与执行第三计算操作所需的时长相除,得到第三计算操作的运行速率。以便根据该运行速率来确定各个计算操作在设定芯片上运行所需的资源量。By using the above method, the memory size required for the third computing operation to run on the chip is obtained, and then divided by the time required to execute the third computing operation to obtain the running rate of the third computing operation, so as to determine the amount of resources required for each computing operation to run on the set chip according to the running rate.
在一些示例性的实施方式中,资源量包括带宽和/或算力;将各个运行速率转换为对应的计算操作在设定芯片上运行所需的资源量,包括:第四计算操作在设定芯片上运行所需的资源量包括带宽,第
四计算操作为第一待融合子图中至少两个算子所包括的各个计算操作中任一个计算操作,根据运行速率和带宽的转换关系,将第四计算操作的运行速率转换为第四计算操作在设定芯片上运行所需的带宽;或,第五计算操作在设定芯片上运行所需的资源量包括算力,第五计算操作为第一待融合子图中至少两个算子所包括的各个计算操作中任一个计算操作,根据运行速率和算力的转换关系,将第五计算操作的运行速率转换为第五计算操作在设定芯片上运行所需的算力。In some exemplary embodiments, the amount of resources includes bandwidth and/or computing power; converting each operating rate into the amount of resources required for the corresponding computing operation to run on the set chip includes: the amount of resources required for the fourth computing operation to run on the set chip includes bandwidth, the The fourth computing operation is any one of the computing operations included in the at least two operators in the first sub-graph to be fused. According to the conversion relationship between the running rate and the bandwidth, the running rate of the fourth computing operation is converted into the bandwidth required for the fourth computing operation to run on the set chip; or, the amount of resources required for the fifth computing operation to run on the set chip includes computing power, and the fifth computing operation is any one of the computing operations included in the at least two operators in the first sub-graph to be fused. According to the conversion relationship between the running rate and the computing power, the running rate of the fifth computing operation is converted into the computing power required for the fifth computing operation to run on the set chip.
通过上述方法,由于运行速率和带宽、运行速率和算力之间分别存在转换关系,因此,可以根据各自的转换关系,可以确定计算操作在设定芯片上运行所需的带宽和/或算力。以便根据计算操作在设定芯片上运行所需的带宽和/或算力来确定第一待融合子图在设定芯片上运行的带宽和/或算力。Through the above method, since there are conversion relationships between the operating rate and bandwidth, and between the operating rate and computing power, respectively, the bandwidth and/or computing power required for the computing operation to run on the set chip can be determined according to the respective conversion relationships. In this way, the bandwidth and/or computing power required for the computing operation to run on the set chip can be determined according to the bandwidth and/or computing power required for the computing operation to run on the set chip.
第二方面,本申请实施例提供了一种用于神经网络的算子融合装置,该装置包括:In a second aspect, an embodiment of the present application provides an operator fusion device for a neural network, the device comprising:
传输单元,用于获取神经网络模型;A transmission unit, used to obtain a neural network model;
处理单元,用于确定神经网络模型对应的计算图,计算图用于描述神经网络模型中的多个算子之间的连接关系;多个算子中每个算子用于执行至少一个计算操作;A processing unit, used to determine a computational graph corresponding to the neural network model, where the computational graph is used to describe a connection relationship between multiple operators in the neural network model; each of the multiple operators is used to perform at least one computational operation;
处理单元还用于,确定计算图中的至少两个待融合子图,至少两个待融合子图中任一待融合子图包括至少一个算子;The processing unit is further used to determine at least two subgraphs to be fused in the computation graph, wherein any of the at least two subgraphs to be fused includes at least one operator;
处理单元还用于,在第一待融合子图满足融合条件时,将第一待融合子图包括的至少两个算子融合为一个算子;The processing unit is further configured to, when the first subgraph to be fused satisfies a fusion condition, fuse at least two operators included in the first subgraph to be fused into one operator;
第一待融合子图为至少两个待融合子图中的任一个;The first sub-graph to be fused is any one of the at least two sub-graphs to be fused;
融合条件包括:第一待融合子图包括至少两个算子,且第一待融合子图在设定芯片上运行所需的资源量的利用率大于利用率阈值,第一待融合子图在设定芯片上运行所需的资源量的利用率与设定芯片的内存大小以及第一待融合子图中至少两个算子所包括的计算操作的总数量相关。The fusion conditions include: the first subgraph to be fused includes at least two operators, and the utilization rate of the amount of resources required for the first subgraph to be fused to run on the set chip is greater than the utilization rate threshold, and the utilization rate of the amount of resources required for the first subgraph to be fused to run on the set chip is related to the memory size of the set chip and the total number of computing operations included in the at least two operators in the first subgraph to be fused.
第三方面,本申请实施例提供了一种计算机设备,包括处理器和存储器;In a third aspect, an embodiment of the present application provides a computer device, including a processor and a memory;
存储器,用于存储计算机程序指令;a memory for storing computer program instructions;
处理器执行调用存储器中的计算机程序指令执行前述任意一方面中或任意一方面中的任意可能的实现方式中提供的方法。The processor executes the computer program instructions in the memory to perform the method provided in any one of the aforementioned aspects or any possible implementation manner of any one of the aspects.
第四方面,本申请实施例中还提供一种计算机可读存储介质,该存储介质中存储软件程序,该软件程序在被一个或多个处理器读取并执行时可实现任意一方面中的任意一种设计提供的方法。In a fourth aspect, an embodiment of the present application further provides a computer-readable storage medium, in which a software program is stored. When the software program is read and executed by one or more processors, it can implement any method provided by any design in any aspect.
第五方面,本申请提供了一种计算机程序产品,计算机程序产品包括计算机指令,在被计算设备执行时,计算设备执行前述任意一方面中或任意一方面中的任意可能的实现方式中提供的方法。该计算机程序产品可以为一个软件安装包,在需要使用前述任一方面或任一方面的任意可能的实现方式中提供的方法的情况下,可以下载该计算机程序产品并在计算设备上执行该计算机程序产品。In a fifth aspect, the present application provides a computer program product, the computer program product including computer instructions, when executed by a computing device, the computing device performs the method provided in any one of the aforementioned aspects or in any possible implementation of any one of the aspects. The computer program product may be a software installation package, and when it is necessary to use the method provided in any one of the aforementioned aspects or in any possible implementation of any one of the aspects, the computer program product may be downloaded and executed on a computing device.
第六方面,本申请还提供一种计算机芯片,芯片与存储器相连,芯片用于读取并执行存储器中存储的软件程序,执行前述任一方面或任一方面的任意可能的实现方式中提供的方法。In a sixth aspect, the present application also provides a computer chip, which is connected to a memory, and the chip is used to read and execute a software program stored in the memory to execute a method provided in any of the foregoing aspects or any possible implementation of any of the aspects.
图1为本申请实施例提供的一种计算图的示意图;FIG1 is a schematic diagram of a calculation graph provided in an embodiment of the present application;
图2为本申请实施例提供的一种用于神经网络的算子融合方法的流程图;FIG2 is a flow chart of an operator fusion method for a neural network provided in an embodiment of the present application;
图3为本申请实施例提供的一种用于神经网络的算子融合方法的流程图;FIG3 is a flow chart of an operator fusion method for a neural network provided in an embodiment of the present application;
图4为本申请实施例提供的一种第一待融合子图在设定芯片上运行所需的资源量的计算方法的流程图;FIG4 is a flow chart of a method for calculating the amount of resources required for a first subgraph to be fused to run on a set chip provided in an embodiment of the present application;
图5为本申请实施例提供的一种待融合子图的结构示意图;FIG5 is a schematic diagram of the structure of a subgraph to be fused provided in an embodiment of the present application;
图6为本申请实施例提供的一种算子融合过程的示意图;FIG6 is a schematic diagram of an operator fusion process provided in an embodiment of the present application;
图7为本申请实施例提供的一种待融合算子拆分的示意图;FIG7 is a schematic diagram of splitting operators to be fused provided in an embodiment of the present application;
图8为本申请实施例提供的一种用于神经网络的算子融合装置的结构示意图;FIG8 is a schematic diagram of the structure of an operator fusion device for a neural network provided in an embodiment of the present application;
图9为本申请实施例提供的一种计算机设备的结构示意图。FIG. 9 is a schematic diagram of the structure of a computer device provided in an embodiment of the present application.
为了使本申请实施例的目的、技术方案和优点更加清楚,下面将结合附图对本申请实施例作进一步地详细描述。其中,在本申请实施例的描述中,以下,术语“第一”、“第二”仅用于描述目的,而不
能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此,限定有“第一”、“第二”的特征可以明示或者隐含地包括一个或者更多个该特征。In order to make the purpose, technical solution and advantages of the embodiment of the present application clearer, the embodiment of the present application will be further described in detail with reference to the accompanying drawings. In the description of the embodiment of the present application, the terms "first" and "second" are used only for description purposes and are not It can be understood as indicating or implying relative importance or implicitly indicating the number of the indicated technical features. Thus, the features defined as "first" or "second" may explicitly or implicitly include one or more of the features.
为了便于理解,示例性的给出了与本申请相关概念的说明以供参考。To facilitate understanding, exemplary descriptions of concepts related to the present application are provided for reference.
(1)显存(1) Video memory
又称显卡内存,是用来存储要处理的图形信息的部件。用来存储显卡芯片处理过或者即将提取的渲染数据。在神经网络模型训练的场景中,用于存储训练样本;在神经网络模型推理或应用的场景中,用于存储待处理的数据。Also known as graphics card memory, it is a component used to store graphics information to be processed. It is used to store rendering data that has been processed or is about to be extracted by the graphics card chip. In the scenario of neural network model training, it is used to store training samples; in the scenario of neural network model reasoning or application, it is used to store data to be processed.
(2)内存复用能力(2) Memory reuse capability
芯片的运算指令所支持的内存复用能力用于描述指令内存的复用情况,主要包括指令内存可复用以及指令内存不可复用两种。The memory reuse capability supported by the chip's computing instructions is used to describe the reuse of instruction memory, which mainly includes two types: instruction memory reuse and instruction memory non-reusable.
(3)并行度(3) Parallelism
在计算机体系结构中,并行度是指指令或数据并行执行的最大数目。在指令流水中,同时执行多条指令称为指令并行。In computer architecture, parallelism refers to the maximum number of instructions or data executed in parallel. In the instruction pipeline, executing multiple instructions at the same time is called instruction parallelism.
(4)带宽(4) Bandwidth
对于计算图来说,带宽是指计算图中包括的算子在单位时间内的数据传输量。For a computational graph, bandwidth refers to the amount of data transmitted per unit time by the operators included in the computational graph.
对于AI芯片来说,带宽表征了数据搬运的并行度。具体是指,在单位时间内,内存1到内存2的数据传输量。比如,数据是从硬盘传输到内存条或者内存条传输到硬盘,则带宽是指数据在硬盘和内存条之间的数据传输速率。For AI chips, bandwidth represents the parallelism of data transfer. Specifically, it refers to the amount of data transferred from memory 1 to memory 2 per unit time. For example, if data is transferred from a hard disk to a memory stick or from a memory stick to a hard disk, bandwidth refers to the data transfer rate between the hard disk and the memory stick.
(5)算力(5) Computing power
对于计算图来说,算力是指计算图中包括的算子每秒钟所能完成的浮点运算数(floating-point operations per second,FLOPS)。For a computational graph, computing power refers to the number of floating-point operations per second (FLOPS) that the operators included in the computational graph can perform.
对于AI芯片来说,算力表征了数据计算的并行度。算力是指AI芯片每秒钟所能完成的浮点运算数,是一种衡量硬件计算速度的指标。常被用来估算电脑的执行效能,尤其是在使用到大量浮点运算的科学计算领域中。For AI chips, computing power represents the degree of parallelism of data calculation. Computing power refers to the number of floating-point operations that an AI chip can complete per second, and is an indicator of hardware computing speed. It is often used to estimate the execution performance of computers, especially in the field of scientific computing that uses a large number of floating-point operations.
(6)计算图(6) Computational Graph
通过对人工智能(artificial intelligence,AI)模型的定义以及模型的参数求解(可以称为是模型训练),可以确定一个唯一的计算逻辑,该计算逻辑经转换处理后可以应用于推理计算中(也可以称为是模型推理或使用),该计算逻辑可以用图表示,该图即为计算图。By defining the artificial intelligence (AI) model and solving the model's parameters (which can be called model training), a unique computational logic can be determined. After conversion, this computational logic can be applied to reasoning calculations (which can also be called model reasoning or use). This computational logic can be represented by a graph, which is the computational graph.
计算图表现为有向图,定义了数据的流转方式,数据的计算方式,以及各种计算之间的相互依赖关系等。图1为本申请提供的一种计算图的结构示意图,AI模型的计算图由算子和边组成。其中,每个算子用于执行至少一个计算操作,如果计算图是树图的形式,则一个计算操作可以用一个计算节点表示,每个计算操作用于表示施加的数学操作(运算)。另外,算子在施加数学操作时,数据输入的起点可作为输入节点,数据输出的终点或读取/写入持久变量的终点可作为输出节点,算子是AI模型的基本计算单元。边用于表示算子或节点之间的输入/输出关系,边可以传输尺寸能够动态调整的多维数据数组,其中,尺寸能够动态调整的多维数据数组即张量。张量这种数据结构可以用来表示模型中的数据,即,一个张量可以对应一个n维的数组或列表,其中,n为大于或等于零的整数。张量具有维度和秩两个属性。此外,张量可以在计算图的算子之间流通。The computational graph is presented as a directed graph, which defines the flow of data, the calculation of data, and the interdependence between various calculations. Figure 1 is a schematic diagram of the structure of a computational graph provided by the present application. The computational graph of the AI model consists of operators and edges. Among them, each operator is used to perform at least one computational operation. If the computational graph is in the form of a tree graph, a computational operation can be represented by a computational node, and each computational operation is used to represent the applied mathematical operation (operation). In addition, when the operator applies a mathematical operation, the starting point of the data input can be used as an input node, and the end point of the data output or the end point of reading/writing a persistent variable can be used as an output node. The operator is the basic computing unit of the AI model. The edge is used to represent the input/output relationship between operators or nodes, and the edge can transmit a multidimensional data array whose size can be dynamically adjusted, wherein the multidimensional data array whose size can be dynamically adjusted is a tensor. The data structure of the tensor can be used to represent the data in the model, that is, a tensor can correspond to an n-dimensional array or list, where n is an integer greater than or equal to zero. Tensors have two attributes: dimension and rank. In addition, tensors can circulate between operators in the computational graph.
(7)编译器(7) Compiler
比如神经网络编译器或者深度学习编译器,属于机器学习领域的特定编辑器。用于将神经网络的训练或者推理部署到AI芯片上。For example, neural network compilers or deep learning compilers are specific editors in the field of machine learning. They are used to deploy neural network training or reasoning on AI chips.
在本申请实施例中,算子也可以称为计算任务、操作(operator,OP)、运算层等,数据维度也可以称为维度、形状等。此外,本申请实施例中的AI模型可以为神经网络模型等。In the embodiment of the present application, an operator may also be referred to as a computing task, an operation (OP), a computing layer, etc., and a data dimension may also be referred to as a dimension, a shape, etc. In addition, the AI model in the embodiment of the present application may be a neural network model, etc.
在AI模型的部署阶段,可以将AI模型独立地部署在一个或者多个计算设备上,当部署在多个计算设备上时,每个计算设备运行AI模型的全部功能算子,每个计算设备可以独立地执行AI模型的功能,一方面,可以结合负载均衡模块,由多个计算设备均衡分担请求量,另一方面,可以实现容灾,当一个计算设备故障时,其他计算设备上的AI模型可以照常提供服务。During the deployment phase of the AI model, the AI model can be independently deployed on one or more computing devices. When deployed on multiple computing devices, each computing device runs all functional operators of the AI model, and each computing device can independently execute the functions of the AI model. On the one hand, it can be combined with a load balancing module to evenly share the request volume among multiple computing devices. On the other hand, disaster recovery can be achieved. When a computing device fails, the AI models on other computing devices can continue to provide services as usual.
或者,也可以将一个AI模型的各个功能算子分布式地部署在多个计算设备上,由多个计算设备按照AI模型的数据依赖关系协同运行AI模型。应理解,计算设备可以是AI芯片(例如:包括中央处理
器(central processing unit,CPU)、图形处理器(graphics processing unit,GPU)、嵌入式神经网络处理器(neural-network processing uni,NPU)、现场可编程门阵列(field programmable gate array,FPGA)或专用集成电路(application-specific integrated circuit,ASIC)的芯片)、显卡、物理服务器等。Alternatively, the various functional operators of an AI model can be distributed and deployed on multiple computing devices, and the multiple computing devices can collaboratively run the AI model according to the data dependency relationship of the AI model. It should be understood that the computing device can be an AI chip (for example, including a central processing unit). Central processing unit (CPU), graphics processing unit (GPU), embedded neural network processing unit (NPU), field programmable gate array (FPGA) or application-specific integrated circuit (ASIC) chips), graphics cards, physical servers, etc.
AI开发框架是一种便于AI开发者进行快速AI模型开发的工具库。AI开发框架内封装有多种可被调用的算子,也包含进行AI模型开发、训练和部署所需要的工具。在进行AI模型的构建、训练和推理等过程中,可以通过应用程序接口(application program interface,API)调用的方式调用AI框架中的封装的算子,再结合简单的一些驱动代码完成相应的操作。The AI development framework is a tool library that allows AI developers to quickly develop AI models. The AI development framework encapsulates a variety of callable operators and also contains the tools required for AI model development, training, and deployment. During the construction, training, and reasoning of AI models, the encapsulated operators in the AI framework can be called through the application program interface (API), and then combined with some simple driver codes to complete the corresponding operations.
业界AI开发框架通常是开源的,典型的用于进行深度学习模型的开发的AI开发框架,也称为深度学习框架,包括:PaddlePaddle、Tensorflow、Caffe、Theano、MXNet、Torch和PyTorch等。开发者可以在本地安装AI开发框架,然后本地进行AI模型开发,也可以在线上的平台(例如:线上开源框架平台、公有云AI基础开发平台等)使用AI开发框架进行AI模型开发。The AI development framework in the industry is usually open source. Typical AI development frameworks used for developing deep learning models, also known as deep learning frameworks, include PaddlePaddle, Tensorflow, Caffe, Theano, MXNet, Torch, and PyTorch. Developers can install the AI development framework locally and then develop AI models locally, or use the AI development framework on online platforms (for example, online open source framework platforms, public cloud AI basic development platforms, etc.) to develop AI models.
模型架构调整是一种从根本上优化模型的性能的方法,主要是对AI模型的算法结构进行调整,例如:改变AI模型中的算子类型、改变AI模型的层与层之间的连接关系等。Model architecture adjustment is a method to fundamentally optimize the performance of the model. It mainly involves adjusting the algorithm structure of the AI model, such as changing the operator type in the AI model, changing the connection relationship between layers of the AI model, etc.
深度学习框架是整个深度学习生态体系中的第一层,在TensorFlow、MXNet中,均是将神经网络计算进一步拆分为各类常见的面向张量数据的算子,深度学习框架需要将神经网络映射的计算图结构所表达的深度学习任务具体化成可以AI芯片执行的指令和数据。在这个过程中,深度学习框架采用算子作为落实计算任务的具体元素,为每个算子都提供了在AI芯片上执行的核函数(Kernel),根据计算图,深度学习框架调度执行计算图中每个算子对应的核函数,完成整个神经网络的计算。The deep learning framework is the first layer in the entire deep learning ecosystem. In TensorFlow and MXNet, neural network calculations are further split into various common operators for tensor data. The deep learning framework needs to concretize the deep learning tasks expressed by the computational graph structure mapped by the neural network into instructions and data that can be executed by AI chips. In this process, the deep learning framework uses operators as specific elements to implement computing tasks, and provides a kernel function (Kernel) for each operator to be executed on the AI chip. According to the computational graph, the deep learning framework schedules the execution of the kernel function corresponding to each operator in the computational graph to complete the calculation of the entire neural network.
神经网络所映射的计算图中的算子在AI芯片上通过核函数实现,是一种“片外存储→片上计算→片外存储”的模式,即神经网络中的算子的输入数据和输出数存于全局存储中,核函数需要由全局存储中读取输入数据,完成计算,将结果存回全局存储中。这带来了两个问题:首先,每个算子关于输入数据和输出数据的访存无法通过算子内的优化来避免;其次,每个算子都需要启动开销,对AI芯片之外的异构计算设备来说更是如此。为了解决这些问题,将神经网络对应的计算图中的两个或者更多连续的算子的核函数合并为一个新的核函数,使得的这些算子对应的计算任务只需要一次调度开销。因而,可以消除大量的从外部存储器(DRAM)到片上存储器的数据传输、以及从片上存储器到外部存储器的数据传输。例如,在ResNet-18神经网络中,如果所有的算子可以融合在一起的话,可以减少99.6%的数据传输。The operators in the computational graph mapped by the neural network are implemented on the AI chip through kernel functions, which is a mode of "off-chip storage → on-chip computing → off-chip storage", that is, the input data and output numbers of the operators in the neural network are stored in the global storage, and the kernel function needs to read the input data from the global storage, complete the calculation, and store the results back to the global storage. This brings two problems: first, the memory access of each operator on the input data and output data cannot be avoided by optimization within the operator; second, each operator requires startup overhead, which is even more true for heterogeneous computing devices outside the AI chip. In order to solve these problems, the kernel functions of two or more consecutive operators in the computational graph corresponding to the neural network are merged into a new kernel function, so that the computing tasks corresponding to these operators only require one scheduling overhead. Therefore, a large amount of data transmission from external memory (DRAM) to on-chip memory and from on-chip memory to external memory can be eliminated. For example, in the ResNet-18 neural network, if all operators can be fused together, 99.6% of data transmission can be reduced.
然而,很难将实际的神经网络中的所有算子都融合在一起。其中的原因包括:在实际中,片上存储器的大小与神经网络处理的数据规模之间的不匹配,因为AI芯片的面积开销不可能太大,相应地,对AI芯片的片上存储器的面积开销也有限制。并且,AI芯片的片上存储器所需的功耗开销也应在合理范围之内。这些原因导致AI芯片的片上存储的数据规模有一定的限制。因而,假如把神经网络中的所有算子都融合在一起,那些被融合算子的中间数据的数据规模与片上存储器的实际存储的数据规模不相匹配。However, it is difficult to fuse all the operators in an actual neural network together. The reasons include: in practice, there is a mismatch between the size of the on-chip memory and the scale of data processed by the neural network, because the area overhead of the AI chip cannot be too large, and accordingly, there are limits on the area overhead of the on-chip memory of the AI chip. In addition, the power consumption overhead required for the on-chip memory of the AI chip should also be within a reasonable range. These reasons lead to certain limitations on the scale of data stored on the chip of the AI chip. Therefore, if all the operators in the neural network are fused together, the data scale of the intermediate data of the fused operators does not match the actual data scale stored in the on-chip memory.
相关技术中,算子融合主要分为两大类,手动融合和自动融合。In related technologies, operator fusion is mainly divided into two categories: manual fusion and automatic fusion.
(1)手动融合:识别要融合的算子,将对应的算子融合在一起写成一个新的自定义算子,同时需要往框架中注册对应的融合规则。但是,手动融合只支持特定算子在特定拓扑图场景下的融合,且手动适配工作量大。(1) Manual fusion: Identify the operators to be fused, fuse the corresponding operators together to write a new custom operator, and register the corresponding fusion rules in the framework. However, manual fusion only supports the fusion of specific operators in specific topology scenarios, and the manual adaptation workload is large.
(2)自动融合:特定算子之间的融合,比如卷积算子和relu算子的融合;或者特定类型的算子之间的融合,比如Element-wise类型的算子的融合。该融合过程可由编译器在图优化时自动将对应的算子融合成一个算子。但是,在实际操作过程中发现,不是将所有算子融合在一起性能最优。(2) Automatic fusion: fusion between specific operators, such as the fusion of convolution operators and relu operators; or fusion between specific types of operators, such as the fusion of element-wise operators. The compiler can automatically fuse the corresponding operators into one operator during graph optimization. However, in actual operation, it is found that fusing all operators together does not achieve the best performance.
基于此,本申请实施例提供一种算子融合方法。一方面,是将计算图划分为多个待融合子图,以待融合子图为基础对算子进行融合;另一方面,在对算子进行融合的过程中,考虑AI芯片的硬件信息,使得算子融合后,不仅减少了数据传输,还可以使被融合算子中间数据的数据规模与片上存储器的实际存储的数据规模匹配,将片上存储器所需的功耗开销也控制在合理的范围内。这样的设计,对神经网络进行了优化,提高了执行效率,提高网络训练以及推理的性能。Based on this, an embodiment of the present application provides an operator fusion method. On the one hand, the calculation graph is divided into multiple subgraphs to be fused, and the operators are fused based on the subgraphs to be fused; on the other hand, in the process of fusing the operators, the hardware information of the AI chip is considered, so that after the operator fusion, not only the data transmission is reduced, but also the data scale of the intermediate data of the fused operator can be matched with the actual data scale stored in the on-chip memory, and the power consumption overhead required for the on-chip memory is also controlled within a reasonable range. Such a design optimizes the neural network, improves the execution efficiency, and improves the performance of network training and reasoning.
下面将结合附图,对本申请实施例进行详细描述。The embodiments of the present application will be described in detail below with reference to the accompanying drawings.
图2为本申请示例性的提出的一种用于神经网络的算子融合方法的流程图,应用于融合设备,融合设备例如可以是电子设备或者服务器。神经网络模型可以存储在设定芯片上,也可以存储在电子设
备中。FIG2 is a flowchart of an operator fusion method for a neural network proposed by the present application, which is applied to a fusion device, which may be an electronic device or a server. The neural network model may be stored on a setting chip or in an electronic device. Prepared in.
以神经网络模型存储在设定芯片上为例,融合设备以电子设备为例,运行神经网络模型的计算设备以设定芯片为例,该方法至少包括如下步骤:Taking the neural network model stored on a setting chip as an example, the fusion device as an electronic device, and the computing device running the neural network model as an example, the setting chip, the method includes at least the following steps:
S200,设定芯片向电子设备发送神经网络模型。S200, setting the chip to send the neural network model to the electronic device.
其中,设定芯片可以是AI芯片。Among them, the setting chip can be an AI chip.
S201,电子设备获取神经网络模型,并确定神经网络模型对应的计算图。S201, the electronic device obtains a neural network model and determines a calculation graph corresponding to the neural network model.
应用较多的神经网络模型包括BP神经网络、Hopfield网络、ART网络和Kohonen网络等。本申请实施例中,获取的神经网络模型可以是其中的任意一种,也可以是其他的神经网络模型。计算图用于描述神经网络模型中的多个算子之间的连接关系,多个算子中的每个算子用于执行一个计算操作。The neural network models that are widely used include BP neural network, Hopfield network, ART network and Kohonen network. In the embodiment of the present application, the neural network model obtained can be any one of them, or it can be other neural network models. The computational graph is used to describe the connection relationship between multiple operators in the neural network model, and each of the multiple operators is used to perform a computational operation.
在一个具体的例子中,参考图3,计算图中包括2个算子,算子1包括3个计算操作,算子2包括3个计算操作。In a specific example, referring to FIG3 , the computation graph includes 2 operators, operator 1 includes 3 computation operations, and operator 2 includes 3 computation operations.
如果获取到神经网络模型为常用模型,则可以直接在常用模型对应的计算图中查找获取到的神经网络模型对应的计算图。如果获取到的神经网络模型为对常用模型改进后得到的自定义的神经网络模型,则可以应用将神经网络模型抽离成计算图的方法得到自定义的神经网络模型。If the obtained neural network model is a commonly used model, the calculation graph corresponding to the obtained neural network model can be directly searched in the calculation graph corresponding to the commonly used model. If the obtained neural network model is a customized neural network model obtained by improving the commonly used model, the method of extracting the neural network model into a calculation graph can be applied to obtain the customized neural network model.
S202,电子设备确定计算图中的至少两个待融合子图。S202: The electronic device determines at least two subgraphs to be fused in the computation graph.
本申请实施例中,区别于相关技术中的按照算子类型以算子为单元直接进行算子融合,而是将计算图划分为至少两个待融合子图,其中,至少两个待融合子图中任一待融合子图包括至少一个算子。In the embodiment of the present application, unlike the related art in which operator fusion is directly performed based on operator type and operator as a unit, the computation graph is divided into at least two subgraphs to be fused, wherein any of the at least two subgraphs to be fused includes at least one operator.
示例性的,确定计算图中的至少两个待融合子图的方式可以包括如下几种:Exemplarily, the methods of determining at least two subgraphs to be fused in the computation graph may include the following:
(1)按照算子的数量划分:(1) According to the number of operators:
比如将10个连在一起的算子作为一个待融合子图。For example, 10 connected operators are regarded as a subgraph to be fused.
(2)按照常见计算操作的类型划分:(2) Classification by type of common computing operations:
比如卷积类型的计算操作为输入的计算操作通常为add类型,这样可以将卷积类型的计算操作和add类型的计算操作各自所属的算子划分到一个待融合子图。For example, a convolution type computing operation whose input is usually an add type, so the operators to which the convolution type computing operation and the add type computing operation belong can be divided into a subgraph to be fused.
(3)按照输入输出以及计算操作能处理的数据量划分。(3) Divided according to input, output and the amount of data that can be processed by computing operations.
在本申请实施例中,可以应用不限于上述三种方式中的任意一种,将计算图划分为至少两个待融合子图。In the embodiment of the present application, any one of the above three methods can be applied to divide the computation graph into at least two sub-graphs to be fused.
S203,电子设备在第一待融合子图满足融合条件时,将第一待融合子图包括的至少两个算子融合为一个算子。S203: When the first subgraph to be fused meets a fusion condition, the electronic device fuses at least two operators included in the first subgraph to be fused into one operator.
其中,第一待融合子图为至少两个待融合子图中的任一个。融合条件包括:第一待融合子图包括至少两个算子,且第一待融合子图在设定芯片上运行所需的资源量的利用率大于利用率阈值。第一待融合子图在设定芯片上运行所需的资源量的利用率与设定芯片的内存大小以及第一待融合子图中至少两个算子所包括的计算操作的总数量相关。The first subgraph to be fused is any one of the at least two subgraphs to be fused. The fusion condition includes: the first subgraph to be fused includes at least two operators, and the utilization rate of the amount of resources required for the first subgraph to be fused to run on the set chip is greater than the utilization rate threshold. The utilization rate of the amount of resources required for the first subgraph to be fused to run on the set chip is related to the memory size of the set chip and the total number of computing operations included in the at least two operators in the first subgraph to be fused.
如果第一待融合子图中只包括一个算子,则该第一待融合子图为单算子子图,无法进行算子融合。因此,融合条件中的一条是第一待融合子图中包括至少两个算子。另外,融合条件中的另一条是第一待融合子图在设定芯片上运行所需的资源量的利用率大于利用率阈值。If the first subgraph to be fused includes only one operator, the first subgraph to be fused is a single operator subgraph and operator fusion cannot be performed. Therefore, one of the fusion conditions is that the first subgraph to be fused includes at least two operators. In addition, another fusion condition is that the utilization rate of the amount of resources required for the first subgraph to be fused to run on the set chip is greater than the utilization rate threshold.
其中,利用率阈值是预先设定的,例如是0.8。第一待融合子图在设定芯片上运行所需的资源量的利用率是第一待融合子图在设定芯片上运行所需的资源量与设定芯片的最大资源量的比值。而设定芯片的最大资源量是可以从设定芯片的出厂参数获取到的,或者根据出厂参数计算得到。The utilization threshold is preset, for example, 0.8. The utilization of the amount of resources required for the first subgraph to be fused to run on the set chip is the ratio of the amount of resources required for the first subgraph to be fused to run on the set chip to the maximum amount of resources of the set chip. The maximum amount of resources of the set chip can be obtained from the factory parameters of the set chip, or calculated according to the factory parameters.
S204,电子设备将算子融合后得到的神经网络模型发送至设定芯片。S204, the electronic device sends the neural network model obtained after the operator fusion to the setting chip.
其中,由于神经网络模型是在设定芯片运行,因此,将算子融合后得到的神经网络模型发送至设定芯片。Among them, since the neural network model is running on the set chip, the neural network model obtained after operator fusion is sent to the set chip.
S205,设定芯片获取神经网络模型运行指令,运行算子融合后的神经网络模型。S205, setting the chip to obtain the neural network model running instruction, and running the neural network model after the operator fusion.
其中,神经网络模型运行指令可以用于神经网络模型训练,也可以用于神经网络模型推理。设定芯片在获取到神经网络模型的运行指令后,可以针对训练数据或推理数据进行处理。The neural network model running instruction can be used for neural network model training or neural network model reasoning. After the chip obtains the running instruction of the neural network model, it can process the training data or the reasoning data.
基于上述实施例,融合条件中包括第一待融合子图在设定芯片上运行所需的资源量的利用率大于利用率阈值。为了计算第一待融合子图在设定芯片上运行所需的资源量的利用率,可以先计算第一待融合子图在设定芯片上运行所需的资源量。Based on the above embodiment, the fusion condition includes that the utilization rate of the amount of resources required for the first subgraph to be fused to run on the set chip is greater than the utilization rate threshold. In order to calculate the utilization rate of the amount of resources required for the first subgraph to be fused to run on the set chip, the amount of resources required for the first subgraph to be fused to run on the set chip can be calculated first.
参见图4,对第一待融合子图在设定芯片上运行所需的资源量的计算过程进行说明。
Referring to FIG. 4 , the process of calculating the amount of resources required for the first subgraph to be fused to run on a set chip is described.
其中,资源量包括算力和/或带宽。The amount of resources includes computing power and/or bandwidth.
根据设定芯片的内存大小、设定芯片的运算指令所支持的内存复用能力以及第一待融合子图中至少两个算子所包括的计算操作的总数量,确定第一待融合子图在设定芯片上运行所需的资源量。The amount of resources required for the first subgraph to be fused to run on the set chip is determined according to the memory size of the set chip, the memory reuse capability supported by the operation instructions of the set chip, and the total number of computing operations included in at least two operators in the first subgraph to be fused.
其中,设定芯片的内存大小通常是指片上内存。设定芯片的运算指令所支持的内存复用能力通常包括可复用以及不可复用两种情况。如果计算图是树图的形式,则一个计算操作可以用一个计算节点表示,第一待融合子图中至少两个算子所包括的计算操作的总数量也就是计算节点的总数量。Among them, setting the memory size of the chip usually refers to the on-chip memory. Setting the memory reuse capability supported by the chip's operation instructions usually includes two situations: reuse and non-reuse. If the computation graph is in the form of a tree graph, a computation operation can be represented by a computation node, and the total number of computation operations included in at least two operators in the first subgraph to be fused is the total number of computation nodes.
S400,电子设备接收来自设定芯片的硬件信息。S400, the electronic device receives hardware information from a setting chip.
其中,硬件信息包括设定芯片的运算指令所支持的内存复用能力和设定芯片的内存大小。The hardware information includes setting the memory reuse capability supported by the chip's computing instructions and setting the chip's memory size.
S401,电子设备在设定芯片的运算指令所支持的内存复用能力为可复用,且第一待融合子图中至少两个算子所包括的计算操作产生的数据类型相同时,确定运行第一待融合子图所需的内存份数。S401, the electronic device determines the number of memory copies required to run the first sub-graph to be fused when the memory reuse capability supported by the chip's operation instructions is set to be reusable and the data types generated by the calculation operations included in at least two operators in the first sub-graph to be fused are the same.
其中,内存份数为第一待融合子图中至少两个算子所包括的计算操作的总数量与第一待融合子图中至少两个算子所包括的输出型计算操作的数量的差。输出型计算操作满足以输出型计算操作的运算结果作为输入的计算操作的数量小于第二数量阈值。例如,数量阈值为2,也就是说,以输出型计算操作的运算结果作为输入的计算操作的数量为0或者1。The number of memory copies is the difference between the total number of computing operations included in at least two operators in the first subgraph to be fused and the number of output computing operations included in at least two operators in the first subgraph to be fused. The output computing operation satisfies that the number of computing operations that use the computing results of the output computing operations as input is less than the second quantity threshold. For example, the quantity threshold is 2, that is, the number of computing operations that use the computing results of the output computing operations as input is 0 or 1.
通常情况下,第一待融合子图中至少两个算子所包括的计算操作产生的数据类型相同时,比如均为int8类型,设定芯片的运算指令所支持的内存复用能力为可用的情况是有效的。也就是说,如果第一待融合子图中至少两个算子所包括的计算操作产生的数据类型不相同,也无法利用设定芯片的运算指令所支持的内存复用能力。Generally, when the data types generated by the calculation operations included in at least two operators in the first subgraph to be fused are the same, for example, both are int8 types, it is effective to set the memory reuse capability supported by the chip's operation instructions to be available. In other words, if the data types generated by the calculation operations included in at least two operators in the first subgraph to be fused are different, the memory reuse capability supported by the chip's operation instructions cannot be used.
在这种情况中,确定第一待融合子图中至少两个算子所包括的计算操作的总数量p,第一待融合子图中至少两个算子所包括的输出型计算操作的数量q。以输出型计算操作的运算结果作为输入的计算操作的数量为0或者1,则计算操作所在的节点可复用,因此,差值p-q为运行第一待融合子图中至少两个算子所需的内存分数F。In this case, the total number p of computing operations included in the at least two operators in the first subgraph to be fused and the number q of output computing operations included in the at least two operators in the first subgraph to be fused are determined. If the number of computing operations with the operation results of the output computing operations as input is 0 or 1, the nodes where the computing operations are located can be reused, so the difference p-q is the memory fraction F required to run the at least two operators in the first subgraph to be fused.
在一个具体的例子中,计算操作以a=b+c为例,第一待融合子图中至少两个算子所包括的计算操作的总数量为3,内存1存储数据b,内存2存储数据c,计算操作的计算类型是求和运算。在可复用的情况下,可以把数据c加到数据b上作为和数据a,也就是说,数据b和数据a可以共用一份内存,无需重新设置一份内存来存储a。因此,该示例中,所需的内存份数为3-1=2。In a specific example, the calculation operation is a=b+c. The total number of calculation operations included in at least two operators in the first subgraph to be fused is 3. Memory 1 stores data b, and memory 2 stores data c. The calculation type of the calculation operation is a sum operation. In the case of reuse, data c can be added to data b as the sum of data a. In other words, data b and data a can share a memory without resetting a memory to store a. Therefore, in this example, the number of memory copies required is 3-1=2.
如果以输出型计算操作的运算结果作为输入的计算操作的数量大于1,则计算操作所在的节点不可复用,因此,p为运行第一待融合子图中至少两个算子所需的内存份数。另外,如果在设定芯片的运算指令所支持的内存复用能力为不可复用,则确定第一待融合子图中至少两个算子所包括的计算操作的总数量p为运行第一待融合子图中至少两个算子所需的内存份数。If the number of computing operations that use the calculation results of output computing operations as input is greater than 1, the node where the computing operation is located cannot be reused, so p is the number of memory copies required to run at least two operators in the first subgraph to be fused. In addition, if the memory reuse capability supported by the computing instructions of the set chip is not reused, the total number p of computing operations included in the at least two operators in the first subgraph to be fused is determined to be the number of memory copies required to run at least two operators in the first subgraph to be fused.
在一个具体的例子中,计算操作以a=b+c为例,第一待融合子图中至少两个算子所包括的计算操作的总数量p为3,内存1存储数据b,内存2存储数据c,计算操作的计算类型是求和运算。在不可复用的情况下,数据b、数据c以及求和得到的数据a分别需要一份内存,也就是说,运行第一待融合子图的各个算子需要的内存份数也为3。In a specific example, the calculation operation is a=b+c, the total number p of calculation operations included in at least two operators in the first subgraph to be fused is 3, memory 1 stores data b, memory 2 stores data c, and the calculation type of the calculation operation is a sum operation. In the case of non-reusability, data b, data c, and the summed data a each require a memory, that is, the number of memory copies required to run each operator of the first subgraph to be fused is also 3.
参考图5,为第一待融合子图的示意图。其中,51为以输出型计算操作的运算结果作为输入的计算操作的数量为0的输出型计算操作;52为以输出型计算操作的运算结果作为输入的计算操作的数量为1的输出型计算操作;53为以输出型计算操作的运算结果作为输入的计算操作的数量为1的输出型计算操作。Refer to Figure 5, which is a schematic diagram of the first subgraph to be fused. Among them, 51 is an output-type computing operation with the calculation result of the output-type computing operation as the input, and the number of computing operations is 0; 52 is an output-type computing operation with the calculation result of the output-type computing operation as the input, and the number of computing operations is 1; 53 is an output-type computing operation with the calculation result of the output-type computing operation as the input, and the number of computing operations is 1.
另外,图5中包括的加计算操作、减计算操作、乘计算操作、除计算操作、开平方计算操作、反向传播计算操作和数据转换计算操作等。分别以一个例子对各个计算操作的形式进行说明:In addition, FIG5 includes addition operation, subtraction operation, multiplication operation, division operation, square root operation, back propagation operation and data conversion operation, etc. The form of each operation is described with an example:
加计算操作例如是add_14float32[1,2,1,1,16],其中,add为计算操作的计算类型,14为计算操作的数组维度,float32为计算操作的数据类型,[1,2,1,1,16]为计算操作的数据张量。An example of an addition operation is add_14float32[1,2,1,1,16], where add is the calculation type of the calculation operation, 14 is the array dimension of the calculation operation, float32 is the data type of the calculation operation, and [1,2,1,1,16] is the data tensor of the calculation operation.
减计算操作例如是sub_3float32[1,2,1,1,16],其中,sub为计算操作的计算类型,3为计算操作的数组维度,float32为计算操作的数据类型,[1,2,1,1,16]为计算操作的数据张量。An example of a subtraction operation is sub_3float32[1,2,1,1,16], where sub is the calculation type of the calculation operation, 3 is the array dimension of the calculation operation, float32 is the data type of the calculation operation, and [1,2,1,1,16] is the data tensor of the calculation operation.
乘计算操作mul_1float32[1,2,1,1,16],其中,mul为计算操作的计算类型,1为计算操作的数组维度,float32为计算操作的数据类型,[1,2,1,1,16]为计算操作的数据张量。The multiplication operation mul_1float32[1,2,1,1,16] is used, where mul is the calculation type of the operation, 1 is the array dimension of the operation, float32 is the data type of the operation, and [1,2,1,1,16] is the data tensor of the operation.
除计算操作div_6float32[1,2,1,1,16],其中,div为计算操作的计算类型,6为计算操作的数组维度,float32为计算操作的数据类型,[1,2,1,1,16]为计算操作的数据张量。
In addition to the calculation operation div_6float32[1,2,1,1,16], div is the calculation type of the calculation operation, 6 is the array dimension of the calculation operation, float32 is the data type of the calculation operation, and [1,2,1,1,16] is the data tensor of the calculation operation.
开平方计算操作sqrt_5float32[1,2,1,1,16],其中,sqrt为计算操作的计算类型,5为计算操作的数组维度,float32为计算操作的数据类型,[1,2,1,1,16]为计算操作的数据张量。The square root calculation operation sqrt_5float32[1,2,1,1,16], where sqrt is the calculation type of the calculation operation, 5 is the array dimension of the calculation operation, float32 is the data type of the calculation operation, and [1,2,1,1,16] is the data tensor of the calculation operation.
反向传播计算操作broadcast_tenser_1float32[256,2,112,112,16],其中,broadcast_tenser为计算操作的计算类型,1为计算操作的数组维度,float32为计算操作的数据类型,[256,2,112,112,16]为计算操作的数据张量。The back propagation calculation operation broadcast_tenser_1float32[256,2,112,112,16], where broadcast_tenser is the calculation type of the calculation operation, 1 is the array dimension of the calculation operation, float32 is the data type of the calculation operation, and [256,2,112,112,16] is the data tensor of the calculation operation.
数据转换计算操作cast_0float32[256,2,112,112,16],其中,cast为计算操作的计算类型,0为计算操作的数组维度,float32为计算操作的数据类型,[256,2,112,112,16]为计算操作的数据张量。The data conversion calculation operation cast_0float32[256,2,112,112,16], where cast is the calculation type of the calculation operation, 0 is the array dimension of the calculation operation, float32 is the data type of the calculation operation, and [256,2,112,112,16] is the data tensor of the calculation operation.
S402,电子设备根据设定芯片的内存大小、运行第一待融合子图所需的内存份数,以及第一待融合子图包括的每个计算操作所属的类型对应的运行时长,确定设定芯片运行第一待融合子图包括的每个计算操作分别对应的运行速率。S402, the electronic device determines the running speed corresponding to each computing operation included in the first subgraph to be fused run by the setting chip according to the memory size of the setting chip, the number of memory portions required to run the first subgraph to be fused, and the running time corresponding to the type of each computing operation included in the first subgraph to be fused.
在一些可能设计中,计算操作不同,其对应的运行速率通常不同。以第三计算操作为例,说明第三计算操作对应的运行速率的计算过程。其中,第三计算操作为第一待融合子图中至少两个算子所包括的计算操作中的任意一个计算操作。In some possible designs, different computing operations usually have different corresponding running rates. Taking the third computing operation as an example, the calculation process of the running rate corresponding to the third computing operation is described. The third computing operation is any computing operation included in the at least two operators in the first subgraph to be fused.
(1)确定设定芯片的内存大小与运行第一待融合子图的各个算子需要的内存份数的比值为第三计算操作在芯片上运行所需的内存大小。(1) Determine and set the ratio of the memory size of the chip to the number of memory copies required to run each operator of the first subgraph to be fused to obtain the memory size required for the third computing operation to run on the chip.
一份内存用于存储一个计算操作的计算结果,因此,将设定芯片的内存大小M与运行第一待融合子图的各个算子需要的内存份数f做商,得到的比值为第三计算操作在芯片上运行所需的内存大小。One memory is used to store the calculation result of a calculation operation. Therefore, the memory size M of the chip is set to be divided by the number of memory shares f required to run each operator of the first subgraph to be fused, and the obtained ratio is the memory size required for the third calculation operation to run on the chip.
因此,第三计算操作在设定芯片上运行所需的内存大小满足:m=M/f。其中,第三计算操作为第一待融合子图所包括的计算操作中的任一个计算操作,m第三计算操作在设定芯片上运行所需的内存大小,M为设定芯片的内存大小,f为运行第一待融合子图所需的内存份数。Therefore, the memory size required for the third computing operation to run on the set chip satisfies: m=M/f. The third computing operation is any computing operation included in the first subgraph to be fused, m is the memory size required for the third computing operation to run on the set chip, M is the memory size of the set chip, and f is the number of memory copies required to run the first subgraph to be fused.
(2)确定第三计算操作在芯片上运行所需的内存大小与执行第三计算操作所需的时长的比值为第三计算操作对应的运行速率。(2) Determine that the ratio of the memory size required for the third computing operation to run on the chip to the time required to execute the third computing operation is the running rate corresponding to the third computing operation.
由于执行不同的计算操作所需的时长不同,因此,如果计算操作的个数为S,则确定的运营速率的个数也为S。示例性的,针对第三计算操作,可以确定第三计算操作在芯片上运行所需的内存大小与执行第三计算操作所需的时长的比值为第三计算操作对应的运行速率。Since different computing operations require different durations of time to execute, if the number of computing operations is S, the number of determined operating rates is also S. Exemplarily, for a third computing operation, a ratio of a memory size required for the third computing operation to run on a chip to a duration of time required to execute the third computing operation may be determined as the operating rate corresponding to the third computing operation.
例如,第三计算操作的运行速率满足V=m/t。V为第三计算操作的运行速率,t为执行第三计算操作所需的时长。For example, the running rate of the third computing operation satisfies V=m/t, where V is the running rate of the third computing operation, and t is the time required to execute the third computing operation.
S403,电子设备将第一待融合子图包括的每个计算操作分别对应的运行速率进行单位转换得到第一待融合子图包括的每个计算操作在设定芯片上运行所需的资源量。S403, the electronic device converts the operation rate corresponding to each computing operation included in the first sub-graph to be fused into units to obtain the amount of resources required for each computing operation included in the first sub-graph to be fused to run on a set chip.
其中,第四计算操作在设定芯片上运行所需的资源量包括带宽,第四计算操作为第一待融合子图中至少两个算子所包括的各个计算操作中任一个计算操作,根据运行速率和带宽的转换关系,将第四计算操作的运行速率转换为第四计算操作在设定芯片上运行所需的带宽。Among them, the amount of resources required for the fourth computing operation to run on the set chip includes bandwidth, and the fourth computing operation is any computing operation included in the at least two operators in the first sub-graph to be fused. According to the conversion relationship between the running rate and the bandwidth, the running rate of the fourth computing operation is converted into the bandwidth required for the fourth computing operation to run on the set chip.
例如,运行速率和带宽的转换关系为BW=h*k,运行速率为k,带宽为BW。For example, the conversion relationship between the operating rate and the bandwidth is BW=h*k, where the operating rate is k and the bandwidth is BW.
或者,第五计算操作在设定芯片上运行所需的资源量包括算力,第五计算操作为第一待融合子图中至少两个算子所包括的各个计算操作中任一个计算操作,根据运行速率和算力的转换关系,将第五计算操作的运行速率转换为第五计算操作在设定芯片上运行所需的算力。Alternatively, the amount of resources required for the fifth computing operation to run on the set chip includes computing power, and the fifth computing operation is any one of the computing operations included in at least two operators in the first sub-graph to be fused. According to the conversion relationship between the running rate and the computing power, the running rate of the fifth computing operation is converted into the computing power required for the fifth computing operation to run on the set chip.
例如,运行速率和算力的转换关系为F=s*k,运行速率为k,算力为F。For example, the conversion relationship between running rate and computing power is F=s*k, where the running rate is k and the computing power is F.
S404,电子设备将第一待融合子图包括的每个计算操作在设定芯片上运行所需的资源量中最小的资源量作为第一待融合子图在设定芯片上运行所需的资源量。S404: The electronic device uses the minimum amount of resources among the amounts of resources required for each computing operation included in the first subgraph to be fused to run on the set chip as the amount of resources required for the first subgraph to be fused to run on the set chip.
由于在同一时刻只有一个计算操作执行,因此,各个计算操作在设定芯片上运行所需的资源量中最小的资源量为第一待融合子图在设定芯片上运行所需的资源量。这样的设计,可以得到第一待融合子图在设定芯片上运行所需的带宽BW0和算力F0。Since only one computing operation is executed at the same time, the minimum amount of resources required for each computing operation to run on the set chip is the amount of resources required for the first subgraph to be fused to run on the set chip. With this design, the bandwidth BW0 and computing power F0 required for the first subgraph to be fused to run on the set chip can be obtained.
在另一种可能的设计中,如果第一待融合子图不满足融合条件,表明如果对第一待融合子图中包括的至少两个算子进行融合,得到的神经网络模型在设定芯片上运行时,设定芯片的性能没有达到一个较优的状态。此时,将第一待融合子图拆分为第二待融合子图和第三待融合子图。进而再应用融合条件分别对第二待融合子图和第三待融合子图分别进行判断。在第二待融合子图满足融合条件时,将第二待融合子图包括的至少两个算子融合为一个算子;在第三待融合子图满足融合条件时,将第三待融合子图包括的至少两个算子融合为一个算子。其中,融合条件以及融合过程参见第一待融合子图的
表述,这里不赘述。In another possible design, if the first subgraph to be fused does not meet the fusion condition, it means that if at least two operators included in the first subgraph to be fused are fused, when the obtained neural network model runs on the set chip, the performance of the set chip does not reach a better state. At this time, the first subgraph to be fused is split into the second subgraph to be fused and the third subgraph to be fused. Then, the fusion condition is applied to judge the second subgraph to be fused and the third subgraph to be fused respectively. When the second subgraph to be fused meets the fusion condition, the at least two operators included in the second subgraph to be fused are fused into one operator; when the third subgraph to be fused meets the fusion condition, the at least two operators included in the third subgraph to be fused are fused into one operator. Among them, the fusion condition and the fusion process refer to the first subgraph to be fused. The expression is not elaborated here.
参考图6,示出了一种算子融合过程的示意图。该示例中,第一待融合子图包括四个算子,如果将四个算子融合,不满足融合条件。而后进行子图切分,得到第二待融合子图和第三待融合子图,其中,第二待融合子图包括算子1、算子2和算子3;第三待融合子图包括算子4。第二待融合子图满足融合条件,则将第二待融合子图的算子1、算子2和算子3进行融合。而第三待融合子图为单算子子图,无法进行算子融合。Referring to Figure 6, a schematic diagram of an operator fusion process is shown. In this example, the first subgraph to be fused includes four operators. If the four operators are fused, the fusion condition is not met. Then the subgraph is divided to obtain the second subgraph to be fused and the third subgraph to be fused, wherein the second subgraph to be fused includes operator 1, operator 2 and operator 3; the third subgraph to be fused includes operator 4. If the second subgraph to be fused meets the fusion condition, operator 1, operator 2 and operator 3 of the second subgraph to be fused are fused. The third subgraph to be fused is a single operator subgraph, and operator fusion cannot be performed.
参考图7,为待融合子图拆分的示意图。拆分的过程中,从第一待融合子图的第一输出型计算操作开始(如71),通过反向搜索的方式遍历第一待融合子图的各个计算操作。第一输出型计算操作为第一待融合子图的任意一个输出型计算操作,第一输出型计算操作为第一待融合子图的任意一个输出型计算操作,第一输出型计算操作满足以第一输出型计算操作的运算结果作为输入的计算操作的数量小于第一数量阈值。Refer to Figure 7, which is a schematic diagram of the splitting of the subgraphs to be fused. During the splitting process, starting from the first output-type computing operation of the first subgraph to be fused (such as 71), each computing operation of the first subgraph to be fused is traversed by reverse search. The first output-type computing operation is any output-type computing operation of the first subgraph to be fused, and the first output-type computing operation is any output-type computing operation of the first subgraph to be fused. The first output-type computing operation satisfies that the number of computing operations that use the operation result of the first output-type computing operation as input is less than the first quantity threshold.
其中,第一数量阈值可以为1,以第一输出型计算操作的运算结果作为输入的计算操作的数量为0,也就是说,不存在以第一输出型计算操作的运算结果作为输入的计算操作。The first quantity threshold may be 1, and the number of computing operations using the calculation result of the first output type computing operation as input is 0, that is, there is no computing operation using the calculation result of the first output type computing operation as input.
在第二计算操作的类型满足拆分条件时,将第一待融合子图拆分为第二待融合子和第三待融合子图。将第一待融合子图中以第二计算操作为根节点的子图作为第二待融合子图,将第一待融合子图中除以第二计算操作为根节点的子图以外的子图作为第三待融合子图。第二计算操作满足拆分条件为第二计算操作的计算类型为设定类型(比如反向传播型)或者第二计算操作的计算类型与作为第二计算操作的父节点的计算操作的计算类型相同。When the type of the second computing operation meets the splitting condition, the first subgraph to be fused is split into the second subgraph to be fused and the third subgraph to be fused. The subgraph with the second computing operation as the root node in the first subgraph to be fused is taken as the second subgraph to be fused, and the subgraphs other than the subgraph with the second computing operation as the root node in the first subgraph to be fused are taken as the third subgraph to be fused. The second computing operation meets the splitting condition if the computing type of the second computing operation is a set type (such as a back propagation type) or the computing type of the second computing operation is the same as the computing type of the computing operation of the parent node of the second computing operation.
基于与方法实施例同一发明构思,本申请实施例还提供了一种用于神经网络的算子融合装置,用于执行上述所示的方法实施例中用于神经网络的算子融合方法,相关特征可参见上述方法实施例,此处不再赘述,如图8所示,用于神经网络的算子融合装置包括传输单元801和处理单元802。Based on the same inventive concept as the method embodiment, the embodiment of the present application also provides an operator fusion device for a neural network, which is used to execute the operator fusion method for a neural network in the method embodiment shown above. The relevant features can be found in the above method embodiment and will not be repeated here. As shown in Figure 8, the operator fusion device for a neural network includes a transmission unit 801 and a processing unit 802.
传输单元801,用于获取神经网络模型;A transmission unit 801 is used to obtain a neural network model;
处理单元802,用于确定神经网络模型对应的计算图,计算图用于描述神经网络模型中的多个算子之间的连接关系;多个算子中每个算子用于执行至少一个计算操作;A processing unit 802 is used to determine a computation graph corresponding to the neural network model, where the computation graph is used to describe the connection relationship between multiple operators in the neural network model; each of the multiple operators is used to perform at least one computation operation;
处理单元802还用于,确定计算图中的至少两个待融合子图,至少两个待融合子图中任一待融合子图包括至少一个算子;The processing unit 802 is further configured to determine at least two subgraphs to be fused in the computation graph, wherein any of the at least two subgraphs to be fused includes at least one operator;
处理单元802还用于,在第一待融合子图满足融合条件时,将第一待融合子图包括的至少两个算子融合为一个算子;The processing unit 802 is further configured to, when the first subgraph to be fused meets a fusion condition, fuse at least two operators included in the first subgraph to be fused into one operator;
第一待融合子图为至少两个待融合子图中的任一个;The first sub-graph to be fused is any one of the at least two sub-graphs to be fused;
融合条件包括:第一待融合子图包括至少两个算子,且第一待融合子图在设定芯片上运行所需的资源量的利用率大于利用率阈值,第一待融合子图在设定芯片上运行所需的资源量的利用率与设定芯片的内存大小以及第一待融合子图中至少两个算子所包括的计算操作的总数量相关。The fusion conditions include: the first subgraph to be fused includes at least two operators, and the utilization rate of the amount of resources required for the first subgraph to be fused to run on the set chip is greater than the utilization rate threshold, and the utilization rate of the amount of resources required for the first subgraph to be fused to run on the set chip is related to the memory size of the set chip and the total number of computing operations included in the at least two operators in the first subgraph to be fused.
在一些示例性的实施方式中,处理单元802还用于:In some exemplary embodiments, the processing unit 802 is further configured to:
在第一待融合子图不满足融合条件时,将第一待融合子图拆分为第二待融合子图和第三待融合子图;When the first sub-graph to be fused does not meet the fusion condition, splitting the first sub-graph to be fused into a second sub-graph to be fused and a third sub-graph to be fused;
在第二待融合子图满足融合条件时,将第二待融合子图包括的至少两个算子融合为一个算子;和/或,When the second subgraph to be fused meets the fusion condition, at least two operators included in the second subgraph to be fused are fused into one operator; and/or,
在第三待融合子图满足融合条件时,将第三待融合子图包括的至少两个算子融合为一个算子。When the third subgraph to be fused meets the fusion condition, at least two operators included in the third subgraph to be fused are fused into one operator.
在一些示例性的实施方式中,处理单元802具体用于:In some exemplary implementations, the processing unit 802 is specifically configured to:
从第一待融合子图的第一输出型计算操作开始,通过反向搜索的方式遍历第一待融合子图的各个计算操作;其中,第一输出型计算操作为第一待融合子图的任意一个输出型计算操作,第一输出型计算操作满足以第一输出型计算操作的运算结果作为输入的计算操作的数量小于第一数量阈值;Starting from the first output type computing operation of the first subgraph to be fused, traverse each computing operation of the first subgraph to be fused by reverse search; wherein the first output type computing operation is any output type computing operation of the first subgraph to be fused, and the first output type computing operation satisfies that the number of computing operations that use the operation result of the first output type computing operation as input is less than a first quantity threshold;
在当前遍历到的第二计算操作满足拆分条件时,将第一待融合子图中以第二计算操作为根节点的子图作为第二待融合子图,将第一待融合子图中除以第二计算操作为根节点的子图以外的子图作为第三待融合子图;When the second calculation operation currently traversed meets the splitting condition, the subgraph with the second calculation operation as the root node in the first subgraph to be fused is used as the second subgraph to be fused, and the subgraphs in the first subgraph to be fused except the subgraph with the second calculation operation as the root node are used as the third subgraph to be fused;
其中,第二计算操作满足拆分条件为第二计算操作的计算类型为设定类型或者第二计算操作的计算类型与作为第二计算操作的父节点的计算操作的计算类型相同。The second computing operation satisfies the splitting condition if the computing type of the second computing operation is a set type or the computing type of the second computing operation is the same as the computing type of the computing operation of the parent node of the second computing operation.
在一些示例性的实施方式中,处理单元802还用于:
In some exemplary embodiments, the processing unit 802 is further configured to:
根据设定芯片的内存大小、设定芯片的运算指令所支持的内存复用能力以及第一待融合子图中至少两个算子所包括的计算操作的总数量,确定第一待融合子图在设定芯片上运行所需的资源量。The amount of resources required for the first subgraph to be fused to run on the set chip is determined according to the memory size of the set chip, the memory reuse capability supported by the operation instructions of the set chip, and the total number of computing operations included in at least two operators in the first subgraph to be fused.
在一些示例性的实施方式中,处理单元802具体用于:In some exemplary implementations, the processing unit 802 is specifically configured to:
在设定芯片的运算指令所支持的内存复用能力为可复用,且第一待融合子图中至少两个算子所包括的计算操作产生的数据类型相同时,确定运行第一待融合子图所需的内存份数,内存份数为第一待融合子图中至少两个算子所包括的计算操作的总数量与第一待融合子图中至少两个算子所包括的输出型计算操作的数量的差;输出型计算操作以输出型计算操作的运算结果作为输入的计算操作的数量小于第二数量阈值;When the memory reuse capability supported by the operation instructions of the chip is set to be reusable, and the data types generated by the calculation operations included in at least two operators in the first subgraph to be fused are the same, determine the number of memory shares required to run the first subgraph to be fused, and the number of memory shares is the difference between the total number of calculation operations included in the at least two operators in the first subgraph to be fused and the number of output calculation operations included in the at least two operators in the first subgraph to be fused; the number of calculation operations in which the output calculation operation uses the calculation result of the output calculation operation as input is less than the second quantity threshold;
根据设定芯片的内存大小、运行第一待融合子图所需的内存份数,以及第一待融合子图包括的每个计算操作所属的类型对应的运行时长,确定设定芯片运行第一待融合子图包括的每个计算操作分别对应的运行速率;Determine the running speed corresponding to each computing operation included in the first subgraph to be fused and executed by the set chip according to the memory size of the set chip, the number of memory copies required to execute the first subgraph to be fused, and the running time corresponding to the type of each computing operation included in the first subgraph to be fused;
将第一待融合子图包括的每个计算操作分别对应的运行速率进行单位转换得到第一待融合子图包括的每个计算操作在设定芯片上运行所需的资源量;Convert the operation rates corresponding to each computing operation included in the first subgraph to be fused into units to obtain the amount of resources required for each computing operation included in the first subgraph to be fused to run on the set chip;
将第一待融合子图包括的每个计算操作在设定芯片上运行所需的资源量中最小的资源量作为第一待融合子图在设定芯片上运行所需的资源量。The minimum amount of resources among the amounts of resources required for each computing operation included in the first subgraph to be fused to run on the set chip is used as the amount of resources required for the first subgraph to be fused to run on the set chip.
在一些示例性的实施方式中,处理单元802具体用于:In some exemplary implementations, the processing unit 802 is specifically configured to:
第三计算操作在设定芯片上运行所需的内存大小满足:m=M/f;The memory size required for the third computing operation to run on the set chip satisfies: m = M/f;
其中,第三计算操作为第一待融合子图所包括的计算操作中的任一个计算操作,m第三计算操作在设定芯片上运行所需的内存大小,M为设定芯片的内存大小,f为运行第一待融合子图所需的内存份数;The third computing operation is any computing operation included in the first sub-graph to be fused, m is the memory size required for the third computing operation to run on the set chip, M is the memory size of the set chip, and f is the number of memory copies required to run the first sub-graph to be fused;
第三计算操作的运行速率满足V=m/t;The running rate of the third computing operation satisfies V=m/t;
其中,V为第三计算操作的运行速率,t为执行第三计算操作所需的时长。Wherein, V is the running rate of the third computing operation, and t is the time required to execute the third computing operation.
在一些示例性的实施方式中,资源量包括带宽和/或算力;In some exemplary embodiments, the amount of resources includes bandwidth and/or computing power;
第四计算操作在设定芯片上运行所需的资源量包括带宽,第四计算操作为第一待融合子图中至少两个算子所包括的各个计算操作中任一个计算操作,处理单元802具体用于:根据运行速率和带宽的转换关系,将第四计算操作的运行速率转换为第四计算操作在设定芯片上运行所需的带宽;或,The amount of resources required for the fourth computing operation to run on the set chip includes bandwidth, the fourth computing operation is any computing operation included in the at least two operators in the first sub-graph to be fused, and the processing unit 802 is specifically used to: convert the running rate of the fourth computing operation into the bandwidth required for the fourth computing operation to run on the set chip according to the conversion relationship between the running rate and the bandwidth; or,
第五计算操作在设定芯片上运行所需的资源量包括算力,第五计算操作为第一待融合子图中至少两个算子所包括的各个计算操作中任一个计算操作,处理单元802具体用于:根据运行速率和算力的转换关系,将第五计算操作的运行速率转换为第五计算操作在设定芯片上运行所需的算力。The amount of resources required for the fifth computing operation to run on the set chip includes computing power. The fifth computing operation is any one of the computing operations included in at least two operators in the first sub-graph to be fused. The processing unit 802 is specifically used to: convert the running rate of the fifth computing operation into the computing power required for the fifth computing operation to run on the set chip according to the conversion relationship between the running rate and the computing power.
需要说明的是,本申请实施例中对模块的划分是示意性的,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。在本申请的实施例中的各功能模块可以集成在一个处理模块中,也可以是各个模块单独物理存在,也可以两个或两个以上模块集成在一个模块中。上述集成的模块既可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。It should be noted that the division of modules in the embodiments of the present application is schematic and is only a logical function division. There may be other division methods in actual implementation. The functional modules in the embodiments of the present application may be integrated into a processing module, or each module may exist physically separately, or two or more modules may be integrated into one module. The above-mentioned integrated modules may be implemented in the form of hardware or in the form of software functional modules.
通过上述方法,可以全部或部分地通过软件、硬件、固件或其他任意组合来实现。当使用软件实现时,通过上述方法可以全部或部分地以计算机程序产品的形式实现。计算机程序产品包括一个或多个计算机指令。在计算机上加载或执行计算机程序指令时,全部或部分地产生按照本发明实施例的流程或功能。计算机可以为通用计算机、专用计算机、计算机网络、或者其他可编程装置。计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集合的服务器、数据中心等数据存储设备。可用介质可以是磁性介质(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质。半导体介质可以是固态硬盘(solid state drive,SSD)。The above method can be implemented in whole or in part by software, hardware, firmware or any other combination. When implemented by software, the above method can be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded or executed on a computer, the process or function according to the embodiment of the present invention is generated in whole or in part. The computer can be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device. The computer instructions can be stored in a computer-readable storage medium, or transmitted from one computer-readable storage medium to another computer-readable storage medium. For example, the computer instructions can be transmitted from one website, computer, server or data center to another website, computer, server or data center by wired (e.g., coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains one or more available media sets. The available medium can be a magnetic medium (e.g., a floppy disk, a hard disk, a tape), an optical medium (e.g., a DVD), or a semiconductor medium. The semiconductor medium can be a solid state drive (SSD).
在一个简单的实施例中,本领域的技术人员可以想到实施例中电子设备或服务器可采用图9所示的形式。In a simple embodiment, those skilled in the art may conceive that the electronic device or server in the embodiment may adopt the form shown in FIG. 9 .
如图9所示的装置900,包括至少一个处理器901、存储器902,可选的,还可以包括通信接口903。
The device 900 shown in FIG. 9 includes at least one processor 901 , a memory 902 , and optionally, a communication interface 903 .
存储器902可以是易失性存储器,例如随机存取存储器;存储器也可以是非易失性存储器,例如只读存储器,快闪存储器,硬盘(hard disk drive,HDD)或固态硬盘(solid-state drive,SSD)、或者存储器902是能够用于携带或存储具有指令或数据结构形式的期望的程序代码并能够由计算机存取的任何其他介质,但不限于此。存储器902可以是上述存储器的组合。The memory 902 may be a volatile memory, such as a random access memory; the memory may also be a non-volatile memory, such as a read-only memory, a flash memory, a hard disk drive (HDD) or a solid-state drive (SSD), or the memory 902 may be any other medium that can be used to carry or store the desired program code in the form of instructions or data structures and can be accessed by a computer, but is not limited thereto. The memory 902 may be a combination of the above memories.
本申请实施例中不限定上述处理器901以及存储器902之间的具体连接介质。The specific connection medium between the processor 901 and the memory 902 is not limited in the embodiment of the present application.
处理器901可以为CPU,该处理器901还可以是其他通用处理器、数字信号处理器(digital signal processor,DSP)、专用集成电路(application specific integrated circuit,ASIC)、现场可编程门阵列(field programmable gate array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件、人工智能芯片、片上芯片等。通用处理器可以是微处理器或者是任何常规的处理器等。在如图9装置中,也可以设置独立的数据收发模块,例如通信接口903,用于收发数据;处理器901在与其他设备进行通信时,可以通过通信接口903进行数据传输。The processor 901 may be a CPU, or other general-purpose processors, digital signal processors (DSP), application-specific integrated circuits (ASIC), field programmable gate arrays (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, artificial intelligence chips, chips on chips, etc. A general-purpose processor may be a microprocessor or any conventional processor, etc. In the device shown in FIG. 9 , an independent data transceiver module may also be provided, such as a communication interface 903, for transmitting and receiving data; when the processor 901 communicates with other devices, data may be transmitted through the communication interface 903.
在一种可能的应用场景中,计算设备采用图9所示的形式,图9中的处理器901可以通过调用存储器902中存储的计算机执行指令,使得计算设备可以执行上述任一方法实施例中的用于神经网络的算子融合方法。In a possible application scenario, the computing device takes the form shown in FIG. 9 , and the processor 901 in FIG. 9 can call the computer execution instructions stored in the memory 902 so that the computing device can execute the operator fusion method for a neural network in any of the above method embodiments.
具体的,图8的传输单元801和处理单元802的功能/实现过程均可以通过图9中的处理器901调用存储器902中存储的计算机执行指令来实现。或者,图8中的处理单元的功能/实现过程可以通过图9中的处理器901调用存储器902中存储的计算机执行指令来实现,图8的传输功能/实现过程可以通过图9中的通信接口903来实现。Specifically, the functions/implementation processes of the transmission unit 801 and the processing unit 802 in FIG8 can be implemented by the processor 901 in FIG9 calling the computer execution instructions stored in the memory 902. Alternatively, the functions/implementation processes of the processing unit in FIG8 can be implemented by the processor 901 in FIG9 calling the computer execution instructions stored in the memory 902, and the transmission function/implementation process of FIG8 can be implemented by the communication interface 903 in FIG9.
本领域内的技术人员应明白,本申请的实施例可提供为方法、系统、或计算机程序产品。因此,本申请可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。Those skilled in the art will appreciate that the embodiments of the present application may be provided as methods, systems, or computer program products. Therefore, the present application may adopt the form of a complete hardware embodiment, a complete software embodiment, or an embodiment in combination with software and hardware. Moreover, the present application may adopt the form of a computer program product implemented in one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) that contain computer-usable program code.
本申请是参照根据本申请的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。The present application is described with reference to the flowchart and/or block diagram of the method, device (system), and computer program product according to the present application. It should be understood that each process and/or box in the flowchart and/or block diagram, as well as the combination of the process and/or box in the flowchart and/or block diagram, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, a special-purpose computer, an embedded processor, or other programmable data processing device to produce a machine, so that the instructions executed by the processor of the computer or other programmable data processing device produce a device for implementing the functions specified in one process or multiple processes in the flowchart and/or one box or multiple boxes in the block diagram.
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing device to work in a specific manner, so that the instructions stored in the computer-readable memory produce a manufactured product including an instruction device that implements the functions specified in one or more processes in the flowchart and/or one or more boxes in the block diagram.
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions may also be loaded onto a computer or other programmable data processing device so that a series of operational steps are executed on the computer or other programmable device to produce a computer-implemented process, whereby the instructions executed on the computer or other programmable device provide steps for implementing the functions specified in one or more processes in the flowchart and/or one or more boxes in the block diagram.
显然,本领域的技术人员可以对本申请进行各种改动和变型而不脱离本申请范围。这样,倘若本申请的这些修改和变型属于本申请权利要求及其等同技术的范围之内,则本申请也意图包含这些改动和变型在内。
Obviously, those skilled in the art can make various changes and modifications to the present application without departing from the scope of the present application. Thus, if these modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is also intended to include these modifications and variations.
Claims (17)
- 一种用于神经网络的算子融合方法,其特征在于,包括:An operator fusion method for a neural network, characterized by comprising:获取神经网络模型,并确定所述神经网络模型对应的计算图,所述计算图用于描述所述神经网络模型中的多个算子之间的连接关系;所述多个算子中每个算子用于执行至少一个计算操作;Acquire a neural network model, and determine a computational graph corresponding to the neural network model, wherein the computational graph is used to describe a connection relationship between multiple operators in the neural network model; each of the multiple operators is used to perform at least one computational operation;确定所述计算图中的至少两个待融合子图,所述至少两个待融合子图中任一待融合子图包括至少一个算子;Determine at least two subgraphs to be fused in the computation graph, wherein any subgraph to be fused in the at least two subgraphs to be fused includes at least one operator;在第一待融合子图满足融合条件时,将所述第一待融合子图包括的至少两个算子融合为一个算子;When the first subgraph to be fused meets the fusion condition, fusing at least two operators included in the first subgraph to be fused into one operator;所述第一待融合子图为所述至少两个待融合子图中的任一个;The first sub-graph to be fused is any one of the at least two sub-graphs to be fused;所述融合条件包括:所述第一待融合子图包括至少两个算子,且所述第一待融合子图在设定芯片上运行所需的资源量的利用率大于利用率阈值,所述第一待融合子图在设定芯片上运行所需的资源量的利用率与所述设定芯片的内存大小以及所述第一待融合子图中至少两个算子所包括的计算操作的总数量相关。The fusion conditions include: the first sub-graph to be fused includes at least two operators, and the utilization rate of the amount of resources required for the first sub-graph to be fused to run on the set chip is greater than the utilization rate threshold, and the utilization rate of the amount of resources required for the first sub-graph to be fused to run on the set chip is related to the memory size of the set chip and the total number of computing operations included in the at least two operators in the first sub-graph to be fused.
- 根据权利要求1所述的方法,其特征在于,所述方法还包括:The method according to claim 1, characterized in that the method further comprises:在所述第一待融合子图不满足所述融合条件时,将所述第一待融合子图拆分为第二待融合子图和第三待融合子图;When the first sub-graph to be fused does not meet the fusion condition, splitting the first sub-graph to be fused into a second sub-graph to be fused and a third sub-graph to be fused;在所述第二待融合子图满足所述融合条件时,将所述第二待融合子图包括的至少两个算子融合为一个算子;和/或,When the second subgraph to be fused satisfies the fusion condition, fusing at least two operators included in the second subgraph to be fused into one operator; and/or,在所述第三待融合子图满足所述融合条件时,将所述第三待融合子图包括的至少两个算子融合为一个算子。When the third subgraph to be fused meets the fusion condition, at least two operators included in the third subgraph to be fused are fused into one operator.
- 根据权利要求2所述的方法,其特征在于,所述将所述第一待融合子图拆分为第二待融合子图和第三待融合子图,包括:The method according to claim 2, characterized in that splitting the first subgraph to be fused into the second subgraph to be fused and the third subgraph to be fused comprises:从所述第一待融合子图的第一输出型计算操作开始,通过反向搜索的方式遍历所述第一待融合子图的各个计算操作;其中,所述第一输出型计算操作为所述第一待融合子图的任意一个输出型计算操作,所述第一输出型计算操作满足以所述第一输出型计算操作的运算结果作为输入的计算操作的数量小于第一数量阈值;Starting from the first output-type computing operation of the first subgraph to be fused, traversing each computing operation of the first subgraph to be fused by reverse search; wherein the first output-type computing operation is any output-type computing operation of the first subgraph to be fused, and the first output-type computing operation satisfies that the number of computing operations that use the operation result of the first output-type computing operation as input is less than a first quantity threshold;在当前遍历到的第二计算操作满足拆分条件时,将所述第一待融合子图中以所述第二计算操作为根节点的子图作为所述第二待融合子图,将所述第一待融合子图中除以所述第二计算操作为根节点的子图以外的子图作为所述第三待融合子图;When the second computing operation currently traversed meets the splitting condition, the subgraph with the second computing operation as the root node in the first subgraph to be fused is used as the second subgraph to be fused, and the subgraphs in the first subgraph to be fused except the subgraph with the second computing operation as the root node are used as the third subgraph to be fused;其中,所述第二计算操作满足所述拆分条件为所述第二计算操作的计算类型为设定类型或者所述第二计算操作的计算类型与作为所述第二计算操作的父节点的计算操作的计算类型相同。The second computing operation satisfies the splitting condition if the computing type of the second computing operation is a set type or the computing type of the second computing operation is the same as the computing type of the computing operation of the parent node of the second computing operation.
- 根据权利要求1~3任一项所述的方法,其特征在于,所述方法还包括:The method according to any one of claims 1 to 3, characterized in that the method further comprises:根据所述设定芯片的内存大小、所述设定芯片的运算指令所支持的内存复用能力以及所述第一待融合子图中至少两个算子所包括的计算操作的总数量,确定所述第一待融合子图在设定芯片上运行所需的资源量。The amount of resources required for the first subgraph to be fused to run on the setting chip is determined according to the memory size of the setting chip, the memory reuse capability supported by the operation instructions of the setting chip, and the total number of computing operations included in at least two operators in the first subgraph to be fused.
- 根据权利要求4所述的方法,其特征在于,所述根据所述设定芯片的内存大小、所述设定芯片的运算指令所支持的内存复用能力以及所述第一待融合子图中至少两个算子所包括的计算操作的总数量,确定所述第一待融合子图在设定芯片上运行所需的资源量,包括:The method according to claim 4 is characterized in that the step of determining the amount of resources required for the first subgraph to be fused to run on the setting chip according to the memory size of the setting chip, the memory reuse capability supported by the operation instructions of the setting chip, and the total number of computing operations included in at least two operators in the first subgraph to be fused comprises:在所述设定芯片的运算指令所支持的内存复用能力为可复用,且所述第一待融合子图中至少两个算子所包括的计算操作产生的数据类型相同时,确定运行所述第一待融合子图所需的内存份数,所述内存份数为所述第一待融合子图中至少两个算子所包括的计算操作的总数量与所述第一待融合子图中至少两个算子所包括的输出型计算操作的数量的差;所述输出型计算操作满足以所述输出型计算操作的运算结果作为输入的计算操作的数量小于第二数量阈值;When the memory reuse capability supported by the operation instructions of the setting chip is reusable, and the data types generated by the calculation operations included in at least two operators in the first subgraph to be fused are the same, determine the number of memory shares required to run the first subgraph to be fused, and the number of memory shares is the difference between the total number of calculation operations included in the at least two operators in the first subgraph to be fused and the number of output calculation operations included in the at least two operators in the first subgraph to be fused; the number of the output calculation operations that satisfies the calculation operation that takes the calculation result of the output calculation operation as input is less than a second quantity threshold;根据所述设定芯片的内存大小、所述运行所述第一待融合子图所需的内存份数,以及所述第一待融合子图包括的每个计算操作所属的类型对应的运行时长,确定所述设定芯片运行所述第一待融合子图包括的每个计算操作分别对应的运行速率;Determine the running speed corresponding to each computing operation included in the first subgraph to be fused when the setting chip runs the setting chip according to the memory size of the setting chip, the number of memory copies required for running the first subgraph to be fused, and the running time corresponding to the type of each computing operation included in the first subgraph to be fused;将所述第一待融合子图包括的每个计算操作分别对应的运行速率进行单位转换得到所述第一待融合子图包括的每个计算操作在所述设定芯片上运行所需的资源量; Convert the operation rates corresponding to each computing operation included in the first subgraph to be fused into units to obtain the amount of resources required for each computing operation included in the first subgraph to be fused to run on the set chip;将所述第一待融合子图包括的每个计算操作在所述设定芯片上运行所需的资源量中最小的资源量作为所述第一待融合子图在所述设定芯片上运行所需的资源量。The minimum amount of resources among the amounts of resources required for each computing operation included in the first subgraph to be fused to run on the set chip is used as the amount of resources required for the first subgraph to be fused to run on the set chip.
- 根据权利要求5所述的方法,其特征在于,第三计算操作在所述设定芯片上运行所需的内存大小满足:m=M/f;The method according to claim 5 is characterized in that the memory size required for the third computing operation to run on the set chip satisfies: m = M/f;其中,所述第三计算操作为所述第一待融合子图所包括的计算操作中的任一个计算操作,m为所述第三计算操作在所述设定芯片上运行所需的内存大小,M为所述设定芯片的内存大小,f为运行所述第一待融合子图所需的内存份数;The third computing operation is any one of the computing operations included in the first sub-graph to be fused, m is the memory size required for the third computing operation to run on the set chip, M is the memory size of the set chip, and f is the number of memory copies required to run the first sub-graph to be fused;所述第三计算操作的运行速率满足V=m/t;The running rate of the third computing operation satisfies V=m/t;其中,V为所述第三计算操作的运行速率,t为执行所述第三计算操作所需的时长。Wherein, V is the running rate of the third computing operation, and t is the time required to execute the third computing operation.
- 根据权利要求5或6所述的方法,其特征在于,所述资源量包括带宽和/或算力;The method according to claim 5 or 6, characterized in that the amount of resources includes bandwidth and/or computing power;所述将所述第一待融合子图包括的每个计算操作分别对应的运行速率进行单位转换得到所述第一待融合子图包括的每个计算操作在所述设定芯片上运行所需的资源量,包括:The step of converting the operation rates corresponding to each computing operation included in the first subgraph to be fused into units to obtain the amount of resources required for each computing operation included in the first subgraph to be fused to run on the set chip includes:第四计算操作在所述设定芯片上运行所需的资源量包括带宽,所述第四计算操作为所述第一待融合子图中至少两个算子所包括的各个计算操作中任一个计算操作,根据运行速率和带宽的转换关系,将所述第四计算操作的运行速率转换为所述第四计算操作在所述设定芯片上运行所需的带宽;或,The amount of resources required for the fourth computing operation to run on the set chip includes bandwidth, the fourth computing operation is any computing operation included in the at least two operators in the first sub-graph to be fused, and the running rate of the fourth computing operation is converted into the bandwidth required for the fourth computing operation to run on the set chip according to the conversion relationship between the running rate and the bandwidth; or,第五计算操作在所述设定芯片上运行所需的资源量包括算力,所述第五计算操作为所述第一待融合子图中至少两个算子所包括的各个计算操作中任一个计算操作,根据运行速率和算力的转换关系,将所述第五计算操作的运行速率转换为所述第五计算操作在所述设定芯片上运行所需的算力。The amount of resources required for the fifth computing operation to run on the set chip includes computing power. The fifth computing operation is any one of the computing operations included in at least two operators in the first sub-graph to be fused. According to the conversion relationship between running rate and computing power, the running rate of the fifth computing operation is converted into the computing power required for the fifth computing operation to run on the set chip.
- 一种用于神经网络的算子融合装置,其特征在于,包括:An operator fusion device for a neural network, characterized by comprising:传输单元,用于获取神经网络模型;A transmission unit, used to obtain a neural network model;处理单元,用于确定所述神经网络模型对应的计算图,所述计算图用于描述所述神经网络模型中的多个算子之间的连接关系;所述多个算子中每个算子用于执行至少一个计算操作;A processing unit, configured to determine a computational graph corresponding to the neural network model, wherein the computational graph is used to describe a connection relationship between a plurality of operators in the neural network model; each of the plurality of operators is used to perform at least one computational operation;所述处理单元还用于,确定所述计算图中的至少两个待融合子图,所述至少两个待融合子图中任一待融合子图包括至少一个算子;The processing unit is further used to determine at least two subgraphs to be fused in the computation graph, wherein any subgraph to be fused in the at least two subgraphs to be fused includes at least one operator;所述处理单元还用于,在第一待融合子图满足融合条件时,将所述第一待融合子图包括的至少两个算子融合为一个算子;The processing unit is further configured to, when the first subgraph to be fused meets a fusion condition, fuse at least two operators included in the first subgraph to be fused into one operator;所述第一待融合子图为所述至少两个待融合子图中的任一个;The first sub-graph to be fused is any one of the at least two sub-graphs to be fused;所述融合条件包括:所述第一待融合子图包括至少两个算子,且所述第一待融合子图在设定芯片上运行所需的资源量的利用率大于利用率阈值,所述第一待融合子图在设定芯片上运行所需的资源量的利用率与所述设定芯片的内存大小以及所述第一待融合子图中至少两个算子所包括的计算操作的总数量相关。The fusion conditions include: the first sub-graph to be fused includes at least two operators, and the utilization rate of the amount of resources required for the first sub-graph to be fused to run on the set chip is greater than the utilization rate threshold, and the utilization rate of the amount of resources required for the first sub-graph to be fused to run on the set chip is related to the memory size of the set chip and the total number of computing operations included in the at least two operators in the first sub-graph to be fused.
- 根据权利要求8所述的算子融合装置,其特征在于,所述处理单元还用于:The operator fusion device according to claim 8, characterized in that the processing unit is further used for:在所述第一待融合子图不满足所述融合条件时,将所述第一待融合子图拆分为第二待融合子图和第三待融合子图;When the first sub-graph to be fused does not meet the fusion condition, splitting the first sub-graph to be fused into a second sub-graph to be fused and a third sub-graph to be fused;在所述第二待融合子图满足所述融合条件时,将所述第二待融合子图包括的至少两个算子融合为一个算子;和/或,When the second subgraph to be fused satisfies the fusion condition, fusing at least two operators included in the second subgraph to be fused into one operator; and/or,在所述第三待融合子图满足所述融合条件时,将所述第三待融合子图包括的至少两个算子融合为一个算子。When the third subgraph to be fused meets the fusion condition, at least two operators included in the third subgraph to be fused are fused into one operator.
- 根据权利要求9所述的算子融合装置,其特征在于,所述处理单元具体用于:The operator fusion device according to claim 9, characterized in that the processing unit is specifically used for:从所述第一待融合子图的第一输出型计算操作开始,通过反向搜索的方式遍历所述第一待融合子图的各个计算操作;其中,所述第一输出型计算操作为所述第一待融合子图的任意一个输出型计算操作,所述第一输出型计算操作满足以所述第一输出型计算操作的运算结果作为输入的计算操作的数量小于第一数量阈值;Starting from the first output-type computing operation of the first subgraph to be fused, traversing each computing operation of the first subgraph to be fused by reverse search; wherein the first output-type computing operation is any output-type computing operation of the first subgraph to be fused, and the first output-type computing operation satisfies that the number of computing operations that use the operation result of the first output-type computing operation as input is less than a first quantity threshold;在当前遍历到的第二计算操作满足拆分条件时,将所述第一待融合子图中以所述第二计算操作为根节点的子图作为所述第二待融合子图,将所述第一待融合子图中除以所述第二计算操作为根节点的子图以外的子图作为所述第三待融合子图;When the second computing operation currently traversed meets the splitting condition, the subgraph with the second computing operation as the root node in the first subgraph to be fused is used as the second subgraph to be fused, and the subgraphs in the first subgraph to be fused except the subgraph with the second computing operation as the root node are used as the third subgraph to be fused;其中,所述第二计算操作满足所述拆分条件为所述第二计算操作的计算类型为设定类型或者所述第二计算操作的计算类型与作为所述第二计算操作的父节点的计算操作的计算类型相同。 The second computing operation satisfies the splitting condition if the computing type of the second computing operation is a set type or the computing type of the second computing operation is the same as the computing type of the computing operation of the parent node of the second computing operation.
- 根据权利要求8~10任一项所述的算子融合装置,其特征在于,所述处理单元还用于:The operator fusion device according to any one of claims 8 to 10, characterized in that the processing unit is further used for:根据所述设定芯片的内存大小、所述设定芯片的运算指令所支持的内存复用能力以及所述第一待融合子图中至少两个算子所包括的计算操作的总数量,确定所述第一待融合子图在设定芯片上运行所需的资源量。The amount of resources required for the first subgraph to be fused to run on the setting chip is determined according to the memory size of the setting chip, the memory reuse capability supported by the operation instructions of the setting chip, and the total number of computing operations included in at least two operators in the first subgraph to be fused.
- 根据权利要求11所述的算子融合装置,其特征在于,所述处理单元具体用于:The operator fusion device according to claim 11, characterized in that the processing unit is specifically used for:在所述设定芯片的运算指令所支持的内存复用能力为可复用,且所述第一待融合子图中至少两个算子所包括的计算操作产生的数据类型相同时,确定运行所述第一待融合子图所需的内存份数,所述内存份数为所述第一待融合子图中至少两个算子所包括的计算操作的总数量与所述第一待融合子图中至少两个算子所包括的输出型计算操作的数量的差;所述输出型计算操作以所述输出型计算操作的运算结果作为输入的计算操作的数量小于第二数量阈值;When the memory reuse capability supported by the operation instructions of the setting chip is reusable, and the data types generated by the calculation operations included in at least two operators in the first subgraph to be fused are the same, determine the number of memory shares required to run the first subgraph to be fused, and the number of memory shares is the difference between the total number of calculation operations included in the at least two operators in the first subgraph to be fused and the number of output calculation operations included in the at least two operators in the first subgraph to be fused; the number of calculation operations of the output calculation operation using the calculation results of the output calculation operation as input is less than a second number threshold;根据所述设定芯片的内存大小、所述运行所述第一待融合子图所需的内存份数,以及所述第一待融合子图包括的每个计算操作所属的类型对应的运行时长,确定所述设定芯片运行所述第一待融合子图包括的每个计算操作分别对应的运行速率;Determine the running speed corresponding to each computing operation included in the first subgraph to be fused when the setting chip runs the setting chip according to the memory size of the setting chip, the number of memory copies required for running the first subgraph to be fused, and the running time corresponding to the type of each computing operation included in the first subgraph to be fused;将所述第一待融合子图包括的每个计算操作分别对应的运行速率进行单位转换得到所述第一待融合子图包括的每个计算操作在所述设定芯片上运行所需的资源量;Convert the operation rates corresponding to each computing operation included in the first subgraph to be fused into units to obtain the amount of resources required for each computing operation included in the first subgraph to be fused to run on the set chip;将所述第一待融合子图包括的每个计算操作在所述设定芯片上运行所需的资源量中最小的资源量作为所述第一待融合子图在所述设定芯片上运行所需的资源量。The minimum amount of resources among the amounts of resources required for each computing operation included in the first subgraph to be fused to run on the set chip is used as the amount of resources required for the first subgraph to be fused to run on the set chip.
- 根据权利要求12所述的算子融合装置,其特征在于,第三计算操作在所述设定芯片上运行所需的内存大小满足:m=M/f;The operator fusion device according to claim 12 is characterized in that the memory size required for the third computing operation to run on the setting chip satisfies: m = M/f;其中,所述第三计算操作为所述第一待融合子图所包括的计算操作中的任一个计算操作,m为所述第三计算操作在所述设定芯片上运行所需的内存大小,M为所述设定芯片的内存大小,f为运行所述第一待融合子图所需的内存份数;The third computing operation is any one of the computing operations included in the first sub-graph to be fused, m is the memory size required for the third computing operation to run on the set chip, M is the memory size of the set chip, and f is the number of memory copies required to run the first sub-graph to be fused;所述第三计算操作的运行速率满足V=m/t;The running rate of the third computing operation satisfies V=m/t;其中,V为所述第三计算操作的运行速率,t为执行所述第三计算操作所需的时长。Wherein, V is the running rate of the third computing operation, and t is the time required to execute the third computing operation.
- 根据权利要求12或13所述的算子融合装置,其特征在于,所述资源量包括带宽和/或算力;The operator fusion device according to claim 12 or 13, characterized in that the resource amount includes bandwidth and/or computing power;第四计算操作在所述设定芯片上运行所需的资源量包括带宽,所述第四计算操作为所述第一待融合子图中至少两个算子所包括的各个计算操作中任一个计算操作,所述处理单元具体用于:根据运行速率和带宽的转换关系,将所述第四计算操作的运行速率转换为所述第四计算操作在所述设定芯片上运行所需的带宽;或,The amount of resources required for the fourth computing operation to run on the set chip includes bandwidth, the fourth computing operation is any one of the computing operations included in the at least two operators in the first subgraph to be fused, and the processing unit is specifically used to: convert the running rate of the fourth computing operation into the bandwidth required for the fourth computing operation to run on the set chip according to the conversion relationship between the running rate and the bandwidth; or,第五计算操作在所述设定芯片上运行所需的资源量包括算力,所述第五计算操作为所述第一待融合子图中至少两个算子所包括的各个计算操作中任一个计算操作,所述处理单元具体用于:根据运行速率和算力的转换关系,将所述第五计算操作的运行速率转换为所述第五计算操作在所述设定芯片上运行所需的算力。The amount of resources required for the fifth computing operation to run on the set chip includes computing power. The fifth computing operation is any one of the computing operations included in at least two operators in the first sub-graph to be fused. The processing unit is specifically used to: convert the running rate of the fifth computing operation into the computing power required for the fifth computing operation to run on the set chip according to the conversion relationship between the running rate and the computing power.
- 一种算子融合设备,其特征在于,包括处理器和存储器;An operator fusion device, characterized by comprising a processor and a memory;所述存储器,用于存储计算机程序指令;The memory is used to store computer program instructions;所述处理器执行调用所述存储器中的计算机程序指令执行如权利要求1~7中任一项所述的方法。The processor executes and calls the computer program instructions in the memory to perform the method according to any one of claims 1 to 7.
- 一种包含指令的计算机程序产品,其特征在于,当所述指令被计算设备运行时,使得所述计算设备执行如权利要求的1~7任一项所述的方法。A computer program product comprising instructions, characterized in that when the instructions are executed by a computing device, the computing device is caused to execute the method according to any one of claims 1 to 7.
- 一种计算机可读存储介质,其特征在于,包括计算机程序指令,当所述计算机程序指令由计算设备执行时,所述计算设备执行如权利要求1~7任一项所述的方法。 A computer-readable storage medium, characterized in that it includes computer program instructions, and when the computer program instructions are executed by a computing device, the computing device executes the method according to any one of claims 1 to 7.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211584001.6A CN118171683A (en) | 2022-12-09 | 2022-12-09 | Operator fusion method and related device for neural network |
CN202211584001.6 | 2022-12-09 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2024120050A1 true WO2024120050A1 (en) | 2024-06-13 |
Family
ID=91347334
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2023/127261 WO2024120050A1 (en) | 2022-12-09 | 2023-10-27 | Operator fusion method used for neural network, and related apparatus |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN118171683A (en) |
WO (1) | WO2024120050A1 (en) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110490309A (en) * | 2019-08-14 | 2019-11-22 | 北京中科寒武纪科技有限公司 | A kind of Operator Fusion method and its Related product for neural network |
CN111260019A (en) * | 2020-02-18 | 2020-06-09 | 深圳鲲云信息科技有限公司 | Data processing method, device and equipment of neural network model and storage medium |
US20210182036A1 (en) * | 2019-12-12 | 2021-06-17 | Huawei Technologies Co., Ltd. | Hardware platform specific operator fusion in machine learning |
CN114239669A (en) * | 2021-04-14 | 2022-03-25 | 无锡江南计算技术研究所 | Data multiplexing method based on operator fusion on heterogeneous many-core architecture |
CN114970814A (en) * | 2022-05-17 | 2022-08-30 | 北京灵汐科技有限公司 | Processing method and processing device of neural network computation graph |
-
2022
- 2022-12-09 CN CN202211584001.6A patent/CN118171683A/en active Pending
-
2023
- 2023-10-27 WO PCT/CN2023/127261 patent/WO2024120050A1/en unknown
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110490309A (en) * | 2019-08-14 | 2019-11-22 | 北京中科寒武纪科技有限公司 | A kind of Operator Fusion method and its Related product for neural network |
US20210182036A1 (en) * | 2019-12-12 | 2021-06-17 | Huawei Technologies Co., Ltd. | Hardware platform specific operator fusion in machine learning |
CN111260019A (en) * | 2020-02-18 | 2020-06-09 | 深圳鲲云信息科技有限公司 | Data processing method, device and equipment of neural network model and storage medium |
CN114239669A (en) * | 2021-04-14 | 2022-03-25 | 无锡江南计算技术研究所 | Data multiplexing method based on operator fusion on heterogeneous many-core architecture |
CN114970814A (en) * | 2022-05-17 | 2022-08-30 | 北京灵汐科技有限公司 | Processing method and processing device of neural network computation graph |
Also Published As
Publication number | Publication date |
---|---|
CN118171683A (en) | 2024-06-11 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11018979B2 (en) | System and method for network slicing for service-oriented networks | |
Lin et al. | Modeling and optimization of performance and cost of serverless applications | |
US9244735B2 (en) | Managing resource allocation or configuration parameters of a model building component to build analytic models to increase the utility of data analysis applications | |
CN110046704B (en) | Deep network acceleration method, device, equipment and storage medium based on data stream | |
EP4268077A1 (en) | Methods, systems, articles of manufacture and apparatus to optimize resources in edge networks | |
US11025561B2 (en) | Systems and methods for computing infrastructure resource allocation | |
CN110537193A (en) | The quick calculating of convolutional neural networks | |
CN114915629A (en) | Information processing method, device, system, electronic equipment and storage medium | |
CN111522640A (en) | Parallel execution method and equipment of computational graph | |
CN116501503B (en) | Architecture mapping method and device for load task, computer equipment and medium | |
JP2023535168A (en) | Run-time environment determination for software containers | |
Kaya et al. | Seamless computation offloading for mobile applications using an online learning algorithm | |
WO2023222047A1 (en) | Processing method and processing unit for neural network computing graph, and device and medium | |
WO2024120050A1 (en) | Operator fusion method used for neural network, and related apparatus | |
CN115469931B (en) | Instruction optimization method, device, system, equipment and medium of loop program | |
CN113535346A (en) | Method, device and equipment for adjusting number of threads and computer storage medium | |
US20230229514A1 (en) | Intelligent orchestration of classic-quantum computational graphs | |
CN116048759A (en) | Data processing method, device, computer and storage medium for data stream | |
Guo et al. | Hierarchical design space exploration for distributed CNN inference at the edge | |
CN116701091A (en) | Method, electronic device and computer program product for deriving logs | |
CN118034695A (en) | Calculation map compiling method, compiling device, calculating device and storage medium | |
CN115361332A (en) | Processing method and device for fault-tolerant routing, processor and electronic equipment | |
Qiu et al. | Virtual network function deployment algorithm based on graph convolution deep reinforcement learning | |
Liang et al. | HPA: hierarchical placement algorithm for multi-cloud microservices applications | |
US11252061B1 (en) | Distributed computation of graph embeddings for network traffic in virtual networks |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 23899631 Country of ref document: EP Kind code of ref document: A1 |