WO2024120050A1

WO2024120050A1 - Operator fusion method used for neural network, and related apparatus

Info

Publication number: WO2024120050A1
Application number: PCT/CN2023/127261
Authority: WO
Inventors: 王宇; 李培杰; 王琪; 孙一博; 苗方正
Original assignee: 华为技术有限公司
Priority date: 2022-12-09
Filing date: 2023-10-27
Publication date: 2024-06-13
Also published as: CN118171683A

Abstract

The embodiments of the present application provide an operator fusion method used for a neural network, and a related apparatus. The method comprises: obtaining a neural network model, and determining a computational graph corresponding to the neural network model. The computational graph being able to describe connection relationships between a plurality of operators in the neural network model, and each operator being able to perform a computational operation. In order to improve fusion efficiency, determining within the computational graph at least two sub-graphs to be fused. When a first sub-graph to be fused comprises at least two operators and when the utilization rate of the amount of resources required for the first sub-graph to be fused to operate on a set chip is greater than a utilization rate threshold, fusing into a single operator the at least two operators included in the first sub-graph to be fused. Moreover, the utilization rate required by the first sub-graph to be fused is related to the memory size of the set chip and to the total number of the computational operations included in the first sub-graph to be fused, wherein considering the memory size of the chip allows a fused operator to fully utilize the resources of the chip, thereby further improving the operation speed of the neural network model.

Description

A method and related device for operator fusion of neural network

CROSS-REFERENCE TO RELATED APPLICATIONS

This application claims priority to a Chinese patent application filed with the Chinese Patent Office on December 9, 2022, with application number 202211584001.6 and application name “An operator fusion method and related device for neural networks”, the entire contents of which are incorporated by reference in this application.

Technical Field

The present application relates to the field of artificial intelligence technology, and in particular to an operator fusion method and related devices for neural networks.

Background technique

By defining the artificial intelligence (AI) model and solving the model's parameters (which can be called model training), a unique computational logic can be determined. After conversion, this computational logic can be applied to reasoning calculations (which can also be called model reasoning or use). This computational logic can be represented by a graph, which is the computational graph.

A computation graph usually contains many operators. During model training or inference, each operator has memory access overhead, scheduling overhead, and execution overhead. Therefore, too many operators in the computation graph will seriously affect execution efficiency. Usually, the number of operators can be reduced by operator fusion.

In the related art, there are two methods: manual fusion and automatic fusion. In either method, operators are classified according to their characteristics and then fused according to fusion rules. However, the performance of the neural network model obtained by the fused operators cannot be guaranteed at runtime.

Summary of the invention

The present application provides an operator fusion method and related devices for a neural network, which improves the resource utilization of the chip and the computing speed of the neural network model when the neural network model runs on the chip, thereby improving the performance of the neural network in training or reasoning.

In a first aspect, an embodiment of the present application provides an operator fusion method for a neural network, in which:

First, obtain a neural network model, and determine a computational graph corresponding to the neural network model, the computational graph is used to describe the connection relationship between multiple operators in the neural network model, and each of the multiple operators is used to perform at least one computing operation. Secondly, determine at least two subgraphs to be fused in the computational graph, and any of the at least two subgraphs to be fused includes at least one operator. Finally, when there are at least two operators in the first subgraph to be fused, and the utilization rate of the amount of resources required for the first subgraph to be fused to run on a set chip is greater than the utilization rate threshold, the at least two operators included in the first subgraph to be fused are fused into one operator, and the first subgraph to be fused is any one of the at least two subgraphs to be fused. The utilization rate required for the first subgraph to be fused is related to the memory size of the set chip and the total number of computing operations included in the at least two operators in the first subgraph to be fused.

Through the above method, operators are fused based on the subgraph to be fused, and the computational graph is first divided into at least two subgraphs to be fused (any subgraph to be fused can be referred to as the first subgraph to be fused). When the first subgraph to be fused includes at least two operators and the utilization rate of the amount of resources required to run on the set chip is greater than the utilization rate threshold, the at least two operators included in the first subgraph to be fused are fused into one operator. Since the utilization rate of the amount of resources required to run the first subgraph to be fused on the set chip is related to the memory size of the set chip and the total number of computing operations included in the at least two operators in the first subgraph to be fused, it can be seen that the fusion process takes into account the memory of the set chip, improves the chip resource utilization rate and the computing speed of the neural network model when the neural network model runs on the chip, and thus improves the performance of the neural network in training or reasoning.

In some exemplary embodiments, the operator fusion method for a neural network also includes: when the first subgraph to be fused does not meet the fusion condition, splitting the first subgraph to be fused into a second subgraph to be fused and a third subgraph to be fused; when the second subgraph to be fused meets the fusion condition, fusing at least two operators included in the second subgraph to be fused into one operator; and/or, when the third subgraph to be fused meets the fusion condition, fusing at least two operators included in the third subgraph to be fused into one operator.

Through the above method, if the first subgraph to be fused does not meet the fusion condition, it means that if at least the operators included in the first subgraph to be fused are fused, the performance of the neural network model obtained after the chip runs the fusion is poor and the resource utilization is low. In this case, the first subgraph to be fused is split into the second subgraph to be fused and the third subgraph to be fused, and the at least two operators included in the second fused subgraph and the at least two operators included in the third fused subgraph are fused respectively in combination with the fusion condition. This improves the resource utilization of the neural network model obtained by the chip after executing the operator fusion, thereby improving the computing speed of the neural network model.

In some exemplary implementations, splitting the first subgraph to be fused into the second subgraph to be fused and the third subgraph to be fused includes: Starting from the first output-type computing operation of the first subgraph to be fused, each computing operation of the first subgraph to be fused is traversed by reverse search; the first output-type computing operation is any output-type computing operation of the first subgraph to be fused, and the first output-type computing operation satisfies that the number of computing operations that use the operation result of the first output-type computing operation as input is less than the first quantity threshold; when the second computing operation currently traversed meets the splitting condition, the subgraph with the second computing operation as the root node in the first subgraph to be fused is taken as the second subgraph to be fused, and the subgraphs of the first subgraph to be fused except the subgraph with the second computing operation as the root node are taken as the third subgraph to be fused; the second computing operation satisfies the splitting condition if the computing type of the second computing operation is the set type or the computing type of the second computing operation is the same as the computing type of the computing operation that is the parent node of the second computing operation.

Through the above method, when splitting the first subgraph to be fused, the types of various calculation operations included in the first subgraph to be fused are considered. When the calculation type of the second calculation operation is a set type or the calculation type of the second calculation operation is the same as the calculation type of the calculation operation of the parent node of the second calculation operation, the second subgraph to be fused and the third subgraph to be fused are obtained. Continue to judge whether the operators in the second subgraph to be fused and the third subgraph to be fused can be fused through the fusion conditions, further improve the resource utilization of the neural network model obtained by the chip after executing the operator fusion, and then improve the operation speed of the neural network model.

In some exemplary embodiments, the operator fusion method for a neural network further includes: determining the amount of resources required for the first subgraph to be fused to run on a set chip based on the memory size of the set chip, the memory reuse capability supported by the operation instructions of the set chip, and the total number of computing operations included in at least two operators in the first subgraph to be fused.

Through the above method, not only the total number of computing operations included in at least two operators in the first subgraph to be fused in the neural network model itself is considered, but also the memory size of the chip and the memory reuse capability supported by the operation instructions of the set chip are considered. The resource utilization rate of the resource amount required for the first subgraph to be fused to run on the set chip is calculated, which fully considers the performance of the chip, improves the resource utilization rate of the chip, and thus improves the operation speed of the neural network model.

In some exemplary embodiments, the amount of resources required for the first subgraph to be fused to run on the set chip is determined based on the memory size of the set chip, the memory reuse capability supported by the operation instructions of the set chip, and the total number of computing operations included in at least two operators in the first subgraph to be fused, including: when the data types generated by the computing operations included in at least two operators are the same, determining the number of memory shares required to run the first subgraph to be fused, the number of memory shares being the difference between the total number of computing operations included in the at least two operators in the first subgraph to be fused and the number of output computing operations included in the at least two operators in the first subgraph to be fused; the output computing operation uses the computing result of the output computing operation as the computing input The number of operations is less than a second quantity threshold; according to the memory size of the setting chip, the number of memory copies required to run the first sub-graph to be fused, and the running time corresponding to the type of each computing operation included in the first sub-graph to be fused, the running speed corresponding to each computing operation included in the first sub-graph to be fused is determined; the running speed corresponding to each computing operation included in the first sub-graph to be fused is converted into units to obtain the amount of resources required for each computing operation included in the first sub-graph to be fused to run on the setting chip; the minimum amount of resources among the amounts of resources required for each computing operation included in the first sub-graph to be fused to run on the setting chip is used as the amount of resources required for the first sub-graph to be fused to run on the setting chip.

By the above method, when calculating the number of memory copies required to run at least two operators in the first subgraph to be fused, the memory reuse capability supported by the operation instructions of the setting chip is considered to be reusable or non-reusable. Since the data types generated by each computing operation are the same, reuse is effective. Therefore, in the case of reuse, the difference between the total number of computing operations included in the at least two operators in the first subgraph to be fused and the number of output computing operations included in the at least two operators in the first subgraph to be fused is determined as the number of memory copies required to run at least two operators in the first subgraph to be fused. Since there is a variable relationship between the running rate of the computing operation and the amount of resources, the running rate can be determined based on the number of memory copies and the time required for the computing operation, and then the amount of resources required for each computing operation to run on the setting chip is determined. In addition, only one computing operation can be executed at the same time, so the minimum amount of resources for the computing operation is used as the amount of resources required for the first subgraph to be fused to run on the setting chip.

In some exemplary embodiments, the memory size required for the third computing operation to run on the set chip satisfies: m=M/f; wherein the third computing operation is any computing operation included in the first sub-graph to be fused, m is the memory size required for the third computing operation to run on the set chip, M is the memory size of the set chip, and f is the number of memory copies required to run the first sub-graph to be fused; the running rate of the third computing operation satisfies V=m/t; wherein V is the running rate of the third computing operation, and t is the time required to execute the third computing operation.

By using the above method, the memory size required for the third computing operation to run on the chip is obtained, and then divided by the time required to execute the third computing operation to obtain the running rate of the third computing operation, so as to determine the amount of resources required for each computing operation to run on the set chip according to the running rate.

In some exemplary embodiments, the amount of resources includes bandwidth and/or computing power; converting each operating rate into the amount of resources required for the corresponding computing operation to run on the set chip includes: the amount of resources required for the fourth computing operation to run on the set chip includes bandwidth, the The fourth computing operation is any one of the computing operations included in the at least two operators in the first sub-graph to be fused. According to the conversion relationship between the running rate and the bandwidth, the running rate of the fourth computing operation is converted into the bandwidth required for the fourth computing operation to run on the set chip; or, the amount of resources required for the fifth computing operation to run on the set chip includes computing power, and the fifth computing operation is any one of the computing operations included in the at least two operators in the first sub-graph to be fused. According to the conversion relationship between the running rate and the computing power, the running rate of the fifth computing operation is converted into the computing power required for the fifth computing operation to run on the set chip.

Through the above method, since there are conversion relationships between the operating rate and bandwidth, and between the operating rate and computing power, respectively, the bandwidth and/or computing power required for the computing operation to run on the set chip can be determined according to the respective conversion relationships. In this way, the bandwidth and/or computing power required for the computing operation to run on the set chip can be determined according to the bandwidth and/or computing power required for the computing operation to run on the set chip.

In a second aspect, an embodiment of the present application provides an operator fusion device for a neural network, the device comprising:

A transmission unit, used to obtain a neural network model;

A processing unit, used to determine a computational graph corresponding to the neural network model, where the computational graph is used to describe a connection relationship between multiple operators in the neural network model; each of the multiple operators is used to perform at least one computational operation;

The processing unit is further used to determine at least two subgraphs to be fused in the computation graph, wherein any of the at least two subgraphs to be fused includes at least one operator;

The processing unit is further configured to, when the first subgraph to be fused satisfies a fusion condition, fuse at least two operators included in the first subgraph to be fused into one operator;

The first sub-graph to be fused is any one of the at least two sub-graphs to be fused;

The fusion conditions include: the first subgraph to be fused includes at least two operators, and the utilization rate of the amount of resources required for the first subgraph to be fused to run on the set chip is greater than the utilization rate threshold, and the utilization rate of the amount of resources required for the first subgraph to be fused to run on the set chip is related to the memory size of the set chip and the total number of computing operations included in the at least two operators in the first subgraph to be fused.

In a third aspect, an embodiment of the present application provides a computer device, including a processor and a memory;

a memory for storing computer program instructions;

The processor executes the computer program instructions in the memory to perform the method provided in any one of the aforementioned aspects or any possible implementation manner of any one of the aspects.

In a fourth aspect, an embodiment of the present application further provides a computer-readable storage medium, in which a software program is stored. When the software program is read and executed by one or more processors, it can implement any method provided by any design in any aspect.

In a fifth aspect, the present application provides a computer program product, the computer program product including computer instructions, when executed by a computing device, the computing device performs the method provided in any one of the aforementioned aspects or in any possible implementation of any one of the aspects. The computer program product may be a software installation package, and when it is necessary to use the method provided in any one of the aforementioned aspects or in any possible implementation of any one of the aspects, the computer program product may be downloaded and executed on a computing device.

In a sixth aspect, the present application also provides a computer chip, which is connected to a memory, and the chip is used to read and execute a software program stored in the memory to execute a method provided in any of the foregoing aspects or any possible implementation of any of the aspects.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG1 is a schematic diagram of a calculation graph provided in an embodiment of the present application;

FIG2 is a flow chart of an operator fusion method for a neural network provided in an embodiment of the present application;

FIG3 is a flow chart of an operator fusion method for a neural network provided in an embodiment of the present application;

FIG4 is a flow chart of a method for calculating the amount of resources required for a first subgraph to be fused to run on a set chip provided in an embodiment of the present application;

FIG5 is a schematic diagram of the structure of a subgraph to be fused provided in an embodiment of the present application;

FIG6 is a schematic diagram of an operator fusion process provided in an embodiment of the present application;

FIG7 is a schematic diagram of splitting operators to be fused provided in an embodiment of the present application;

FIG8 is a schematic diagram of the structure of an operator fusion device for a neural network provided in an embodiment of the present application;

FIG. 9 is a schematic diagram of the structure of a computer device provided in an embodiment of the present application.

Detailed ways

In order to make the purpose, technical solution and advantages of the embodiment of the present application clearer, the embodiment of the present application will be further described in detail with reference to the accompanying drawings. In the description of the embodiment of the present application, the terms "first" and "second" are used only for description purposes and are not It can be understood as indicating or implying relative importance or implicitly indicating the number of the indicated technical features. Thus, the features defined as "first" or "second" may explicitly or implicitly include one or more of the features.

To facilitate understanding, exemplary descriptions of concepts related to the present application are provided for reference.

(1) Video memory

Also known as graphics card memory, it is a component used to store graphics information to be processed. It is used to store rendering data that has been processed or is about to be extracted by the graphics card chip. In the scenario of neural network model training, it is used to store training samples; in the scenario of neural network model reasoning or application, it is used to store data to be processed.

(2) Memory reuse capability

The memory reuse capability supported by the chip's computing instructions is used to describe the reuse of instruction memory, which mainly includes two types: instruction memory reuse and instruction memory non-reusable.

(3) Parallelism

In computer architecture, parallelism refers to the maximum number of instructions or data executed in parallel. In the instruction pipeline, executing multiple instructions at the same time is called instruction parallelism.

(4) Bandwidth

For a computational graph, bandwidth refers to the amount of data transmitted per unit time by the operators included in the computational graph.

For AI chips, bandwidth represents the parallelism of data transfer. Specifically, it refers to the amount of data transferred from memory 1 to memory 2 per unit time. For example, if data is transferred from a hard disk to a memory stick or from a memory stick to a hard disk, bandwidth refers to the data transfer rate between the hard disk and the memory stick.

(5) Computing power

For a computational graph, computing power refers to the number of floating-point operations per second (FLOPS) that the operators included in the computational graph can perform.

For AI chips, computing power represents the degree of parallelism of data calculation. Computing power refers to the number of floating-point operations that an AI chip can complete per second, and is an indicator of hardware computing speed. It is often used to estimate the execution performance of computers, especially in the field of scientific computing that uses a large number of floating-point operations.

(6) Computational Graph

The computational graph is presented as a directed graph, which defines the flow of data, the calculation of data, and the interdependence between various calculations. Figure 1 is a schematic diagram of the structure of a computational graph provided by the present application. The computational graph of the AI model consists of operators and edges. Among them, each operator is used to perform at least one computational operation. If the computational graph is in the form of a tree graph, a computational operation can be represented by a computational node, and each computational operation is used to represent the applied mathematical operation (operation). In addition, when the operator applies a mathematical operation, the starting point of the data input can be used as an input node, and the end point of the data output or the end point of reading/writing a persistent variable can be used as an output node. The operator is the basic computing unit of the AI model. The edge is used to represent the input/output relationship between operators or nodes, and the edge can transmit a multidimensional data array whose size can be dynamically adjusted, wherein the multidimensional data array whose size can be dynamically adjusted is a tensor. The data structure of the tensor can be used to represent the data in the model, that is, a tensor can correspond to an n-dimensional array or list, where n is an integer greater than or equal to zero. Tensors have two attributes: dimension and rank. In addition, tensors can circulate between operators in the computational graph.

(7) Compiler

For example, neural network compilers or deep learning compilers are specific editors in the field of machine learning. They are used to deploy neural network training or reasoning on AI chips.

In the embodiment of the present application, an operator may also be referred to as a computing task, an operation (OP), a computing layer, etc., and a data dimension may also be referred to as a dimension, a shape, etc. In addition, the AI model in the embodiment of the present application may be a neural network model, etc.

During the deployment phase of the AI model, the AI model can be independently deployed on one or more computing devices. When deployed on multiple computing devices, each computing device runs all functional operators of the AI model, and each computing device can independently execute the functions of the AI model. On the one hand, it can be combined with a load balancing module to evenly share the request volume among multiple computing devices. On the other hand, disaster recovery can be achieved. When a computing device fails, the AI models on other computing devices can continue to provide services as usual.

Alternatively, the various functional operators of an AI model can be distributed and deployed on multiple computing devices, and the multiple computing devices can collaboratively run the AI model according to the data dependency relationship of the AI model. It should be understood that the computing device can be an AI chip (for example, including a central processing unit). Central processing unit (CPU), graphics processing unit (GPU), embedded neural network processing unit (NPU), field programmable gate array (FPGA) or application-specific integrated circuit (ASIC) chips), graphics cards, physical servers, etc.

The AI development framework is a tool library that allows AI developers to quickly develop AI models. The AI development framework encapsulates a variety of callable operators and also contains the tools required for AI model development, training, and deployment. During the construction, training, and reasoning of AI models, the encapsulated operators in the AI framework can be called through the application program interface (API), and then combined with some simple driver codes to complete the corresponding operations.

The AI development framework in the industry is usually open source. Typical AI development frameworks used for developing deep learning models, also known as deep learning frameworks, include PaddlePaddle, Tensorflow, Caffe, Theano, MXNet, Torch, and PyTorch. Developers can install the AI development framework locally and then develop AI models locally, or use the AI development framework on online platforms (for example, online open source framework platforms, public cloud AI basic development platforms, etc.) to develop AI models.

Model architecture adjustment is a method to fundamentally optimize the performance of the model. It mainly involves adjusting the algorithm structure of the AI model, such as changing the operator type in the AI model, changing the connection relationship between layers of the AI model, etc.

The deep learning framework is the first layer in the entire deep learning ecosystem. In TensorFlow and MXNet, neural network calculations are further split into various common operators for tensor data. The deep learning framework needs to concretize the deep learning tasks expressed by the computational graph structure mapped by the neural network into instructions and data that can be executed by AI chips. In this process, the deep learning framework uses operators as specific elements to implement computing tasks, and provides a kernel function (Kernel) for each operator to be executed on the AI chip. According to the computational graph, the deep learning framework schedules the execution of the kernel function corresponding to each operator in the computational graph to complete the calculation of the entire neural network.

The operators in the computational graph mapped by the neural network are implemented on the AI chip through kernel functions, which is a mode of "off-chip storage → on-chip computing → off-chip storage", that is, the input data and output numbers of the operators in the neural network are stored in the global storage, and the kernel function needs to read the input data from the global storage, complete the calculation, and store the results back to the global storage. This brings two problems: first, the memory access of each operator on the input data and output data cannot be avoided by optimization within the operator; second, each operator requires startup overhead, which is even more true for heterogeneous computing devices outside the AI chip. In order to solve these problems, the kernel functions of two or more consecutive operators in the computational graph corresponding to the neural network are merged into a new kernel function, so that the computing tasks corresponding to these operators only require one scheduling overhead. Therefore, a large amount of data transmission from external memory (DRAM) to on-chip memory and from on-chip memory to external memory can be eliminated. For example, in the ResNet-18 neural network, if all operators can be fused together, 99.6% of data transmission can be reduced.

However, it is difficult to fuse all the operators in an actual neural network together. The reasons include: in practice, there is a mismatch between the size of the on-chip memory and the scale of data processed by the neural network, because the area overhead of the AI chip cannot be too large, and accordingly, there are limits on the area overhead of the on-chip memory of the AI chip. In addition, the power consumption overhead required for the on-chip memory of the AI chip should also be within a reasonable range. These reasons lead to certain limitations on the scale of data stored on the chip of the AI chip. Therefore, if all the operators in the neural network are fused together, the data scale of the intermediate data of the fused operators does not match the actual data scale stored in the on-chip memory.

In related technologies, operator fusion is mainly divided into two categories: manual fusion and automatic fusion.

(1) Manual fusion: Identify the operators to be fused, fuse the corresponding operators together to write a new custom operator, and register the corresponding fusion rules in the framework. However, manual fusion only supports the fusion of specific operators in specific topology scenarios, and the manual adaptation workload is large.

(2) Automatic fusion: fusion between specific operators, such as the fusion of convolution operators and relu operators; or fusion between specific types of operators, such as the fusion of element-wise operators. The compiler can automatically fuse the corresponding operators into one operator during graph optimization. However, in actual operation, it is found that fusing all operators together does not achieve the best performance.

Based on this, an embodiment of the present application provides an operator fusion method. On the one hand, the calculation graph is divided into multiple subgraphs to be fused, and the operators are fused based on the subgraphs to be fused; on the other hand, in the process of fusing the operators, the hardware information of the AI chip is considered, so that after the operator fusion, not only the data transmission is reduced, but also the data scale of the intermediate data of the fused operator can be matched with the actual data scale stored in the on-chip memory, and the power consumption overhead required for the on-chip memory is also controlled within a reasonable range. Such a design optimizes the neural network, improves the execution efficiency, and improves the performance of network training and reasoning.

The embodiments of the present application will be described in detail below with reference to the accompanying drawings.

FIG2 is a flowchart of an operator fusion method for a neural network proposed by the present application, which is applied to a fusion device, which may be an electronic device or a server. The neural network model may be stored on a setting chip or in an electronic device. Prepared in.

Taking the neural network model stored on a setting chip as an example, the fusion device as an electronic device, and the computing device running the neural network model as an example, the setting chip, the method includes at least the following steps:

S200, setting the chip to send the neural network model to the electronic device.

Among them, the setting chip can be an AI chip.

S201, the electronic device obtains a neural network model and determines a calculation graph corresponding to the neural network model.

The neural network models that are widely used include BP neural network, Hopfield network, ART network and Kohonen network. In the embodiment of the present application, the neural network model obtained can be any one of them, or it can be other neural network models. The computational graph is used to describe the connection relationship between multiple operators in the neural network model, and each of the multiple operators is used to perform a computational operation.

In a specific example, referring to FIG3 , the computation graph includes 2 operators, operator 1 includes 3 computation operations, and operator 2 includes 3 computation operations.

If the obtained neural network model is a commonly used model, the calculation graph corresponding to the obtained neural network model can be directly searched in the calculation graph corresponding to the commonly used model. If the obtained neural network model is a customized neural network model obtained by improving the commonly used model, the method of extracting the neural network model into a calculation graph can be applied to obtain the customized neural network model.

S202: The electronic device determines at least two subgraphs to be fused in the computation graph.

In the embodiment of the present application, unlike the related art in which operator fusion is directly performed based on operator type and operator as a unit, the computation graph is divided into at least two subgraphs to be fused, wherein any of the at least two subgraphs to be fused includes at least one operator.

Exemplarily, the methods of determining at least two subgraphs to be fused in the computation graph may include the following:

(1) According to the number of operators:

For example, 10 connected operators are regarded as a subgraph to be fused.

(2) Classification by type of common computing operations:

For example, a convolution type computing operation whose input is usually an add type, so the operators to which the convolution type computing operation and the add type computing operation belong can be divided into a subgraph to be fused.

(3) Divided according to input, output and the amount of data that can be processed by computing operations.

In the embodiment of the present application, any one of the above three methods can be applied to divide the computation graph into at least two sub-graphs to be fused.

S203: When the first subgraph to be fused meets a fusion condition, the electronic device fuses at least two operators included in the first subgraph to be fused into one operator.

The first subgraph to be fused is any one of the at least two subgraphs to be fused. The fusion condition includes: the first subgraph to be fused includes at least two operators, and the utilization rate of the amount of resources required for the first subgraph to be fused to run on the set chip is greater than the utilization rate threshold. The utilization rate of the amount of resources required for the first subgraph to be fused to run on the set chip is related to the memory size of the set chip and the total number of computing operations included in the at least two operators in the first subgraph to be fused.

If the first subgraph to be fused includes only one operator, the first subgraph to be fused is a single operator subgraph and operator fusion cannot be performed. Therefore, one of the fusion conditions is that the first subgraph to be fused includes at least two operators. In addition, another fusion condition is that the utilization rate of the amount of resources required for the first subgraph to be fused to run on the set chip is greater than the utilization rate threshold.

The utilization threshold is preset, for example, 0.8. The utilization of the amount of resources required for the first subgraph to be fused to run on the set chip is the ratio of the amount of resources required for the first subgraph to be fused to run on the set chip to the maximum amount of resources of the set chip. The maximum amount of resources of the set chip can be obtained from the factory parameters of the set chip, or calculated according to the factory parameters.

S204, the electronic device sends the neural network model obtained after the operator fusion to the setting chip.

Among them, since the neural network model is running on the set chip, the neural network model obtained after operator fusion is sent to the set chip.

S205, setting the chip to obtain the neural network model running instruction, and running the neural network model after the operator fusion.

The neural network model running instruction can be used for neural network model training or neural network model reasoning. After the chip obtains the running instruction of the neural network model, it can process the training data or the reasoning data.

Based on the above embodiment, the fusion condition includes that the utilization rate of the amount of resources required for the first subgraph to be fused to run on the set chip is greater than the utilization rate threshold. In order to calculate the utilization rate of the amount of resources required for the first subgraph to be fused to run on the set chip, the amount of resources required for the first subgraph to be fused to run on the set chip can be calculated first.

Referring to FIG. 4 , the process of calculating the amount of resources required for the first subgraph to be fused to run on a set chip is described.

The amount of resources includes computing power and/or bandwidth.

The amount of resources required for the first subgraph to be fused to run on the set chip is determined according to the memory size of the set chip, the memory reuse capability supported by the operation instructions of the set chip, and the total number of computing operations included in at least two operators in the first subgraph to be fused.

Among them, setting the memory size of the chip usually refers to the on-chip memory. Setting the memory reuse capability supported by the chip's operation instructions usually includes two situations: reuse and non-reuse. If the computation graph is in the form of a tree graph, a computation operation can be represented by a computation node, and the total number of computation operations included in at least two operators in the first subgraph to be fused is the total number of computation nodes.

S400, the electronic device receives hardware information from a setting chip.

The hardware information includes setting the memory reuse capability supported by the chip's computing instructions and setting the chip's memory size.

S401, the electronic device determines the number of memory copies required to run the first sub-graph to be fused when the memory reuse capability supported by the chip's operation instructions is set to be reusable and the data types generated by the calculation operations included in at least two operators in the first sub-graph to be fused are the same.

The number of memory copies is the difference between the total number of computing operations included in at least two operators in the first subgraph to be fused and the number of output computing operations included in at least two operators in the first subgraph to be fused. The output computing operation satisfies that the number of computing operations that use the computing results of the output computing operations as input is less than the second quantity threshold. For example, the quantity threshold is 2, that is, the number of computing operations that use the computing results of the output computing operations as input is 0 or 1.

Generally, when the data types generated by the calculation operations included in at least two operators in the first subgraph to be fused are the same, for example, both are int8 types, it is effective to set the memory reuse capability supported by the chip's operation instructions to be available. In other words, if the data types generated by the calculation operations included in at least two operators in the first subgraph to be fused are different, the memory reuse capability supported by the chip's operation instructions cannot be used.

In this case, the total number p of computing operations included in the at least two operators in the first subgraph to be fused and the number q of output computing operations included in the at least two operators in the first subgraph to be fused are determined. If the number of computing operations with the operation results of the output computing operations as input is 0 or 1, the nodes where the computing operations are located can be reused, so the difference p-q is the memory fraction F required to run the at least two operators in the first subgraph to be fused.

In a specific example, the calculation operation is a=b+c. The total number of calculation operations included in at least two operators in the first subgraph to be fused is 3. Memory 1 stores data b, and memory 2 stores data c. The calculation type of the calculation operation is a sum operation. In the case of reuse, data c can be added to data b as the sum of data a. In other words, data b and data a can share a memory without resetting a memory to store a. Therefore, in this example, the number of memory copies required is 3-1=2.

If the number of computing operations that use the calculation results of output computing operations as input is greater than 1, the node where the computing operation is located cannot be reused, so p is the number of memory copies required to run at least two operators in the first subgraph to be fused. In addition, if the memory reuse capability supported by the computing instructions of the set chip is not reused, the total number p of computing operations included in the at least two operators in the first subgraph to be fused is determined to be the number of memory copies required to run at least two operators in the first subgraph to be fused.

In a specific example, the calculation operation is a=b+c, the total number p of calculation operations included in at least two operators in the first subgraph to be fused is 3, memory 1 stores data b, memory 2 stores data c, and the calculation type of the calculation operation is a sum operation. In the case of non-reusability, data b, data c, and the summed data a each require a memory, that is, the number of memory copies required to run each operator of the first subgraph to be fused is also 3.

Refer to Figure 5, which is a schematic diagram of the first subgraph to be fused. Among them, 51 is an output-type computing operation with the calculation result of the output-type computing operation as the input, and the number of computing operations is 0; 52 is an output-type computing operation with the calculation result of the output-type computing operation as the input, and the number of computing operations is 1; 53 is an output-type computing operation with the calculation result of the output-type computing operation as the input, and the number of computing operations is 1.

In addition, FIG5 includes addition operation, subtraction operation, multiplication operation, division operation, square root operation, back propagation operation and data conversion operation, etc. The form of each operation is described with an example:

An example of an addition operation is add_14float32[1,2,1,1,16], where add is the calculation type of the calculation operation, 14 is the array dimension of the calculation operation, float32 is the data type of the calculation operation, and [1,2,1,1,16] is the data tensor of the calculation operation.

An example of a subtraction operation is sub_3float32[1,2,1,1,16], where sub is the calculation type of the calculation operation, 3 is the array dimension of the calculation operation, float32 is the data type of the calculation operation, and [1,2,1,1,16] is the data tensor of the calculation operation.

The multiplication operation mul_1float32[1,2,1,1,16] is used, where mul is the calculation type of the operation, 1 is the array dimension of the operation, float32 is the data type of the operation, and [1,2,1,1,16] is the data tensor of the operation.

In addition to the calculation operation div_6float32[1,2,1,1,16], div is the calculation type of the calculation operation, 6 is the array dimension of the calculation operation, float32 is the data type of the calculation operation, and [1,2,1,1,16] is the data tensor of the calculation operation.

The square root calculation operation sqrt_5float32[1,2,1,1,16], where sqrt is the calculation type of the calculation operation, 5 is the array dimension of the calculation operation, float32 is the data type of the calculation operation, and [1,2,1,1,16] is the data tensor of the calculation operation.

The back propagation calculation operation broadcast_tenser_1float32[256,2,112,112,16], where broadcast_tenser is the calculation type of the calculation operation, 1 is the array dimension of the calculation operation, float32 is the data type of the calculation operation, and [256,2,112,112,16] is the data tensor of the calculation operation.

The data conversion calculation operation cast_0float32[256,2,112,112,16], where cast is the calculation type of the calculation operation, 0 is the array dimension of the calculation operation, float32 is the data type of the calculation operation, and [256,2,112,112,16] is the data tensor of the calculation operation.

S402, the electronic device determines the running speed corresponding to each computing operation included in the first subgraph to be fused run by the setting chip according to the memory size of the setting chip, the number of memory portions required to run the first subgraph to be fused, and the running time corresponding to the type of each computing operation included in the first subgraph to be fused.

In some possible designs, different computing operations usually have different corresponding running rates. Taking the third computing operation as an example, the calculation process of the running rate corresponding to the third computing operation is described. The third computing operation is any computing operation included in the at least two operators in the first subgraph to be fused.

(1) Determine and set the ratio of the memory size of the chip to the number of memory copies required to run each operator of the first subgraph to be fused to obtain the memory size required for the third computing operation to run on the chip.

One memory is used to store the calculation result of a calculation operation. Therefore, the memory size M of the chip is set to be divided by the number of memory shares f required to run each operator of the first subgraph to be fused, and the obtained ratio is the memory size required for the third calculation operation to run on the chip.

Therefore, the memory size required for the third computing operation to run on the set chip satisfies: m=M/f. The third computing operation is any computing operation included in the first subgraph to be fused, m is the memory size required for the third computing operation to run on the set chip, M is the memory size of the set chip, and f is the number of memory copies required to run the first subgraph to be fused.

(2) Determine that the ratio of the memory size required for the third computing operation to run on the chip to the time required to execute the third computing operation is the running rate corresponding to the third computing operation.

Since different computing operations require different durations of time to execute, if the number of computing operations is S, the number of determined operating rates is also S. Exemplarily, for a third computing operation, a ratio of a memory size required for the third computing operation to run on a chip to a duration of time required to execute the third computing operation may be determined as the operating rate corresponding to the third computing operation.

For example, the running rate of the third computing operation satisfies V=m/t, where V is the running rate of the third computing operation, and t is the time required to execute the third computing operation.

S403, the electronic device converts the operation rate corresponding to each computing operation included in the first sub-graph to be fused into units to obtain the amount of resources required for each computing operation included in the first sub-graph to be fused to run on a set chip.

Among them, the amount of resources required for the fourth computing operation to run on the set chip includes bandwidth, and the fourth computing operation is any computing operation included in the at least two operators in the first sub-graph to be fused. According to the conversion relationship between the running rate and the bandwidth, the running rate of the fourth computing operation is converted into the bandwidth required for the fourth computing operation to run on the set chip.

For example, the conversion relationship between the operating rate and the bandwidth is BW=h*k, where the operating rate is k and the bandwidth is BW.

Alternatively, the amount of resources required for the fifth computing operation to run on the set chip includes computing power, and the fifth computing operation is any one of the computing operations included in at least two operators in the first sub-graph to be fused. According to the conversion relationship between the running rate and the computing power, the running rate of the fifth computing operation is converted into the computing power required for the fifth computing operation to run on the set chip.

For example, the conversion relationship between running rate and computing power is F=s*k, where the running rate is k and the computing power is F.

S404: The electronic device uses the minimum amount of resources among the amounts of resources required for each computing operation included in the first subgraph to be fused to run on the set chip as the amount of resources required for the first subgraph to be fused to run on the set chip.

Since only one computing operation is executed at the same time, the minimum amount of resources required for each computing operation to run on the set chip is the amount of resources required for the first subgraph to be fused to run on the set chip. With this design, the bandwidth BW0 and computing power F0 required for the first subgraph to be fused to run on the set chip can be obtained.

In another possible design, if the first subgraph to be fused does not meet the fusion condition, it means that if at least two operators included in the first subgraph to be fused are fused, when the obtained neural network model runs on the set chip, the performance of the set chip does not reach a better state. At this time, the first subgraph to be fused is split into the second subgraph to be fused and the third subgraph to be fused. Then, the fusion condition is applied to judge the second subgraph to be fused and the third subgraph to be fused respectively. When the second subgraph to be fused meets the fusion condition, the at least two operators included in the second subgraph to be fused are fused into one operator; when the third subgraph to be fused meets the fusion condition, the at least two operators included in the third subgraph to be fused are fused into one operator. Among them, the fusion condition and the fusion process refer to the first subgraph to be fused. The expression is not elaborated here.

Referring to Figure 6, a schematic diagram of an operator fusion process is shown. In this example, the first subgraph to be fused includes four operators. If the four operators are fused, the fusion condition is not met. Then the subgraph is divided to obtain the second subgraph to be fused and the third subgraph to be fused, wherein the second subgraph to be fused includes operator 1, operator 2 and operator 3; the third subgraph to be fused includes operator 4. If the second subgraph to be fused meets the fusion condition, operator 1, operator 2 and operator 3 of the second subgraph to be fused are fused. The third subgraph to be fused is a single operator subgraph, and operator fusion cannot be performed.

Refer to Figure 7, which is a schematic diagram of the splitting of the subgraphs to be fused. During the splitting process, starting from the first output-type computing operation of the first subgraph to be fused (such as 71), each computing operation of the first subgraph to be fused is traversed by reverse search. The first output-type computing operation is any output-type computing operation of the first subgraph to be fused, and the first output-type computing operation is any output-type computing operation of the first subgraph to be fused. The first output-type computing operation satisfies that the number of computing operations that use the operation result of the first output-type computing operation as input is less than the first quantity threshold.

The first quantity threshold may be 1, and the number of computing operations using the calculation result of the first output type computing operation as input is 0, that is, there is no computing operation using the calculation result of the first output type computing operation as input.

When the type of the second computing operation meets the splitting condition, the first subgraph to be fused is split into the second subgraph to be fused and the third subgraph to be fused. The subgraph with the second computing operation as the root node in the first subgraph to be fused is taken as the second subgraph to be fused, and the subgraphs other than the subgraph with the second computing operation as the root node in the first subgraph to be fused are taken as the third subgraph to be fused. The second computing operation meets the splitting condition if the computing type of the second computing operation is a set type (such as a back propagation type) or the computing type of the second computing operation is the same as the computing type of the computing operation of the parent node of the second computing operation.

Based on the same inventive concept as the method embodiment, the embodiment of the present application also provides an operator fusion device for a neural network, which is used to execute the operator fusion method for a neural network in the method embodiment shown above. The relevant features can be found in the above method embodiment and will not be repeated here. As shown in Figure 8, the operator fusion device for a neural network includes a transmission unit 801 and a processing unit 802.

A transmission unit 801 is used to obtain a neural network model;

A processing unit 802 is used to determine a computation graph corresponding to the neural network model, where the computation graph is used to describe the connection relationship between multiple operators in the neural network model; each of the multiple operators is used to perform at least one computation operation;

The processing unit 802 is further configured to determine at least two subgraphs to be fused in the computation graph, wherein any of the at least two subgraphs to be fused includes at least one operator;

The processing unit 802 is further configured to, when the first subgraph to be fused meets a fusion condition, fuse at least two operators included in the first subgraph to be fused into one operator;

In some exemplary embodiments, the processing unit 802 is further configured to:

When the first sub-graph to be fused does not meet the fusion condition, splitting the first sub-graph to be fused into a second sub-graph to be fused and a third sub-graph to be fused;

When the second subgraph to be fused meets the fusion condition, at least two operators included in the second subgraph to be fused are fused into one operator; and/or,

When the third subgraph to be fused meets the fusion condition, at least two operators included in the third subgraph to be fused are fused into one operator.

In some exemplary implementations, the processing unit 802 is specifically configured to:

Starting from the first output type computing operation of the first subgraph to be fused, traverse each computing operation of the first subgraph to be fused by reverse search; wherein the first output type computing operation is any output type computing operation of the first subgraph to be fused, and the first output type computing operation satisfies that the number of computing operations that use the operation result of the first output type computing operation as input is less than a first quantity threshold;

When the second calculation operation currently traversed meets the splitting condition, the subgraph with the second calculation operation as the root node in the first subgraph to be fused is used as the second subgraph to be fused, and the subgraphs in the first subgraph to be fused except the subgraph with the second calculation operation as the root node are used as the third subgraph to be fused;

The second computing operation satisfies the splitting condition if the computing type of the second computing operation is a set type or the computing type of the second computing operation is the same as the computing type of the computing operation of the parent node of the second computing operation.

When the memory reuse capability supported by the operation instructions of the chip is set to be reusable, and the data types generated by the calculation operations included in at least two operators in the first subgraph to be fused are the same, determine the number of memory shares required to run the first subgraph to be fused, and the number of memory shares is the difference between the total number of calculation operations included in the at least two operators in the first subgraph to be fused and the number of output calculation operations included in the at least two operators in the first subgraph to be fused; the number of calculation operations in which the output calculation operation uses the calculation result of the output calculation operation as input is less than the second quantity threshold;

Determine the running speed corresponding to each computing operation included in the first subgraph to be fused and executed by the set chip according to the memory size of the set chip, the number of memory copies required to execute the first subgraph to be fused, and the running time corresponding to the type of each computing operation included in the first subgraph to be fused;

Convert the operation rates corresponding to each computing operation included in the first subgraph to be fused into units to obtain the amount of resources required for each computing operation included in the first subgraph to be fused to run on the set chip;

The minimum amount of resources among the amounts of resources required for each computing operation included in the first subgraph to be fused to run on the set chip is used as the amount of resources required for the first subgraph to be fused to run on the set chip.

The memory size required for the third computing operation to run on the set chip satisfies: m = M/f;

The third computing operation is any computing operation included in the first sub-graph to be fused, m is the memory size required for the third computing operation to run on the set chip, M is the memory size of the set chip, and f is the number of memory copies required to run the first sub-graph to be fused;

The running rate of the third computing operation satisfies V=m/t;

Wherein, V is the running rate of the third computing operation, and t is the time required to execute the third computing operation.

In some exemplary embodiments, the amount of resources includes bandwidth and/or computing power;

The amount of resources required for the fourth computing operation to run on the set chip includes bandwidth, the fourth computing operation is any computing operation included in the at least two operators in the first sub-graph to be fused, and the processing unit 802 is specifically used to: convert the running rate of the fourth computing operation into the bandwidth required for the fourth computing operation to run on the set chip according to the conversion relationship between the running rate and the bandwidth; or,

The amount of resources required for the fifth computing operation to run on the set chip includes computing power. The fifth computing operation is any one of the computing operations included in at least two operators in the first sub-graph to be fused. The processing unit 802 is specifically used to: convert the running rate of the fifth computing operation into the computing power required for the fifth computing operation to run on the set chip according to the conversion relationship between the running rate and the computing power.

It should be noted that the division of modules in the embodiments of the present application is schematic and is only a logical function division. There may be other division methods in actual implementation. The functional modules in the embodiments of the present application may be integrated into a processing module, or each module may exist physically separately, or two or more modules may be integrated into one module. The above-mentioned integrated modules may be implemented in the form of hardware or in the form of software functional modules.

The above method can be implemented in whole or in part by software, hardware, firmware or any other combination. When implemented by software, the above method can be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded or executed on a computer, the process or function according to the embodiment of the present invention is generated in whole or in part. The computer can be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device. The computer instructions can be stored in a computer-readable storage medium, or transmitted from one computer-readable storage medium to another computer-readable storage medium. For example, the computer instructions can be transmitted from one website, computer, server or data center to another website, computer, server or data center by wired (e.g., coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains one or more available media sets. The available medium can be a magnetic medium (e.g., a floppy disk, a hard disk, a tape), an optical medium (e.g., a DVD), or a semiconductor medium. The semiconductor medium can be a solid state drive (SSD).

In a simple embodiment, those skilled in the art may conceive that the electronic device or server in the embodiment may adopt the form shown in FIG. 9 .

The device 900 shown in FIG. 9 includes at least one processor 901 , a memory 902 , and optionally, a communication interface 903 .

The memory 902 may be a volatile memory, such as a random access memory; the memory may also be a non-volatile memory, such as a read-only memory, a flash memory, a hard disk drive (HDD) or a solid-state drive (SSD), or the memory 902 may be any other medium that can be used to carry or store the desired program code in the form of instructions or data structures and can be accessed by a computer, but is not limited thereto. The memory 902 may be a combination of the above memories.

The specific connection medium between the processor 901 and the memory 902 is not limited in the embodiment of the present application.

The processor 901 may be a CPU, or other general-purpose processors, digital signal processors (DSP), application-specific integrated circuits (ASIC), field programmable gate arrays (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, artificial intelligence chips, chips on chips, etc. A general-purpose processor may be a microprocessor or any conventional processor, etc. In the device shown in FIG. 9 , an independent data transceiver module may also be provided, such as a communication interface 903, for transmitting and receiving data; when the processor 901 communicates with other devices, data may be transmitted through the communication interface 903.

In a possible application scenario, the computing device takes the form shown in FIG. 9 , and the processor 901 in FIG. 9 can call the computer execution instructions stored in the memory 902 so that the computing device can execute the operator fusion method for a neural network in any of the above method embodiments.

Specifically, the functions/implementation processes of the transmission unit 801 and the processing unit 802 in FIG8 can be implemented by the processor 901 in FIG9 calling the computer execution instructions stored in the memory 902. Alternatively, the functions/implementation processes of the processing unit in FIG8 can be implemented by the processor 901 in FIG9 calling the computer execution instructions stored in the memory 902, and the transmission function/implementation process of FIG8 can be implemented by the communication interface 903 in FIG9.

Those skilled in the art will appreciate that the embodiments of the present application may be provided as methods, systems, or computer program products. Therefore, the present application may adopt the form of a complete hardware embodiment, a complete software embodiment, or an embodiment in combination with software and hardware. Moreover, the present application may adopt the form of a computer program product implemented in one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) that contain computer-usable program code.

The present application is described with reference to the flowchart and/or block diagram of the method, device (system), and computer program product according to the present application. It should be understood that each process and/or box in the flowchart and/or block diagram, as well as the combination of the process and/or box in the flowchart and/or block diagram, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, a special-purpose computer, an embedded processor, or other programmable data processing device to produce a machine, so that the instructions executed by the processor of the computer or other programmable data processing device produce a device for implementing the functions specified in one process or multiple processes in the flowchart and/or one box or multiple boxes in the block diagram.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing device to work in a specific manner, so that the instructions stored in the computer-readable memory produce a manufactured product including an instruction device that implements the functions specified in one or more processes in the flowchart and/or one or more boxes in the block diagram.

These computer program instructions may also be loaded onto a computer or other programmable data processing device so that a series of operational steps are executed on the computer or other programmable device to produce a computer-implemented process, whereby the instructions executed on the computer or other programmable device provide steps for implementing the functions specified in one or more processes in the flowchart and/or one or more boxes in the block diagram.

Obviously, those skilled in the art can make various changes and modifications to the present application without departing from the scope of the present application. Thus, if these modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is also intended to include these modifications and variations.

Claims

An operator fusion method for a neural network, characterized by comprising:

Acquire a neural network model, and determine a computational graph corresponding to the neural network model, wherein the computational graph is used to describe a connection relationship between multiple operators in the neural network model; each of the multiple operators is used to perform at least one computational operation;

Determine at least two subgraphs to be fused in the computation graph, wherein any subgraph to be fused in the at least two subgraphs to be fused includes at least one operator;

When the first subgraph to be fused meets the fusion condition, fusing at least two operators included in the first subgraph to be fused into one operator;

The first sub-graph to be fused is any one of the at least two sub-graphs to be fused;

The fusion conditions include: the first sub-graph to be fused includes at least two operators, and the utilization rate of the amount of resources required for the first sub-graph to be fused to run on the set chip is greater than the utilization rate threshold, and the utilization rate of the amount of resources required for the first sub-graph to be fused to run on the set chip is related to the memory size of the set chip and the total number of computing operations included in the at least two operators in the first sub-graph to be fused.
The method according to claim 1, characterized in that the method further comprises:

When the first sub-graph to be fused does not meet the fusion condition, splitting the first sub-graph to be fused into a second sub-graph to be fused and a third sub-graph to be fused;

When the second subgraph to be fused satisfies the fusion condition, fusing at least two operators included in the second subgraph to be fused into one operator; and/or,

When the third subgraph to be fused meets the fusion condition, at least two operators included in the third subgraph to be fused are fused into one operator.
The method according to claim 2, characterized in that splitting the first subgraph to be fused into the second subgraph to be fused and the third subgraph to be fused comprises:

Starting from the first output-type computing operation of the first subgraph to be fused, traversing each computing operation of the first subgraph to be fused by reverse search; wherein the first output-type computing operation is any output-type computing operation of the first subgraph to be fused, and the first output-type computing operation satisfies that the number of computing operations that use the operation result of the first output-type computing operation as input is less than a first quantity threshold;

When the second computing operation currently traversed meets the splitting condition, the subgraph with the second computing operation as the root node in the first subgraph to be fused is used as the second subgraph to be fused, and the subgraphs in the first subgraph to be fused except the subgraph with the second computing operation as the root node are used as the third subgraph to be fused;

The second computing operation satisfies the splitting condition if the computing type of the second computing operation is a set type or the computing type of the second computing operation is the same as the computing type of the computing operation of the parent node of the second computing operation.
The method according to any one of claims 1 to 3, characterized in that the method further comprises:

The amount of resources required for the first subgraph to be fused to run on the setting chip is determined according to the memory size of the setting chip, the memory reuse capability supported by the operation instructions of the setting chip, and the total number of computing operations included in at least two operators in the first subgraph to be fused.
The method according to claim 4 is characterized in that the step of determining the amount of resources required for the first subgraph to be fused to run on the setting chip according to the memory size of the setting chip, the memory reuse capability supported by the operation instructions of the setting chip, and the total number of computing operations included in at least two operators in the first subgraph to be fused comprises:

When the memory reuse capability supported by the operation instructions of the setting chip is reusable, and the data types generated by the calculation operations included in at least two operators in the first subgraph to be fused are the same, determine the number of memory shares required to run the first subgraph to be fused, and the number of memory shares is the difference between the total number of calculation operations included in the at least two operators in the first subgraph to be fused and the number of output calculation operations included in the at least two operators in the first subgraph to be fused; the number of the output calculation operations that satisfies the calculation operation that takes the calculation result of the output calculation operation as input is less than a second quantity threshold;

Determine the running speed corresponding to each computing operation included in the first subgraph to be fused when the setting chip runs the setting chip according to the memory size of the setting chip, the number of memory copies required for running the first subgraph to be fused, and the running time corresponding to the type of each computing operation included in the first subgraph to be fused;

Convert the operation rates corresponding to each computing operation included in the first subgraph to be fused into units to obtain the amount of resources required for each computing operation included in the first subgraph to be fused to run on the set chip;

The minimum amount of resources among the amounts of resources required for each computing operation included in the first subgraph to be fused to run on the set chip is used as the amount of resources required for the first subgraph to be fused to run on the set chip.
The method according to claim 5 is characterized in that the memory size required for the third computing operation to run on the set chip satisfies: m = M/f;

The third computing operation is any one of the computing operations included in the first sub-graph to be fused, m is the memory size required for the third computing operation to run on the set chip, M is the memory size of the set chip, and f is the number of memory copies required to run the first sub-graph to be fused;

The running rate of the third computing operation satisfies V=m/t;

Wherein, V is the running rate of the third computing operation, and t is the time required to execute the third computing operation.
The method according to claim 5 or 6, characterized in that the amount of resources includes bandwidth and/or computing power;

The step of converting the operation rates corresponding to each computing operation included in the first subgraph to be fused into units to obtain the amount of resources required for each computing operation included in the first subgraph to be fused to run on the set chip includes:

The amount of resources required for the fourth computing operation to run on the set chip includes bandwidth, the fourth computing operation is any computing operation included in the at least two operators in the first sub-graph to be fused, and the running rate of the fourth computing operation is converted into the bandwidth required for the fourth computing operation to run on the set chip according to the conversion relationship between the running rate and the bandwidth; or,

The amount of resources required for the fifth computing operation to run on the set chip includes computing power. The fifth computing operation is any one of the computing operations included in at least two operators in the first sub-graph to be fused. According to the conversion relationship between running rate and computing power, the running rate of the fifth computing operation is converted into the computing power required for the fifth computing operation to run on the set chip.
An operator fusion device for a neural network, characterized by comprising:

A transmission unit, used to obtain a neural network model;

A processing unit, configured to determine a computational graph corresponding to the neural network model, wherein the computational graph is used to describe a connection relationship between a plurality of operators in the neural network model; each of the plurality of operators is used to perform at least one computational operation;

The processing unit is further used to determine at least two subgraphs to be fused in the computation graph, wherein any subgraph to be fused in the at least two subgraphs to be fused includes at least one operator;

The processing unit is further configured to, when the first subgraph to be fused meets a fusion condition, fuse at least two operators included in the first subgraph to be fused into one operator;

The first sub-graph to be fused is any one of the at least two sub-graphs to be fused;

The fusion conditions include: the first sub-graph to be fused includes at least two operators, and the utilization rate of the amount of resources required for the first sub-graph to be fused to run on the set chip is greater than the utilization rate threshold, and the utilization rate of the amount of resources required for the first sub-graph to be fused to run on the set chip is related to the memory size of the set chip and the total number of computing operations included in the at least two operators in the first sub-graph to be fused.
The operator fusion device according to claim 8, characterized in that the processing unit is further used for:

When the first sub-graph to be fused does not meet the fusion condition, splitting the first sub-graph to be fused into a second sub-graph to be fused and a third sub-graph to be fused;

When the second subgraph to be fused satisfies the fusion condition, fusing at least two operators included in the second subgraph to be fused into one operator; and/or,

When the third subgraph to be fused meets the fusion condition, at least two operators included in the third subgraph to be fused are fused into one operator.
The operator fusion device according to claim 9, characterized in that the processing unit is specifically used for:

Starting from the first output-type computing operation of the first subgraph to be fused, traversing each computing operation of the first subgraph to be fused by reverse search; wherein the first output-type computing operation is any output-type computing operation of the first subgraph to be fused, and the first output-type computing operation satisfies that the number of computing operations that use the operation result of the first output-type computing operation as input is less than a first quantity threshold;

When the second computing operation currently traversed meets the splitting condition, the subgraph with the second computing operation as the root node in the first subgraph to be fused is used as the second subgraph to be fused, and the subgraphs in the first subgraph to be fused except the subgraph with the second computing operation as the root node are used as the third subgraph to be fused;

The second computing operation satisfies the splitting condition if the computing type of the second computing operation is a set type or the computing type of the second computing operation is the same as the computing type of the computing operation of the parent node of the second computing operation.
The operator fusion device according to any one of claims 8 to 10, characterized in that the processing unit is further used for:

The amount of resources required for the first subgraph to be fused to run on the setting chip is determined according to the memory size of the setting chip, the memory reuse capability supported by the operation instructions of the setting chip, and the total number of computing operations included in at least two operators in the first subgraph to be fused.
The operator fusion device according to claim 11, characterized in that the processing unit is specifically used for:

When the memory reuse capability supported by the operation instructions of the setting chip is reusable, and the data types generated by the calculation operations included in at least two operators in the first subgraph to be fused are the same, determine the number of memory shares required to run the first subgraph to be fused, and the number of memory shares is the difference between the total number of calculation operations included in the at least two operators in the first subgraph to be fused and the number of output calculation operations included in the at least two operators in the first subgraph to be fused; the number of calculation operations of the output calculation operation using the calculation results of the output calculation operation as input is less than a second number threshold;

Determine the running speed corresponding to each computing operation included in the first subgraph to be fused when the setting chip runs the setting chip according to the memory size of the setting chip, the number of memory copies required for running the first subgraph to be fused, and the running time corresponding to the type of each computing operation included in the first subgraph to be fused;

Convert the operation rates corresponding to each computing operation included in the first subgraph to be fused into units to obtain the amount of resources required for each computing operation included in the first subgraph to be fused to run on the set chip;

The minimum amount of resources among the amounts of resources required for each computing operation included in the first subgraph to be fused to run on the set chip is used as the amount of resources required for the first subgraph to be fused to run on the set chip.
The operator fusion device according to claim 12 is characterized in that the memory size required for the third computing operation to run on the setting chip satisfies: m = M/f;

The third computing operation is any one of the computing operations included in the first sub-graph to be fused, m is the memory size required for the third computing operation to run on the set chip, M is the memory size of the set chip, and f is the number of memory copies required to run the first sub-graph to be fused;

The running rate of the third computing operation satisfies V=m/t;

Wherein, V is the running rate of the third computing operation, and t is the time required to execute the third computing operation.
The operator fusion device according to claim 12 or 13, characterized in that the resource amount includes bandwidth and/or computing power;

The amount of resources required for the fourth computing operation to run on the set chip includes bandwidth, the fourth computing operation is any one of the computing operations included in the at least two operators in the first subgraph to be fused, and the processing unit is specifically used to: convert the running rate of the fourth computing operation into the bandwidth required for the fourth computing operation to run on the set chip according to the conversion relationship between the running rate and the bandwidth; or,

The amount of resources required for the fifth computing operation to run on the set chip includes computing power. The fifth computing operation is any one of the computing operations included in at least two operators in the first sub-graph to be fused. The processing unit is specifically used to: convert the running rate of the fifth computing operation into the computing power required for the fifth computing operation to run on the set chip according to the conversion relationship between the running rate and the computing power.
An operator fusion device, characterized by comprising a processor and a memory;

The memory is used to store computer program instructions;

The processor executes and calls the computer program instructions in the memory to perform the method according to any one of claims 1 to 7.
A computer program product comprising instructions, characterized in that when the instructions are executed by a computing device, the computing device is caused to execute the method according to any one of claims 1 to 7.
A computer-readable storage medium, characterized in that it includes computer program instructions, and when the computer program instructions are executed by a computing device, the computing device executes the method according to any one of claims 1 to 7.