WO2023029589A1

WO2023029589A1 - Neural network compilation method and apparatus, device, and storage medium

Info

Publication number: WO2023029589A1
Application number: PCT/CN2022/093058
Authority: WO
Inventors: 勾志宏; 胡英俊; 徐宁仪; 曹雨
Original assignee: 上海商汤智能科技有限公司
Priority date: 2021-08-31
Filing date: 2022-05-16
Publication date: 2023-03-09
Also published as: CN113703775A; CN113703775B

Abstract

Embodiments of the present disclosure provide a neural network compilation method and apparatus, a device, and a storage medium. According to an example of the method, after a calculation graph of a neural network to be compiled is determined, a target topological sequence is selected from among multiple topological sequences of the calculation graph, and the neural network is then compiled on the basis of the target topological sequence to obtain machine instructions executable by a target chip. By screening out the target topological sequence having a high efficiency of execution by the target chip, and then compiling the neural network, the computing capacity of the target chip can be maximized.

Description

Method, device, device and storage medium for neural network compilation

cross-reference statement

This application claims priority to a Chinese patent application with application number 202111013533.X filed with the China Patent Office on August 31, 2021, the entire contents of which are incorporated herein by reference.

technical field

The present disclosure relates to the technical field of artificial intelligence, and in particular to a method, device, device and storage medium for compiling.

Background technique

After the neural network is trained, the neural network needs to be deployed on the target terminal to meet various requirements in different application scenarios. In related technologies, when the neural network is deployed to the target terminal, the neural network is generally parsed into a fixed topology sequence by the deep learning compiler according to the preset parsing rules, so that the target terminal, in the process of neural network reasoning, Operations are performed on the data input to the neural network according to the operator execution sequence represented by the fixed topology sequence. Although the deep learning compiler can also fuse operators to a certain extent or adjust the execution order of some operators during back-end optimization, these optimizations are still based on the topological sequence initially input to the deep learning compiler. Therefore, when the neural network is deployed on different chips, the execution order of the operators is basically fixed. However, the calculation graph of the neural network usually includes many topological sequences; and, for different types of chips, the topological sequence with the highest execution efficiency of the neural network may be different. This makes it impossible to maximize the computing power of the chip if the neural network is parsed into a fixed topology sequence according to preset rules, resulting in waste of resources and reducing the reasoning efficiency of the neural network.

Contents of the invention

The present disclosure provides a compiling method, device, device and storage medium.

According to the first aspect of the embodiments of the present disclosure, there is provided a compiling method, the method comprising: determining a computation graph corresponding to the neural network to be compiled, where nodes in the computation graph represent operators in the neural network, and The edge in the calculation graph represents the data flow direction in the neural network; the target topological sequence is determined from a plurality of topological sequences in the calculation graph, wherein each of the topological sequences represents an operator in the neural network A specific execution sequence: generating machine instructions corresponding to the neural network based on the target topology sequence, so that the target chip executes the machine instructions.

In some embodiments, the target topology sequence is determined based on the operation duration for the target chip to perform operations on the input data of the neural network according to the operator execution sequence represented by each topology sequence.

In some embodiments, the target chip includes at least two types of computing units capable of performing different types of operations on input data in parallel.

In some embodiments, determining the target topological sequence from the multiple topological sequences of the computation graph includes: dividing the computation graph into multiple subgraphs, wherein each subgraph includes at least two subtopological sequences; for each subgraph Graph, determining a target sub-topology sequence from at least two sub-topology sequences in the sub-graph; the target sub-topology sequence is based on the target chip according to the operator execution sequence represented by the at least two sub-topology sequences for the input data The operation duration of the operation is determined; the target topological sequence is obtained based on the target subtopological sequence of each subgraph.

In some embodiments, dividing the computation graph into multiple subgraphs includes: determining a plurality of key nodes from multiple nodes in the computation graph, wherein each of the key nodes is at least A convergence point of two paths; dividing the computation graph into multiple subgraphs based on the multiple key nodes.

In some embodiments, splitting the computation graph into a plurality of subgraphs based on the plurality of key nodes includes: dividing at least two adjacent key nodes and nodes and edges between the at least two key nodes constituted as a subgraph.

In some embodiments, after splitting the computation graph into multiple subgraphs, it further includes: determining a target subgraph whose number of nodes in the subgraph is less than a preset number; combining the target subgraph with the target subgraph Adjacent subgraph fusion of a graph.

In some embodiments, determining the machine instructions corresponding to the neural network based on the target topological sequence includes: determining the machine instructions corresponding to each of the target subtopological sequences; The instruction is linked according to the data flow in the calculation graph to obtain the machine instruction corresponding to the neural network.

In some embodiments, the operation duration for the target chip to perform operations on the input data of the neural network according to the operator execution order represented by each topological sequence is determined based on the following method: for each topological sequence, determine The target chip performs operations on the input data according to the operator execution order represented by the topological sequence; determine the operation duration based on the duration of the target chip executing the machine instruction; or, for each topological sequence , using a preset cost model to determine the operation time for the target chip to perform operations on the input data according to the operator execution sequence represented by the topology sequence, wherein the cost model is used according to the hardware parameters of the target chip And the operator execution sequence represented by the topological sequence estimates the operation duration corresponding to the topological sequence.

In some embodiments, using a preset cost model to determine the operation duration for the target chip to perform operations on the input data according to the operator execution sequence represented by the topology sequence includes: The operator represented by the sequence executes the corresponding machine instructions for performing operations on the input data in sequence; the operation duration is determined based on a preset cost model and the machine instructions.

In some embodiments, determining the calculation graph corresponding to the neural network to be compiled includes: analyzing the neural network to obtain the original calculation graph corresponding to the neural network; according to the memory size of the target chip and the original The data volume of the operation data corresponding to each operator in the calculation graph adjusts the operators in the original calculation graph to update the calculation graph.

In some embodiments, the operators in the original calculation graph are adjusted according to the memory size of the target chip and the amount of operation data corresponding to each operator in the original calculation graph, so as to update the calculation Graph, including: for each target operator whose corresponding operation data in the original computation graph is greater than a preset threshold, add at least one additional object of the same type as the target operator in the original computation graph operator, to split the operation data into multiple pieces of data and then perform operations through additional operators newly added by the target operator; wherein, the preset threshold is determined based on the memory size of the target chip; based on The newly added additional operator adjusts the original computation graph to update the computation graph.

In some embodiments, the splitting the operation data into multiple pieces of data includes: determining the split method for splitting the operation data based on the type of the target operator and the hardware performance parameters of the target chip sub-dimension; the data is split on the split dimension to obtain multiple pieces of data.

In some embodiments, the operation data includes image data, and the split dimension includes one or more of the following: frame number dimension of the image data, channel dimension of the image data, width of the image data dimension, the height dimension of the image data.

According to the second aspect of the embodiments of the present disclosure, there is provided a compiling device, the device comprising: a calculation graph determination module, configured to determine a calculation graph corresponding to a neural network to be compiled, and a node in the calculation graph represents the neural network The operator in the network, the edge in the calculation graph represents the data flow in the neural network; the screening module is used to determine the target topological sequence from the multiple topological sequences in the calculation graph, wherein each of the The topology sequence represents a specific execution order of operators in the neural network; the compiling module is configured to determine machine instructions corresponding to the neural network based on the target topology sequence, so that the target chip executes the machine instructions.

According to a third aspect of the embodiments of the present disclosure, there is provided an electronic device, the electronic device includes a processor, a memory, and computer instructions stored in the memory that can be executed by the processor, and the processor executes the computer Instructions, the method mentioned in the first aspect above can be implemented.

According to a fourth aspect of the embodiments of the present disclosure, there is provided a computer-readable storage medium, on which computer instructions are stored, and when the computer instructions are executed, the method mentioned in the above-mentioned first aspect is implemented.

In the embodiment of the present disclosure, the calculation graph of the neural network to be compiled can be determined, and the target topological sequence with higher execution efficiency and shorter operation time of the target chip can be selected from the multiple topological sequences of the calculation graph, and then based on the target topological sequence to The neural network is compiled to obtain machine instructions for execution by the target chip. By screening the target topology sequence with high execution efficiency of the target chip, and then compiling the neural network, the computing power of the target chip can be maximized, and the processing efficiency of the target chip in the process of reasoning based on the neural network can be improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Description of drawings

The drawings here show embodiments consistent with the present disclosure, and are used together with the description to explain the technical solution of the present disclosure.

FIG. 1 is a schematic diagram of a computation graph of an embodiment of the present disclosure.

FIG. 2 is a flowchart of a method for compiling a neural network according to an embodiment of the present disclosure.

FIG. 3 is a schematic diagram of a calculation graph of an embodiment of the present disclosure.

Fig. 4 is a schematic diagram of key nodes of a calculation graph and division of the calculation graph according to an embodiment of the disclosure.

FIG. 5 is a schematic diagram of adding a calculation unit in a calculation graph to adjust the calculation graph according to an embodiment of the disclosure.

FIG. 6 is a schematic diagram of a logical structure of a neural network compiling device according to an embodiment of the present disclosure.

Fig. 7 is a schematic diagram of a logic structure of an electronic device according to an embodiment of the present disclosure.

Detailed ways

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numerals in different drawings refer to the same or similar elements unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatuses and methods consistent with aspects of the present disclosure as recited in the appended claims.

The terminology used in the present disclosure is for the purpose of describing particular embodiments only, and is not intended to limit the present disclosure. As used in this disclosure and the appended claims, the singular forms "a", "the", and "the" are intended to include the plural forms as well, unless the context clearly dictates otherwise. It should also be understood that the term "and/or" as used herein refers to and includes any and all possible combinations of one or more of the associated listed items. In addition, the term "at least one" herein means any one of a plurality or any combination of at least two of a plurality.

It should be understood that although the terms first, second, third, etc. may be used in the present disclosure to describe various information, the information should not be limited to these terms. These terms are only used to distinguish information of the same type from one another. For example, without departing from the scope of the present disclosure, first information may also be called second information, and similarly, second information may also be called first information. Depending on the context, the word "if" as used herein may be interpreted as "at" or "when" or "in response to a determination."

In order to enable those skilled in the art to better understand the technical solutions in the embodiments of the present disclosure, and to make the above-mentioned purposes, features and advantages of the embodiments of the present disclosure more obvious and understandable, the technical solutions in the embodiments of the present disclosure are described below in conjunction with the accompanying drawings The program is described in further detail.

After the neural network is trained, the neural network needs to be deployed on a specific target terminal to use the neural network. Compared with the training phase, the neural network has higher performance requirements when reasoning in the application phase, which requires the deep learning compiler to maximize the computing power of the hardware. When deploying the neural network to the inference chip on the target terminal, the neural network is usually converted into a calculation graph through the deep learning reasoning framework, and then the calculation graph of the neural network is optimized, and the neural network is compiled into The binary instruction is executed on the target terminal to complete the reasoning process of the neural network. By compiling and optimizing the neural network, the complex neural network can also be applied to target terminals with limited computing power, such as mobile terminals.

As shown in FIG. 1 , it is a schematic diagram of a calculation graph, and the calculation graph is a directed acyclic graph. Among them, the nodes in the figure represent the operators in the neural network, for example, Relu represents activation, Conv represents convolution, MatMul represents matrix multiplication, Mean represents averaging, Add represents addition, Apxy represents vector summation, Sub represents division, Softmax Indicates activation. In addition, the edges in the graph indicate the flow of input data to the neural network. Usually, a computation graph may include multiple topological sequences, each of which represents a specific execution order of operators in the computation graph. For example, in the calculation diagram shown in Figure 1, Relu-Conv1-MatMul-Mean-Add1-Conv2-Apxy-Sub-Add2-Softmax is a topological sequence, while Relu-Conv2-Apxy-Sub-Add2-Conv1- MatMul-Mean-Add1-Softmax is another topological sequence.

In related technologies, when deploying a neural network to a target terminal, the neural network is usually parsed into a fixed topology sequence according to predetermined parsing rules, and then the neural network is compiled into binary instructions based on the fixed topology sequence, so that the target terminal The chip executes the instruction. That is to say, during the inference process, the chip on the target terminal will perform operations on the input data according to the operator execution order represented by the fixed topology sequence, that is, the execution order of the operators in the neural network on the target terminal chip is stable. However, the calculation graph of the neural network may include various topological sequences; and, for different chips, the execution efficiency of different operator execution sequences may be different. For example, for some chips, multiple computing units may be included on the chip, and different types of computing units can be used to perform different types of operators. For example, convolution or matrix multiplication will be performed by dedicated computing units. , while other types of operators will be executed by another computing unit. Different execution sequences of operators (that is, different topological sequences) may have an impact on the parallelism of computing units on the chip. This makes certain computing units on the chip idle for a long time when using a fixed topology sequence (for example, it may rely on the calculation results of other operators for calculation), thereby affecting the overall inference efficiency of the chip.

In order to solve the above problems, the embodiment of the present disclosure provides a compiling method, which can filter out the topological sequences included in the calculation graph of the neural network based on the hardware performance of the target chip running the neural network. The target topology sequence with high efficiency, and based on the target topology sequence, the machine instructions corresponding to the neural network are obtained, and output to the target chip for execution, so as to maximize the computing power of the target chip and improve the reasoning efficiency of the target chip using the neural network.

The compilation method provided in the embodiments of the present disclosure can be used in deep learning compilation and optimization tools such as a deep learning compiler or a tool chain of an AI chip. The deep learning compilation and optimization tool can optimize the neural network and compile it into machine-recognizable binary instructions, which can be output to the chip on the target terminal for execution.

Specifically, the neural network compilation method provided by the embodiment of the present disclosure is shown in Figure 2, and may include the following steps:

S202. Determine a computation graph corresponding to the neural network to be compiled, where nodes in the computation graph represent operators in the neural network, and edges in the computation graph represent data flows in the neural network;

S204. Determine a target topological sequence from multiple topological sequences of the computation graph, where each topological sequence represents a specific execution order of operators in the neural network;

S206. Determine a machine instruction corresponding to the neural network based on the target topology sequence, so that the target chip executes the machine instruction.

In step 202, the neural network to be compiled may be parsed to obtain a computation graph of the neural network. For example, the Caffe model file corresponding to the neural network can be analyzed to determine the calculation graph of the neural network. The neural network to be compiled may be various neural networks, for example, a convolutional neural network and the like. The calculation graph of the neural network is used to represent the entire calculation process of data from the input to the output of the neural network, which is a directed acyclic graph. Among them, the nodes in the calculation graph can represent an operator in the neural network, such as convolution, matrix multiplication, activation, addition, division, averaging, etc., and the arrow points of the edges in the calculation graph can represent the neural network. data flow direction.

In step S204, after the computation graph of the neural network is determined, a target topological sequence may be determined from multiple topological sequences of the computation graph of the neural network. Generally speaking, a computation graph can include multiple topological sequences, and each topological sequence represents a specific execution sequence for executing the operators in the neural network. Take a relatively simple calculation graph shown in Figure 3 as an example. The topological sequences included in this computational graph include Conv-MatMul-Mean-Softmax and Conv-Mean-MatMul-Softmax, and the execution order of operators in different topological sequences is also different. . For different topological sequences, their execution efficiency on the target chip may be different. Therefore, the target topological sequence with higher execution efficiency of the target chip can be selected from multiple topological sequences. For example, the hardware performance parameters of the target chip can be combined. For example, parameters such as the type and number of computing units in the target chip, the computing power of the computing units, or the size of the memory determine a topological sequence with high execution efficiency (for example, higher than a certain threshold) as the target topological sequence. Alternatively, according to the operation duration of the final output obtained by the target chip performing operations on the input data of the neural network according to the topological sequence, the topological sequence whose operation duration is shorter than the preset duration is determined as the target topological sequence. Any method is applicable as long as a target topology sequence with higher execution efficiency can be selected from multiple topology sequences, which is not limited in the embodiment of the present disclosure.

The target chip may be various chips for inferring neural networks, for example, it may be a CPU, GPU, various AI chips or other chips capable of inferring neural networks, which is not limited in the embodiments of the present disclosure. In step S206, after the target topology sequence is determined, the neural network can be compiled based on the target topology sequence to obtain machine instructions, and then the machine instructions are input into the target chip, so that the target chip can complete the neural network by executing the machine instructions. reasoning process.

By determining the calculation graph of the neural network to be compiled, the target topological sequence with higher execution efficiency of the target chip is selected from the multiple topological sequences of the calculation graph, and then the neural network is compiled based on the target topological sequence to obtain machine instructions for the target chip implement. By screening the optimal target topology sequence according to the hardware capability of the target chip, and then compiling the neural network, the computing power of the target chip can be maximized, and the processing efficiency of the target chip in the process of reasoning based on the neural network can be improved.

In some embodiments, the target topological sequence may be determined from the multiple topological sequences according to the operation duration of the target chip to perform operations on the input data of the neural network according to the operator execution order represented by each topological sequence. For example, the operation time represents the execution efficiency of the topological sequence on the target chip, and the shorter the operation time, the higher the execution efficiency of the topological sequence. Therefore, a topological sequence whose operation time meets certain conditions can be selected from multiple topological sequences as the target topological sequence, for example, a topological sequence with the shortest operation time can be selected as the target topological sequence. Of course, in some scenarios, if traversing each topological sequence to filter out the topological sequence with the shortest operation time, it may consume too much time and computing resources. Therefore, it is also possible to filter out only a topological sequence whose operation duration is shorter than a preset duration, so as to ensure that the execution efficiency of the target chip is better.

In some embodiments, the target chip in the embodiments of the present disclosure includes at least two types of computing units, and the at least two types of computing units can perform different types of operations on the input data of the neural network in parallel, that is, the at least two types of computing units can Execute different types of operators in neural networks in parallel. General-purpose chips, such as CPU, GPU, etc., have only one computing unit, so there is no situation where multiple computing units execute different types of operators in parallel. Therefore, for this kind of chip, the operation time corresponding to different topological sequences may have little difference. For a target chip that includes at least two types of computing units, since the at least two types of computing units can execute different types of operators in parallel, different operator execution sequences (that is, different topological sequences) may affect the parallelism between computing units. have a greater impact. For example, for some topological sequences, it may cause some computing units to be idle for a long time, which seriously wastes computing resources and reduces the processing efficiency of the target chip. Therefore, the compiling method provided by the embodiments of the present disclosure is mainly used to improve the processing efficiency of the target chip having at least two kinds of computing units, while CPU, GPU, etc. have their corresponding optimization means.

When determining the target topological sequence from multiple topological sequences in the computing graph, one way is to directly traverse all the topological sequences in the computing graph, and determine that the target chip operates on the input data according to the operator execution order represented by each topological sequence operation time, and then filter out the target topology sequence based on the determined operation time. Of course, this method may be more suitable for computational graphs with simple structures and fewer topological sequences, but for computational graphs with more complex structures, it is more cumbersome to traverse all their topological sequences.

Therefore, in some embodiments, the calculation graph can be divided into multiple sub-graphs first, and the better or optimal target sub-topology sequence corresponding to each sub-graph can be determined, and then the target sub-topology sequences can be linked to obtain the entire calculation The target topological sequence of graphs. When performing subgraph division, each subgraph obtained through division needs to include at least two subtopological sequences, and then for each subgraph, a target subtopological sequence can be determined from at least two topological sequences of the subgraph. For example, the target sub-topology sequence may be determined based on the operation time for the target chip to perform operations on the input data according to the operator execution sequence represented by the at least two sub-topology sequences of the subgraph. For example, the target subtopological sequence can be the subtopological sequence corresponding to the shortest operation time in the topological sequence of the subgraph, or a subtopological sequence whose operation time is shorter than the preset duration in the subtopological sequence of the subgraph, as long as the target When the chip operates on the data according to the operator execution order of the target sub-topology sequence, it only needs to have relatively high processing efficiency. After obtaining the target sub-topology sequence of each subgraph, the target sub-topology sequence can be linked according to the data flow in the computation graph, that is, the target topology sequence of the entire computation graph can be obtained. By dividing the calculation graph into multiple sub-graphs, filtering out the better or optimal target sub-topology sequence of each sub-graph, and then linking to obtain the better or optimal target topology sequence of the entire calculation graph, the complex calculation graph can be Simplification, which facilitates the screening of better or optimal topological sequences.

Of course, when there are too many combinations of subtopological sequences in the subgraph, a random selection method can be used to randomly select a specified number of candidate subtopological sequences from the candidate subtopological sequences, and select a better or optimal subtopological sequence from the candidate subtopological sequences. An optimal target subtopological sequence. Alternatively, it is also possible to perform feature induction and extraction on the subgraph itself, and use machine learning or Monte Carlo methods to optimize the process of finding the optimal subtopological sequence, so as to improve the processing efficiency of screening the target subtopological sequence.

In some embodiments, when the computation graph is divided to obtain multiple subgraphs, the key nodes may be determined from the nodes of the computation graph first, and then the computation graph is divided into multiple subgraphs based on the key nodes. Among them, the key node is the converging point of at least two paths in the computation graph, that is, a node with branches having two or more paths in the computation graph. For a node in the calculation graph, if the node has two or more branches, it means that there are multiple execution sequences when passing through the node. For example, as shown in Figure 4, the operator nodes with a gray background in the figure represent key nodes. It can be seen from Figure 4 that for each key node, at least two paths can be derived from the node, or at least two paths can converge on the key node. After the key nodes are determined, the calculation graph can be divided into multiple subgraphs based on the key nodes. For example, in some embodiments, at least two adjacent key nodes and the nodes and edges between the at least two key nodes can be composed of as a subgraph.

For example, for the part framed by the dotted line box 401 in Figure 4, two adjacent key nodes and the nodes and edges between the adjacent two key point nodes can be formed into a subgraph, by The nodes and edges between two key nodes form a subgraph, and each subgraph includes only two key nodes, that is, the beginning and end of the subgraph are key nodes, so that each subgraph includes relatively few topological sequences . Of course, multiple adjacent key nodes and nodes and edges between the multiple key nodes can also be formed into a subgraph, such as the part framed by the dotted box 402 in FIG. 4 . In this way, each subgraph can include more than 2 key nodes, so that each subgraph will have more topological sequences, but the number of subgraphs will be less.

In some embodiments, when dividing the calculation graph, if the part between two adjacent key nodes is used as a subgraph, the number of subgraphs obtained may be relatively large, and the subsequent subgraphs are compiled and The link time is also longer. A target subgraph whose number of nodes is less than a preset number may be determined from the divided subgraph, and then the target subgraph is fused with adjacent subgraphs of the target subgraph. For example, the target subgraph can be merged with its previous or subsequent subgraph to reduce the number of subgraphs, thereby saving subsequent compilation and linking time.

In some embodiments, when determining the machine instruction corresponding to the neural network according to the target topological sequence, the machine instruction corresponding to the target subtopological sequence corresponding to each subgraph can be determined first, and then the machine instruction corresponding to each target subtopological sequence can be determined according to The data flow indicated in the calculation graph goes to the links, and the machine instructions corresponding to the entire neural network are obtained. For example, the binary machine code corresponding to the target subtopological sequence corresponding to each subgraph can be determined, and then the binary machine code corresponding to different subgraphs can be combined according to the data flow direction indicated in the calculation graph through a linker, so that the compiled Multiple neural network subgraphs are combined into a complete neural network.

In some embodiments, when determining the operation duration for the target chip to perform operations on the input data according to the operator operation sequence represented by each topological sequence, it can be determined for each topological sequence that the target chip executes according to the operator represented by the topological sequence Sequentially perform operations on the input data corresponding to the machine instructions, and then use the target chip to execute the machine instructions to obtain the length of time the target chip executes the machine instructions, that is, the target chip performs operations on the input data according to the operator execution sequence represented by the topology sequence Operation time. This method needs to execute the machine instruction corresponding to the topological sequence once on the target chip, and the obtained operation time is more accurate. However, this method requires the participation of the target chip, which is cumbersome and time-consuming.

In some embodiments, the operation time corresponding to each topology sequence can also be estimated based on the cost model. Specifically, a cost model can be constructed in advance, which can estimate the operation time for the target chip to operate on the input data according to the operator execution sequence represented by each topology sequence according to the hardware parameters of the target chip and the type of topology sequence. Among them, the time consumed by the target chip to execute machine instructions mainly includes the time for the target chip to read instructions from the storage device, the calculation time for the calculation unit in the target chip to calculate the input data, and some waiting time during the operation process (generally, it can be can be ignored). Wherein, the time for reading instructions from the storage device can be determined according to the transmission bandwidth of the port of the target chip and the amount of data to be transmitted, and the time for calculating the data can be determined according to the amount of data to be calculated and the computing power of the computing unit of the target chip Sure. Therefore, the operation logic of the cost model is: according to the hardware performance parameters of the target chip (for example, the target chip includes several kinds of computing units, the number of each computing unit, the computing power of each computing unit, the interface of the target chip to read data Data transmission bandwidth, etc.), the size of the input data, and the operator execution order indicated by the topology sequence (for example, convolution first, then addition, or addition first, then convolution, etc.) to estimate the operation time. By using this cost model to simulate the machine instructions corresponding to each topological sequence running on the real target chip, the operation time corresponding to each topological sequence can be estimated without running the machine instructions corresponding to each topological sequence on real hardware .

In some embodiments, when determining the operation time for the target chip to operate on the input data according to the operator execution sequence represented by each topology sequence based on the preset cost model, it can be determined for each topology sequence that the target chip The operator represented by the sequence executes the machine instruction corresponding to the operation on the input data in order, and then determines the operation time according to the preset cost model and the machine instruction. Of course, since the process of compiling the neural network into binary instructions based on the topological sequence is time-consuming, in some embodiments, the constructed cost model can be optimized and improved so that it can directly estimate the operation time based on the topological sequence, thereby The step of compiling the neural network based on the topology sequence can be omitted, and the time spent on compiling this step can be saved.

In some embodiments, when determining the calculation graph corresponding to the neural network to be compiled, the neural network may be analyzed to obtain the original calculation graph corresponding to the neural network. For some operators in the original calculation graph, the target chip's memory may be too small to store the corresponding operation data of the operator, resulting in the inability to complete the operation of the operator at one time. The operation data corresponding to the operator includes various tensors such as input data, model parameters, and output data corresponding to the operator. In this case, the operators in the original computation graph can be adjusted according to the memory size of the target chip and the amount of operation data corresponding to each operator in the original computation graph to obtain the final computation graph of the neural network.

In some embodiments, when adjusting the operators in the original calculation diagram according to the memory size of the target chip and the amount of operation data corresponding to each operator in the original calculation diagram to obtain the final calculation diagram of the neural network, For each operator in the original calculation graph, the following operations can be performed: first, the data volume of the operation data corresponding to the operator can be determined, and if the data volume of the operation data corresponding to the operator is greater than the preset threshold, As shown in Figure 5, one or more operators of the same type as the operator can be added in the original calculation graph, so that the operation data can be split into multiple pieces of data and passed through the newly added operators Operation, so that the amount of operation data of each operator will not exceed the memory of the target chip. Wherein, the preset threshold is determined based on the memory size of the target chip, for example, it may be the memory size of the target chip, or a value obtained by subtracting a certain buffer value from the memory size of the target chip. Repeat the above operations for each operator in the original computation graph, and then adjust the original computation graph based on the newly added operators to obtain the final computation graph.

In some embodiments, when splitting the operation data of each operator into multiple pieces of data, the division dimension for splitting the operation data can be determined according to the type of the operator and the hardware performance parameters of the target chip, and then The operation data is split on the determined split dimension to obtain multiple pieces of data. For example, assuming that the input data is 10 frames of 100-channel images, then 10 frames of 100-channel images can be split into two pieces of data, and each piece of data is 5 frames of 100-channel images, or the 10 frames of 100-channel images can also be divided in the channel dimension. A frame of 100-channel images is split into two pieces of data, and each piece of data is an image of 10 frames of 50 channels. Which splitting method to choose can be determined according to the type of operator and the hardware performance parameters of the target chip. For example, different operators have different calculation methods, and their applicable splitting methods are also different. In addition, according to the different size of the memory of the target chip, it is also necessary to choose a method suitable for splitting, so that the split data meets the memory limit of the target chip. For example, for operators such as conv, fc (fullconnection, full connection), depthwise (depthwise separable convolution), the data is first split in the frame number dimension of the image data, if the frame number dimension cannot be split Or after the split, the memory limit of the target chip is still exceeded, and the split can be continued in the channel dimension.

In some embodiments, the operation data may be image data, and the split dimension for splitting the image data includes one or more of the following: the frame number dimension of the image data, the channel dimension of the image data, the image data The width dimension of , or the height dimension of the image data.

In order to further explain the neural network compiling method provided by the embodiments of the present disclosure, it will be explained in conjunction with a specific embodiment below.

In related technologies, when deploying a neural network to a target terminal, the neural network is usually parsed into a fixed topological sequence according to preset parsing rules, and the neural network is compiled into binary instructions based on the fixed topological sequence, and compiled into The binary instructions are output to the target chip of the target terminal for execution. For general-purpose processors such as CPU or GPU, since they only have a single computing unit, the problem to be considered is mainly to handle thread-level and instruction-level parallelism, so this method has little impact on processing efficiency. However, for some AI chips or special-purpose accelerators in certain fields, they generally include a variety of computing units, and different types of operators may be executed on different types of computing units. For example, for some AI chips, it includes a DAU unit for memory access, an MPU unit for convolution and other operations, and a VPU unit for vector calculations, and each calculation unit can be executed concurrently. At this time, the execution order of operators in the neural network (corresponding to the topological sequence) will have an impact on the parallelism of different computing units, such as directly using a fixed topological sequence may cause some computing units (due to data dependence, for example) If it is idle for a long time, it will affect the overall reasoning efficiency.

Based on this, the present embodiment provides a method for compiling a neural network, which specifically includes the following steps:

1. Update the calculation graph of the neural network based on the amount of computing data corresponding to each operator in the neural network and the memory size of the target chip.

First of all, the Caffe file of the neural network can be converted into a calculation graph. Considering the limited memory of the target chip, for each operator of the Caffe model, if the space occupied by the operation data of the operator (that is, the input and output tensor) exceeds The set size (can be set based on the memory size of the target chip), split the operation data of the operator into multiple pieces of data, and add one or more same operators in the calculation graph, so that after splitting Each piece of data corresponds to an operator, and the operation data corresponding to each operator can be completed in one operation on the target chip. Perform the same operation for each operator in the calculation graph until the memory space required by all operators in the calculation graph to run alone will not exceed the preset size, that is, the updated calculation graph is obtained.

When splitting the operation data of the operator, the split dimension can be determined according to the type of the operator and the memory size of the target chip, so as to split the operation data of the operator on this dimension. For example, for operators such as conv, fc, and depthwise, the operator is first split in the image frame dimension. If the image frame dimension cannot be split or the split still exceeds the memory limit of the target chip, it can be split in the image channel dimension. Continue to split.

For other types of operators, try not to split in the dimension where the reduction operation is located, but select the dimension with the lowest complexity after splitting from other dimensions. For example, for resize (resize) and pooling (pooling) operators, since it involves the reduction operation on the image height and width dimensions, it is best to split the operation data on the image frame number dimension or the image channel number dimension; As for the transpose (transpose) operator that involves operations on the image channel dimension and the image width dimension, the operation data can be split on the image frame number dimension and the image height dimension.

2. Divide the updated calculation graph into multiple subgraphs

Traverse and record all the paths from the input node to the output node of the updated calculation graph, and take the intersection of all the paths to get all the key nodes in the updated calculation graph, the key nodes are the convergence points of the two paths in the calculation graph , form two adjacent key nodes and the nodes and edges between these two key nodes into a subgraph. Then, the subgraphs consisting of only a preset number of next nodes are merged into forward subgraphs, thereby splitting the entire computational graph of the neural network into multiple subgraphs. Among them, the main purpose of merging subgraphs is to reduce the number of subgraphs and save subsequent compilation and linking time.

3. Topological sequence optimization of subgraphs

Since the subgraphs are executed sequentially, after splitting to obtain multiple subgraphs, it is only necessary to optimize the execution time of each subgraph to minimize the overall running time of the neural network.

In order to simulate the execution time of the machine instructions corresponding to each topological sequence of the subgraph on the target chip, this embodiment constructs a cost model, which can be read into the binary instruction stream compiled by the compiler of the tool chain based on each topological sequence , and generate an estimated execution time by simulating the execution process of the target chip. The execution logic of the cost model is briefly described as follows.

The time taken by the target chip to execute the binary instruction stream mainly includes the following parts:

(1) Time T1 for reading the binary instruction from the memory through the data reading interface of the target chip.

The time T1 is mainly related to the data transmission bandwidth of the data reading interface of the target chip and the amount of data to be read. Therefore, the cost model can be based on some performance parameters (for example, data transmission bandwidth) of the data reading interface of the target chip and the data to be read The amount of data taken (for example, the input data of the neural network) estimates the time T1.

(2) The time T2 for calculating the read data by the calculation unit.

The time T2 is mainly related to the performance parameters of the computing units in the target chip (such as the type of computing units, the number of computing units, the computing power of computing units, etc.) and the neural network topology sequence (ie, the execution order of operators), so the cost model The time T2 can be estimated based on the performance parameters of the computing unit of the target chip and the topology sequence of the neural network.

(3) Some waiting time during actual execution. Since this part of the time is usually short, it can be ignored.

Based on the above cost model, the optimal topology sequence of the subgraph under the given cost model is obtained, which mainly includes the following steps:

(1) traverse all possible one or more topological sequences of the subgraph;

(2) Use the tool chain to compile each topology sequence, generate the corresponding binary instruction file, and obtain the execution time of the subgraph under the topology sequence according to the cost model;

(3) Select the topology sequence with the shortest execution time as the optimal topology sequence of the subgraph under the given cost model and toolchain version.

Of course, the compilation process is time-consuming, and the running time of the machine instructions corresponding to the topology sequence on the target chip can also be simulated directly according to the topology sequence. In addition, when there are too many combinations of topological sequences in the subgraph, a random selection method can be used to randomly select a specified number of candidate sequences from the topological sequence for optimization. It is also possible to perform feature induction and extraction on the subgraph itself. With the help of machine learning Or methods such as Monte Carlo to optimize the search process.

4. After obtaining the optimal topological sequence of different subgraphs, the linker can be used to combine the binary machine codes corresponding to different subgraphs according to the original data flow direction, so as to combine the separately compiled neural network subgraphs into one Complete neural network model.

Through the neural network compilation method provided in this embodiment, considering the limitation of the memory space of the target chip, the operation data of each operator in the calculation graph is segmented into a finer granularity, and the operators in the calculation graph are updated accordingly so that each operator can be executed at one time on the target chip. The updated calculation graph is split according to the key nodes to obtain multiple subgraphs, and the optimal topology sequence of each subgraph is determined, and the optimal topology sequence of different subgraphs is linked to obtain the optimal topology sequence of the entire calculation graph. Based on this The optimal topology sequence compiles the neural network, obtains machine instructions and outputs them to the target chip for execution, which can maximize the parallelism between different computing units for different target chips and improve the processing efficiency of the target chip. In addition, this application can also estimate the execution time of the target chip through the constructed cost model without running on real hardware to complete the optimization process of the topology sequence.

Corresponding to the method provided by the embodiment of the present disclosure, the embodiment of the present disclosure also provides a neural network compiling device, as shown in Figure 6, the device 60 includes:

A computation graph determination module 61, configured to determine a computation graph corresponding to the neural network to be compiled, where nodes in the computation graph represent operators in the neural network, and edges in the computation graph represent inputs to the neural network flow of data;

A screening module 62, configured to determine a target topological sequence from multiple topological sequences of the computation graph, wherein each topological sequence represents a specific execution order of operators in the neural network;

The compiling module 63 is configured to determine machine instructions corresponding to the neural network based on the target topology sequence, so that the target chip executes the machine instructions.

In some embodiments, the target chip is determined from the plurality of topological sequences based on the operation duration of the target chip to perform operations on the input data of the neural network according to the execution order of the operators represented by each of the topological sequences. topological sequence.

In some embodiments, the screening module is configured to determine a target topological sequence from multiple topological sequences of the computation graph, specifically for: dividing the computation graph into multiple subgraphs, wherein each subgraph includes at least Two subtopological sequences; for each subgraph, determine a target subtopological sequence from at least two subtopological sequences of the subgraph; the target subtopological sequence is based on an algorithm represented by the target chip according to the at least two subtopological sequences The sub-execution sequence is determined by the operation duration of the operation on the input data; the target topological sequence is obtained based on the target sub-topological sequence of each sub-graph.

In some embodiments, the screening module is used to divide the calculation graph into multiple subgraphs, specifically to: determine multiple key nodes from multiple nodes in the calculation graph, wherein each of the key A node is a convergence point of at least two paths in the computation graph; and the computation graph is divided into multiple subgraphs based on the plurality of key nodes.

In some embodiments, after the calculation graph is split into multiple subgraphs, the screening module is further used to: determine a target subgraph whose number of nodes in the subgraph is less than a preset number; The graph is fused with the neighboring subgraphs of the target subgraph.

In some embodiments, the compiling module is configured to determine the machine instructions corresponding to the neural network based on the target topology sequence, specifically to: determine the machine instructions corresponding to each target sub-topology sequence; The machine instructions corresponding to the target sub-topology sequence are linked according to the data flow direction in the calculation graph to obtain the machine instructions corresponding to the neural network.

In some embodiments, the operation duration for the target chip to perform operations on the input data of the neural network according to the operator execution sequence represented by each topological sequence is determined based on the following method: for each topological sequence, determine the target chip According to the execution order of operators represented by the topological sequence, perform operations on the corresponding machine instructions for the input data; determine the operation duration based on the duration of execution of the machine instructions by the target chip; or, for each topological sequence, use The preset cost model determines the operation duration for the target chip to perform operations on the input data in accordance with the operator execution sequence represented by the topology sequence, wherein the cost model is used to Estimate the operation duration corresponding to the topological sequence according to the execution order of the operators represented by the topological sequence.

In some embodiments, when the calculation graph determination module is used to determine the calculation graph corresponding to the neural network to be compiled, it is specifically used to: analyze the neural network to obtain the original calculation graph corresponding to the neural network; The memory size of the target chip and the data amount of operation data corresponding to each operator in the original calculation graph adjust the operators in the original calculation graph to update the calculation graph.

In some embodiments, the calculation graph determining module is configured to perform calculations on the operators in the original calculation graph according to the memory size of the target chip and the amount of operation data corresponding to each operator in the original calculation graph Make adjustments to update the computation graph, specifically for:

For each target operator whose corresponding operation data in the original calculation graph is greater than a preset threshold, add at least one additional operator of the same type as the target operator in the original calculation graph, so as to Splitting the operation data into multiple pieces of data and then performing operations through the target operator and the newly added additional operator; wherein the preset threshold is determined based on the memory size of the target chip;

The original computation graph is adjusted based on the newly added additional operator, so as to update the computation graph.

In some embodiments, when the calculation graph determination module is used for splitting the operation data into multiple pieces of data, it is specifically used for: based on the type of the target operator and the hardware performance parameters of the target chip Determining a splitting dimension for splitting the operation data; splitting the data on the splitting dimension to obtain multiple pieces of data.

In addition, an embodiment of the present disclosure also provides an electronic device. As shown in FIG. 7 , the electronic device includes a processor 71, a memory 72, and computer instructions stored in the memory 72 for execution by the processor 71. When the processor 71 executes the computer instructions, the methods described in the foregoing embodiments can be implemented.

An embodiment of the present disclosure further provides a computer-readable storage medium, on which a computer program is stored, and when the program is executed by a processor, the method described in any one of the foregoing embodiments is implemented.

Computer-readable media, including both permanent and non-permanent, removable and non-removable media, can be implemented by any method or technology for storage of information. Information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read only memory (ROM), Electrically Erasable Programmable Read-Only Memory (EEPROM), Flash memory or other memory technology, Compact Disc Read-Only Memory (CD-ROM), Digital Versatile Disc (DVD) or other optical storage, Magnetic tape cartridge, tape magnetic disk storage or other magnetic storage device or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, computer-readable media excludes transitory computer-readable media, such as modulated data signals and carrier waves.

It can be known from the above description of the implementation manners that those skilled in the art can clearly understand that the embodiments of the present disclosure can be implemented by means of software plus a necessary general hardware platform. Based on this understanding, the essence of the technical solutions of the embodiments of the present disclosure or the part that contributes to the prior art can be embodied in the form of software products, and the computer software products can be stored in storage media, such as ROM/RAM, A magnetic disk, an optical disk, etc., include several instructions to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute the methods described in various embodiments or some parts of the embodiments of the present disclosure.

The systems, devices, modules, or units described in the above embodiments can be specifically implemented by computer chips or entities, or by products with certain functions. A typical implementing device is a computer, which may take the form of a personal computer, laptop computer, cellular phone, camera phone, smart phone, personal digital assistant, media player, navigation device, e-mail device, game control device, etc. desktops, tablets, wearables, or any combination of these.

Each embodiment in the present disclosure is described in a progressive manner, the same and similar parts of the various embodiments can be referred to each other, and each embodiment focuses on the differences from other embodiments. In particular, as for the device embodiment, since it is basically similar to the method embodiment, the description is relatively simple, and for relevant parts, please refer to part of the description of the method embodiment. The device embodiments described above are only illustrative, and the modules described as separate components may or may not be physically separated, and the functions of each module may be integrated into the same or multiple software and/or hardware implementations. It is also possible to select some or all of the modules according to actual needs to realize the purpose of the solution of this embodiment. It can be understood and implemented by those skilled in the art without creative effort.

The above is only the specific implementation of the embodiment of the present disclosure. It should be pointed out that for those of ordinary skill in the art, without departing from the principle of the embodiment of the present disclosure, some improvements and modifications can also be made. These Improvements and modifications should also be regarded as the protection scope of the embodiments of the present disclosure.

Claims

A compilation method, characterized in that the method comprises:

determining a computation graph corresponding to the neural network to be compiled, where nodes in the computation graph represent operators in the neural network, and edges in the computation graph represent data flows in the neural network;

determining a target topological sequence from a plurality of topological sequences of the computation graph, wherein each of the topological sequences represents a specific execution order of operators in the neural network;

A machine instruction corresponding to the neural network is generated based on the target topology sequence, so that the target chip executes the machine instruction.
The method according to claim 1, wherein determining the target topological sequence from the plurality of topological sequences of the computation graph comprises:

The target topological sequence is determined from the plurality of topological sequences based on the operation duration for the target chip to perform operations on the input data of the neural network according to the operator execution order represented by each topological sequence.
The method according to claim 2, characterized in that, the operation duration for the target chip to operate on the input data of the neural network according to the execution order of the operators represented by each of the topological sequences is determined based on the following manner :

For each of the topological sequences,

Determining that the target chip performs operations on the input data of the neural network corresponding to machine instructions according to the operator execution sequence represented by the topology sequence;

The operation duration is determined based on the duration for the target chip to execute the machine instruction.
The method according to claim 2, characterized in that, the operation duration for the target chip to operate on the input data of the neural network according to the execution order of the operators represented by each of the topological sequences is determined based on the following manner :

For each of the topological sequences,

Using a preset cost model, determine the operation duration for the target chip to operate on the input data of the neural network according to the operator execution sequence represented by the topology sequence,

Wherein, the cost model is used to estimate the operation duration corresponding to the topological sequence according to the hardware parameters of the target chip and the execution order of operators represented by the topological sequence.
The method according to claim 4, characterized in that, using the preset cost model, it is determined that the target chip performs operations on the input data of the neural network according to the operator execution order represented by the topology sequence duration, including:

Determine the machine instructions corresponding to the target chip performing operations on the input data of the neural network according to the operator execution sequence represented by the topology sequence;

The operation duration is determined based on the cost model and the machine instruction.
The method according to any one of claims 1-5, characterized in that,

The target chip includes at least two computing units,

The at least two computing units are capable of performing different types of operations on input data in parallel.
The method according to any one of claims 1-6, wherein determining the target topological sequence from the plurality of topological sequences of the computation graph comprises:

dividing the computation graph into a plurality of subgraphs, wherein each subgraph includes at least two subtopological sequences;

for each subgraph, determining a target subtopological sequence from said at least two subtopological sequences of said subgraph;

The target topological sequence is derived based on the target subtopological sequence of each of the subgraphs.
The method according to claim 7, wherein dividing the computation graph into the plurality of subgraphs comprises:

Determining a plurality of key nodes from a plurality of nodes in the calculation graph, wherein each of the key nodes is a convergence point of at least two paths in the calculation graph;

The computation graph is divided into the plurality of subgraphs based on the plurality of key nodes.
The method according to claim 8, wherein splitting the computation graph into the plurality of subgraphs based on the plurality of key nodes comprises:

Constructing at least two adjacent key nodes and nodes and edges between the at least two key nodes into a subgraph.
The method according to claim 9, wherein after splitting the calculation graph into the plurality of subgraphs, further comprising:

determining a target subgraph whose number of nodes in the subgraph is less than a preset number;

The target subgraph is fused with neighboring subgraphs of the target subgraph.
The method according to any one of claims 7-10, wherein determining the machine instruction corresponding to the neural network based on the target topology sequence comprises:

determining a machine instruction corresponding to each of the target subtopological sequences;

The machine instructions corresponding to each target sub-topology sequence are linked according to the data flow direction in the calculation graph to obtain the machine instructions corresponding to the neural network.
The method according to any one of claims 1-11, wherein determining the calculation graph corresponding to the neural network to be compiled comprises:

Analyzing the neural network to obtain an original calculation graph corresponding to the neural network;

The operators in the original calculation graph are adjusted according to the memory size of the target chip and the amount of operation data corresponding to each operator in the original calculation graph, so as to update the calculation graph.
The method according to claim 12, characterized in that, the operator in the original calculation graph is performed according to the memory size of the target chip and the data amount of operation data corresponding to each operator in the original calculation graph Adjustments to update the computation graph include:

For each target operator whose corresponding operation data in the original calculation graph is greater than a preset threshold, add at least one additional operator of the same type as the target operator in the original calculation graph, so as to Splitting the operation data into multiple pieces of data and then performing operations through the target operator and the newly added additional operator; wherein the preset threshold is determined based on the memory size of the target chip;

The original computation graph is adjusted based on the newly added additional operator, so as to update the computation graph.
The method according to claim 13, wherein said splitting said operation data into multiple data includes:

determining a split dimension for splitting the operation data based on the type of the target operator and the hardware performance parameters of the target chip;

The data is split on the split dimension to obtain multiple pieces of data.
The method according to claim 14, characterized in that,

The operation data includes image data,

The splitting dimension includes one or more of the following: the frame number dimension of the image data, the channel dimension of the image data, the width dimension of the image data, and the height dimension of the image data.
A compiling device, characterized in that the device comprises:

A computation graph determination module, configured to determine a computation graph corresponding to the neural network to be compiled, where nodes in the computation graph represent operators in the neural network, and edges in the computation graph represent data in the neural network flow direction;

a screening module, configured to determine a target topological sequence from a plurality of topological sequences in the computation graph, wherein each topological sequence represents a specific execution order of operators in the neural network;

A compiling module, configured to determine machine instructions corresponding to the neural network based on the target topology sequence, so that the target chip executes the machine instructions.
An electronic device, characterized in that the electronic device includes a processor, a memory, and computer instructions stored in the memory that can be executed by the processor, and when the processor executes the computer instructions, it can realize the The compiling method described in any one of 1-15 is required.
A computer-readable storage medium, characterized in that computer instructions are stored on the storage medium, and when the computer instructions are executed, the compiling method according to any one of claims 1-15 is implemented.