CN110764744A

CN110764744A - Intermediate representation generation method and device for neural network computation

Info

Publication number: CN110764744A
Application number: CN201810829863.8A
Authority: CN
Inventors: 隋凌志; 刘鑫; 王雨顺
Original assignee: Xilinx Inc
Current assignee: Xilinx Inc
Priority date: 2018-07-25
Filing date: 2018-07-25
Publication date: 2020-02-07
Anticipated expiration: 2038-07-25
Also published as: CN110764744B

Abstract

The present disclosure proposes an intermediate representation generation method and apparatus for neural network computation. The method comprises the following steps: analyzing the input model file to acquire topological structure information of the neural network; and using the characteristic graph information and the calculation operation information in the topological structure information as nodes and edges respectively to generate a first intermediate representation in a graph form. Thus, subsequent convenient optimization of graphs and scheduling is achieved for the first IR for which a node operates as an edge by introducing a feature graph. Preferably, the generation method of the present invention can also generate subsequent IR, so that the algorithm is transformed and described by using IR with different granularities and different forms, so that the compiler based on the present invention can be conveniently applied to various front-end frameworks and back-end hardware implementations, and can efficiently and accurately optimize instructions.

Description

Intermediate representation generation method and device for neural network computation

Technical Field

The invention relates to the field of deep learning, in particular to an intermediate representation generation method and device for neural network computation.

Background

Neural networks (Neural networks) have become a research hotspot in the field of image recognition in recent years. The trained neural network model can be used in the fields of image classification, object recognition, significance detection and the like. In recent years, the neural network model shows the trend of increasing the calculation scale and the complexity, and the traditional CPU platform can not meet the practical requirement. Therefore, designing a neural network accelerator by using heterogeneous computing platforms such as an FPGA and a GPU becomes a new research hotspot. Compared with a GPU (graphics processing Unit) platform, the FPGA and the ASIC can achieve higher calculation energy efficiency ratio, and meanwhile, the flexibility and the customizability of the FPGA and the ASIC are more suitable for the requirement of high-speed development of a neural network algorithm.

The workflow of a compiler is generally composed of a plurality of different task phases, so the composition of the compiler can be generally divided into three parts, namely a front end part, an optimization part and a back end part. In order to pass information between different task phases, the compiler needs to deduce the full knowledge of the target program. Therefore, almost all compilers require some form of intermediate characterization for the target algorithm to model, facilitating its analysis, transformation and optimization.

For neural network compiling, converting neural network algorithms from different-depth learning frames into a general calculation graph, optimizing and reconstructing the calculation graph, and mapping the optimized calculation graph into executable instructions and machine codes of a hardware platform, so that the compiling of the algorithms for the hardware platform is completed. Due to the fact that a plurality of deep learning frames are different in used underlying computing libraries, computing graphs, code styles and the like, great differences exist in the precision and the computing speed of computing results, and besides general processors, more heterogeneous hardware platforms are developed. If M front-end deep learning frames need to be optimized respectively and mapped onto N back-end hardware platforms, O (M × N) workload is encountered, and there is a risk of combinatorial explosion.

For this reason, an intermediate representation generation scheme capable of coping with flexible compatibility with various front-end and back-end is required.

Disclosure of Invention

In order to solve at least one problem, the invention provides a compiler architecture scheme, which can cope with various deep learning frameworks and back-end hardware platforms with extremely high expandability and compatibility and provide efficient code optimization capability by matching modules in the compiler architecture with various intermediate representations with different granularities and attributes.

According to an aspect of the present invention, there is provided an intermediate representation generation method for neural network computation, comprising: analyzing the input model file to acquire topological structure information of the neural network; and using the characteristic graph information and the calculation operation information in the topological structure information as nodes and edges respectively to generate a first intermediate representation in a graph form. Thus, subsequent convenient optimization of graphs and schedules is achieved for the IR that the nodes operate as edges by introducing the feature graph.

The first intermediate representation further comprises node properties and edge properties, the node properties comprising at least one of: dimension information and length and width channel information of the feature map; the computing operation of the edge representation includes at least one of: convolution, pooling, dimension transformation, point addition (eltwise), deconvolution, rearrangement, nonlinearity, batch normalization (BatchNorm), scaling (scale); and the edge attribute comprises a parameter of the computing operation and comprises at least one of: convolution kernel size, extended edge (pad), stride, grouping, expansion (disparity).

The method further comprises the following steps: graph optimization is performed on the first intermediate representation to generate a second intermediate representation in the form of a graph. The method may specifically include merging the computing operations to obtain a feature graph as a node, and taking a plurality of merged computing operations as a second intermediate representation in the form of a hypergraph of an edge.

Merging the computing operations may include at least one of: removing unnecessary or non-influence operation on the calculation result; fusing a plurality of adjacent computing operations; and decomposing the computing operation to fuse or implement processing of the decomposed computing operation with a preceding or subsequent computing operation.

Merging the computing operations may include: setting a subgraph template capable of computation operation merging, acquiring at least one subgraph matching scheme of a computation graph aiming at the first intermediate representation, and reconstructing the computation graph into the second intermediate representation merged by the computation operation based on the subgraph matching scheme. The sub-graph template may be determined based on attribute information of a hardware platform on which the instruction code compiled from the intermediate representation is to be executed.

Merging the computing operations may further include: and under the condition that a plurality of calculation operation combination modes exist, adding edges corresponding to the execution cost of the calculation operation combination mode between the input nodes and the output nodes corresponding to the calculation operation combination modes of the first intermediate representation, and solving an optimal calculation operation combination scheme based on the shortest path problem among the nodes.

The second intermediate representation may be represented by a Domain-specific language (DSL) designed based on the scheme language.

Thus, by introducing a second intermediate representation, a graphical form optimization of the first IR is carried.

The intermediate representation generation method of the present invention may further include: and performing scheduling optimization on the second intermediate representation to obtain a third intermediate representation with fine granularity. The method specifically comprises the following steps: and performing scheduling optimization on the second intermediate representation based on attribute information of a hardware platform for executing the instruction codes obtained by compiling the intermediate representation to obtain a third intermediate representation of the block execution scheme indicating the feature map and/or the weight, and preferably, obtaining a third intermediate representation of instruction dependency relationship among the execution instructions in the block execution scheme indicating the feature map and/or the weight based on the attribute information of the hardware platform.

The third intermediate representation is represented by a language that writes each computing operation as a multiple loop.

The method may further comprise: compiling the third intermediate representation into instruction code for execution on a hardware platform. Therefore, code optimization based on hardware attributes is conveniently realized.

The hardware platform may include at least one of: a neural network special computing platform realized based on FPGA or ASIC; a neural network special computing platform realized based on a GPU; and a general purpose computing platform.

According to another aspect of the present invention, there is provided an intermediate representation generation apparatus for neural network computation, comprising: the analysis unit is used for analyzing the input model file to acquire the topological structure information of the neural network; and a first intermediate representation generating unit for generating a first intermediate representation in a graph form using the feature graph information and the calculation operation information in the topology structure information as nodes and edges, respectively.

The apparatus may further include: a second intermediate representation generating unit for performing graph optimization on the first intermediate representation to generate a second intermediate representation in the form of a graph.

The second intermediate representation generating unit further comprises: and the calculation operation merging unit is used for merging the calculation operations to acquire the feature graph as a node, and the merged plurality of calculation operations are used as a second intermediate representation of the edge in the form of a hypergraph.

The calculation operation merging unit is used for performing at least one of the following operations: removing unnecessary or non-influence operation on the calculation result; fusing a plurality of adjacent computing operations; and decomposing the computing operation to fuse or implement processing of the decomposed computing operation with a preceding or subsequent computing operation.

The calculation operation merging unit is further configured to: setting a subgraph template capable of computation operation merging, acquiring at least one subgraph matching scheme of a computation graph aiming at the first intermediate representation, and reconstructing the computation graph into the second intermediate representation merged by the computation operation based on the subgraph matching scheme. The sub-graph template is determined based on attribute information of a hardware platform on which instruction code compiled from the intermediate representation is to be executed.

The calculation operation merging unit is further configured to: and under the condition that a plurality of calculation operation combination modes exist, adding edges corresponding to the execution cost of the calculation operation combination mode between the input nodes and the output nodes corresponding to the calculation operation combination modes of the first intermediate representation, and solving an optimal calculation operation combination scheme based on the shortest path problem among the nodes.

The second intermediate representation is represented by a Domain-Specific Language (DSL) designed based on the scheme Language.

The apparatus may further include: and a third intermediate representation generating unit, configured to perform scheduling optimization on the second intermediate representation to obtain a fine-grained third intermediate representation. The third intermediate representation generating unit may be configured to: scheduling optimization is performed on the second intermediate representation based on attribute information of a hardware platform on which instruction codes compiled from the intermediate representation are to be executed, to obtain a third intermediate representation of a blocked execution scheme indicating feature maps and/or weights. The third intermediate representation generating unit may be further configured to: and acquiring a third intermediate representation of the instruction dependency relationship among the execution instructions in the block execution scheme indicating the feature graph and/or the weight based on the attribute information of the hardware platform.

The apparatus may further include: a compiling unit for compiling the third intermediate representation into instruction code for execution on a hardware platform.

According to yet another aspect of the invention, there is provided a computing device comprising: a processor; and a memory having executable code stored thereon, which when executed by the processor, causes the processor to perform the method of any of the above.

According to an aspect of the invention, there is provided a non-transitory machine-readable storage medium having stored thereon executable code which, when executed by a processor of an electronic device, causes the processor to perform a method as described in any one of the above.

The first IR can realize the decoupling with a deep learning framework, and represents the characteristic graph through nodes, and the special structure of the calculation operation is represented by edges, so that the subsequent memory optimization is facilitated. The second IR has a hypergraph form, and the efficiency, the accuracy and the hardware pertinence of graph optimization can be greatly improved by introducing a sub-graph template and a cost function edge. The third IR is preferably realized by using a multi-loop language, so that the scheduling optimization efficiency can be greatly improved, and the characteristics of back-end hardware are fully considered. By using the IR with different granularities and different forms to convert and describe the algorithm, the compiler based on the invention can be conveniently suitable for various front-end frameworks and back-end hardware implementation, and can carry out efficient and accurate optimization on the instruction.

Drawings

The above and other objects, features and advantages of the present disclosure will become more apparent by describing in greater detail exemplary embodiments thereof with reference to the attached drawings, in which like reference numerals generally represent like parts throughout.

Fig. 1 shows a series of layers of ordered runs that make up a typical CNN.

Fig. 2 shows a compilation diagram of an existing neural network compiler.

Fig. 3A-3B illustrate typical network computation graph structures of existing CNN networks.

FIG. 4 shows a flow diagram of an intermediate representation generation method according to one embodiment of the invention.

Fig. 5 shows an example of the conversion of the conventional computation graph into the first IR of the present invention.

Fig. 6 shows a flow diagram of an intermediate representation generation method according to another embodiment of the invention.

Fig. 7 shows a representation of the merging of the conventional computation graph of fig. 5 and the computation operation in the first IR of the present invention.

Fig. 8 shows a schematic diagram of an intermediate representation generating device according to an embodiment of the invention.

Fig. 9 shows a schematic diagram of an intermediate representation generating device according to another embodiment of the invention.

FIG. 10 shows a schematic diagram of a compiler architecture according to one embodiment of the present invention.

FIG. 11 is a schematic structural diagram of a computing device that can be used to implement the above-described intermediate representation generation method according to an embodiment of the present invention.

Detailed Description

Preferred embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While the preferred embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

Artificial intelligence is rapidly developed in recent years, and the method has good application effects in the fields of image classification, detection, video and voice processing and the like, and still has great development prospects. The neural network is the core of artificial intelligence application, and the deep learning neural network algorithm is one of the most common neural network models. The workload characteristics of neural networks are computational and data intensive. The multiplication and addition operation required by the neural network calculation is usually in the order of G, for example, the calculation amount of the target detection type neural network SSD is 120G operation times. The parameters required for calculation are typically in the order of M to hundreds of mbytes, for example, the parameters of the classification neural network VGG are 480 mbytes.

Common Artificial Neural Networks (ANN) include Deep Neural Networks (DNN), Recurrent Neural Networks (RNN), and Convolutional Neural Networks (CNN). CNN is a kind of artificial neural network, and has become a research hotspot in the field of current speech analysis and image recognition. The weight sharing network structure of the system is more similar to a biological neural network, the complexity of a network model is reduced, and the number of weights is reduced. The advantage is more obvious when the input of the network is a multi-dimensional image, so that the image can be directly used as the input of the network, and the complex characteristic extraction and data reconstruction process in the traditional recognition algorithm is avoided. Convolutional networks are a multi-layered perceptron specifically designed to recognize two-dimensional shapes, the structure of which is highly invariant to translation, scaling, tilting, or other forms of deformation. The convolutional neural network will be described with some degree of background, particularly with reference to the accompanying drawings.

CNN basic concept

As shown in fig. 1, a typical CNN consists of a series of layers that run in order.

The parameters of the CNN model are called "weights" (weights). The first layer of CNN reads the input map and outputs a series of feature maps (featuremaps). The lower layer reads the feature map generated by the upper layer and outputs a new feature map. The last classifier (classifier) outputs the probability that the input graph may belong to a certain class. The CONV layer (convolutional layer) and the FC layer (full link layer) are two basic layer types in CNN. After the CONV layer, there is usually a Pooling layer (Pooling layers).

In the present application, for one CNN layer,

a jth input feature map is shown,

representing the ith output characteristic diagram, b_iThe offset term of the ith output plot is shown.

For the CONV layer, n_inAnd n_outRepresenting the number of input and output profiles, respectively.

For FC layer, n_inAnd n_outRepresenting the length of the input and output feature vectors, respectively.

Definition of CONV layers (Convolutional layers): the CONV layer takes a series of feature maps as input and obtains an output feature map by convolution kernel convolution.

A non-linear layer, i.e. a non-linear excitation function, usually connected to the CONV layer, is applied to each element in the output signature. The excitation function used is typically a ReLU function, which layer is also commonly referred to as the ReLU layer.

The CONV layer may be represented by expression 1:

wherein g is_i,jIs a convolution kernel applied to the jth input feature map and the ith output feature map. Definition of FC layers (full-Connected layers): the FC layer applies a linear transformation of the input features upwards:

f^out＝W fⁱⁿ+b (2)

w is an integer n_out×n_inTransform matrix, b is a bias term. It is worth noting that for the FC layer, what is input is not a combination of several two-dimensional feature maps, but one feature direction. Thus, in expression 2, the parameter n_inAnd n_outIn effect corresponding to the length of the input and output feature vectors.

Pooling (pooling) layer: usually connected to the CONV layer for outputting the maximum or average value of each partition (subarea) in each profile. The Pooling maximum value can be represented by expression 3:

where p is the size of the pooling kernel. This non-linear "down-sampling" not only reduces the size and computation of the feature map for the next layer, but also provides a translation invariance. CNN can be used for image classification in the forward inference process.

Deep learning framework

The deep learning framework provides a building block for the design, training and verification of the neural network through a high-level programming interface. In other words, the deep learning framework provides a way for the implementation of neural network specific algorithms (e.g., the neural network construct shown in fig. 1).

With the development of deep learning and neural network algorithms, many top-level deep learning frameworks such as Caffe, TensorFlow, MxNet, PyTorch and the like for researchers and developers have emerged. Developers can use the DSL and API of these frameworks to design different computational graph models to perform specific tasks, such as face recognition, image detection, voice recognition, etc.

There are great differences among many deep learning frameworks, including the underlying computational libraries, computational patterns, code styles, etc. they use, resulting in great differences in both the accuracy of the computation results and the computation speed.

For example, tensrflow, MxNet, and the like represent a neural network in the form of a computation graph in a framework, and Caffe does not use the graph form, but builds a dependency relationship between each computation operation and blob. However, these deep learning frameworks use IR representations of different granularities, e.g., TensorFlow splits the convolution into several nodes Padding, Dot, BiasAdd, and several constant nodes Pad, Weights, Bias, while Caffe is only one convolution node.

No matter what kind of deep learning framework is used to train the obtained neural network instructions and parameters, the instructions and parameters need to be compiled into machine code that can be executed by a hardware processor in the back end through a compiler. If M front-end deep learning frames are optimized separately and mapped onto N back-end hardware platforms, O (M × N) workload is encountered, and there is a risk of combinatorial explosion. The compiler then needs to abstract an Intermediate Representation (IR) that is not frame-independent, and can represent all the information of the algorithm and facilitate the optimization thereof by the subsequent compiler. In other words, in order to be decoupled from the deep learning computation framework, a computation graph structure corresponding to the neural network processor needs to be constructed. Converting the neural network algorithm from different depth learning platforms into a general calculation graph, optimizing and reconstructing the calculation graph, and mapping the optimized calculation graph into instructions and machine codes of a hardware platform, thereby completing the compiling part of the algorithm on the hardware platform.

Compilation of neural networks

In order to deploy the deep neural network after training, a compiler is required to compile a neural network algorithm into a binary instruction stream that can be executed by a computing platform. Unlike applications developed using high-level languages such as C + + or Java, neural network algorithms have their own unique syntax and structure. In view of this, high performance computing platforms dedicated to neural network computing and corresponding neural network compilers have emerged. For example, a Deep Neural Network compiler (dnnc) may compile Neural Network algorithms into an optimized instruction stream for a DPU (Deep learning processor Unit) platform. The intermediate representation IR of the internal calculation graph of the compiler and the control flow and data flow information in the IR are constructed by analyzing the topological structure of the neural network, and the neural network compiler applies various compiling optimization and transformation technologies based on the IR, so that the memory access bandwidth and power consumption requirements of the system are effectively reduced while the DPU computing performance is improved. Fig. 2 shows a compilation diagram of an existing neural network compiler. As shown in fig. 2, a specialized neural network algorithm (e.g., for pruned CNNs) may be fed into a neural network compiler that includes a compilation front-end, an optimizer, and an instruction generator, and generates binary instruction code for a neural network computing platform (e.g., DPU).

Herein, "compilation" refers to the process of generating low-level object code executing on a computing platform from a representation described by a high-level formalization method using a compiler. Since only binary instruction codes are involved in the processing of a hardware computing platform, a compiler is required to convert the familiar high-level language description into computer-readable low-level binary code. Unlike source program code described using high-level programming languages such as C/C + +, a neural network needs to be represented by a specialized model that describes neural network algorithms. The neural network algorithm includes a topology of the neural network algorithm and parameters of the neural network algorithm. In contrast, the formal description of the neural network topology requires much less memory than the massive number of neural network algorithm parameters.

A neural network compiler, as shown in fig. 2, typically includes a compilation front-end, an optimizer, and an instruction generator. The neural network compiling front end is used for analyzing the input neural network algorithm. The analysis may include parsing and reconstruction of the network topology. The optimizer is used for optimizing the neural network algorithm and/or the intermediate representation generated by the previous analysis, and the optimization object of the optimizer is a calculation graph representation equivalent to the neural network algorithm. Optimization operations may perform various semantically equivalent transformations on the computational graph for subsequent more efficient generation of object code. Finally, the instruction generator may generate efficient object code based on the optimized computation graph. The object code may then be input into a back-end, in particular a neural network processor as described below, to perform a corresponding neural network inference calculation.

Basic concept of neural network processor

Due to the characteristics of huge parameter scale and huge calculation amount of the convolutional neural network and the requirements on the stability and high calculation energy consumption ratio of a hardware platform, a conventional CPU (central processing unit) cannot meet the calculation requirements of the neural network, and an accelerator is designed by utilizing heterogeneous calculation platforms such as an FPGA (field programmable gate array), a GPU (graphic processing unit), an ASIC (application specific integrated circuit) and the like to become a new research hotspot. Compared with a GPU platform, the FPGA can obtain higher energy efficiency due to the characteristic of low power consumption, and meanwhile, the FPGA can be iterated rapidly and can perform hardware reconstruction, so that the requirement of high-speed algorithm development is met. Furthermore, the AI chip is realized by a customized ASIC chip, and is used as a processor chip specially designed for deep learning, the deep neural network is deeply customized and optimized in the aspects of operation speed, power consumption, cost and the like, and the method is further improved compared with an FPGA and a GPU.

The compiler architecture of the present invention, while useful in general purpose computing platforms (i.e., host or CPU only computing platforms), is more suitable for use in neural network specific processors that are specifically designed to perform neural network computations. It will be understood by those skilled in the art that the term "neural network dedicated processor" as used in the present application may also be referred to simply as "neural network processor" or "NN processor". Since deep learning is currently one of the most popular technology classes in neural network technology, the neural network dedicated processor may be implemented as a deep learning dedicated processor or a deep learning processor. However, it will be appreciated by those skilled in the art that neural networks have various branches of technology, such as DNN and CNN (where DNN is from a depth perspective and CNN is named from a convolution perspective, which are not mutually exclusive), and thus the neural network specific processor may also be implemented as a deep neural network specific processor or a deep neural network processor (DNN processor or CNN processor). That is, neural network computing implementation techniques involving "deep learning processors" or "deep neural network processors" in heterogeneous computing platforms are also within the scope of the present invention.

The DPU (Deep-learning Processing Unit) is a general acceleration platform for a neural network algorithm in artificial intelligence, and realizes reasoning based on a convolutional neural network (CNN for short) by utilizing the characteristics of high parallelism and low power consumption of an FPGA. Herein, a DPU may be considered as one specific implementation of the above "deep learning processor" or "deep neural network processor" or "neural network processor". The binary instruction codes compiled by the compiler architecture of the present invention can be executed by a DPU implemented by an FPGA, but it will be understood by those skilled in the art that the compiler architecture of the present invention can also be adapted to a variety of backend implementations, such as a neural network processor that uses the hardware structure of a GPU to reason about other neural networks, and an ASIC chip, e.g., a dedicated AI chip, that is deeply customized and optimized for neural network computations.

Basic concept of network computation graph

In order to be decoupled from the deep learning computation framework, a computation graph structure corresponding to a neural network processor needs to be constructed. Converting the neural network algorithm from different depth learning platforms into a general calculation graph, optimizing and reconstructing the calculation graph, and mapping the optimized calculation graph into instructions and machine codes of a hardware platform, thereby completing the compiling part of the algorithm on the hardware platform. Because of the limitations of on-chip storage resources, bandwidth, calculation resources, hardware design and the like of different hardware platforms and the bit width of an instruction set, and the limitations of factors such as diversified calculation operations, dimension transformation, parameter changes during calculation operations and the like of different depth learning platforms, how to find the optimal way for executing a calculation graph in the algorithm-to-instruction mapping process is a significant problem to be solved for realizing the calculation platform, in other words, how to enable instructions compiled by the algorithm to be executed efficiently on the hardware platform.

Fig. 3A-3B illustrate typical network computation graph structures of existing CNN networks. Fig. 3A shows the basic structure in a Vgg network. As shown in fig. 3A, the non-branched network computation graph requires repeated data transfer between DDR and on-chip cache (e.g., implemented as BRAM, i.e., block RAM) during hardware execution when performing the most basic CONV (convolution operation), ReLU (nonlinear operation, the used stimulus function is typically a ReLU function), and POOL (pooling operation) because the characteristic graph required to be loaded is usually larger than the cache capacity of the on-chip cache. Fig. 3B shows the basic structure in a ResNet network. As shown in fig. 3B, the branched network computation graph further includes an eltwise layer for adding and combining a plurality of convolution layers and a CONCAT layer for concatenating data of each input layer into a new layer according to channels. Likewise, the network computation graph in the graph still requires repeated data handling between DDR and BRAM when implemented in back-end hardware. It should be understood that the above-listed "Vgg" and "ResNet" are both popular CNN architectures in the art, and are intended to be illustrative of the principles of the present invention and not limiting.

Intermediate representation generation scheme of the present invention

Existing compilers are used to map respective computation graphs for a framework into hardware instructions of a general-purpose processor, e.g., CUDA for GPU, LLVM for CPU, etc. In the instruction mapping process, different front-end deep learning frames and different back-end hardware have different levels of optimization modes, including coarse-grained optimization oriented to a computation graph, fine-grained optimization oriented to operators (operations), memory management optimization and the like.

The great differences exist among a plurality of deep learning frameworks, including the underlying computational libraries, the computational graph forms, the code styles and the like, which cause great differences in the accuracy and the operational speed of the operational results. If M front-end deep learning frames are optimized separately and mapped onto N back-end hardware platforms, O (M × N) workload is encountered, and there is a risk of combinatorial explosion. In order to be compatible with various front-end and back-end, the neural network compiler needs to abstract an intermediate representation mode (i.e. IR) which is independent of a framework, and can represent all information of an algorithm and facilitate the optimization of the algorithm by a subsequent compiler.

The XLA of Google is a back-end compiler framework for TensorFlow; the NNVM and TVM compiler Of the DMLC team respectively carry out the optimization Of the layer surface, adopt the compiling mode Of AOT (ahead Of time), use the IR and dispatching mode Of Halide for reference, and support most front-end deep learning frames on the market; the DLVM of champagne division of university of Illinois adopts a more traditional compiler optimization mode, uses the Swift language as the DSL representation thereof, has complete control flow, and is a representation method without side effects. The compiler frameworks support different front-end deep learning frameworks more or less, create own IR, and after optimization is performed on the IR, the IR is converted into IR of LLVM, or mapped into different hardware instructions by using libraries such as OpenCL and CUDA. However, these compiler frameworks are all general purpose processor oriented. For the implementation of FPGA or ASIC, the hardware resources and design differences are large, and the optimization modes are also very different. By using the compiler design and optimization method facing the general processor for reference, the invention can reversely transfer the hardware design and optimization difficulty of the FPGA to the level of the compiler.

The neural network algorithm is characterized by data driving and only contains a small amount of control logic, so that it is important to design an appropriate intermediate representation with enough characterization capability for the neural network. In the design of the intermediate representation, factors such as the nature of the algorithm, underlying hardware and compiler need to be fully considered.

To this end, the present invention proposes a new intermediate representation structure, which provides a compilation scheme more suitable for subsequent efficient graph optimization and memory optimization by using feature graphs as nodes and computing operations as edges. Further, a variety of front-end deep learning frameworks can be compatible by using IR with different granularity and attributes, and instruction codes for a dedicated neural network processor (e.g., DPU) implemented by FPGA or ASIC can be acquired, and other hardware platforms, such as CPU and GPU general-purpose processors, can also be conveniently compatible.

FIG. 4 shows a flow diagram of an intermediate representation generation method according to one embodiment of the invention. In step S410, the input model file is parsed to acquire topology information of the neural network. In step S420, feature graph information and calculation operation information in the topology information are used as nodes and edges, respectively, to generate a first intermediate representation in the form of a graph.

Specifically, step S410 may use parsing sub-modules each corresponding to one type of model file, each parsing sub-module being used to parse a corresponding type of model file. For example, a corresponding Caffe parser and TensorFlow parser can be used to parse model files obtained via the Caffe and TensorFlow deep learning frameworks, respectively. For other or new deep learning frameworks, a parser for the model can be added accordingly.

Through the analysis step S410, the neural network model developed on the different depth learning frames can be analyzed as the frame-independent IR, i.e., the first IR in the present invention, thereby achieving the decoupling of the deep learning frame and the compiler optimization method, and uniformly converting the calculation graph forms with different particle sizes of various deep learning frames into the calculation graph form with fixed particle size (i.e., the first IR) in the present invention. By utilizing the characteristics of the python scripting language, the analysis of the model and the conversion of the IR can be conveniently realized.

In the present invention, the first IR is generated as a node representation feature graph, an edge is represented as a computation graph of computation operations, and the node and the edge each include an attribute feature. The attributes of the nodes may include dimension information and/or length-width channel information of the feature map. The computing operation of the edge representation includes at least one of: convolution, pooling, dimension transformation, point addition (eltwise), deconvolution, rearrangement, non-linearity, batch normalization (BatchNorm), scaling (scale). The attributes of the edge then include parameters of the computing operation and include at least one of: convolution kernel size, extended edge (pad), stride, grouping, expansion (disparity).

Fig. 5 shows an example of the conversion of the conventional computation graph into the first IR of the present invention. Compared to the conventional computational graph on the left side of the graph, the first IR shown on the right side of the graph is an IR in the form of a graph, each node of the graph represents FeatureMap (feature graph), which can also be understood as a multidimensional Tensor (Tensor), and each edge of the graph represents an operator, including but not limited to convolution, pooling, dimension transformation, point addition, deconvolution, normalization, nonlinearity, and the like. This is a coarse-grained IR representation, and each edge can be viewed as a multi-cycle computational operation on a multidimensional sensor. For example, for a finer-grained computational graph under Tensorflow, Pad or BiasAdd (biased) adjacent to Conv2D (two-dimensional convolution) can be fused to the edges represented by the Conv 2D; and fusing all constant nodes into the attributes of the edges corresponding to the corresponding operators, thereby constructing a directed acyclic graph. In a typical deep learning framework and compiler, computation operations are set as nodes of a computation graph, and edges of the computation graph represent dependencies between the operations. However, the first IR of the present invention sets the node as FeatureMap, and records its dimension information and dimension order, which is more beneficial for the compiler to perform subsequent memory optimization (e.g., based on the third IR), including memory multiplexing and All-Bank optimization, i.e., when FeatureMap is small, intermediate results between different calculation operations can All be stored in on-chip storage, without requiring repeated interaction with external storage.

Fig. 6 shows a flow diagram of an intermediate representation generation method according to another embodiment of the invention. Similar to fig. 4, fig. 6 also includes a parsing step S610 and a first IR generating step S620, but fig. 6 further includes a multi-stage optimization of the IR to generate second and third IR.

In step S630, graph optimization is performed on the first intermediate representation to generate a second intermediate representation in graph form. In particular, the computing operations may be merged to obtain the feature graph as a node, with the merged multiple computing operations as a second intermediate representation in the form of a hypergraph of edges. Merging the computing operations includes at least one of: removing unnecessary or non-influence operation on the calculation result; fusing a plurality of adjacent computing operations; and decomposing the computing operation to fuse or implement processing of the decomposed computing operation with a preceding or subsequent computing operation.

In particular, pruning operations may remove operations that are not needed or have no effect on the computational results.

The fusion operation fuses a plurality of adjacent computing operations. For example, the fusion operation may fuse the computing operation into a pre-and post-access operation and/or a computing operation. One is that the dimensional transformation can be fused into the load or store process before and after this operation. The other is that the operation can be naturally integrated into the previous calculation operation through the operation transformation. For example, the operation of the FLATTEN layer is to "beat", i.e., to one-dimensionally, a single convolved feature map. Whereas in a branched network (e.g. google lenet), for multi-layer cascading operation, the upper layer has the outputs of multiple layers switched into the input of the CONCAT. The operation of the CONCAT layer means that the data of each input layer is cascaded into a new layer according to the channels and then output to the next layer. Operations at both the FLATTEN and CONCAT layers are specialized data reordering and dimension transformation operations that may be omitted by specific specification of the manner in which data is stored and/or read. In addition, batchnorm (bn) and Scale may directly incorporate operations and parameters into the previous convolutional layer. In addition, the fusion operation can also reduce the number of times of data interaction between the hardware platform and the external memory when the instruction codes are executed by fusing the calculation operation. The fusion described above may be a point-wise fusion (i.e., a fusion for a single computational result). For example, for active layer ReLU, leakage ReLU, etc. after convolution operation, the point-wise fusion meaning is that a result of convolution calculated by a large number of input parameters needs to be stored in DDR, then the kernel of ReLU is started, all results of ReLU are calculated, and fusion can save the time for kernel start. In addition to the fusion of point-wise, the non-point-wise operation can be fused as much as possible through the design of a heuristic algorithm, for example, Conv + Point, Conv + Eltwise and the like, so that the interaction process of data on and off the chip can be omitted, the bandwidth pressure can be reduced, the calculation process of Point or Eltwise can be hidden in the loading or storing process of data, and the calculation efficiency can be improved. In one embodiment, in order to complete the CONV, ReLU, and POOL operations with respect to the input feature map, the data stored in the on-chip cache (e.g., BRAM) after the CONV operation may be directly subjected to the ReLU operation through operation fusion, and the necessary on-chip cache may be performed before the POOL operation, thereby greatly improving the overall processing efficiency compared to off-chip storage and reading of the data after the CONV, ReLU, and POOL operations are each completed.

The decomposition operation may decompose the computing operation to merge the decomposed computing operation with the preceding and following computing operations or to perform processing on the decomposed computing operation. In other words, one goal of decomposition is to more efficiently perform fusion. An operation may be broken down into branches to perform a single branch fusion. For example, pooling after cascading (concat) is equivalent to cascading after pooling. The pooling operation can be decomposed and incorporated into the convolution operation before concatenation. On the other hand, for some calculation operations that cannot be processed originally, the processing of the calculation operations can be realized by decomposition. For example, a group-wise convolution may be decomposed into a number of convolutions that may be processed. .

When the above optimization for the first IR is subsequently performed, it is inevitable that some merging schemes of the calculation operations may obtain better calculation efficiency than other merging schemes. The merging is limited by the storage resources, bandwidth, and computing resources of the hardware platform, and is not always better than performing the merging separately.

In view of this, the above operation merging can be more efficiently implemented by introducing a sub-graph template. Then, merging the computing operations may include: setting a subgraph template capable of computation operation merging, acquiring at least one subgraph matching scheme of a computation graph aiming at the first intermediate representation, and reconstructing the computation graph into the second intermediate representation merged by the computation operation based on the subgraph matching scheme. Preferably, the sub-graph template is determined based on attribute information of a hardware platform on which the instruction code compiled from the intermediate representation is to be executed. Specifically, a sub-graph template capable of being fused is set firstly in a sub-graph isomorphic matching mode, then all possible fusion schemes are found in a calculation graph, and the execution time of various fusion modes is obtained through a cost fitting formula or a simulator, so that which fusion mode is more preferable is determined.

When determining the overall calculation operation merging scheme, the determination of the optimal merging scheme can be facilitated by introducing a cost edge. Then, merging the computing operations may include: and under the condition that a plurality of calculation operation combination modes exist, adding edges corresponding to the execution cost of the calculation operation combination mode between the input nodes and the output nodes corresponding to the calculation operation combination modes of the first intermediate representation, and solving an optimal calculation operation combination scheme based on the shortest path problem among the nodes. In other words, in one embodiment, the optimal fusion mode can be obtained by introducing a cost edge into the computation graph of the first IR. For example, an edge may be added between FeatureMap nodes of input and output of each potential merging mode, merging costs are calculated according to the fitting model, and non-merging costs are calculated respectively, where the costs mainly include information such as calculation time and power consumption, and the costs may be used as one of the attributes of the edge. The coarse-grained compilation optimization problem can then be converted into a graph optimization problem that finds the shortest path through the neural network from input to output.

Fig. 7 shows a representation of the merging of the conventional computation graph of fig. 5 and the computation operation in the first IR of the present invention. As shown on the left side of fig. 7, the dashed boxes 1-5 correspond to representations of the five potential merging approaches in the conventional computation graph, respectively. After conversion to the first IR version of the invention calculation chart shown on the right side of fig. 7, the above combination can be converted to the corresponding 5 lines 1-5. Thus, the selection of the optimal merging scheme can be achieved by finding the shortest path from input to output.

It should be understood that the above-mentioned sub-graph template and cost edge implementation can be used in combination, for example, the sub-graph template is used to determine the candidate merging manner first, and then the cost edge is used to simplify the solution of the optimal solution.

The first IR of the present invention can obtain the second IR after the above optimization. The second IR is a node representation feature graph and edges represent a hypergraph with a combination of various computing operations. These combined calculation operations result from the previous coarse-grained optimization for the first IR, so the second IR can also be understood as a coarser-grained IR. And the calculation operation of the combination comes from the optimization of All-Bank, the horizontal and vertical combination of the calculation operation, a hardware-related heuristic combination method, the combination of dimensional transformation, the combination of equivalent transformation such as Conv + BN + Scale and the like.

In one embodiment, the second IR is represented by a Domain-Specific Language (DSL) designed based on the scheme Language. For example, DSL can be designed to represent IR at this level using a dialect Scheme of Lisp, thereby making full use of the Scheme's simplicity and extending IR representation and parsing using various powerful language tools (grammatical sugar) therein.

The use of DSL for hypergraph IR representation greatly facilitates subsequent third IR-based scheduling optimization. Firstly, the method is not limited to specific network structures, calculation parameters and calculation scales; secondly, the DSL can be used for completing the test of a hardware platform; again, using such DSL means that tasks can be generated directly without the corresponding network structure to perform cost calculations to fit a cost function in the graph optimization process.

The optimization of the entire computational graph hierarchy is mostly hardware independent. One hardware-related aspect is the setting of the sub-graph template, which can be adjusted according to the design of the underlying hardware computation module and the change of the instruction mapping strategy, and the other hardware-related aspect is that when the cost brought by the fusion mode is measured and calculated, the fusion cost is changed according to the change of the actual measurement or the fitting method.

In step S640, scheduling optimization is performed on the second intermediate representation to obtain a third intermediate representation with fine granularity.

The second IR is still a coarse-grained hypergraph form of IR, and the calculation process is a multidimensional sensor, performing a multi-cycle calculation operation.

The multi-cycle operation has a large optimization space, including calculation speed, occupied memory and the like, for a general-purpose processor or a special-purpose accelerator. Due to various limitations of on-chip storage resources, calculation resources and bandwidths of different hardware platforms carrying DPUs, bit widths of different DPU instruction sets, different neural network topologies, calculation parameters, parameter scales and the like, the requirements need to be met in the process of segmenting the multi-cycle calculation operation, for example, when the LOAD instruction is mapped, the number of LOADs cannot exceed the size of a buffer allocated to the LOAD instruction on a chip. After these constraints are fully satisfied, the problem of dependencies between instructions needs to be solved. The dependency problem involves several dependencies, such as requiring all of the numbers required for a computation to be loaded onto a chip before the computation operation can be performed; when writing data to an address of an on-chip cache, it is necessary that the number originally present at the address is not relied upon by other yet unexecuted instructions.

Thus, the optimization of step S640 for the third IR is based on scheduling optimization performed by the hardware platform to determine a block execution scheme for the feature map and/or weights, and further determine instruction dependencies among the executed instructions. Preferably, step S640 includes performing scheduling optimization on the second intermediate representation based on attribute information of a hardware platform on which the instruction code compiled from the intermediate representation is to be executed, to obtain a third intermediate representation of the block execution scheme indicating the feature map and/or the weight, and further, obtaining a third intermediate representation of instruction dependency relationships between execution instructions in the block execution scheme indicating the feature map and/or the weight based on the attribute information of the hardware platform.

Preferably, the third IR is represented by a language that writes each computation operation as multiple cycles. By using an IR representation method similar to Halide, each calculation operation can be written as a multiple loop, so that in the process of scheduling optimization, a partition that enables the multiple loop to occupy the least memory and the optimal calculation efficiency is determined. The above blocks need to fully consider the influence of on-chip computing resources, storage resources and bandwidth, and are determined by combining parameters of a specific neural network structure. In a preferred embodiment, the third IR is used to enable the generation of an automated scheduling policy and achieve the same efficiency of execution as the handwritten instructions.

Subsequently, in step S650, the third intermediate representation is compiled into instruction code for execution on the hardware platform. For example, the above-described block execution scheme and/or instruction dependencies may be mapped to the instruction code of the hardware platform via encoding for the third IR.

The instruction code (also referred to as the fourth IR in the present invention) is a specific instruction for the hardware platform, and includes a representation method when the IR in the optimal block mode found in the scheduling optimization is mapped to the specific instruction. The back end of the invention mainly aims at FPGA and ASIC design of DPU and embedded CPU such as ARM. In one embodiment, the hardware platform to which the compiler architecture of the present invention is adapted includes at least one of: a neural network special computing platform realized based on FPGA or ASIC; a neural network special computing platform realized based on a GPU; and general purpose computing platforms such as embedded CPU computing platforms.

Here, the instruction set used by the DPU may be a coarse-grained instruction set, but is also the finest-grained IR compared to the first three. The instruction set may act directly on various values, including DDR, on-chip Ram, registers, etc. The instruction set can be roughly divided into memory management, calculation control, logic control, etc., for example, the LOAD instruction is responsible for carrying data from DDR to chip; the SAVE instruction stores data from the chip to the DDR; the CONV instruction is responsible for sending data on the chip to each convolution calculation unit for calculation, and then carrying out some point-wise calculation such as biasing and ReLU operation, thereby greatly reducing the bandwidth pressure between the chip and the DDR; the END instruction is responsible for sending an END signal to a certain register to indicate that the DPU is finished, and subsequent calculation operations can be continued. Therefore, the strategy after optimized scheduling can be quickly mapped to each instruction of the DPU instruction set and translated into machine codes which can be directly executed on hardware, and the whole compiling process is completed.

The intermediate representation generation method according to the invention has been described above in connection with fig. 4-7. Backend hardware-independent (or almost independent) graph optimization is achieved by coarse-grained first and second IR, followed by the introduction of fine-grained third IR to achieve hardware optimization and facilitate its compilation into final instruction code. Here, a first IR is used for decoupling from various types of depth computing platforms, a second IR is used for performing results characterizing coarse-grained graph optimization, a third IR is used for fine-grained optimization for hardware, and instruction codes (also referred to as a fourth IR) are used for final hardware platform execution.

In other embodiments, aspects of the present invention may also be implemented as an intermediate representation generation apparatus for neural network computations. Fig. 8 shows a schematic diagram of an intermediate representation generating device according to an embodiment of the invention.

As shown in fig. 8, the intermediate representation generating apparatus 800 includes a parsing unit 810 and a first intermediate representation generating unit 820. The parsing unit 810 is configured to parse the input model file to obtain topology information of the neural network. The first intermediate representation generating unit 820 is configured to use the feature map information and the calculation operation information in the topology information as nodes and edges, respectively, to generate a first intermediate representation in a graph form.

Preferably, the first intermediate representation further comprises node properties and edge properties, the node properties comprising at least one of: dimension information and length and width channel information of the feature map; the computing operation of the edge representation includes at least one of: convolution, pooling, dimension transformation, point addition (eltwise), deconvolution, rearrangement, nonlinearity, batch normalization (BatchNorm), scaling (scale); and the edge attribute comprises a parameter of the computing operation and comprises at least one of: convolution kernel size, extended edge (pad), stride, grouping, expansion (disparity).

In a preferred embodiment, the intermediate representation generating means may further comprise generating means for subsequent IR. Fig. 9 shows a schematic diagram of an intermediate representation generating device according to another embodiment of the invention. The generating apparatus 900 of fig. 9 may further include second, third, and fourth IR generating units in addition to the parsing unit 910 and the first intermediate representation generating unit 920.

In particular, the second intermediate representation generating unit 930 may be configured to perform a graph optimization on the first intermediate representation to generate the second intermediate representation in the form of a graph. The second intermediate representation generating unit 930 may further include: and the calculation operation merging unit is used for merging the calculation operations to acquire the feature graph as a node, and the merged plurality of calculation operations are used as a second intermediate representation of the edge in the form of a hypergraph.

In one embodiment, the computing operation merging unit may be configured to perform at least one of the following operations: merging the computing operation into a preceding or succeeding access operation and/or computing operation; fusing a plurality of adjacent computing operations to reduce the number of data interaction between the hardware platform and the external memory when the instruction codes are executed; and decomposing the computing operation to fuse the decomposed computing operation with a preceding or subsequent computing operation.

In one embodiment, the calculation operation merging unit may be further configured to: setting a subgraph template capable of computation operation merging, acquiring at least one subgraph matching scheme of a computation graph aiming at the first intermediate representation, and reconstructing the computation graph into the second intermediate representation merged by the computation operation based on the subgraph matching scheme. The sub-graph template may be determined based on attribute information of a hardware platform on which the instruction code compiled from the intermediate representation is to be executed.

In one embodiment, the calculation operation merging unit may be further configured to: and under the condition that a plurality of calculation operation combination modes exist, adding edges corresponding to the execution cost of the calculation operation combination mode between the input nodes and the output nodes corresponding to the calculation operation combination modes of the first intermediate representation, and solving an optimal calculation operation combination scheme based on the shortest path problem among the nodes.

The third intermediate representation generating unit 940 may be configured to perform scheduling optimization on the second intermediate representation to obtain a fine-grained third intermediate representation. In particular, the third intermediate representation generating unit 940 may be configured to perform scheduling optimization on the second intermediate representation based on attribute information of a hardware platform that is to execute the instruction code compiled from the intermediate representation, so as to obtain a third intermediate representation of a blocking execution scheme indicating a feature map and/or weights.

In one embodiment, the third intermediate representation generating unit 940 may be further configured to: and acquiring a third intermediate representation of the instruction dependency relationship among the execution instructions in the block execution scheme indicating the feature graph and/or the weight based on the attribute information of the hardware platform.

Preferably, the intermediate representation generating means may further comprise a fourth intermediate representation generating unit 950 (also referred to as compiling unit) for compiling said third intermediate representation into instruction code for execution on a hardware platform.

It is to be understood that the above-mentioned intermediate representation generating apparatus of the present invention may be implemented as a compiler architecture, and fig. 10 shows a schematic diagram of the compiler architecture according to an embodiment of the present invention. It should be understood that while fig. 10 shows the various modules described in detail, the example of fig. 10 may be viewed as a superposition of several preferred embodiments, which may be performed separately or in partial combination in different embodiments.

Compiler architecture 1000 can include computation graph building module 1010, computation graph optimization module 1020, and instruction generation module 1030. Further, the computation graph building module 510 may include a model file parsing module 1011 and a computation graph generating module 1012. The model file parsing module 1011 and the computation graph generating module 1012 may correspond to the parsing unit and the first intermediate representation generating unit described above, respectively.

Model file parsing module 1011 may include parsing sub-modules each corresponding to a type of model file, each parsing sub-module for parsing a corresponding type of model file. As shown, the model file parsing module 1011 may include a Caffe parser and a tensrflow parser for parsing the model file acquired via the Caffe and tensrflow deep learning frameworks, respectively. Model file parsing module 1011 may also include parsers for other deep learning frameworks (shown "other" in the figure). In subsequent applications, if support for a new deep learning framework needs to be added, only one parser for the model needs to be added, and most of the subsequent optimization modes are framework-independent, so that the expandability and compatibility of the compiler framework are improved.

Via the model file parsing module 1011, neural network models developed over different depth learning frameworks can be parsed into framework independent IR, the first IR in the present invention. By utilizing the characteristics of the python scripting language, the analysis of the model and the conversion of the IR can be conveniently realized. Thus, the computation graph generation module 1012 may conveniently generate the first IR based on the parsing results of the corresponding parsing sub-modules.

After the first IR is constructed, the first IR file (i.e., "computation graph file" in fig. 5) may optionally be exported from computation graph generation module 1012. The first IR (i.e., "computation graph" in fig. 5) is then sent to computation graph optimization module 1020, which can optimize it to obtain a second IR. There are many ways to optimize the computation graph. The computational graph optimization module 1020 may correspond to a preceding second intermediate representation generation unit. In one embodiment, the computational graph user module 1020, as shown in fig. 5, may include a pruning module 1021, a decomposition module 1022, and a fusion module 1023.

Pruning module 1021 may be used to remove operations that are not needed or have no effect on the computational results. The fusion module 1022 may fuse computing operations, for example, to reduce the number of data interactions between the hardware platform and the external memory when the instruction code is executed. The decomposition module 1023 can decompose the computing operation to fuse the decomposed computing operation with preceding and following computing operations or to implement processing of the decomposed computing operation.

The computational graph optimization module 1020 may also implement efficient optimization of the computational graph by introducing sub-graph templates and cost edges as previously described.

After the optimization in graphical form is completed, a second IR file (i.e., the "optimized computational graph file" in fig. 10) is optionally exported from computational graph optimization module 1020. The second IR (i.e., the "optimized computation graph" in fig. 10) is then fed into instruction generation module 1030 to complete the process of mapping the optimized second IR to hardware platform (e.g., DPU) instructions. Instruction generation module 1030 may also include functionally separate modules, such as a schedule optimization module for schedule optimization for the second IR to generate a third IR, and an instruction code generation module for generating hardware platform instruction codes from the third IR, corresponding to the preceding third and fourth intermediate representation generation modules, respectively.

Likewise, instruction generation module 1030 may map the above-described blocked execution scheme and/or instruction dependencies to the instruction code of the hardware platform via encoding for the third IR.

In one embodiment, the compiler architecture of the present invention may further include a neural network forwarding module, configured to provide a standard answer for comparison for a result of the instruction code executed by the hardware platform. Preferably, the neural network forwarding module may use at least a part of the deep learning framework for which the model file is directed to solve the standard answer. For example, the software calculation is performed directly using a deep learning framework for neural network training and standard answers are obtained for comparison. However, since the neural network algorithm introduces differences due to fixed points and the like after being input into the hardware platform, the neural network forwarding module may modify at least a portion of the deep learning framework utilized based on the hardware platform to obtain a standard answer for the model file to be executed on the hardware platform.

For each deep learning framework, because different bottom floating point number operation libraries are adopted, the operation results of the floating point numbers are different due to the operation sequence, different truncation methods and other factors. Therefore, even though the same network structure is adopted, floating point number results among different depth learning frames are greatly different; for fixed point number calculation, the above difference becomes small but still exists. In addition, there are differences in the calculation parameters of different deep learning frames, such as the processing of Pad, the data arrangement order, the calculation method of mean pooling, the transformation method of Reorg, and the like. The fronthaul module, as a module that provides standard answers, needs to eliminate the variability between these frameworks. Or more conservatively, the forward module may include an operator implementation of the mainstream deep learning framework, and the operator implementation is modified to a certain extent, so that the computing operation is completely consistent with the behavior of the back-end hardware, including the operation of shifting, the boundary extension rule, and the like.

The compiler architecture of the present invention can also be considered as a broad compiler, and for different stages of compilation, the algorithms are transformed and described using IR with different granularity and different format. The main purpose of these IR is to simplify the compiler developer's optimization difficulty as much as possible and generate the most efficient instructions; and the use difficulty of a user is reduced as much as possible, so that the user does not need to pay attention to the implementation details of the algorithm, and the development process from end-to-end algorithm design to hardware deployment is really realized.

Referring to fig. 11, computing device 1100 includes memory 1110 and processor 1120.

The processor 1120 may be a multi-core processor or may include multiple processors. In some embodiments, processor 1120 may comprise a general-purpose host processor and one or more special purpose coprocessors such as a Graphics Processor (GPU), Digital Signal Processor (DSP), or the like. The memory 1110 may include various types of storage units, such as system memory, Read Only Memory (ROM), and permanent storage. The ROM may store, among other things, static data or instructions for the processor 1120 or other modules of the computer. The persistent storage device may be a read-write storage device. The persistent storage may be a non-volatile storage device that does not lose stored instructions and data even after the computer is powered off. In some embodiments, the persistent storage device employs a mass storage device (e.g., magnetic or optical disk, flash memory) as the persistent storage device. In other embodiments, the permanent storage may be a removable storage device (e.g., floppy disk, optical drive). The system memory may be a read-write memory device or a volatile read-write memory device, such as a dynamic random access memory. The system memory may store instructions and data that some or all of the processors require at runtime. In addition, the memory 1110 may include any combination of computer-readable storage media, including various types of semiconductor memory chips (DRAM, SRAM, SDRAM, flash memory, programmable read-only memory), magnetic and/or optical disks, as well. In some embodiments, memory 1110 may include a removable storage device that is readable and/or writable, such as a Compact Disc (CD), a digital versatile disc read only (e.g., DVD-ROM, dual layer DVD-ROM), a Blu-ray disc read only, an ultra-dense disc, a flash memory card (e.g., SD card, min SD card, Micro-SD card, etc.), a magnetic floppy disk, or the like. Computer-readable storage media do not contain carrier waves or transitory electronic signals transmitted by wireless or wired means.

The memory 1110 has stored thereon executable code that, when processed by the processor 1020, may cause the processor 1120 to perform the neural network compilation methods described above.

Furthermore, the method according to the invention may also be implemented as a computer program or computer program product comprising computer program code instructions for carrying out the above-mentioned steps defined in the above-mentioned method of the invention.

Alternatively, the invention may also be embodied as a non-transitory machine-readable storage medium (or computer-readable storage medium, or machine-readable storage medium) having stored thereon executable code (or a computer program, or computer instruction code) which, when executed by a processor of an electronic device (or computing device, server, etc.), causes the processor to perform the steps of the above-described method according to the invention.

Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the disclosure herein may be implemented as electronic hardware, computer software, or combinations of both.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems and methods according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

Having described embodiments of the present invention, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. An intermediate representation generation method for neural network computation, comprising:

analyzing the input model file to acquire topological structure information of the neural network;

and using the characteristic graph information and the calculation operation information in the topological structure information as nodes and edges respectively to generate a first intermediate representation in a graph form.

2. The method of claim 1, wherein the first intermediate representation further comprises node attributes and edge attributes, the node attributes comprising at least one of:

dimension information and length and width channel information of the feature map;

the computing operation of the edge representation includes at least one of:

convolution, pooling, dimension transformation, point addition (eltwise), deconvolution, rearrangement, nonlinearity, batch normalization (BatchNorm), scaling (scale); and is

The edge attribute includes a parameter of the computing operation and includes at least one of:

convolution kernel size, extended edge (pad), stride, grouping, expansion (disparity).

3. The method of claim 1, further comprising:

graph optimization is performed on the first intermediate representation to generate a second intermediate representation in the form of a graph.

4. The method of claim 3, wherein graph optimizing the first intermediate representation to generate a second intermediate representation in graph form comprises:

and merging the computing operations to obtain the feature graph as a node, and taking the merged multiple computing operations as a second intermediate representation of the edge in the form of a hypergraph.

5. The method of claim 4, wherein merging the computing operations comprises at least one of:

removing unnecessary or non-influence operation on the calculation result;

fusing a plurality of adjacent computing operations; and

the computing operations are decomposed to fuse or otherwise enable processing of the decomposed computing operations with preceding or succeeding computing operations.

6. The method of claim 4, wherein merging the computing operations comprises:

setting a subgraph template capable of computation operation merging, acquiring at least one subgraph matching scheme of a computation graph aiming at the first intermediate representation, and reconstructing the computation graph into the second intermediate representation merged by the computation operation based on the subgraph matching scheme.

7. The method of claim 6, wherein the sub-graph template is determined based on attribute information of a hardware platform on which instruction code compiled from the intermediate representation is to be executed.

8. The method of claim 4, wherein merging the computing operations comprises:

and under the condition that a plurality of calculation operation combination modes exist, adding edges corresponding to the execution cost of the calculation operation combination mode between the input nodes and the output nodes corresponding to the calculation operation combination modes of the first intermediate representation, and solving an optimal calculation operation combination scheme based on the shortest path problem among the nodes.

9. The method of claim 3, wherein the second intermediate representation is represented by a Domain-Specific Language (DSL) designed based on the scheme Language.

10. The method of claim 3, further comprising:

and performing scheduling optimization on the second intermediate representation to obtain a third intermediate representation with fine granularity.

11. The method of claim 10, wherein performing scheduling optimization on the second intermediate representation to obtain a third intermediate representation of fine granularity comprises:

scheduling optimization is performed on the second intermediate representation based on attribute information of a hardware platform on which instruction codes compiled from the intermediate representation are to be executed, to obtain a third intermediate representation of a blocked execution scheme indicating feature maps and/or weights.

12. The method of claim 11, wherein scheduling optimization of the second intermediate representation to obtain a third intermediate representation of a blocked execution scheme specifying feature maps and/or weights based on a hardware platform to execute instruction code compiled from the intermediate representation comprises:

and acquiring a third intermediate representation of the instruction dependency relationship among the execution instructions in the block execution scheme indicating the feature graph and/or the weight based on the attribute information of the hardware platform.

13. The method of claim 10, wherein the third intermediate representation is represented by a language that writes each computing operation as a multiple loop.

14. The method of claim 10, further comprising:

compiling the third intermediate representation into instruction code for execution on a hardware platform.

15. The method of claim 10, wherein the hardware platform comprises at least one of:

a neural network special computing platform realized based on FPGA or ASIC;

a neural network special computing platform realized based on a GPU; and

a general purpose computing platform.

16. An intermediate representation generation apparatus for neural network computation, comprising:

the analysis unit is used for analyzing the input model file to acquire the topological structure information of the neural network; and

and a first intermediate representation generating unit, configured to use the feature graph information and the calculation operation information in the topology information as nodes and edges, respectively, to generate a first intermediate representation in a graph form.

17. The apparatus of claim 16, wherein the first intermediate representation further comprises node properties and edge properties, the node properties comprising at least one of:

the computing operation of the edge representation includes at least one of:

18. The apparatus of claim 16, further comprising:

a second intermediate representation generating unit for performing graph optimization on the first intermediate representation to generate a second intermediate representation in the form of a graph.

19. The apparatus of claim 18, wherein the second intermediate representation generating unit further comprises:

and the calculation operation merging unit is used for merging the calculation operations to acquire the feature graph as a node, and the merged plurality of calculation operations are used as a second intermediate representation of the edge in the form of a hypergraph.

20. The apparatus of claim 19, wherein the computing operation merging unit is configured to at least one of:

removing unnecessary or non-influence operation on the calculation result;

fusing a plurality of adjacent computing operations; and

21. The apparatus of claim 19, wherein the compute operation merge unit is further to:

22. The apparatus of claim 21, wherein the sub-graph template is determined based on attribute information of a hardware platform on which instruction code compiled from the intermediate representation is to be executed.

23. The apparatus of claim 19, wherein the compute operation merge unit is further to:

24. The apparatus of claim 18, wherein the second intermediate representation is represented by a Domain-Specific Language (DSL) designed based on the scheme Language.

25. The apparatus of claim 18, further comprising:

and a third intermediate representation generating unit, configured to perform scheduling optimization on the second intermediate representation to obtain a fine-grained third intermediate representation.

26. The apparatus of claim 25, wherein the third intermediate representation generating unit is to:

27. The apparatus of claim 26, wherein the third intermediate representation generating unit is further to:

28. The apparatus of claim 25, wherein the third intermediate representation is represented by a language that programs each computing operation as a multiple loop.

29. The apparatus of claim 25, further comprising:

a compiling unit for compiling the third intermediate representation into instruction code for execution on a hardware platform.

30. A computing device, comprising:

a processor; and

a memory having executable code stored thereon, which when executed by the processor, causes the processor to perform the method of any one of claims 1-15.

31. A non-transitory machine-readable storage medium having stored thereon executable code, which when executed by a processor of an electronic device, causes the processor to perform the method of any one of claims 1-15.