Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present disclosure more clear, the technical solutions of the embodiments of the present disclosure will be described below clearly and completely with reference to the accompanying drawings of the embodiments of the present disclosure. It is to be understood that the described embodiments are only a few embodiments of the present disclosure, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the described embodiments of the disclosure without any inventive step, are within the scope of protection of the disclosure.
Unless otherwise defined, technical or scientific terms used herein shall have the ordinary meaning as understood by one of ordinary skill in the art to which this disclosure belongs. The use of "first," "second," and similar terms in this disclosure is not intended to indicate any order, quantity, or importance, but rather is used to distinguish one element from another. The word "comprising" or "comprises", and the like, means that the element or item listed before the word covers the element or item listed after the word and its equivalents, but does not exclude other elements or items. The terms "connected" or "coupled" and the like are not restricted to physical or mechanical connections, but may include electrical connections, whether direct or indirect. "upper", "lower", "left", "right", and the like are used merely to indicate relative positional relationships, and when the absolute position of the object being described is changed, the relative positional relationships may also be changed accordingly.
To maintain the following description of the embodiments of the present disclosure clear and concise, a detailed description of some known functions and components have been omitted from the present disclosure.
Due to the limitation of moore's law, it is only slightly successful to improve the computational power of the chip by adopting the conventional method of improving the main frequency of the processor. With the development of diversified computing, a variety of different Processing units such as a CPU (Central Processing Unit), a DSP (Digital Signal processor), a GPU (Graphics Processing Unit), an ASIC (Application Specific Integrated Circuit), an FPGA (Field Programmable Gate Array), and the like are introduced into an increasing number of scenes to accelerate computing, and thus heterogeneous computing is performed.
The chip-level (SoC) heterogeneous computation solves the computation problem by carrying out heterogeneous computation on chips with different processes and different architectures. Technically, a heterogeneous multi-core architecture is usually adopted to build a system on chip so as to improve the computing power of the whole system on chip. The heterogeneous system on chip utilizes cooperative computing and mutual acceleration among processing units of different types of instruction sets and architectures, thereby breaking through the bottleneck of development of a single processing unit.
FIG. 1 is a schematic architecture diagram of a heterogeneous system on a chip. For example, the system-on-chip may be used in complex deep learning application scenarios.
As shown in fig. 1, one possible heterogeneous system on chip 100 is composed of a central processing unit 110, a neural network processing unit 120, a digital signal processing unit 130, a first memory 140, a second memory 150, and necessary communication links 160.
Such as computers, cell phones, embedded devices, etc., are typically heterogeneous architectures having at least one central processing unit 110 and other heterogeneous processing units. For example, these devices are typically hosted by the central processing unit 110, communicating specialized computing tasks to other processing units; the digital signal processing unit 130 has the characteristics of high processing speed, high flexibility and strong specificity, and has more advantages than the central processing unit 110 in terms of high-speed operation scenes and digital processing; the neural network processing unit 120 imitates human neurons and synapses at a circuit layer, and directly processes large-scale neurons and synapses with a deep learning instruction set, one instruction completes processing of a group of neurons, and compared with the central processing unit 110 and the digital signal processing unit 130, the neural network processing unit 120 realizes integration of storage and calculation through synaptic weights, thereby improving operation efficiency, and is generally used for executing related instructions of deep learning; the first memory 140 is used for storing the intermediate operation result and as a buffer memory of the second memory 150, and generally has a faster access rate than the second memory 150; the communication link 160 is used to transfer data and control commands between the central processing unit 110, the neural network processing unit 120, the digital signal processing unit 130, the first memory 140, and the second memory 150.
In addition, there may be internal memory areas inside the central processing unit 110, the neural network processing unit 120, and the digital signal processing unit 130, and in general, the access rate of the internal memory areas is better than that of the first memory 140, and the access rate of the first memory 140 is better than that of the second memory 150.
Operations in neural network models, such as weighted summation, convolution, activation functions, etc., are organized into a graph structure composed of nodes and edges, commonly referred to as a computational graph. The computational graph is a directed graph, nodes in the computational graph represent specific operations or variables in the neural network, edges in the computational graph represent data flow directions, and the computational graph represents a computational process by a graph. For example, a variable may provide its value to an operation, and the operation may output the result of the computation and provide it to other operations, so that each node in the graph defines a function of the variable, which may also be referred to as an operator. The values entering and exiting a node are called tensors (tensors), which have different dimensions and include scalars, vectors and matrices and higher-order tensors, e.g., a 0-dimensional Tensor is also called a scalar.
Fig. 2 is a flow chart of compiling a computation graph.
As shown in fig. 2, the conversion of the computation graph 210 into execution instructions in the heterogeneous system-on-chip 100 requires processing by the compiler 200.
As shown in FIG. 2, first, at step 210, according to the computation graph, the compiler 200 constructs an intermediate representation; thereafter, in step 220, the compiler 200 optimizes the constructed intermediate representation to obtain an executable file, which contains executable instructions, or a sequence of executable instructions, on the target hardware platform (e.g., the heterogeneous system-on-chip 100).
For example, Intermediate Representation (IR) refers to a data structure for analysis and transformation between a compiler receiving a source program for semantic analysis and an output executable program.
The heterogeneous system on chip 100 is composed of a plurality of different processing units, and how to design a unified compiling framework to support different hardware backend platforms is a great challenge brought to deep learning compiler design by the rapid development of deep learning algorithms and hardware.
Currently, existing compilation frameworks attempt to address the problem of commonality. For example, the TensorFlow platform provided by Google compiles computational graphs via XLA (Accelerated Linear Algebra) to generate a graph-based HLO (High Level Optimizer) IR similar to a High-Level language; after optimizing the HLO IR, generating an LLVM (Low Level Virtual Machine) IR, wherein the LLVM IR is a Low-Level language and is similar to an instruction set of RISC; and then compiled by LLVM IR into assembly language for various hardware platforms (i.e. different processing units of the heterogeneous system on chip).
However, the following three problems exist in this compiling method:
1. since the HLO IR is an intermediate representation based on a computational graph, the intermediate representation and hardware-supported instructions are structurally and abstractly different, the conversion overhead from HLO IR to LLVM IR is relatively large, and the collaborative optimization is difficult.
2. HLO IR defines common atomic-granularity operators, such as some operators that perform basic computations, such as addition, subtraction, dot-product, broadcast, maximum, etc., which are also a large feature of HLO IR. The method can conveniently represent the calling relation among the calculation graphs, and realizes complex calculation through the combination of basic operators supported by HLO aiming at the characteristics of different hardware. However, when the method is applied to a heterogeneous system on chip for neural network computing, some processing units can directly support complex operators or operators not supported by HLO IR, and instead, the operators with atomic granularity defined by HLO IR cannot maximally utilize the computing power of the processing units, thereby increasing the conversion overhead.
3. Heterogeneous systems on chip are composed of different processing units that require different forms of intermediate representation, such as requiring conversion of HLO IR to LLVM IR adapted to the different processing units, which is inefficient and increases compiler development costs.
At least one embodiment of the present disclosure provides a compiling method, a compiling apparatus, an electronic device, and a non-transitory computer-readable storage medium. The compiling method comprises the following steps: acquiring a first intermediate representation corresponding to an object to be compiled, wherein the first intermediate representation is an intermediate representation based on a graph; performing multi-stage conversion and optimization on the first intermediate representation to obtain a second intermediate representation; and obtaining executable instructions corresponding to each processing unit according to the second intermediate representation, wherein the multi-stage conversion and optimization comprises operator fusion of the first intermediate representation and addition of hardware platform characteristic information related to various processing units.
The compiling method provides a unified multi-level intermediate representation definition method, aiming at the diversity of a hardware platform, the intermediate representation based on the graph is gradually optimized and converted into the intermediate representation which is related to the hardware platform and has a low level, the conversion span between the first intermediate representation based on the graph and the executable instruction supported by the processing unit is reduced, the conversion overhead from the calculation graph to the hardware execution instruction is reduced, and the cost for constructing the compiler in the specific field is reduced.
Embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings, but the present disclosure is not limited to these specific embodiments.
Fig. 3 is a schematic flow chart of a compiling method according to at least one embodiment of the disclosure.
For example, the compiling method is applicable to heterogeneous devices, e.g., the heterogeneous devices are heterogeneous systems on a chip, e.g., the heterogeneous devices may include a plurality of different kinds of processing units. For example, the plurality of different kinds of processing units includes at least two of a central processing unit, a graphics processing unit, a neural network processing unit, and a digital processing unit.
For example, one possible architecture of the heterogeneous device is the architecture shown in fig. 1, but it should be noted that the heterogeneous system on chip shown in fig. 1 is one possible architecture, and the disclosure is not limited thereto. In the present disclosure, the heterogeneous device is composed of a plurality of heterogeneous processing units, where the heterogeneous processing units refer to those processing units having different types of instruction sets or architectures, for example, the heterogeneous device may also adopt an architecture of a central processing unit + a graphics processing unit, or an architecture of a central processing unit + a neural network processing unit + a digital signal processing unit, and the like, the number of each processing unit in each architecture may be more than one, and the present disclosure does not specifically limit the type and number of the processing units.
For example, as shown in fig. 3, the compiling method provided by the embodiment of the disclosure includes steps S10 to S30.
In step S10, a first intermediate representation corresponding to the object to be compiled is obtained.
For example, the first intermediate representation is a graph-based intermediate representation.
For example, step S10 may include: acquiring operation information of an object to be compiled, wherein the operation information is in a calculation graph form of the object to be compiled; and converting the object to be compiled into a first intermediate representation according to the operation information.
For example, the object to be compiled is used to perform neural network computations. For example, the object to be compiled may be an algorithm program for performing neural network operations, the algorithm program being written in any feasible high-level language, such as Python, and the like.
For example, according to the operation of the neural network to be performed by the object to be compiled, such as weighted summation, convolution, activation function, etc., a graph structure composed of nodes and edges is arranged, and the graph structure is the operation information in the form of a computation graph.
For example, the operational information may be processed using an existing compilation framework to generate an intermediate representation. For example, converting the object to be compiled into the first intermediate representation according to the operation information may include: compiling the operation information through an accelerated linear algebra compiler to obtain a high-order optimized intermediate representation; the higher order optimized intermediate representation is taken as the first intermediate representation.
For example, the operational information is compiled into HLO IR by XLA, which is taken as the first intermediate representation here.
It should be noted that, those skilled in the art may also use other compiling frameworks and compiling methods to obtain the first intermediate representation here, and the obtained first intermediate representation may be a graph-based intermediate representation, or the first intermediate representation may be an intermediate representation in a higher-order and higher-level language.
Then, on this basis, the first intermediate representation is optimized and converted step by step into an intermediate representation related to the hardware platform for the diversity of the hardware platform.
In step S20, the first intermediate representation is subjected to multi-level conversion and optimization to obtain a second intermediate representation.
For example, the multi-level transformation and optimization includes operator fusion of the first intermediate representation and addition of hardware platform characterization information associated with the various processing units.
For example, the input of the conversion and optimization of each stage is an intermediate representation, the output is also an intermediate representation, and the intermediate representation of the output is subjected to operator fusion relative to the intermediate representation of the input or is added with some hardware platform characteristic information.
For example, when the first intermediate representation is a graph-based intermediate representation, the first intermediate representation includes a plurality of operators, which are operators of atomic granularity, such as operators that implement underlying operations of addition, subtraction, dot multiplication, maximum, and so on.
The operators directly supported by the hardware processing unit may be operation operators, and may also directly support more complex operators formed by combining a plurality of operation operators, for example, the neural network processing unit directly supports an activation function operator. At this time, operator fusion needs to be performed on the operation operators, and one or more operation operators realizing complex operator functions are combined into an operator form directly supported by the processing unit, so that the computing power of each processing unit in the heterogeneous device is maximally utilized, and the conversion span from the graph-based intermediate representation to the lower-layer intermediate representation is reduced.
In addition to operator fusion, the present disclosure also adds hardware platform characterization information related to a variety of processing units through at least one level of conversion and optimization. For example, these hardware platform characteristic information provide hardware operation information required for the processing unit to execute each operator, such as the shape of input and output data, data storage location, data accuracy, and the like. In addition, the hardware platform characteristic information may also provide hardware optimization information, including conventional optimizations such as Dead Code Elimination (Dead Code Elimination), constant folding, Common Subexpression Elimination (Common Subexpression Elimination), and the like, and may also perform more specific custom optimizations according to specific needs of each processing unit in combination with hardware characteristics of each processing unit.
For example, the multi-stage optimization and transformation may include two or more stages of transformation and optimization. For example, the first intermediate representation is subjected to N-level conversion and optimization, wherein N is a positive integer greater than or equal to 2, optimization and operator fusion common to some processing units are performed in the first-level conversion, and required hardware platform characteristic information is added in a targeted manner by combining hardware characteristics in subsequent N-1-level conversion and optimization. Therefore, the first-stage conversion and optimization are decoupled from the hardware characteristics, the conversion and optimization in the aspect of operator combination and operator definition are focused, the description directly related to the hardware characteristics is added step by step in the subsequent N-1-stage conversion, the calculation diagram is decoupled from the finally executed hardware platform environment in the mode, the work of an operator level and the work strongly related to the hardware are avoided being finished in the first-stage conversion, the expandability of the compiling method is improved, and the cost for constructing a compiler in a specific field is reduced.
For example, fig. 4 is a schematic flowchart of step S20 provided in at least one embodiment of the present disclosure. As shown in fig. 4, step S20 includes at least steps S201-S203.
In step S201, at least one reference operator is obtained.
For example, each reference operator can be directly supported by at least one processing unit.
For example, statistics may be performed on neural network models of multiple structures in advance, and common operators that may be needed in calculation of various neural network models, such as an activation function operator, a matrix convolution calculation operator, a pooling operator, and the like, may be determined. And, from these statistically derived operators, one or more operators that can be directly supported by at least one processing unit in the heterogeneous device are determined as reference operators, for example, a neural network processing unit can directly support the operation of an activation function operator, which can be taken as a reference operator.
For example, the reference operator may also include an operator that is not supported by the first intermediate representation. For example, the first intermediate representation does not directly support a matrix convolution calculation operator, in which the matrix convolution calculation operator is implemented by a combination of a plurality of operation operators, but the neural network processing unit is capable of directly supporting matrix convolution calculation, then the reference operator may also comprise a matrix convolution calculation operator.
For example, the reference operator needs to be an operator that can be directly supported by at least one processing unit, and the reference operator may be changed according to the change of different types of processing units included in the heterogeneous device, that is, different heterogeneous devices may correspond to different reference operators.
For example, the reference operator in the present disclosure may be an operator with a granularity larger than that of the operation operator, for example, the function of the reference operator is implemented by combining a plurality of operation operators, or the reference operator may also include an operator with the same granularity as that of the operation operator, for example, basic operations such as addition and subtraction directly supported by the processing unit, and the reference operator may also include an operator for performing the basic operations.
In step S202, a plurality of operation operators are converted according to at least one reference operator to obtain an operator fusion intermediate representation.
For example, step S202 may include: determining at least one operator for executing the function of the target reference operator from the plurality of operators, wherein the target reference operator is any one of the at least one reference operator; and converting at least one operation operator into an expression form of a target reference operator to obtain an operator fusion intermediate representation.
For example, the first intermediate representation includes a plurality of operation operators, and the plurality of operation operators can perform the functions of the activation function operator and the matrix convolution calculation operator by combination. For example, a number a of operators that perform the function of an activation function operator are converted into an expression form of the activation function operator, and a number B of operators that perform the function of a matrix convolution calculation operator are converted into an expression form of the matrix convolution calculation operator, thereby obtaining an operator fusion intermediate representation, where a and B are positive integers, and the activation function operator and the matrix convolution calculation operator are reference operators.
For example, converting at least one operator into an expression form of a target reference operator to obtain an operator fused intermediate representation may include: determining a first namespace; and converting at least one operation operator into an expression form of a target reference operator in the first namespace to obtain an operator fusion intermediate representation.
For example, determining the first namespace may include: obtaining a first name space based on a multi-level intermediate representation architecture; at least one reference operator is defined in a first namespace.
For example, a Multi-Level Intermediate Representation (Multi-Level Intermediate Representation) is intended to provide a Multi-Level IR format, which uses its modular and extensible features to solve the problem of interaction between the IR, and uses a highly consistent way to compile into assembly language of a specific hardware platform. MLIR is an infrastructure rather than a specific IR or IR definition method, and it emphasizes that by continuously optimizing and converting computational graphs through multiple levels of IR and finally generating hardware executable language, users need to customize specific IR according to heterogeneous SoC conditions.
Here, we use the TableGen syntax definition of MLIR based on the Dialect provided by MLIR as the base class or template, and the class (class) in TableGen is similar to the meaning of the class in C + + language, so that the Dialect can be used as the template or base class to derive the subclass. For example, we define the subclass EHLO _ Dialect, which is derived from Dialect, to construct a first namespace (namespace). The Dialect template is responsible for defining various operations and analyses, while also being extensible, e.g., the first namespace defined by EHLO _ Dialect derived from Dialect is called "EHLO".
It should be noted that we here derive the subclass named EHLO _ dimension and the first namespace named "EHLO", but this disclosure does not limit the names of the derived subclass and the names of the namespaces.
For example, defining at least one reference operator in a first namespace may include: and defining the description of each reference operator, the operation parameter and parameter type of each reference operator, and the operation result and result type of each reference operator in a first name space.
For example, in the first namespace ehlo, basic parameters such as data types for describing edges (i.e., tensor data) of the computation graph are defined, and in addition, main structures of respective reference operators are defined. For example, each reference operator is derived from the class Op (operators), which is the primary structure provided by MLIR that defines the operator, by specifying the operator name "menmonic" and the operator constraint "traits" to specifically describe the derived reference operator.
For example, each reference operator defines an annotation "summary" in the first namespace, describing "description", operation parameters and operation parameter types ("definitions") required for the operation, and operation results and operation result types ("results"). For specific illustration of the reference operator and the first namespace, reference may be made to descriptions in fig. 6B to fig. 6D, which are described later, and details are not repeated here.
For example, traversing the entire first intermediate representation, all operation operators are converted into reference operators defined in the first namespace, i.e. replaced with expressions of the reference operators defined in the first namespace. Thus, the first intermediate representation is fully converted into an expression defined in the first namespace, resulting in an operator fused intermediate representation.
For example, in the operator fused intermediate representation, for complex calculations implemented by a plurality of operation operators, one reference operator may be represented in the operator fused intermediate representation; in addition to the operation operators that combine to complete complex computations, for operation operators directly supported by some hardware, such as the maximum value operator directly supported by the neural network processing unit, these operation operators are also converted into expressions of corresponding reference operators in the first namespace in the operator fusion intermediate representation.
Furthermore, in the first namespace, optimizations common to several individual processing units may also be defined, such as constant folding, dead code elimination, common subexpression elimination, and so on.
The present disclosure provides a general operator fusion conversion and definition mode, in which the general operator fusion and optimization relative to each hardware unit are performed, small-granularity operation operators in the first intermediate representation are combined into complex operators which can be directly supported by the processing unit, the expandability is strong, and the conversion span from the first intermediate representation based on the graph to executable instructions supported by the processing unit is reduced.
For example, after the first-level conversion and optimization is completed in step S202, the subsequent N-1-level conversion and optimization is completed in step S203, and the required hardware platform characteristic information is added in a targeted manner in combination with the hardware characteristics.
In step S203, at least one stage of conversion and optimization is performed on the operator fusion intermediate representation according to the hardware platform characteristics of each processing unit, and hardware platform characteristic information related to a plurality of processing units is added stage by stage to obtain a second intermediate representation.
For example, step S203 may include: and adding a description directly related to the hardware platform characteristic of each processing unit in the operator fusion intermediate representation through at least one stage of conversion and optimization to obtain a second intermediate representation, wherein the second intermediate representation contains the hardware platform characteristic information of each processing unit.
For example, in the conversion and optimization of the at least one stage, optimization more specific to the hardware characteristics of the respective processing units is mainly performed. For example, the conversion stage number can be designed according to the requirements of actual conversion and optimization, the conversion of each stage is further optimized and converted by hardware aiming at the intermediate representation output by the previous stage, and the multi-stage conversion and optimization can reduce the conversion span from the operator fusion intermediate representation to the executable instructions supported by the hardware platform, thereby reducing the conversion overhead.
For example, adding, through at least one stage of conversion and optimization, a description directly related to the hardware platform characteristics of each processing unit in the operator fusion intermediate representation to obtain a second intermediate representation may include: determining at least one second namespace in one-to-one correspondence with the at least one level of translation and optimization; step-by-step converting a plurality of first intermediate operators included in the operator fusion intermediate representation into expression forms in corresponding second namespaces, wherein the expression form of each first intermediate operator in the corresponding second namespaces contains hardware platform characteristic information of a processing unit executing each first intermediate operator; and taking the intermediate representation obtained by the last stage of conversion and optimization as a second intermediate representation.
For example, according to hardware characteristics of each processing unit, such as supported data precision, storage positions of input data and output data during operation of an operator, step size information of a convolution operator, pooling type (maximum pooling, average pooling) of a pooling operator, and the like, corresponding hardware platform characteristic information is added to each first intermediate operator to give each first intermediate operator a more specific expression.
For example, the first intermediate operator is an expression form of the aforementioned reference operator in the first namespace.
For example, determining at least one second namespace that has a one-to-one correspondence with at least one level of translation and optimization may include: obtaining at least one second namespace based on the multi-level intermediate representation architecture; and defining expression forms corresponding to a plurality of first intermediate operators in each second namespace, wherein the hardware platform characteristic information of the processing unit executing the first intermediate operators is added to the corresponding expression form of each first intermediate operator in each second namespace.
Here, we also derive subclasses using Dialect as a template or base class by using the TableGen syntax definition of MLIR based on Dialect provided by MLIR as a base class or template. For example, we define a subclass EIR _ Dialect derived from Dialect to construct a second namespace. The Dialect template is responsible for defining various operations and analyses, while also being extensible, e.g., the namespace defined by EIR _ Dialect derived from Dialect is called "EIR".
For example, when there are multiple second namespaces, different second namespaces are derived based on different derived subclasses, all derived from Dialect. For example, different second namespaces can have different names. For example, in the different second namespaces, different hardware optimizations may be performed according to the different hardware platform characteristic information defined.
It should be noted that we name EIR _ dimension and the corresponding second namespace as "EIR" for the derived subclass, but this disclosure does not limit the names of the derived subclasses and the names of the namespaces.
For example, defining the expression forms respectively corresponding to the plurality of first intermediate operators in each second namespace may include: for each first intermediate operator, determining a processing unit executing the first intermediate operator; adding an execution unit identifier for the first intermediate operator according to the processing unit; determining hardware platform characteristic information of the processing unit, wherein the hardware platform characteristic information comprises hardware operation information or hardware optimization information, the hardware operation information comprises one or more of data shape, data precision and data storage position corresponding to the processing unit, and the hardware optimization information comprises one or more of hardware storage analysis, hardware power consumption minimization, data blocking and hardware storage quantification; and according to the hardware platform characteristic information of the processing unit, defining a first intermediate operator and the hardware platform characteristic information of the first intermediate operator in each second name space, wherein different hardware platform characteristic information is defined in different second name spaces.
For example, for a first intermediate operator in the plurality of first intermediate operators, if the first intermediate operator can be executed in a plurality of processing units, the processing unit executing the first intermediate operator is determined by analyzing from the aspects of execution efficiency, resource consumption, resource allocation and the like. For example, the first intermediate operator is an activation function operator, the activation function operator can be directly supported in both the graphics processing unit and the neural network processing unit, and the neural network processing unit has higher execution efficiency, and then it is determined that the processing unit corresponding to the activation function operator is the neural network processing unit, and a corresponding execution unit identifier is added to the activation function operator to mark that the activation function operator is executed by the neural network processing unit.
For example, after determining the processing unit executing the first intermediate operator, some hardware operation information required by the processing unit during the operator operation and hardware optimization information required for hardware optimization related to the specific hardware optimization can be determined.
For example, the hardware operation information provides hardware parameters required when the processing unit executes the operator, such as data shapes of input data and output data, dimensions including tensor, data types, and the like, and further, data precision, data storage positions, and the like can be included.
For example, the hardware optimization information includes general optimizations such as constant folding, dead code elimination, common sub-expression elimination, and the like, and may further include custom optimization information that more combines with hardware characteristics, such as optimization of hardware storage analysis, hardware storage quantization, data blocking, and the like, which can greatly reduce memory handling and reduce resource waste. In addition, the hardware optimization information may also include memory limitations, minimizing hardware power consumption, etc. according to the characteristics of the processing unit.
The hardware optimization information and the hardware operation information are directly related to the processing unit, and different processing units can have different hardware platform characteristic information. If special conversion needs to be performed for a special hardware platform, such as a digital processing unit, for example, the format of input/output parameters of an operator is assembled according to a protocol, and an ID (identification number) of an assigned operator is also assigned, a new stage of conversion may be added to add the hardware platform characteristic information. By analogy, more levels of conversion and optimization can be added according to specific requirements, so that the compiling method provided by the disclosure has better expandability and flexibility, and when the hardware optimization of the processing unit needs to be increased, a new level of conversion can be added to realize the conversion, or corresponding hardware platform characteristic information can be added into a namespace.
For example, the hardware platform characteristic information added at each stage is different, i.e., different hardware optimizations are performed in the conversion and optimization at different stages.
For example, in the second namespace eir where the operator fused intermediate representation is transformed and optimized, storage level attributes are defined that describe the edges (i.e., tensor data) of the computational graph, which indicate the storage locations of the data of the operators in the heterogeneous devices. In addition, the main structure of each intermediate conversion operator is also defined in the second namespace eir, e.g., an intermediate conversion operator refers to the corresponding operator expression of each first intermediate operator in the second namespace eir. For example, each intermediate conversion operator is derived from the class of Op (operators), which is the primary structure provided by MLIR that defines the operator, by specifying the operator name "menemonic" and the operator constraint "traits" to specify the derived intermediate conversion operator.
For example, a data conversion precision module may also be defined in second namespace eir for converting data of one precision type to data of another precision type. The data conversion precision module comprises three attributes of data truncation truncate, proportion scale and offset, and data of a specified precision type can be obtained by specifying the three attributes during precision conversion.
Similarly, various other modules that implement hardware optimization can be defined eir in the second namespace to add corresponding hardware optimization information.
For specific illustration of the intermediate conversion operator and the second namespace, reference may be made to the descriptions in fig. 7A to 7C, which are described later, and details are not repeated here.
For example, the present disclosure provides a general definition and expression manner, which can uniformly describe hardware characteristics of different processing units, and different processing units only need to be given different specific parameters, that is, a general expression manner is defined in the second namespace, and parameters corresponding to the processing units are written in the second namespace when the second namespace is specifically instantiated.
For example, the whole operator fusion intermediate representation is traversed, and for each level of conversion and optimization, the intermediate conversion operator (or the first intermediate operator) in the intermediate representation output by the previous level of conversion and optimization is converted into an expression form defined in the corresponding second namespace, and the hardware platform characteristic information is added into the expression form. Thus, the operator-fused intermediate representation is progressively converted into a second intermediate representation, the second intermediate representation being closer to the underlying representation, and each operator in the second intermediate representation being assigned a specific hardware parameter.
For example, in the compiling method provided by at least one embodiment of the present disclosure, the conversion and optimization at each stage are implemented by a multi-stage intermediate conversion architecture (MLIR). For example, as previously described, derived classes EHLO _ Dialect and EIR _ Dialect are defined to construct the first namespace and the second namespace. The Dialect base class of MLIR places all intermediate representations in the same namespace (e.g., EHLO _ Dialect places all intermediate representations in first namespace MLIR:: EHLO and EIR _ Dialect places all intermediate representations in second namespace MLIR:: EIR), and defines a corresponding production formula for each intermediate representation, i.e., how an operator is converted from the expression form of an operator in the first intermediate representation to the expression form of a reference operator in the operator-fused intermediate representation is defined in EHLO _ Dialect, and EIR _ Dialect defines how an operator is converted from the expression form of a first intermediate operator in the operator-fused intermediate representation to the expression form in the second namespace, and the filling of some information is also done in this process or in a subsequent optimization process, thereby completing the conversion and optimization between different levels. In the compiling process, conversion and optimization are completed by using the base class Dialect traversal operation information (namely, a calculation graph) and using operators as units.
The present disclosure provides a general conversion and definition method related to hardware optimization, wherein in the at least one stage of conversion and optimization, optimization strongly related to each hardware unit is executed, operators are added step by step to execute required hardware operation information, custom hardware optimization is executed, a first intermediate representation with a higher order is converted step by step to bottom layer hardware, the conversion overhead from a computation graph to a hardware execution instruction is reduced, and the cost for constructing a compiler in a specific field is reduced.
For example, after obtaining the intermediate representation related to the hardware, that is, the second intermediate representation, the compiler generates the executable instruction according to the target processing unit corresponding to each second intermediate operator in the second intermediate representation.
In step S30, according to the second intermediate representation, executable instructions corresponding to each processing unit are obtained.
For example, in some embodiments, the executable instructions corresponding to the respective processing units are derived directly from the second intermediate representation.
For example, the second intermediate representation comprises a plurality of second intermediate operators, and step S30 may comprise: determining a processing unit corresponding to each second intermediate operator according to a second intermediate representation, wherein each second intermediate operator is executed on the corresponding processing unit, and the second intermediate representation comprises an execution unit identifier corresponding to each second intermediate operator; hardware platform characteristic information of at least one second intermediate operator executed on each processing unit is extracted from the second intermediate representation according to the instruction set type and the rule of each processing unit to generate an executable instruction corresponding to each processing unit.
For example, in the second intermediate representation, the execution unit identifier corresponding to each second intermediate operator is added through step S20, and the execution unit identifier marks the processing unit corresponding to each second intermediate operator.
For example, according to the instruction set type and rules of each processing unit, the relevant hardware parameters of one or more second intermediate operators executed on each processing unit are extracted from the second intermediate representation, executable instructions corresponding to each processing unit are generated, and finally, a binary executable file targeting the heterogeneous device is obtained through compilation.
For example, in other embodiments, some operators need to run on a general-purpose platform such as a CPU, and the second intermediate representation can be converted into an underlying virtual intermediate representation (LLVM IR), and the second intermediate representation can be accessed with an existing facility LLVM IR, and the executable instructions can be obtained by using an existing compiler.
For example, step S30 may include: converting the second intermediate representation into an underlying virtual intermediate representation; and generating executable instructions corresponding to each processing unit according to the bottom virtual intermediate representation.
Fig. 5 is a processing flow diagram of a compiling method according to an embodiment of the disclosure.
As shown in fig. 5, first, operation information 310 in the form of a computation graph is obtained, and the operation information 310 is compiled by an accelerated linear algebra compiler to obtain HLO IR, which is used as a first intermediate representation 320. For example, the specific procedure may refer to the description of step S10.
Thereafter, a two-stage transformation and optimization 330 is performed on the first intermediate representation 320, resulting in a second intermediate representation 340. For example, the two-stage transformation and optimization includes a first stage transformation and optimization 331, and a second stage transformation and optimization 332. As previously mentioned, the first namespace defined by EHLO _ Dialect derived from Dialect is referred to as "EHLO", which in FIG. 5 represents the name of the first namespace corresponding to first level conversion and optimization 331; the second namespace defined by EIR _ Dialect derived from Dialect is called "EIR", and EIR represents the name of the second namespace corresponding to second level transformation and optimization 332. For example, the specific procedure may refer to the description of step S20.
Thereafter, the executable instructions 350 for each type of processing unit are obtained based on the second intermediate representation 340. For example, the specific procedure may refer to the description of step S30.
For example, the compiling method shown in fig. 5 includes two stages of transformation and optimization, and of course, in practice, more stages of transformation and optimization may be included, and these stages of transformation and optimization are connected to the second stage of transformation and optimization 332, and corresponding hardware platform characteristic information is added as needed.
For example, the following takes the structure shown in fig. 1 as an example of an activation function RELU and a heterogeneous device in the operation information (computation graph), and specifically describes the activation function operator described in the two-stage conversion and optimization in fig. 5.
The activation function is defined as shown in the following equation (1), and as an activation layer of a neural network, the activation function defines a nonlinear output result of the neuron after linear transformation. As can be seen from equation (1), for the input data x, the result of the activation function action is to take a large value between x and 0.
Fig. 6A is a schematic diagram of a first intermediate representation provided by an embodiment of the present disclosure. It should be noted that fig. 6A only shows the portion related to the activation function, and the first intermediate representation may also include more content.
Block 400 in FIG. 6A is a computation (computation) in the first intermediate representation, equivalent to the concept of a function in C language.
As shown in fig. 6A, "ENTRY" 410 is an ENTRY of a function, and the calculation beginning with "ENTRY" 410 is similar to a main function (main function) in the C language. "ROOT" 460 is the return flag of the function, corresponding to return in the C language.
In fig. 6A, each line is an instruction constituting the calculation, and the parameter beginning with% indicates a temporary variable for storing the operator operation result.
For example, in row 420, a tensor Arg _0.1 whose dimension is 1 × 4 and whose data type is a 32-bit floating point type is defined, and Arg _0.1 is input data of the activation function.
In line 430, a constant operator is used to define a constant.2 with a value of 0, the constant type being a 32-bit floating point type.
In line 440, a tensor broadcast.3 with dimensions 1 × 4, data type 32-bit floating point, and a value of 0 for each element in the tensor is constructed using the broadcast operator broadcast.
In line 450, maximum comparison is performed between the input data Arg _0.1 and broadpast.3 using the maximum operator, and the maximum between the two is assigned to the parameter maximum.4.
In line 460, the parameter maximum.4 is typed and assigned back to parameter tuple.5.
Thus, in the first intermediate representation, the calculation 400 completes the calculation function of the activation function RELU, and uses the operator constant, the operator broadcast, and the operator maximum.
Fig. 6B is a schematic diagram illustrating a description of a first namespace according to an embodiment of the disclosure.
For example, the namespace defined by EHLO _ pilot derived from pilot is referred to as "EHLO", thereby determining a first namespace, EHLO.
As shown in FIG. 6B, section 510 defines a derived class EHLO _ Dialect derived from the Dialect class provided by MLIR, and a first namespace "MLIR:: EHLO".
Section 520 defines the data types used in the first namespace. For example, EHLO _ SInt represents signed integer, EHLO _ UInt represents unsigned integer, EHLO _ Int represents any integer type, and EHLO _ Tensor represents a Tensor type inherited from different data types.
Section 530 defines a class EHLO _ Op derived from a base class Op provided by the MLIR, the derived class EHLO _ Op being described in detail by specifying the Op name "menemonic" and specifying the constraint "traits" of the Op.
Of course, it should be noted that fig. 6B shows a partial description related to the first namespace, and further contents may be defined in the first namespace, which is not limited in this disclosure.
Fig. 6C is a schematic diagram illustrating a description of a reference operator according to an embodiment of the disclosure. For example, the reference operator, denoted EHLO _ ReluOp, is the operator defined in the first namespace EHLO for performing the activation function calculation.
For example, neural network processing units directly support activation function operators.
Specifically, EHLO _ ReluOp (610 in fig. 6C) is derived from EHLO _ Op and specifies the name of the reference operator as "Relu" in specialization list 620, and the constraint "[ NoSideEffect ]" specifies the constraint of the reference operator as side-free. The annotation of the reference operator EHLO _ ReluOp is described by "summary" 630, and a more specific description about the reference operator EHLO _ ReluOp is given by "specification" 640. The parameters and operation results required for the reference operator EHLO _ ReluOp operation are described by "definitions" 650 and "results" 660. "ins" and "outs" in "definitions" 650 and "results" 660 represent inputs and outputs, respectively, the types of input and output parameters being Tensor types EHLO _ Tensor defined in the first namespace.
Having defined the reference operator EHLO _ ReluOp in the first namespace EHLO, the 3 operators in the calculation 400 shown in fig. 6A may be converted into an expression form of the reference operator EHLO _ ReluOp.
Fig. 6D is a schematic diagram of an operator fusion intermediate representation provided by an embodiment of the present disclosure.
It should be noted that fig. 6D only shows the part related to the activation function, and the operator fusion intermediate representation may also include more contents.
As shown in FIG. 6D, applying the definitions of the previously described portions of FIGS. 6B and 6C, the computation 400 in FIG. 6A is converted to an expression 700 in an operator fused intermediate representation. The base operators (operators constant, Broadcast, and maximum) in the first intermediate representation are converted to activation function operators that are directly supported by the neural network processing unit.
In this description, the expression of the reference operator is "ehlo.Relu" in line 710, and the input and output parameters required by the reference operator are both tensors with dimensions 1x1x4x4, and element types of 32-bit floating point numbers.
Line 720 is used to complete the data type conversion, assign the activate function calculation result to parameter 1, and return in line 730.
Thus, by the transformation in fig. 6A-6D, the first level of transformation and optimization is completed, transforming the operator (constant, broadcast, maximum) of 3 atomic granularity in the first intermediate representation into the expression form of the customized reference operator in the first namespace.
For example, after traversing all operation operators in the first intermediate representation and performing conversion by referring to the above process, an operator fused intermediate representation is obtained.
Fig. 7A is a schematic diagram illustrating a description of a second namespace according to an embodiment of the disclosure.
For example, EIR _ Dialect derived from Dialect defines a namespace called "EIR" and determines the second namespace to be EIR.
As shown in FIG. 7A, section 810 defines a derived class EIR _ Dialect derived from the Dialect class provided by MLIR.
Section 820 defines the Tensor type Tensor, and defines the important attributes of Tensor: the storage level StorageLevel. The storage hierarchy describes the storage location of a sensor in a heterogeneous device, and is divided into 4 types:
THE 0: ON _ THE _ FLY-tensor is directly calculated and transmitted among all modules in THE processing unit and is temporarily not stored in any storage area;
1, an internal storage area of a GLOBAL-neural network processing unit, a digital processing unit or a central processing unit;
2: a first memory 140 in the LOCAL-heterogeneous device 100;
3 SYSTEM-second memory 150 in heterogeneous device 100.
The Tensor type Tensor provided in the second namespace eir describes the storage location of each Tensor, facilitating the addition of the necessary instructions to access memory when the compiler generates the hardware executable.
Section 830 defines a tensor type AnyLevelTensor that can be stored in any storage location.
Fig. 7B is a schematic diagram illustrating a description of a second namespace according to an embodiment of the disclosure.
For example, as shown in FIG. 7B, section 910 defines class BaseOp, which is the base class for all operators in the second namespace eir, and defines the generic properties of Op.
For example, since the neural network processing unit 120 in the heterogeneous device 100 adopted in the present embodiment is provided with a data precision conversion module, which allows data of one precision type to be converted into another precision type, a data precision conversion module ENPU _ precision cvtatts is defined in the second namespace eir. As shown in fig. 7B, at 920, the data precision conversion module ENPU _ precision cvtatts includes three attributes of data truncation truncate, scale and offset.
For example, section 930 is a description of the activation function operator ActivationOp that is directly supported by neural network processing unit 120. For example, when the neural network processing unit 120 executes an activation function operator, it is necessary to specify the precision of the input tensor and the output tensor of the activation function operator, so the activation function operator ActivationOp contains attributes of two inpu _ precision cvtatrs types, "in _ cvt" and "out _ cvt".
For example, section 940 describes the corresponding representation form ReluOp of the reference operator EHLO _ ReluOp in the second namespace eir. ReluOp is derived from ActivateOp, inherits all the definitions of ActivateOp, so ReluOp receives the AnyLevelTensor type tensor, and returns one AnyLevelTensor type tensor. And, the ReluOp contains the input parameters of two ENPU _ precision CVTAttrs attributes of "in _ cvt" and "out _ cvt".
For example, fig. 7C is a schematic diagram of a second intermediate representation provided by an embodiment of the present disclosure. For example, in the second intermediate representation, the reference operator EHLO _ ReluOp is converted to the Relu operator, the expression form of which is shown at 1000 in fig. 7C. For example, the Relu operator is also the second intermediate operator.
As shown in FIG. 7C, applying the definitions of the aforementioned portions of FIG. 7A and FIG. 7B, the expression 700 in the operator fused intermediate representation in FIG. 6D is converted to an expression form 1000 in the second namespace eir.
For example, "arg 0" represents the input parameters of the Relu operator. The 1010 section represents the type of the Relu operator input tensor, for example, the input tensor type is dimension 1x1x4x4, the element type is 32-bit floating point number, and the GLOBAL in the input tensor type represents that the operational data of the Relu operator needs to be read from the internal storage area of the neural network processing unit. Section 1020 represents the type of the Relu operator output tensor, e.g., the output tensor type is dimension 1x1x4x4, the element type is a 32-bit floating point number, and LOCAL in the output tensor type indicates that the result of the operation of the Relu operator will remain in the first memory 140.
In this description, the expression form of the Relu operator is "eir.relu" in the line 1030, "arg 0" represents the input parameter of the Relu operator, and "% 0" represents the output parameter of the Relu operator.
Section 1040 describes the data precision of the input tensor and the output tensor when the Relu operator operates, and the data precision is determined by three parameters, namely offset, scale and truncate.
Section 1050 describes the operation result "% 0" returned for Relu operator.
Likewise, a second level of transformation and optimization is accomplished through the transformations in FIGS. 7A-7C, transforming the reference operator in the operator-fused intermediate representation to the corresponding expression form in the second namespace eir. And the traversal operator fuses all reference operators in the intermediate representation and combines the processes for conversion to obtain a second intermediate representation.
From the description of the parts of fig. 5-7C, it can be seen that for the activation functions in the computation graph, in the present embodiment, first, 3 basic operators in the first intermediate representation that implement the function of the activation function are combined into a reference operator EHLO _ ReluOp directly supported by the neural network processing unit in the first namespace EHLO according to the hardware characteristics of the neural network processing unit; then, in the second namespace eir, in combination with hardware characteristics such as data storage location of the neural network processing unit, precision conversion and the like, the reference operator EHLO _ ReluOp is converted into a Relu operator, and a more specific expression is given to the operator.
The above discussion has only been made by taking the activation function as an example, and describes the process of gradually converting the activation function in the computation graph from the first intermediate representation to the operator fusion intermediate representation and the second intermediate representation. From the above, it is understood that similar conversion, optimization, definition methods may be employed for converting the first intermediate representation for all processing units, e.g. digital processing units, central processing units, etc.
Corresponding to the compiling method, at least one embodiment of the disclosure further provides a compiling apparatus. Fig. 8 is a schematic block diagram of a compiling apparatus according to at least one embodiment of the present disclosure.
For example, as shown in fig. 8, the compiling apparatus 800 includes: an acquisition unit 801, a conversion unit 802, and a generation unit 803.
For example, the compiling apparatus 800 is configured to compile a first intermediate graph-based representation into executable instructions corresponding to each type of processing unit.
For example, the compiling apparatus 800 is suitable for heterogeneous apparatuses, for example, heterogeneous apparatuses include various different kinds of processing units. For specific definition of the heterogeneous device, reference may be made to the foregoing compiling method, which is not described herein again.
For example, the obtaining unit 801 is configured to obtain a first intermediate representation corresponding to the object to be compiled, where the first intermediate representation is an intermediate representation based on a graph.
For example, the conversion unit 802 is configured to perform a multi-level conversion and optimization on the first intermediate representation, resulting in a second intermediate representation. For example, the multi-level transformation and optimization includes operator fusion of the first intermediate representation and addition of hardware platform characterization information associated with the various processing units. For example, the multi-level transformation and optimization also includes conventional optimization and transformation for common sub-expression elimination, dead code elimination, inference tensor shape, and so on.
For example, the generating unit 803 is configured to derive the executable instructions corresponding to each processing unit according to the second intermediate representation. For example, executable instructions are for running on heterogeneous devices.
For example, the acquisition unit 801, the conversion unit 802, and the generation unit 803 include codes and programs stored in a memory; the processor may execute the code and program to implement some or all of the functions of the acquisition unit 801, the conversion unit 802, and the generation unit 803 as described above. For example, the acquisition unit 801, the conversion unit 802, and the generation unit 803 may be dedicated hardware devices for implementing some or all of the functions of the acquisition unit 801, the conversion unit 802, and the generation unit 803 as described above. For example, the acquisition unit 801, the conversion unit 802, and the generation unit 803 may be one circuit board or a combination of a plurality of circuit boards for realizing the functions as described above. In the embodiment of the present application, the one or a combination of a plurality of circuit boards may include: (1) one or more processors; (2) one or more non-transitory memories connected to the processor; and (3) firmware stored in the memory executable by the processor.
It should be noted that the acquiring unit 801 is configured to implement step S10 shown in fig. 3, the converting unit 802 is configured to implement step S20 shown in fig. 3, and the generating unit 803 is configured to implement step S30 shown in fig. 3. Thus, for the specific description of the obtaining unit 801, reference may be made to the description related to step S10 shown in fig. 3 in the embodiment of the compiling method, for the specific description of the converting unit 802, reference may be made to the description related to step S20 shown in fig. 3 in the embodiment of the compiling method, and for the specific description of the generating unit 803, reference may be made to the description related to step S30 shown in fig. 3 in the embodiment of the compiling method. In addition, the compiling apparatus can achieve similar technical effects to those of the compiling method described above, and details are not repeated herein.
At least one embodiment of the present disclosure further provides an electronic device, and fig. 9 is a schematic block diagram of the electronic device provided in at least one embodiment of the present disclosure.
For example, as shown in fig. 9, the electronic device includes a processor 901, a communication interface 902, a memory 903, and a communication bus 904. The processor 901, the communication interface 902, and the memory 903 communicate with each other via a communication bus 904, and components such as the processor 901, the communication interface 902, and the memory 903 may communicate with each other via a network connection. The present disclosure is not limited herein as to the type and function of the network.
For example, the memory 903 is used to store computer-executable instructions non-transiently. When the processor 901 is configured to execute the computer-executable instructions, the computer-executable instructions when executed by the processor 901 implement the compiling method according to any of the embodiments. For specific implementation and related explanation of each step of the compiling method, reference may be made to the above-mentioned embodiment of the compiling method, which is not described herein again.
For example, the processor 901 executes the program stored in the memory 903 to implement the compiling method, which is the same as the implementation manner mentioned in the foregoing embodiment of the compiling method, and is not described herein again.
For example, the communication bus 904 may be a peripheral component interconnect standard (PCI) bus or an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.
For example, communication interface 902 is used to enable communication between an electronic device and other devices.
For example, the processor 901 and the memory 903 may be located on a server side (or cloud side).
For example, the processor 901 may control other components in the electronic device to perform desired functions. The processor 901 may be a Central Processing Unit (CPU), a Network Processor (NP), etc., and may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components. The Central Processing Unit (CPU) may be an X86 or ARM architecture, etc.
For example, memory 903 may comprise any combination of one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. Volatile memory can include, for example, Random Access Memory (RAM), cache memory (or the like). The non-volatile memory may include, for example, Read Only Memory (ROM), a hard disk, an Erasable Programmable Read Only Memory (EPROM), a portable compact disc read only memory (CD-ROM), USB memory, flash memory, and the like. On which one or more computer-executable instructions may be stored and executed by processor 901 to implement various functions of an electronic device. Various application programs and various data and the like can also be stored in the storage medium.
For example, for the detailed description of the process of performing compilation by the electronic device, reference may be made to the related description in the embodiment of the compilation method, and repeated descriptions are omitted here.
Fig. 10 is a schematic diagram of a non-transitory computer-readable storage medium according to at least one embodiment of the disclosure. For example, as shown in fig. 10, one or more computer-executable instructions 1002 may be non-temporarily stored on a storage medium 1001. For example, the computer-executable instructions 1002, when executed by a processor, may perform one or more steps in accordance with the compilation method described above.
For example, the storage medium 1001 may be applied to the electronic device and/or the compiling apparatus 800 described above. For example, the storage medium 1001 may include the memory 903 in the electronic device.
For example, for the description of the storage medium 1001, reference may be made to the description of the memory in the embodiment of the electronic device, and repeated descriptions are omitted.
For the present disclosure, there are also the following points to be explained:
(1) the drawings of the embodiments of the disclosure only relate to the structures related to the embodiments of the disclosure, and other structures can refer to the common design.
(2) Thicknesses and dimensions of layers or structures may be exaggerated in the drawings used to describe embodiments of the present invention for clarity. It will be understood that when an element such as a layer, film, region, or substrate is referred to as being "on" or "under" another element, it can be "directly on" or "under" the other element or intervening elements may be present.
(3) Without conflict, embodiments of the present disclosure and features of the embodiments may be combined with each other to arrive at new embodiments.
The above description is only for the specific embodiments of the present disclosure, but the scope of the present disclosure is not limited thereto, and the scope of the present disclosure should be subject to the scope of the claims.