WO2022253075A1 - 一种编译方法及相关装置 - Google Patents

一种编译方法及相关装置 Download PDF

Info

Publication number
WO2022253075A1
WO2022253075A1 PCT/CN2022/094998 CN2022094998W WO2022253075A1 WO 2022253075 A1 WO2022253075 A1 WO 2022253075A1 CN 2022094998 W CN2022094998 W CN 2022094998W WO 2022253075 A1 WO2022253075 A1 WO 2022253075A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
matrix
statement
interface
memory
Prior art date
Application number
PCT/CN2022/094998
Other languages
English (en)
French (fr)
Inventor
李姗妮
聂旺
刘超
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2022253075A1 publication Critical patent/WO2022253075A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/44Encoding
    • G06F8/447Target code generation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/52Multiplying; Dividing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation

Definitions

  • the present application relates to the field of computer technology, in particular to a compiling method and related devices.
  • Artificial intelligence is a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results.
  • artificial intelligence is the branch of computer science that attempts to understand the nature of intelligence and produce a new class of intelligent machines that respond in ways similar to human intelligence.
  • Artificial intelligence is to study the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making.
  • the Warp-level Matrix Multiply and Accumulate (WMMA) interface is called to generate the execution code to pass the WMMA Interface to perform matrix multiplication operations.
  • WMMA Warp-level Matrix Multiply and Accumulate
  • the present application provides a compiling method, by calling an interface including a specific instruction code in the generated target code during the process of compiling the operator description, so as to specify that the instances executed in parallel in the same stage correspond to the instances in different banks Data, it can ensure that all instances will not access the data in a bank at the same time, thereby avoiding bank conflicts in shared memory and effectively improving the efficiency of execution operations.
  • the first aspect of the present application provides a compiling method, which can be applied to a terminal with a compiling function.
  • the method includes: the terminal obtains an operator description, the operator description includes a description of matrix multiplication operations, that is, the operator description defines the matrix multiplication operations that need to be performed during code execution, and the matrix multiplication operations required to perform matrix multiplication operations. data.
  • the terminal parses the operator description to obtain the target code.
  • the terminal may first parse the operator description to obtain an intermediate representation. Among them, the role of the intermediate representation is to make the structure described by the operator to be compiled more simple and clear in logic, so that the optimization of the final target code is easier to realize. Then, the terminal parses the intermediate representation to generate the object code.
  • the target code is the code generated by the terminal after compiling the operator description.
  • the operator description is written based on a high-level language
  • the object code is a language between a high-level language and a machine language.
  • Object code can be further converted into executable binary machine code.
  • the target code calls a first interface
  • the first interface is used to indicate a plurality of first mapping relationships
  • the first mapping relationship is a mapping relationship between an instance and first data
  • the instance is used for Processing the first data corresponding to the instance
  • the first data is the data involved in the matrix multiplication operation, wherein the multiple instances executed in parallel in the same stage have the same relationship with the first data located in different memory banks respectively The first mapping relationship.
  • multiple instances that collectively perform a matrix multiplication operation are divided into stages to process the data participating in the matrix multiplication operation.
  • the first data corresponding to each instance in the multiple instances is located in a different bank, that is, there is no first data that is simultaneously with the multiple Any two or more instances in the instance have the first mapping relationship.
  • one instance may correspond to the first data in multiple banks.
  • the first data in a bank may also correspond to multiple instances that are not executed in parallel in the same stage.
  • instances can be threads or hyperthreads.
  • the input information, output information, and calculation information of an operator can be understood as a description of an operator, referred to as an operator description.
  • the operator description may also include other operator-related information.
  • the input information may include the number of matrices involved in the operation, the data size of the matrix, the data type of the matrix, and the data arrangement of the matrix
  • the output information may include the number of matrices output, the data size of the matrix, the data type of the matrix, and The data arrangement of the matrix
  • calculation information includes the type of operation, such as matrix multiplication.
  • the instance Under the instruction of the first interface, the instance only processes the first data that has a mapping relationship with it. That is to say, multiple instances executed in parallel in the same stage are respectively used to process the first data located in different banks. In this way, all instances executed in the same stage will not access the first data in the same bank, thereby effectively avoiding bank conflicts in the shared memory and ensuring the efficiency of performing operations.
  • the manner in which the target code calls the first interface may be: the target code includes a statement calling a library file and a statement calling an interface in the library file (ie, the above-mentioned first interface). Based on the statement of calling the library file, the calling of the library file can be realized. After calling the library file, the interface in the library file can be called based on the statement calling the interface in the library file.
  • the target code may also invoke the first interface by implementing a static link library, a dynamic link library, or an inline library.
  • the terminal parses the operator description to generate an object code, including: parsing the operator description to obtain an intermediate representation; replacing the first statement in the intermediate representation with the first statement an interface statement to obtain the target code, the first statement is used to instruct to execute the matrix multiplication operation, the first interface statement is used to call the first interface, and the first interface is used to execute the The matrix multiplication operation described above.
  • the first statement indicating to perform matrix multiplication in the intermediate representation may be an expression of matrix multiplication.
  • the terminal can determine the matrix multiplication operation in the intermediate representation and information about the matrix multiplication operation by matching the expression of the matrix multiplication operation in the intermediate representation. Then, the terminal replaces the expression of the matrix multiplication operation with the first interface statement calling the first interface, so as to realize the generation of the object code calling the first interface.
  • the first interface statement includes matrix multiplication operation information, so that the matrix multiplication operation can be realized based on the matrix multiplication operation information when the target code is executed.
  • the first interface includes a parallel instance execution (Parallel Thread eXecution, PTX) instruction code.
  • parallel instance execution Parallel Thread eXecution, PTX
  • the information of the matrix multiplication operation includes the first data, the arrangement of the first data (such as row_major arrangement or col_major arrangement), the first data
  • the data type for example, float16 or float32 and other data types
  • the terminal parses the operator description to generate the target code, including: the terminal parses the operator description to obtain an intermediate representation; the terminal replaces the second statement in the intermediate representation with for the second interface statement to obtain the object code.
  • the second statement is used to indicate to move the first data to the local memory or to move the first data out of the local memory.
  • the first data involved in the matrix multiplication usually needs to be moved to the local memory to implement the subsequent matrix multiplication.
  • the second interface statement is used to call the first interface, and the first interface is also used to move the first data.
  • the first interface it is possible to move the first data from the global memory or the shared memory to the local memory, so as to perform matrix multiplication based on the first data in the local memory; based on the first interface, it is also possible to realize After the matrix multiplication operation is completed, the first data in the local memory is moved to the global memory or the shared memory, so as to free up space in the local memory.
  • the first interface is called during the process of moving data.
  • the first interface further specifies the mapping relationship between each instance and data, which can ensure that multiple instances executed in the same stage will not access the data at the same time during the data migration process.
  • the data in the same bank avoids bank conflicts, ensures the efficiency of data movement, and improves the efficiency of performing operations.
  • the terminal parses the operator description to generate the target code, including: the terminal parses the operator description to obtain an intermediate representation; the terminal replaces the third sentence in the intermediate representation with the third an interface statement to obtain the target code, the third statement is used to indicate the execution of a fusion operation, the input of the fusion operation includes the output of the matrix multiplication operation, and the third interface statement is used to call the second interface, The second interface is used to perform a fusion operation.
  • the fusion operation refers to an operation that combines multiple operators (for example, an operator of a matrix multiplication operation), that is, it can implement a combination operation of an output of a matrix multiplication operation and other operators.
  • the fusion operation may be an operation that performs an element-by-element operation based on an output of a matrix multiplication operation.
  • the fusion operation may include, for example, at least one of the following operations: addition, subtraction, multiplication, division, division with result rounded down, modulo operation, and result rounded down Modulo operation.
  • the process of the fusion operation is: add the elements at the same position in the two matrices participating in the fusion operation one by one, and finally obtain the result of the fusion operation.
  • the fusion of multiple operators can be realized, the operation efficiency of the terminal is improved, and the resource utilization rate of the terminal is improved.
  • the first interface called by the target code is also used to indicate the logical storage structure for obtaining the first data and the data type of the first data, And determine the size of the data loading pointer according to the logical storage structure and the data type.
  • the first data refers to the data involved in the matrix multiplication operation.
  • the first data may actually be a matrix data, and the matrix data includes multiple elements.
  • the logical storage structure of the first data refers to the logical structural form of the first data stored in the memory.
  • the data type of the first data is used to indicate the data amount of elements in the first data which is matrix data.
  • the size of the data loading pointer is used to indicate the amount of data loaded by the instance at a time. For example, when the size of the data load pointer is 128 bits, the size of the data load pointer indicates that the data volume of the instance's single load data is 128 bits.
  • the method further includes:
  • the terminal Based on the operator description, the terminal generates a parameter for matrix partitioning, where the parameter for matrix partitioning is used to indicate a manner of matrix partitioning. Then, the terminal performs a block operation on the target matrix according to the parameters used for matrix block, to obtain a division result of the target matrix, where the target matrix is a matrix participating in the matrix multiplication operation. Finally, the terminal adds a data movement statement to the intermediate representation according to the division result of the target matrix, and the data movement statement is used to instruct to move the data of the target matrix in memory.
  • the data of the target matrix is the data in the target matrix participating in the matrix multiplication operation.
  • the data movement statement may be used to indicate to move the data of the target matrix from the global memory to the shared memory, or to move the data of the target matrix from the shared memory to the local memory.
  • a multi-level memory promotion mechanism is designed in this embodiment, that is, the data is upgraded from the global memory to the shared memory with high data read and write speed according to the size of the matrix block , and then upgraded to a local memory with a higher data read and write speed.
  • the terminal moves the data corresponding to the outer matrix from the global memory to the shared memory in advance, and then moves the data corresponding to the inner matrix in the outer matrix from the shared memory to the local memory to improve data loading. s efficiency.
  • the terminal may divide the matrix multiplication operation into multiple parts for execution during the matrix multiplication operation, so as to realize multi-instance parallel execution of the matrix multiplication operation, thereby improving operation efficiency.
  • the division result of the target matrix may include a first matrix, and the first matrix includes the second matrix.
  • the terminal adding the data moving statement in the target code may specifically include: the terminal adding the first data moving statement after the statement indicating to divide the first matrix, and adding the second data moving statement after the statement indicating dividing the second matrix.
  • the first data movement statement is used to indicate that the data of the first matrix is moved from the global memory to the shared memory
  • the second data movement statement is used to indicate that the data of the second matrix is moved from the shared memory to local memory.
  • the capacity of the global memory is greater than that of the shared memory, and the capacity of the shared memory is greater than that of the local memory; the data read and write speed of the local memory is greater than that of the shared memory, and the read and write speed of the shared memory is greater than that of the global memory. Moving data from global memory to shared memory and moving data from shared memory to local memory can effectively improve data access speed.
  • the target code further includes a second mapping relationship
  • the second mapping relationship is a mapping relationship between an instance and a data movement statement
  • the second mapping relationship is used to indicate that the execution data An instance of the statement is moved, and the second mapping relationship is established based on the data of the instance and the data structure of the divided matrix.
  • the terminal specifies the mapping relationship between the instance and the data movement statement in the process of generating the target code, thereby ensuring a reasonable match between the instance and the data movement statement, ensuring the locality of data access, and improving the efficiency of the target code.
  • the method further includes: the terminal determines the number of warps according to the total number of instances participating in the matrix multiplication operation, where each warp includes the same number of instances.
  • each warp includes the same number of instances.
  • every 32 instances form a warp
  • a warp is the basic unit of scheduling and running.
  • the terminal establishes a third mapping relationship between warps and axes in the target matrix in the intermediate representation based on the number of warps and the data structure of the target matrix, and the third mapping relationship is used to indicate that in the execution matrix The warp of the operation of the axis.
  • the terminal calls the first interface with PTX instruction codes in the target code, and the unified operation level of the first interface is the warp level, in this solution, by establishing warp-level calculation statement mapping, the relationship between multiple instances and calculation statements can be guaranteed. The mapping relationship between them is more reasonable to further optimize the efficiency of the operation.
  • the second aspect of the present application provides a compilation device, including: an acquisition unit and a processing unit; the acquisition unit is used to acquire an operator description of a neural network model, and the operator description includes a description of a matrix multiplication operation; the a processing unit, configured to parse the operator description and generate object code;
  • the target code calls a first interface
  • the first interface is used to indicate a plurality of first mapping relationships
  • the first mapping relationship is a mapping relationship between an instance and first data
  • the instance is used for Processing the first data corresponding to the instance
  • the first data is the data involved in the matrix multiplication operation, wherein the multiple instances executed in parallel in the same stage have the same relationship with the first data located in different memory banks respectively The first mapping relationship.
  • the processing unit is further configured to: parse the operator description to obtain an intermediate representation; replace the first statement in the intermediate representation with the first interface statement to obtain the target code, the first statement is used to instruct to execute the matrix multiplication operation, the first interface statement is used to call the first interface, and the first interface is used to execute the matrix multiplication operation.
  • the first interface includes a PTX instruction code.
  • the processing unit is further configured to: parse the operator description to obtain an intermediate representation; replace the second statement in the intermediate representation with a second interface statement to obtain the target code, the second statement is used to indicate to move the first data, the second interface statement is used to call the first interface, and the first interface is used to move the first data.
  • the processing unit is further configured to: parse the operator description to obtain an intermediate representation; replace the third statement in the intermediate representation with a third interface statement to obtain the target code, the third statement is used to indicate the execution of the fusion operation, the input of the fusion operation includes the output of the matrix multiplication operation, the third interface statement is used to call the second interface, and the second interface is used to execute fusion operation.
  • the fusion operation includes at least one of the following operations: addition, subtraction, multiplication, division, division with a result rounded down, modulo operation, and result down Modulo operation for rounding.
  • the first interface is also used to indicate the logical storage structure for obtaining the first data and the data type of the target data, and determine according to the logical storage structure and the data type The size of the data loading pointer, where the size of the data loading pointer is used to indicate the amount of data loaded by the instance at a time.
  • the processing unit is further configured to: generate parameters for matrix partitioning based on the operator description; perform partitioning on the target matrix according to the parameters for matrix partitioning operation to obtain the division result of the target matrix, the target matrix is a matrix participating in the matrix multiplication operation; according to the division result of the target matrix, a data movement statement is added in the target code, and the data movement The statement is used to instruct to move the data of the target matrix in the memory.
  • the division result of the target matrix includes a first matrix, and the first matrix includes a second matrix;
  • the processing unit is further configured to: indicate in the target code to divide the first matrix Add the first data movement statement after the statement, and add the second data movement statement after the statement indicating to divide the second matrix in the target code;
  • the first data movement statement is used to indicate that the data of the first matrix is moved from the global memory to the shared memory
  • the second data movement statement is used to indicate that the data of the second matrix is moved from the shared memory to local memory
  • the target code further includes a second mapping relationship
  • the second mapping relationship is a mapping relationship between an instance and a data movement statement
  • the second mapping relationship is used to indicate that the execution data An instance of the statement is moved, and the second mapping relationship is established based on the data of the instance and the data structure of the divided matrix.
  • the target code further includes a third mapping relationship between a warp and an axis in the target matrix; where the third mapping relationship is used to indicate the The warp of the axis operation, the number of warps is determined based on the total number of instances participating in the matrix multiplication operation, each warp includes the same number of instances, and the target matrix is a matrix participating in the matrix multiplication operation.
  • the third aspect of the present application provides a compiling device, which may include a processor, the processor is coupled to a memory, the memory stores program instructions, and when the program instructions stored in the memory are executed by the processor, the method described in the above first aspect is implemented.
  • a compiling device which may include a processor, the processor is coupled to a memory, the memory stores program instructions, and when the program instructions stored in the memory are executed by the processor, the method described in the above first aspect is implemented.
  • details may refer to the first aspect, which will not be repeated here.
  • a fourth aspect of the present application provides a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, and when it is run on a computer, the computer is made to execute the method described in the first aspect above.
  • a fifth aspect of the present application provides a circuit system, where the circuit system includes a processing circuit configured to execute the method described in the first aspect above.
  • the sixth aspect of the present application provides a computer program product, which, when run on a computer, causes the computer to execute the method described in the first aspect above.
  • the seventh aspect of the present application provides a chip system
  • the chip system includes a processor, used to support the server or the threshold value acquisition device to implement the functions involved in the first aspect above, for example, send or process the data and/or information.
  • the chip system further includes a memory, and the memory is configured to store necessary program instructions and data of the server or the communication device.
  • the system-on-a-chip may consist of chips, or may include chips and other discrete devices.
  • FIG. 1 is a schematic diagram of two data arrangement methods provided by the embodiment of the present application.
  • FIG. 2 is a schematic flow diagram of the matrix multiplication operation of the TVM-enabled TensorCore provided by the embodiment of the present application;
  • Fig. 3 is an architecture diagram of a MindSpore graph computing fusion feature provided by the embodiment of the present application.
  • FIG. 4 is a schematic flowchart of a compiling method 400 provided in an embodiment of the present application.
  • FIG. 5 is a schematic diagram of a matrix multiplication operation provided by an embodiment of the present application.
  • FIG. 6 is a schematic diagram of a multi-instance processing data provided by the embodiment of the present application.
  • FIG. 7 is a schematic diagram of a calculation flow of a non-fusion operator provided in an embodiment of the present application.
  • FIG. 8 is a schematic diagram of a calculation flow of a fusion operator provided in an embodiment of the present application.
  • FIG. 9 is a schematic diagram of a logical storage structure of first data provided by an embodiment of the present application.
  • FIG. 10 is a schematic flowchart of a compiling method 1000 provided in an embodiment of the present application.
  • FIG. 11 is a schematic flow diagram of scheduling and optimizing an intermediate representation based on a polyhedron model provided by an embodiment of the present application.
  • Figure 12 is the data access method of the existing process
  • FIG. 13 is a schematic diagram of data access arrangement based on Bank conflict avoidance Pass optimization provided by the embodiment of the present application.
  • Fig. 14 is a schematic diagram of a calculation sequence in the related art
  • FIG. 15 is a schematic diagram of a calculation sequence after data pipeline Pass optimization provided by the embodiment of the present application.
  • Figure 16 is a pseudo-code logic after adding a data pipeline Pass provided by the embodiment of the present application.
  • Fig. 17 is a schematic diagram of a PTX inline library provided by the embodiment of the present application.
  • FIG. 18 is a schematic structural diagram of a compiling device provided by an embodiment of the present application.
  • Fig. 19 is a schematic structural diagram of the execution device provided by the embodiment of the present application.
  • FIG. 20 is a schematic structural diagram of a chip provided by an embodiment of the present application.
  • Compilation Refers to the process of using a compiler to generate object code from a source program written in a source language.
  • Object code is a language between high-level language and machine language. Object code can be further converted into executable binary machine code. In simple terms, compilation is to convert a source program written in a high-level language into an object code that is closer to machine language.
  • Intermediate code It is an internal representation of the source program, also known as Intermediate Representation (IR).
  • IR Intermediate Representation
  • the role of the intermediate representation is to make the structure of the compiler more logically clear, especially to make the optimization of the object code easier to implement.
  • the complexity of the intermediate representation is between source programming language and machine language.
  • Code optimization refers to performing multiple equivalent transformations on the program, so that starting from the transformed program, more effective target code can be generated.
  • the so-called equivalence means that the running result of the program does not change.
  • the so-called effective mainly refers to the short running time of the object code and the small storage space occupied. This transformation is called optimization.
  • Optimizing Pass is an important part of the compilation framework. The optimization pass analyzes and modifies the intermediate representation. In the process of code optimization, multiple optimization passes analyze and modify the intermediate representation, and each pass completes specific optimization work.
  • Warp In a graphics processing unit (GPU), 32 consecutive threads form a Warp, which is the basic unit of GPU scheduling and operation. On top of Warp, it also includes grid (Grid) and instance block (Block). Generally speaking, a Grid includes multiple Blocks, a Block includes multiple Warps, and a Warp includes 32 threads.
  • Shared memory refers to allowing two unrelated instances to access the same logical memory. Shared memory is a very efficient way of sharing and passing data between two running instances.
  • the memory shared between different instances is usually the same piece of physical memory. Instances can connect the same piece of physical memory to their own address space, and all instances can access addresses in the shared memory. If an instance writes to shared memory, the changes immediately affect any other instance that has access to the same shared memory.
  • Bank The shared memory is divided into 32 Banks of equal size.
  • the bandwidth of each Bank can be 32 bits (bit) or 64 bits. Taking the bandwidth of each Bank as 32bit as an example, continuous 32bit data is stored in one Bank, and the next continuous 32bit data is stored in the next bank.
  • Bank conflict When different instances access the same Bank at the same time, Bank conflict will occur. If a Bank conflict occurs when using shared memory, the access request of the instance will become serial, that is, queued. Bank conflicts will greatly reduce memory bandwidth, resulting in a significant degradation in operating performance.
  • Operator Fusion By analyzing and optimizing the existing graph logic, the original one-by-one calculation logic is reorganized to form a fusion sub-graph logic. Operator fusion can greatly reduce the overhead of operator execution gaps while improving device resource utilization.
  • Matrix multiplication operation For matrix A and matrix B, matrix C is obtained after matrix A and matrix B perform matrix multiplication.
  • the element in row m and column n of matrix C is equal to the sum of the product of the element in row m of matrix A and the corresponding element in column n of matrix B.
  • A be the matrix of m ⁇ p
  • B be the matrix of p ⁇ n
  • matrix A can be expressed as [m, p]
  • matrix B can be expressed as [p, n]
  • matrix C can be expressed as [m, n].
  • the element in row i and column j in matrix C can be expressed as:
  • matrix A ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇
  • An m ⁇ n matrix is an array of m ⁇ n numbers arranged in m rows and n columns.
  • matrix A is represented as [m, p]
  • matrix B is represented as [p, n]
  • matrix C is represented as [m, n]
  • the p axis in matrix A as well as matrix B
  • the p-axis in is eliminated, and only the m-axis and n-axis are reserved in the matrix C.
  • the eliminated axes in matrix A and matrix B can be called reduction axes.
  • Batch In the matrix multiplication operation, there are one or more batch axes, which do not participate in the multiplication and accumulation operation.
  • D[Batch1, Batch2, Batch..., M, N] C[Batch1, Batch2, Batch..., M, N] + A[Batch1, Batch2 ,Batch...,M,K]*B[Batch1,Batch2,Batch...,K,N].
  • matrix A is arranged in row-major (row_major)
  • matrix B is arranged in column-major (col_major);
  • matrix A is arranged in row_major
  • matrix B is arranged in row_major
  • matrix A is arranged in col_major
  • matrix B is arranged in row_major.
  • row_major and col_major are methods for storing multidimensional arrays in memory. The difference between the two orders is which elements of the array are contiguous in memory. For the row_major arrangement, consecutive elements of the rows in the array are adjacent to each other in memory; for the col_major arrangement, consecutive elements of the columns in the array are adjacent to each other in memory.
  • FIG. 1 is a schematic diagram of two data arrangement modes provided by the embodiment of the present application.
  • Float A data type.
  • the Float data type is used to store single-precision floating-point numbers or double-precision floating-point numbers.
  • the data type float16 refers to 16-bit data
  • the data type float32 refers to 32-bit data.
  • Modulo operation It refers to finding the remainder of dividing two integers.
  • Integer division operation refers to the operation of taking an integer quotient when dividing an integer by another integer in integer operations, and does not consider the remainder of the operation.
  • Polyhedral Modeling A Scheduling Compilation Optimization Technique.
  • the essence of the polyhedron model is to abstract the affine loop nesting in the program expression, and analyze and optimize the compilation and optimization of the corresponding scheduling of the program through geometric operations on these polyhedrons, so as to expand the automatic parallelism of the program.
  • Pluto algorithm Applied to solve efficient polyhedron scheduling, it is essentially a cost model aimed at optimizing the amount of communication data, and based on this cost model, it sequentially solves the division plane required for scheduling transformation. Given a polyhedral representation of a loop-nested sequence, the algorithm can be used to determine an efficient schedule such that read-write dependencies are satisfied.
  • Graph computing fusion a network performance optimization technology.
  • Graph computing fusion can optimize the logic of existing network computing graphs through automatic analysis, and combine the target hardware capabilities to optimize computing graphs such as calculation simplification, operator splitting and fusion, and operator specialization compilation to improve device computing resources. Utilization, to achieve overall optimization of network performance.
  • graph computing fusion has unique advantages such as multi-operator cross-boundary joint optimization, cross-layer collaboration with operator compilation, and real-time compilation of operators based on polyhedron automatic scheduling optimization.
  • the graphics card manufacturer NVIDIA introduced the TensorCore matrix multiplication unit into the Volta architecture.
  • the TensorCore unit has extremely powerful throughput.
  • the technical solutions for TensorCore mainly include the artificial intelligence compilation framework represented by Tensor Virtual Machine (TVM).
  • FIG. 2 is a schematic flowchart of the matrix multiplication operation of the TVM-enabled TensorCore provided by the embodiment of the present application.
  • the process of TVM enabling the matrix multiplication operation of TensorCore includes the following steps 201-204.
  • Step 201 analyze and compile the operator description corresponding to the matrix multiplication operation, and generate an intermediate representation.
  • step 202 the generated intermediate representation is used as the input of the TensorCore optimization pass, and the intermediate representation is optimized by the optimization pass.
  • the optimization pass performs matrix multiplication pattern recognition and TensorCore functional unit matching on the intermediate representation. If the pattern matches successfully, the intermediate representation is annotated.
  • Step 203 compile the back-end code generation module, parse the intermediate representation, and invoke WMMA to perform matrix multiplication through the WMMA interface.
  • WMMA is a programming interface (Application Programming Interface, API) provided by Compute Unified Device Architecture (CUDA).
  • CUDA is a general-purpose parallel computing architecture introduced by NVIDIA that enables GPUs to solve complex computing problems.
  • Step 204 finally, generate TensorCore-enabled object code based on the intermediate representation.
  • the embodiment of the present application provides a compiling method, by calling an interface including a specific instruction code in the generated object code during the process of compiling the operator description, so as to specify an instance of parallel execution in the same stage Corresponding to data in different banks, it can be guaranteed that all instances will not access data in one bank at the same time, thereby avoiding bank conflicts in shared memory.
  • the compiling method provided by the embodiment of the present application can be applied to the training and reasoning of network models in artificial intelligence scenarios, such as network models under the core feature of graph-computing fusion in the MindSpore framework.
  • artificial intelligence scenarios such as network models under the core feature of graph-computing fusion in the MindSpore framework.
  • MindSpore is an artificial intelligence computing framework in all scenarios.
  • Figure 3 is an architecture diagram of the fusion feature of the MindSpore graph calculation provided in the embodiment of the present application.
  • the MindSpore front-end representation ( 3013 ) is generated after public optimization ( 3012 ) is performed on the network model ( 3011 ) in the MindSpore front-end.
  • layer front-end map optimization (3014) and back-end calculation optimization (3015) are performed.
  • the operator description related to the matrix multiplication operation is generated (3021), and passed into the operator layer compilation framework (3002).
  • the operator description is parsed to obtain an intermediate representation (3022).
  • the intermediate representation is parsed and the calculation library is invoked (3024), and the object code (3025) is generated for calling and running by the layer compilation framework (3001).
  • the compiling method provided by the embodiment of the present application can be applied on a terminal.
  • the terminal provided by the embodiment of the present application can be, for example, a mobile phone (mobile phone), a personal computer (personal computer, PC), a notebook computer, a server, a tablet computer, a smart TV, a mobile internet device (mobile internet device, MID), a wearable equipment, virtual reality (virtual reality, VR) equipment, augmented reality (augmented reality, AR) equipment, wireless terminals in industrial control, wireless terminals in self driving, remote medical Wireless terminals in surgery, wireless terminals in smart grid, wireless terminals in transportation safety, wireless terminals in smart city, wireless terminals in smart home Wait.
  • FIG. 4 is a schematic flowchart of a compiling method 400 provided in an embodiment of the present application. As shown in FIG. 4, the compiling method 400 includes the following steps 401-403.
  • step 401 an operator description of the neural network model is obtained, and the operator description includes a description of matrix multiplication operations.
  • the terminal can obtain the operator description that needs to be compiled, and the operator description can be the operator description of the neural network model.
  • the operator description may be written in a domain-special language (Domain-Special Language, DSL), and is used to define operations that need to be performed during code execution.
  • DSL Domain-Special Language
  • the operator description includes a description of the matrix multiplication operation, that is, the operator description defines the matrix multiplication operation that needs to be performed during code execution, and the data required to perform the matrix multiplication operation.
  • Step 402 parse the operator description to obtain an intermediate representation.
  • the terminal can obtain the intermediate representation by parsing the operator description.
  • the role of the intermediate representation is to make the structure described by the operator to be compiled more simple and clear in logic, so that the optimization of the final target code is easier to realize.
  • the complexity of the intermediate representation is between the writing language of the operator description and the corresponding writing language of the object code.
  • Step 403 parsing the intermediate representation to generate object code.
  • the target code calls a first interface
  • the first interface includes instruction codes
  • the first interface includes PTX instruction codes.
  • the PTX instruction code is an assembly-level code
  • the format of the PTX instruction code is asm().
  • the first interface is used to indicate a plurality of first mapping relationships, and the first mapping relationship is a mapping relationship between an instance and first data.
  • the instance is used to process the first data corresponding to the instance, and the first data is the data involved in the matrix multiplication operation.
  • the multiple instances executed in parallel in the same stage respectively have the first mapping relationship with the first data located in different storage banks (banks). Under the instruction of the instruction code, the instance only processes the first data that has a mapping relationship with it.
  • the instance may include threads or hyperthreads.
  • the instances mentioned in the embodiments of this application can be threads.
  • multiple instances that jointly perform a matrix multiplication operation are divided into stages to process the data participating in the matrix multiplication operation.
  • the 32 instances can be divided into multiple stages (for example, two stages or four stages) to process the data involved in the matrix multiplication operation.
  • stages for example, two stages or four stages
  • the 32 instances can be divided into multiple stages (for example, two stages or four stages) to process the data involved in the matrix multiplication operation.
  • 32 instances are recorded as t0-t31, t0-t7 process data in the first stage, t8-15 process data in the second stage, t16-t23 process data in the third stage, and t24-t31 process data in the fourth stage Data is processed in stages.
  • FIG. 5 is a schematic diagram of a matrix multiplication operation provided by an embodiment of the present application.
  • matrix A is a matrix with a size of 4 ⁇ 7
  • matrix B is a matrix with a size of 7 ⁇ 4
  • matrix C obtained after performing matrix multiplication operations on matrix A and matrix B is a matrix with a size of 4 ⁇ 4.
  • the data in the first row and the first column in the matrix C are obtained by multiplying and accumulating the data in the first row in the matrix A and the first column in the matrix B
  • the data in the fourth row and the fourth column in the matrix C The data in the column is obtained by multiplying and accumulating the data in the fourth row in matrix A and the fourth column in matrix B.
  • the data of each row in matrix C is obtained based on the data of the same row in matrix A, for example, the data of the first row in matrix C is obtained based on the data of the first row in matrix A.
  • instance A is used to obtain the data of the first row and first column in matrix C
  • instance B is used to obtain the data of the first row and second column in matrix C
  • both instance A and instance B need to use matrix A
  • the data in the first row if instance A and instance B access the data in the first row of matrix A at the same time, a data access conflict will occur, and the data access requests of instance A and instance B need to be executed in a queue, which affects the operation efficiency.
  • bank conflicts in shared memory. That is, when different instances access the same bank in the shared memory at the same time, bank conflicts will occur, affecting computing efficiency.
  • FIG. 6 is a schematic diagram of multi-instance processing data provided by an embodiment of the present application.
  • FragmentA[16,4] represents the data in matrix A
  • t0-t31 represent 32 different instances respectively.
  • Two consecutive data in FragmentA belong to the same bank, that is, data 0 and data 1 belong to bank0, data 2 and data 3 belong to bank1...data 62 and data 63 belong to bank31.
  • a single instance handles one row of data in FragmentA.
  • instances t0-t7 belong to the instances executed in parallel in the first phase
  • instances t8-15 belong to the instances executed in parallel in the second phase
  • instances t16-t23 belong to the instances executed in parallel in the third phase
  • instances t24- t31 belongs to the instance executed in parallel within the fourth phase.
  • Instances t0-t7 correspond to the first eight rows of data in FragmentA, and each instance corresponds to a different row of data.
  • instance t0 corresponds to the first row of data
  • instance t1 corresponds to the second row of data. In this way, it can be ensured that instances t0-t7 executed in parallel in the same stage correspond to data in different banks, and bank conflicts during the execution of instances t0-t7 are avoided.
  • Instances t8-t15 also correspond to the first four rows of data in FragmentA. Since instances t0-t7 and instances t8-t15 belong to instances in different stages, instances t0-t7 and instances t8-t15 correspond to data in the same bank and no bank conflict will occur.
  • row represents the row data
  • lane_id represents the instance ID.
  • the binding between row data and instance IDs can be realized, that is, the mapping relationship between instances and data is indicated.
  • the index offset of data between rows is ldm*row.
  • instance t0 and instance t8 are jointly bound to the data whose data address is p+0 (that is, the data in the first row of FragmentA), and t1 and t9 are jointly bound to the data address of p+ldm (ldm means the first row data and the offset of the second row data) data... and so on, and finally realize the binding between all data and instances.
  • the terminal can parse the operator description to obtain an intermediate representation; then, the terminal replaces the first statement in the intermediate representation with the first interface statement to obtain the target code, the The first statement is used to indicate to execute the matrix multiplication operation, the first interface statement is used to call the first interface, and the first interface is used to execute the matrix multiplication operation.
  • the first statement indicating to perform matrix multiplication in the intermediate representation may be an expression of matrix multiplication.
  • the terminal may parse the intermediate representation to obtain operation information, where the operation information includes information about the matrix multiplication operation. Based on the operation information, the terminal determines the location of the matrix multiplication operation in the intermediate representation, and replaces the statement at the location of the matrix multiplication operation (that is, the first statement above) with the first interface statement, so that The target code is obtained, and the first interface statement is used to call the first interface.
  • the terminal can determine the matrix multiplication operation in the intermediate representation and information about the matrix multiplication operation by matching the expression of the matrix multiplication operation in the intermediate representation. Then, the terminal replaces the first interface statement that calls the first interface at the position where the matrix multiplication operation is located, so as to realize the generation of the target code that calls the first interface.
  • the first interface statement includes matrix multiplication operation information, so that the matrix multiplication operation can be realized based on the matrix multiplication operation information when the target code is executed.
  • the terminal parses the operator description to obtain an intermediate representation; the terminal replaces the second statement in the intermediate representation with a second interface statement to obtain the target code.
  • the second statement is used to indicate to move the first data to the local memory or to move the first data out of the local memory.
  • the first data involved in the matrix multiplication usually needs to be moved to the local memory to implement the subsequent matrix multiplication.
  • the second interface statement is used to call the first interface, and the first interface is used to move the first data.
  • the terminal parses the operator description to obtain an intermediate representation; the terminal replaces the third statement in the intermediate representation with a third interface statement to obtain the target code, and the third statement It is used to indicate the execution of a fusion operation, the input of the fusion operation includes the output of the matrix multiplication operation, and the third interface statement is used to call the second interface, and the second interface is used to execute the fusion operation.
  • the operation information obtained by parsing the intermediate representation at the terminal further includes fusion operation information, and the input of the fusion operation includes the output of the matrix multiplication operation.
  • the fusion operation refers to an operation that combines multiple operators (for example, an operator of a matrix multiplication operation), that is, it can implement a combination operation of an output of a matrix multiplication operation and other operators.
  • the process of the terminal analyzing the intermediate representation and generating the target code further includes: the terminal determines the information in the intermediate representation based on the operation information The location where the fusion operation is located, and the statement at the location where the fusion operation is located (that is, the third interface statement above) is replaced by the second interface statement to obtain the target code.
  • the second interface statement is used to call a second interface, and the second interface includes an instruction code for instructing to perform a fusion operation.
  • the fusion operation may include at least one of the following operations: addition, subtraction, multiplication, division, division with a result rounded down, modulo operation, and rounded down result modulo operation.
  • the fusion operation may specifically be element-wise addition of outputs of two different matrix multiplication operations.
  • the fusion of multiple operators can be realized, the operation efficiency of the terminal is improved, and the resource utilization rate of the terminal is improved.
  • FIG. 7 is a schematic diagram of a calculation flow of a non-fused operator provided in the embodiment of the present application
  • Fig. 8 is a schematic diagram of a calculation flow of a fusion operator provided in the embodiment of the present application .
  • FIG. 7 shows a plurality of matrix operations without fusion
  • FIG. 8 shows the corresponding fused matrix operations.
  • the terminal when the terminal executes matrix multiplication operations based on the target code, the terminal often needs to first load the data involved in the matrix multiplication operation from the memory, and then execute the matrix multiplication operation based on the data involved in the matrix multiplication operation. Therefore, in the process of executing the matrix multiplication operation by the terminal, data loading is implemented efficiently, which can improve the efficiency of the terminal executing the matrix multiplication operation.
  • the instruction code in the first interface called by the target code is also used to indicate the logical storage structure for obtaining the first data and the data type, and determine the size of the data load pointer according to the logical storage structure and the data type.
  • the first data refers to the data involved in the matrix multiplication operation.
  • the logical storage structure of the first data refers to the logical structural form of the first data stored in the memory.
  • the first data may actually be a matrix data, and the matrix data includes multiple elements.
  • Data participating in matrix multiplication in a certain matrix may be called Fragment data (ie, matrix data).
  • FragmentA data data participating in matrix multiplication in matrix A
  • Fragment data corresponds to the matrix data set corresponding to all instances in a warp.
  • the logical storage structure of the Fragment data can be expressed as [16,4] or [16,8], for example, that is, the logical storage structure of the Fragment data is a 16*4 matrix, or a 16*8 matrix.
  • the data type of the first data is used to indicate the data amount of elements in the first data which is matrix data.
  • the data amount of each data in the first data is constant. For example, when the data type of the first data is Float16, the data size of the single data in the first data is 16 bits; The amount of data is 32 bits.
  • the data loading pointer is used to indicate the amount of data loaded by the instance at a time. For example, when the size of the data loading pointer is 128 bits, the data loading pointer indicates that the data volume of the single loading data of the instance is 128 bits.
  • FIG. 9 is a schematic diagram of a logical storage structure of first data provided by an embodiment of the present application.
  • the first data is FragmentA data
  • FragmentA data is stored in a memory area with a size of [16, 4]
  • the size of the shared memory is [128, 32].
  • FragmentA data two adjacent rows of data actually differ by 32 data in terms of memory addresses.
  • the size of the data load pointer needs to be set to be less than or equal to 128 bits.
  • the terminal when obtaining the intermediate representation, may further perform scheduling optimization on the intermediate representation, so as to obtain an optimized intermediate representation.
  • the method further includes:
  • the terminal generates a parameter for matrix partitioning based on the intermediate representation, where the parameter for matrix partitioning is used to indicate a manner of matrix partitioning. Then, the terminal performs a block operation on the target matrix according to the parameters used for matrix block, to obtain a division result of the target matrix, where the target matrix is a matrix participating in the matrix multiplication operation. Finally, the terminal adds a data movement statement to the intermediate representation according to the division result of the target matrix, so that the target code includes the data movement statement.
  • the data movement statement is used to instruct to move the data of the target matrix in memory. For example, the data movement statement may be used to indicate to move the data of the target matrix from the global memory to the shared memory, or to move the data of the target matrix from the shared memory to the local memory.
  • a multi-level memory promotion mechanism is designed in this embodiment, that is, the data is upgraded from the global memory to the shared memory with high data read and write speed according to the size of the matrix block , and then upgraded to a local memory with a higher data read and write speed.
  • the terminal moves the data corresponding to the matrix from the global memory to the shared memory in advance, and then moves the data corresponding to the inner matrix in the outer matrix from the shared memory to the local memory to improve the efficiency of data loading.
  • the terminal may divide the matrix multiplication operation into multiple parts for execution during the matrix multiplication operation, so as to realize multi-instance parallel execution of the matrix multiplication operation, thereby improving operation efficiency.
  • the division result of the target matrix may include a first matrix, and the first matrix includes the second matrix.
  • the terminal adding the data movement statement to the intermediate representation may specifically include: the terminal adding the first data movement statement after the statement indicating division of the first matrix, and adding the second data movement statement after the statement indicating division of the second matrix.
  • the first data movement statement is used to indicate that the data of the first matrix is moved from the global memory to the shared memory
  • the second data movement statement is used to indicate that the data of the second matrix is moved from the shared memory to local memory.
  • the capacity of the global memory is greater than that of the shared memory, and the capacity of the shared memory is greater than that of the local memory; the data read and write speed of the local memory is greater than that of the shared memory, and the read and write speed of the shared memory is greater than that of the global memory. Moving data from global memory to shared memory and moving data from shared memory to local memory can effectively improve data access speed.
  • the terminal may divide the matrix into multiple first matrices and multiple second matrices.
  • the terminal can divide the matrix of size [768, 768] into 36 first matrices of size [128, 128]; then, the terminal can divide each first matrix of size [128, 128] into 64 first matrices of size [16 , 16] the size of the second matrix.
  • the terminal actually performs the matrix multiplication operation on multiple second matrices of the size [16, 16], thereby realizing the matrix multiplication operation performed on the matrix of the size [768, 768].
  • the terminal can add the first data moving statement after the statement indicating to divide the first matrix.
  • the memory is moved to the shared memory; and, the terminal adds the second data moving statement after the statement indicating to divide the second matrix, and the first data moving statement refers to transferring the data corresponding to the second matrix of the size [16, 16] from the shared memory Moved to local memory.
  • the matrix can be divided into multiple levels, and the outer matrix can be further divided into multiple inner matrices, and each level corresponds to a different type of data movement statement.
  • the terminal executes the target code, the terminal can first move the data corresponding to the outer matrix to the shared memory when dividing the matrix into the outer matrix; then, the terminal divides the outer matrix into When it is an inner matrix, the terminal moves the data corresponding to the inner matrix from the shared memory to the local memory, so that the terminal can quickly obtain the corresponding data from the local memory when executing the matrix multiplication operation corresponding to the inner matrix .
  • the terminal can also move the data to be executed for matrix multiplication from the shared memory to the local memory in advance, ensuring the efficiency of data access. That is, according to the size of the matrix block, the terminal promotes the data involved in the matrix multiplication operation from the global memory to the shared memory, and then promotes it to the local memory again.
  • the terminal in order to improve the efficiency of instance moving data during the execution of the target code, can also specify the mapping relationship between the instance and the data moving statement during the process of generating the target code, so as to ensure that the instance and the data moving statement A reasonable match between them ensures the locality of data access.
  • the method further includes: the terminal establishes a second mapping between instances and data movement sentences in the intermediate representation based on the number of instances and the data structure of the divided matrix relationship, so as to obtain the second mapping relationship included in the object code.
  • the second mapping relationship is used to indicate an instance of executing a data movement statement.
  • the mapping relationship between the instance and the data corresponding to the matrix is usually determined based on the matrix that needs to be calculated. That is, in the related art, the mapping relationship between the instance and the operation statement and the data transfer statement is determined based on the matrix size of the output part of the matrix multiplication operation.
  • the special feature of the matrix multiplication operation is that the size of the input matrix and the output matrix corresponding to the matrix multiplication operation may be inconsistent. For example, for a matrix multiplication operation in which the input matrix A participates, the size of the input matrix A is [128,32], and the size of the output matrix C of the matrix multiplication operation may be [32,128].
  • the mapping relationship between the instance and the data movement statement is established based on the size of the output matrix C, the efficiency of executing the data movement statement by the instance is often low due to the unreasonable mapping relationship.
  • mapping relationship between instances and matrices is established based on the output matrix
  • the mapping of instances is [32,4].
  • the calculation statement mapping of the warp level can be established , so as to ensure that the mapping relationship between multiple instances and calculation statements is more reasonable, so as to further optimize the efficiency of operations.
  • the method further includes: the terminal determines the number of warps according to the total number of instances participating in the matrix multiplication operation, where each warp includes the same number of instances.
  • each warp includes the same number of instances.
  • every 32 instances form a warp
  • a warp is the basic unit of scheduling and running.
  • the terminal establishes a third mapping relationship between warps and axes in the target matrix in the intermediate representation based on the number of warps and the data structure of the target matrix, and the third mapping relationship is used to indicate that in the execution matrix The warp of the operation of the axis.
  • the terminal first derives the number of warps according to the total number of instances. Then, based on the number of warps, the terminal allocates as many warps as possible to the two dimensions w0 and w1.
  • the warps in the two dimensions w0 and w1 are used to bind the M-axis and N-axis of the matrix in the matrix multiplication operation, respectively.
  • w0/w1 may be represented by an instance expression.
  • the w0 index is expressed as threadIdx.x MOD(32*2) div 32; the w1 index is expressed as threadIdx.x div(32* 2).
  • MOD represents the modulo operation, and div represents the integer division operation.
  • FIG. 10 is a schematic flowchart of a compiling method 1000 provided in an embodiment of the present application.
  • the compilation method 1000 can be implemented in the AKG operator compilation and optimization framework in the MindSpore framework.
  • AKG optimizes the operators in the deep neural network and provides the automatic operator fusion function in a specific mode.
  • AKG and MindSpore graph computing fusion features work together to improve the operating efficiency of heterogeneous back-end networks.
  • the compiling method includes the following steps 1001-1005.
  • Step 1001 obtain operator description.
  • AKG can receive the operator description, which includes matrix multiplication operations and fusion operations related to matrix multiplication operations.
  • the input of the fusion operation includes the output of the matrix multiplication operation.
  • Step 1002 analyzing the operator description, generating an intermediate representation and recording operation information.
  • AKG After receiving the operator description, AKG parses the operator description and generates an initial intermediate representation. In addition, after AKG generates the intermediate representation, AKG can analyze the calculation logic and operator fusion mode corresponding to the matrix multiplication operation based on the intermediate representation to obtain operation information.
  • the calculation logic corresponding to the matrix multiplication operation includes the first data participating in the matrix multiplication operation, the arrangement of the first data, the data type of the first data, and the expression of the matrix multiplication operation.
  • the operator fusion mode includes the calculation logic of the fusion calculation and the position of the fusion calculation statement.
  • the calculation logic of the fusion calculation includes the data participating in the fusion operation, the arrangement of the data participating in the fusion operation, the data type of the data participating in the fusion operation, and the expression of the fusion operation.
  • Step 1003 perform scheduling optimization on the intermediate representation based on the polyhedron model.
  • AKG can perform scheduling optimization for processing software and hardware coordination based on the polyhedron compilation model.
  • AKG can adaptively generate GPU configuration parameters based on the intermediate representation and the obtained operation information.
  • the GPU configuration parameters include parameters for matrix partitioning and Grid/Block configuration parameters. Then, AKG performs matrix partitioning according to the parameters used for matrix partitioning. After the block, AKG binds the calculation statement to the data mapping based on the above-mentioned Grid/Block configuration parameters.
  • AKG performs multi-level memory upgrade on the data involved in the calculation, that is, according to the size of the matrix segmentation, the memory location of the data is upgraded from the global memory to the shared memory, and then to the local memory. At this time, AKG will match the data movement statement corresponding to the above-mentioned memory promotion with the Grid/Block parameter.
  • Step 1004 perform back-end Pass optimization on the scheduling-optimized intermediate representation.
  • step 04 AKG optimizes the general class of the intermediate representation after scheduling optimization.
  • the pass for optimizing the intermediate representation mainly includes: shared memory bank conflict avoidance pass, loop body expansion pass, vectorized loading pass, data pipeline prefetch pass and other optimization passes.
  • all optimization passes are executed by performing pattern matching on the intermediate expression and then labeling and modifying the intermediate expression.
  • Step 1005 analyze and execute the optimized intermediate representation of the back-end Pass, and link the library based on the fusion mode to generate the object code.
  • the core process is to parse the intermediate representation, and at the same time, according to the operation information recorded in step 1002, call the PTX inline library and the Elem-Wise matrix operation library at the Fragment level, launch the corresponding API interface, and finally generate the target code.
  • the PTX inline library includes multiple interfaces, and the multiple interfaces in the PTX inline library correspond to matrix multiplication operations.
  • the terminal executes the target code
  • the terminal performs matrix multiplication calculation based on the interface in the PTX inline library called by the target code.
  • the PTX inline library may include a matrix multiplication operation interface, a data initialization interface, a data loading interface and a data storage interface.
  • the terminal can load the data involved in the matrix multiplication operation based on the data loading interface, set the initial values of all elements in the Fragment based on the data initialization interface, and perform matrix multiplication operations based on the matrix multiplication operation interface.
  • the calculated data is stored based on the data storage interface.
  • the Elem-Wise matrix operation library includes multiple interfaces, and the multiple interfaces in the Elem-Wise matrix operation library correspond to fusion calculations.
  • the terminal executes the target code, the terminal performs fusion calculation based on the interface in the Elem-Wise matrix operation library called by the target code.
  • the Elem-Wise matrix operation library may include an addition operation interface, a subtraction operation interface, a multiplication operation interface, and a division operation interface, which are respectively used to perform different types of fusion calculations.
  • step 1002-step 1005 will be described in detail below in combination with examples.
  • Step 1002 analyzing the operator description, generating an intermediate representation and recording operation information.
  • the terminal After parsing the operator description, the terminal generates an intermediate representation. Analyze the calculation logic and operator fusion mode corresponding to the matrix multiplication expression. Specifically, the terminal may perform matrix multiplication operation mode matching on the operator description to obtain the matched matrix multiplication operation mode. After the matrix multiplication operation mode is obtained through matching, the terminal determines the size of the matrix involved in the calculation, the data arrangement mode of the matrix involved in the calculation, the data type of the matrix involved in the calculation, and the fusion mode corresponding to the matrix.
  • the terminal performs matching of the matrix multiplication operation mode, and determines the size of the matrix involved in the calculation, the data arrangement mode of the matrix involved in the calculation, the data type of the matrix involved in the calculation, and the fusion mode corresponding to the matrix.
  • the terminal can determine the size of the matrix involved in the calculation based on the realize node in the intermediate representation and the size of the corresponding cycle axis. Taking the above code as an example, analyzing the realize compute shows that the size of the output matrix D is [768,768], analyzing the size of the three axes corresponding to the loop axis for shows that the size of the input matrix A is [768,768], and the size of the input matrix B is [ 768,768].
  • the terminal can determine that the two input matrices input_1 and input_2 respectively correspond to a reduction axis reduce_axis by analyzing the calculation relationship of the matrices compute, input_1, and input_2 involved in the calculation.
  • the positions of the reduction axes corresponding to the input matrices input_1 and input_2 are both in the innermost axis, and the corresponding arrangement is A[M,K]&B[N,K].
  • the terminal can perform axis fusion on multiple batch axes during the process of generating the intermediate representation.
  • analyzing the calculation statement and the compute node shows that the calculation is a four-dimensional matrix multiplication operation, where the first two axes B ⁇ [0,32) and b ⁇ [0,12) are batch axes, where [ 0,32) and [0,12) are integer intervals.
  • This scheme multiplies and fuses them into a B.b.fused axis, B.b.fused ⁇ [0,384).
  • the index of B.b.fused axis corresponding to the original B axis and b axis becomes the expression of modulo operation and DIV operation, namely compute(floordiv(B.b.fused,12),floormod(B.b.fused,12).
  • the first two axes in matrix A are batch axes.
  • the matrix A[10,10,M,N] can be understood as multiple matrices with a matrix size of [M,N] are divided into 10 batches, and each batch of matrices also includes 10 matrices. After fusing the two batch axes in the matrix A[10,10,M,N], the matrix A[100,M,N] is obtained. At this time, the matrix A[100,M,N] can be understood as that multiple matrices with a matrix size of [M,N] are divided into 100 batches, and each batch of matrices includes only one matrix.
  • step 1002 The specific execution process of step 1002 is introduced above, and the specific execution process of step 1003 will be described below.
  • Step 1003 perform scheduling optimization on the intermediate representation based on the polyhedron model.
  • FIG. 11 is a schematic flowchart of scheduling and optimizing an intermediate representation based on a polyhedron model according to an embodiment of the present application.
  • the specific process of scheduling and optimizing the intermediate representation based on the polyhedron model may include the following steps 1101-1106.
  • Step 1101 perform polyhedron scheduling optimization on the intermediate representation based on the Pluto algorithm.
  • the terminal performs polyhedron scheduling optimization on the intermediate representation based on the Pluto algorithm, which can realize effective initial loop nested scheduling optimization based on the polyhedron model.
  • Step 1102 adaptively generate configuration parameters.
  • the terminal can adaptively generate corresponding configuration parameters.
  • the terminal can configure parameters according to GPU memory utilization, data locality, and operation concurrency.
  • the configuration parameters output by the terminal may include parameters for matrix partitioning and Grid/Block configuration parameters.
  • Step 1103 perform matrix partitioning based on configuration parameters.
  • the terminal may divide all the matrices by axes according to the parameters for matrix division calculated in step 1102 to implement matrix division, so as to facilitate subsequent memory upgrade of the divided internal matrix.
  • the loop operation caused by the outer layer segmentation results after the matrix is divided into blocks can effectively bind the Grid to achieve parallel computing.
  • the operation of the terminal on the intermediate representation may specifically be as follows: the terminal analyzes the schedule node of the terminal (that is, the instance of the scheduling statement), and divides all the axes in the matrix into multi-layer scheduling based on matrix segmentation parameters.
  • the matrix segmentation parameter can be: the outer layer segmentation parameter is M128N128K32, that is, the sizes of the M, N, and K axes after segmentation are 128, 128, and 32, respectively;
  • the inner layer segmentation parameter is M16N16K8, that is, the sizes of the M, N, and K axes after further segmentation are 16, 16, and 8, respectively.
  • the terminal splits the schedule scheduling nodes corresponding to the three axes of M N K according to the outer layer splitting parameters, and forms two-layer scheduling through modulo operation/division operation; Two-level scheduling is formed by means of modulo operation/division operation.
  • the original one-level scheduling is divided into three layers, namely [M/128, N/128, K/32], [(MMOD128)/16, (NMOD128)/16, (KMOD32)/8 ], [MMOD16, NMOD16, KMOD8].
  • the three-tier scheduling can better adapt to the subsequent binding optimization with GPU hardware.
  • the outermost layer scheduling is bound to Grid
  • the middle layer scheduling is bound to Warp
  • the innermost layer is used to match the first interface for performing matrix multiplication described in the above embodiment.
  • Step 1104 establishing a mapping relationship between calculation statements and Grid/Block.
  • warp-level calculation statement mapping can be established, so as to ensure a more reasonable mapping relationship between multiple instances and calculation statements, so as to further optimize operation efficiency.
  • the terminal first derives the number of warps according to the total number of instances. Then, based on the number of warps, the terminal allocates as many warps as possible to the two dimensions w0 and w1.
  • the warps in the two dimensions w0 and w1 are used to bind the M-axis and N-axis of the matrix in the matrix multiplication operation, respectively.
  • w0/w1 may be represented by an instance expression.
  • the w0 index is expressed as threadIdx.x MOD(32*2) div 32; the w1 index is expressed as threadIdx.x div(32* 2).
  • MOD represents the modulo operation, and div represents the integer division operation.
  • Step 1105 perform multi-level memory promotion.
  • the original polyhedron scheduling technology only performs a memory upgrade on the matrix once.
  • this embodiment designs a multi-level memory promotion mechanism, that is, data is promoted from the global memory to the shared memory according to the size of the matrix block, and then to the local memory for a second time.
  • the specific operation process is as follows: firstly, tags of different scheduling levels are added in the aforementioned step 1104 (for example, mark nodes are added in the code, and the logic of adding tags is the division level of scheduling). Then, in the process of multi-level memory promotion, calculate the required memory and the memory level that needs to be upgraded according to the label; if the memory is sufficient, add the corresponding memory application statement and data movement statement; otherwise, reduce the amount of data for memory promotion , until sufficient memory is available.
  • the added data movement statement is implemented by inserting a child node—extension node into the schedule node below the corresponding mark node in the intermediate representation.
  • the extension node includes the name of the input and output matrix of data migration, that is, the index correspondence. ("->" is the order of data movement).
  • Step 1106 establishing a mapping relationship between the data movement statement and the instance.
  • the terminal can calculate the total number of instances corresponding to the matrix, and reallocate instances from the inner axis (memory continuous axis) to the outer according to the size of the divided matrix until the instances are mapped.
  • This method can effectively guarantee the locality of data access.
  • the terminal can perform the following three specific steps:
  • the terminal can bind input_2_shared[128,32] to Block[4,32], and the remaining cycle axis is [32,1].
  • the first column is the introduction of the data volume of the matrix multiplication operation of the specific example
  • the second column is the time-consuming operation based on the optimization method of this embodiment
  • the third column is the time-consuming operation based on the existing optimization method.
  • Time is the time.
  • the fourth column is the performance gap between the second column and the third column. According to the analysis, it can be seen that this embodiment has different degrees of improvement for different matrix multiplication operations, and the improvement ratio is greater than 70%.
  • step 1003 has been introduced above, and the specific execution process of step 1004 will be introduced below.
  • Step 1004 perform back-end Pass optimization on the scheduling-optimized intermediate representation.
  • the Pass for performing back-end optimization may specifically include a Shared Memory Bank conflict avoidance Pass, a loop body unrolling Pass, a vectorized loading Pass, a data pipeline prefetching Pass, and the like.
  • a Shared Memory Bank conflict avoidance Pass may specifically include a Shared Memory Bank conflict avoidance Pass, a loop body unrolling Pass, a vectorized loading Pass, a data pipeline prefetching Pass, and the like.
  • Bank Conflict Avoidance Pass can modify the intermediate representation and adjust the way Fragment data is stored, so that Bank conflicts can be eliminated during the execution of matrix multiplication operations.
  • A_Global represents global memory
  • A_shared represents shared memory
  • Fragment represents the Fragment area used to store Fragment data.
  • the size of the shared memory is [128,32]
  • the shared memory includes multiple sets of the same Bank, and each set of Banks includes 32 Banks, which are respectively B0-B31.
  • the size of the Fragmen area is [16,8].
  • optimization based on Bank conflict avoidance Pass can realize Fragment data rearrangement, and store Fragment data in continuous shared memory, that is, Fragment[16,8] is stored in shared[1,128].
  • the position of the Fragment area becomes [1,128], and the size of the Fragment area does not change.
  • the data in the Fragment area belongs to different Banks, thereby achieving the purpose of eliminating Bank conflicts.
  • FIG. 13 is a schematic diagram of an optimized data access arrangement based on Bank conflict avoidance Pass provided by an embodiment of the present application.
  • Loop body expansion Pass The loop body expansion Pass is used to unroll and optimize the for loop in the intermediate representation to avoid increasing the number of instructions too much.
  • each layer of for loops includes multiple branches.
  • branch conflicts may occur, thereby increasing the number of instructions.
  • the loop body expansion Pass determines whether to expand a certain for loop in the intermediate representation by comparing the three parameters in the intermediate representation with preset thresholds. If the judgment result is that a for loop needs to be unrolled, mark the corresponding for node in the intermediate representation as an unrolled node, and in the final code generation stage, generate the corresponding unroll instruction, that is, in the code corresponding to the for loop Add a line of macro instruction code "#pragma unroll" to the previous line.
  • the three parameters proposed above and the preset thresholds are specifically shown in Table 3.
  • parameter significance threshold Auto_max_step Number of statements inside the For loop 3 auto_max_depth The number of layers of For that need to be unrolled (for nesting) 8 auto_max_extent Upper Bound of For Loop 16
  • the three parameters are auto_max_step, auto_max_depth and auto_max_extent.
  • auto_max_step indicates the number of statements in the For loop;
  • auto_max_depth indicates the number of For layers that need to be unrolled;
  • auto_max_extent indicates the upper bound of the For loop.
  • the values of the thresholds corresponding to the three parameters can be adjusted according to the actual situation, and are not limited to the values shown in Table 2, as long as it is ensured that the thresholds are greater than 0.
  • the code of the intermediate representation including multiple layers of for loops is as follows:
  • the parameter auto_max_step is the number of statements inside the For loop, that is, the two calculation statements in the fifth and seventh lines, and the value is 2;
  • the parameter auto_max_depth corresponds to the internal for The number of nesting, plus itself, a total of two layers, the value is 2;
  • the parameter auto_max_extent is the maximum number of executions of the for statement, that is, cc9 ⁇ [0,2), the value is 2.
  • the three parameters corresponding to the for loop are all smaller than the limit set in the table, so the loop expansion can be performed.
  • the unrolling result after the above code executes loop unrolling is as follows:
  • Vectorized loading is similar to a Single Instruction Multiple Data (SIMD) instruction.
  • SIMD instructions are a set of instructions that copy multiple operands and pack them into large registers. Since one instruction can process multiple data at one time, it can reduce the number of overall instruction executions and expand bandwidth utilization.
  • the target code compiled based on the intermediate representation can instruct the terminal to use the Float128 data type format to read data, that is, the size of the data read by the terminal each time is 128 bits.
  • the cc3 axis is segmented with a vectorization coefficient of 8, and the inner loop of cc8 is segmented, which is not bound to the GPU instance, that is, one instance processes 8 data of Float16 data type. At the same time, mark the corresponding For loop as a vectorized node in the IR.
  • the code of the intermediate representation processed by the vectorized loading Pass is as follows.
  • the data pipeline pass is used to additionally apply for part of the local memory to transfer the pre-acquired data.
  • the terminal can read a part of the data in advance for subsequent calculations when executing the target code, and read subsequent data at the same time during the calculation process. In this way, data reading and calculation can be performed synchronously, saving time overhead.
  • part of the data extracted and read by the terminal is stored in the local memory requested by the data pipeline Pass.
  • FIG. 14 is a schematic diagram of a calculation sequence in the related art.
  • the calculation statement (compute) needs to wait for the input data to be stored in the shared memory (shared) before proceeding. That is, the terminal needs to store the input data in the shared memory before performing the first calculation; after the first calculation is completed, store the input data required for the second calculation in the shared memory, and then perform the second calculation .
  • FIG. 15 is a schematic diagram of a calculation sequence after data pipeline Pass optimization provided by the embodiment of the present application.
  • the data pipeline Pass additionally applies for a local memory with faster data read and write speeds, which is used to transfer pre-acquired data.
  • the terminal reads the input data required for the second calculation into the local memory in advance, thus ensuring that the data in the local memory can be quickly read after the first calculation is completed To the shared memory, saving the time of data reading.
  • FIG. 16 is a pseudo-code logic after adding data pipeline Pass provided by the embodiment of the present application.
  • step 1004 has been introduced above, and the specific execution process of step 1005 will be described below.
  • Step 1005 analyze and execute the optimized intermediate representation of the back-end Pass, and link the library based on the fusion mode to generate the object code.
  • the terminal parses the intermediate representation, and at the same time calls the above-mentioned PTX inline library and Fragment-level Elem-Wise matrix operation library according to the recorded fusion mode to launch the corresponding interface and finally generate the target code.
  • the process of parsing the intermediate representation and generating the object code includes analyzing and processing different nodes in the intermediate representation. Specifically, the terminal parses the intermediate representation, determines a specific node in the intermediate representation, and converts the specific point in the intermediate representation into a corresponding code statement, so as to realize the generation of the target code corresponding to the intermediate representation.
  • nodes that may exist in the intermediate representation will be respectively introduced below: tvm_load_matrix_sync node, tvm_fill_fragment node, tvm_mma_sync node and tvm_store_matrix_sync node.
  • tvm_load_matrix_sync node The tvm_load_matrix_sync node is used to instruct data to be moved into Fragment and has 8 parameters. Exemplarily, the 8 parameters corresponding to the tvm_load_matrix_sync node are shown in Table 4.
  • the tvm_fill_fragment node is used to implement the initial assignment of the multiply-accumulate matrix and has 6 parameters. Exemplarily, the six parameters corresponding to the tvm_fill_fragment node are shown in Table 5.
  • the tvm_mma_sync node is a multiply-accumulate calculation statement with 8 parameters, and every two parameters are a group. Exemplarily, the 8 parameters corresponding to the tvm_mma_sync node are shown in Table 6.
  • the tvm_store_matrix_sync node is used to instruct data to be moved out of the Fragment and has 8 parameters. Exemplarily, the 8 parameters corresponding to the tvm_store_matrix_sync node are shown in Table 7.
  • the above-mentioned inline PTX library and Elem-Wise operation library are stored in wmma.hpp, and the method of referencing the header file (#include "akg_mma_lib/wmma.hpp") is added to the target code to call the API library and call the internal Connect the interface in the PTX library and the Elem-Wise computing library.
  • FIG. 17 is a schematic diagram of a PTX inline library provided by an embodiment of the present application.
  • the PTX inline library includes a matrix multiplication operation interface, a data initialization interface, a data loading interface and a data storage interface, wherein the data loading interface includes interfaces for loading input matrices and multiplying and accumulating matrices.
  • the implementation of the data loading interface in the object code may be: akg::wmma::load_matrix_sync.
  • the terminal can acquire the ID of the instance, that is, ThreadId, based on the function get_lane_id() of the inline PTX instruction in the target code. After determining the ID of the instance, in the subsequent process of calling the interface of the PTX inline library, specify how each instance processes data, so as to achieve fine-grained data and calculation control.
  • mapping relationship between instances and data can be established based on the same method, so as to realize data loading and storage.
  • the way to establish the mapping relationship between instances and data in the data loading interface and data storage interface can refer to the description of the embodiment corresponding to Figure 6 above, and will not be repeated here.
  • the data initialization interface is used to set the initial value of all elements in Fragment.
  • the data initialization interface can convert the constant into a data type corresponding to the Fragment, and assign a value to each element of the Fragment (that is, traverse through a for loop).
  • the first column is the introduction of the data volume of the matrix multiplication operation of the specific example
  • the second column is the time-consuming execution of the matrix multiplication operation based on the existing WMMA interface
  • the third column is the PTX inline based on this embodiment.
  • the fourth column is the performance gap between the second and third columns. It can be seen from the analysis that this embodiment has different degrees of improvement for different matrix multiplication operations, and the improvement ratio reaches nearly 50%.
  • this embodiment also provides a Fragment-level operator fusion solution based on the TensorCore computing level.
  • this embodiment also provides a Fragment-level operator fusion solution based on the TensorCore computing level.
  • the non-fusion scene of matrix multiplication operation namely, the Fragment-level Elem-Wise matrix operation library, and the operation information recorded in step 1002 .
  • the Elem-Wise matrix operation library includes multiple operation interfaces, such as addition operation interface, subtraction operation interface, multiplication operation interface and division operation interface.
  • the interface design of the Elem-Wise matrix operation library for the input matrix participating in the fusion operation, the same data loading method as the input matrix in the matrix multiplication operation can be used, that is, the above-mentioned data loading interface akg::wmma::load_matrix_sync can be called .
  • This method can store the fusion matrix as a Fragment, and also ensures that the data storage method of the fusion matrix Fragment is the same as the Fragment data storage method of the matrix multiplication operation part.
  • the interface in the ElemWise matrix operation library adopts the method of element-by-element calculation in the Fragment data structure.
  • the input matrix A, input matrix B, and output matrix C received by the addition operation interface are all Fragment structures.
  • the addition operation of B[i] is stored in c[i], and the final result is FragmentC.
  • the terminal matches the matrix multiplication operation mode after parsing the operator description, and determines that there is a fusion operation after determining that the matrix multiplication operation is not the only operation, and records the fusion operation mode (ie fusion mode).
  • the fusion mode analysis record mainly includes the record annotation of the location of the fusion statement, the record of the calculation logic of the fusion statement, and the record of matrix-related information (data size, type, and arrangement mode) participating in the fusion statement.
  • the output matrix compute of the matrix multiplication operation participates in the fusion operation, and the type of the fusion operator is an addition operation.
  • the recorded fusion mode that is, the dependency between ElemWise calculation and matrix multiplication operation.
  • the first column is the introduction of the data volume of the matrix multiplication operation of the specific example
  • the second column is the time-consuming of the single matrix multiplication operation
  • the third column is the fusion operation performed based on the Elem-Wise matrix operation library provided by this embodiment time-consuming.
  • the fourth column is the performance gap between the second and third columns. It can be seen from the analysis that for different matrix multiplication operations in this embodiment, the fusion ratio is less than 5%.
  • FIG. 18 is a schematic structural diagram of a compiling device provided by an embodiment of the present application.
  • a compiling device provided by the embodiment of the present application includes: an acquisition unit 1801 and a processing unit 1802; the acquisition unit 1801 is used to acquire the operator description of the neural network model, and the operator description includes A description of the matrix multiplication operation; the processing unit 1802 is configured to analyze the operator description and generate an object code;
  • the target code calls a first interface
  • the first interface is used to indicate a plurality of first mapping relationships
  • the first mapping relationship is a mapping relationship between an instance and first data
  • the instance is used for Processing the first data corresponding to the instance
  • the first data is the data involved in the matrix multiplication operation, wherein the multiple instances executed in parallel in the same stage have the same relationship with the first data located in different memory banks respectively The first mapping relationship.
  • the processing unit 1802 is further configured to: parse the operator description to obtain an intermediate representation; replace the first statement in the intermediate representation with the first interface statement to obtain the In the target code, the first statement is used to instruct to execute the matrix multiplication operation, the first interface statement is used to call the first interface, and the first interface is used to execute the matrix multiplication operation.
  • the first interface has parallel threads for executing PTX instruction codes.
  • the processing unit 1802 is further configured to: parse the operator description to obtain an intermediate representation; replace the second statement in the intermediate representation with a second interface statement to obtain the In the object code, the second statement is used to indicate to move the first data, the second interface statement is used to call the first interface, and the first interface is used to move the first data.
  • the processing unit 1802 is further configured to: parse the operator description to obtain an intermediate representation; replace the third statement in the intermediate representation with a third interface statement to obtain the In the target code, the third statement is used to indicate the execution of the fusion operation, the input of the fusion operation includes the output of the matrix multiplication operation, the third interface statement is used to call the second interface, and the second interface is used for Perform fusion operations.
  • the fusion operation includes at least one of the following operations: addition, subtraction, multiplication, division, division with a result rounded down, modulo operation, and result down Modulo operation for rounding.
  • the first interface is also used to indicate the logical storage structure for obtaining the first data and the data type of the target data, and determine according to the logical storage structure and the data type The size of the data loading pointer, where the size of the data loading pointer is used to indicate the amount of data loaded by the instance at a time.
  • the processing unit 1802 is further configured to: generate parameters for matrix partitioning based on the operator description; perform partitioning on the target matrix according to the parameters for matrix partitioning. block operation to obtain the division result of the target matrix, the target matrix is a matrix participating in the matrix multiplication operation; according to the division result of the target matrix, a data movement statement is added in the target code, and the data The move statement is used to instruct to move the data of the target matrix in memory.
  • the division result of the target matrix includes a first matrix, and the first matrix includes a second matrix;
  • the processing unit 1802 is further configured to: indicate in the target code to divide the first Adding the first data movement statement after the statement of the matrix, and adding the second data movement statement after the statement indicating to divide the second matrix in the target code;
  • the first data movement statement is used to indicate that the data of the first matrix is moved from the global memory to the shared memory
  • the second data movement statement is used to indicate that the data of the second matrix is moved from the shared memory to local memory
  • the target code further includes a second mapping relationship
  • the second mapping relationship is a mapping relationship between an instance and a data movement statement
  • the second mapping relationship is used to indicate that the execution data An instance of the statement is moved, and the second mapping relationship is established based on the data of the instance and the data structure of the divided matrix.
  • the target code further includes a third mapping relationship between a warp and an axis in the target matrix; where the third mapping relationship is used to indicate the The warp of the axis operation, the number of warps is determined based on the total number of instances participating in the matrix multiplication operation, each warp includes the same number of instances, and the target matrix is a matrix participating in the matrix multiplication operation.
  • FIG. 19 is a schematic structural diagram of the execution device provided by the embodiment of the present application. Smart wearable devices, servers, etc. are not limited here.
  • the data processing apparatus described in the embodiment corresponding to FIG. 19 may be deployed on the execution device 1900 to realize the data processing function in the embodiment corresponding to FIG. 19 .
  • the execution device 1900 includes: a receiver 1901, a transmitter 1902, a processor 1903, and a memory 1904 (the number of processors 1903 in the execution device 1900 may be one or more, and one processor is taken as an example in FIG. 19 ) , where the processor 1903 may include an application processor 19031 and a communication processor 19032 .
  • the receiver 1901 , the transmitter 1902 , the processor 1903 and the memory 1904 may be connected through a bus or in other ways.
  • the memory 1904 may include read-only memory and random-access memory, and provides instructions and data to the processor 1903 .
  • a part of the memory 1904 may also include a non-volatile random access memory (non-volatile random access memory, NVRAM).
  • NVRAM non-volatile random access memory
  • the memory 1904 stores processors and operating instructions, executable modules or data structures, or their subsets, or their extended sets, wherein the operating instructions may include various operating instructions for implementing various operations.
  • the processor 1903 controls the operations of the execution device.
  • various components of the execution device are coupled together through a bus system, where the bus system may include not only a data bus, but also a power bus, a control bus, and a status signal bus.
  • the various buses are referred to as bus systems in the figures.
  • the methods disclosed in the foregoing embodiments of the present application may be applied to the processor 1903 or implemented by the processor 1903 .
  • the processor 1903 may be an integrated circuit chip, which has a signal processing capability. In the implementation process, each step of the above method may be completed by an integrated logic circuit of hardware in the processor 1903 or instructions in the form of software.
  • the above-mentioned processor 1903 may be a general-purpose processor, a digital signal processor (digital signal processing, DSP), a microprocessor or a microcontroller, and may further include an application specific integrated circuit (ASIC), field programmable Field-programmable gate array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.
  • DSP digital signal processing
  • FPGA field programmable Field-programmable gate array
  • the processor 1903 may implement or execute various methods, steps, and logic block diagrams disclosed in the embodiments of the present application.
  • a general-purpose processor may be a microprocessor, or the processor may be any conventional processor, or the like.
  • the steps of the method disclosed in the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor.
  • the software module can be located in a mature storage medium in the field such as random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, register.
  • the storage medium is located in the memory 1904, and the processor 1903 reads the information in the memory 1904, and completes the steps of the above method in combination with its hardware.
  • the receiver 1901 can be used to receive input digital or character information, and generate signal input related to performing device related settings and function control.
  • the transmitter 1902 can be used to output digital or character information through the first interface; the transmitter 1902 can also be used to send instructions to the disk group through the first interface to modify the data in the disk group; the transmitter 1902 can also include a display device such as a display screen .
  • the embodiment of the present application also provides a computer program product, which, when running on a computer, causes the computer to perform the steps performed by the aforementioned execution device, or enables the computer to perform the steps performed by the aforementioned training device.
  • An embodiment of the present application also provides a computer-readable storage medium, the computer-readable storage medium stores a program for signal processing, and when it is run on a computer, the computer executes the steps performed by the aforementioned executing device , or, causing the computer to perform the steps performed by the aforementioned training device.
  • the execution device or terminal device provided in the embodiment of the present application may specifically be a chip.
  • the chip includes: a processing unit and a communication unit.
  • the processing unit may be, for example, a processor, and the communication unit may be, for example, an input/output interface, a pin or circuit etc.
  • the processing unit may execute the computer-executable instructions stored in the storage unit, so that the chip in the execution device executes the compiling method described in the above-mentioned embodiments.
  • the storage unit is a storage unit in the chip, such as a register, a cache, etc.
  • the storage unit may also be a storage unit located outside the chip in the wireless access device, such as only Read-only memory (ROM) or other types of static storage devices that can store static information and instructions, random access memory (random access memory, RAM), etc.
  • ROM Read-only memory
  • RAM random access memory
  • FIG. 20 is a schematic structural diagram of a chip provided by the embodiment of the present application.
  • the chip can be represented as a processor 2000, and the NPU 2000 is mounted on the main CPU (Host CPU) as a coprocessor , the task is assigned by the Host CPU.
  • the core part of the NPU is the operation circuit 2003, and the operation circuit 2003 is controlled by the controller 2004 to extract matrix data in the memory and perform multiplication operations.
  • the operation circuit 2003 includes multiple processing units (Process Engine, PE).
  • arithmetic circuit 2003 is a two-dimensional systolic array.
  • the arithmetic circuit 2003 may also be a one-dimensional systolic array or other electronic circuits capable of performing mathematical operations such as multiplication and addition.
  • the arithmetic circuit 2003 is a general-purpose matrix processor.
  • the operation circuit fetches the data corresponding to the matrix B from the weight memory 2002, and caches it in each PE in the operation circuit.
  • the operation circuit takes the data of matrix A from the input memory 2001 and performs matrix operation with matrix B, and the obtained partial or final results of the matrix are stored in the accumulator (accumulator) 2008 .
  • the unified memory 2006 is used to store input data and output data.
  • the weight data directly accesses the controller (Direct Memory Access Controller, DMAC) 2005 through the storage unit, and the DMAC is transferred to the weight storage 2002.
  • the input data is also transferred to the unified memory 2006 through the DMAC.
  • DMAC Direct Memory Access Controller
  • the BIU is the Bus Interface Unit, that is, the bus interface unit 2020, which is used for the interaction between the AXI bus and the DMAC and the Instruction Fetch Buffer (IFB) 2009.
  • IFB Instruction Fetch Buffer
  • the bus interface unit 2020 (Bus Interface Unit, BIU for short) is used for the instruction fetch memory 2009 to obtain instructions from the external memory, and is also used for the storage unit access controller 2005 to obtain the original data of the input matrix A or the weight matrix B from the external memory.
  • the DMAC is mainly used to move the input data in the external memory DDR to the unified memory 2006 , to move the weight data to the weight memory 2002 , or to move the input data to the input memory 2001 .
  • the vector computing unit 2007 includes a plurality of computing processing units, and if necessary, further processes the output of the computing circuit 2003, such as vector multiplication, vector addition, exponent operation, logarithmic operation, size comparison and so on. It is mainly used for non-convolutional/fully connected layer network calculations in neural networks, such as Batch Normalization (batch normalization), pixel-level summation, and upsampling of feature planes.
  • the vector computation unit 2007 can store the vector of the processed output to unified memory 2006 .
  • the vector calculation unit 2007 can apply a linear function; or, a nonlinear function to the output of the operation circuit 2003, such as performing linear interpolation on the feature plane extracted by the convolutional layer, and then such as a vector of accumulated values to generate an activation value.
  • the vector computation unit 2007 generates normalized values, pixel-level summed values, or both.
  • the vector of processed outputs can be used as an activation input to the arithmetic circuit 2003, for example for use in a subsequent layer in a neural network.
  • An instruction fetch buffer (instruction fetch buffer) 2009 connected to the controller 2004 is used to store instructions used by the controller 2004;
  • the unified memory 2006, the input memory 2001, the weight memory 2002 and the fetch memory 2009 are all On-Chip memories. External memory is private to the NPU hardware architecture.
  • the processor mentioned above can be a general-purpose central processing unit, microprocessor, ASIC, or one or more integrated circuits for controlling the execution of the above-mentioned programs.
  • the device embodiments described above are only illustrative, and the units described as separate components may or may not be physically separated, and the components shown as units may or may not be A physical unit can be located in one place, or it can be distributed to multiple network units. Part or all of the modules can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • the connection relationship between the modules indicates that they have communication connections, which can be specifically implemented as one or more communication buses or signal lines.
  • the essence of the technical solution of this application or the part that contributes to the prior art can be embodied in the form of a software product, and the computer software product is stored in a readable storage medium, such as a floppy disk of a computer , U disk, mobile hard disk, ROM, RAM, magnetic disk or optical disk, etc., including several instructions to make a computer device (which can be a personal computer, training device, or network device, etc.) execute the instructions described in various embodiments of the present application. method.
  • a computer device which can be a personal computer, training device, or network device, etc.
  • all or part of them may be implemented by software, hardware, firmware or any combination thereof.
  • software When implemented using software, it may be implemented in whole or in part in the form of a computer program product.
  • the computer program product includes one or more computer instructions.
  • the computer can be a general purpose computer, a special purpose computer, a computer network, or other programmable devices.
  • the computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transferred from a website, computer, training device, or data
  • the center transmits to another website site, computer, training device or data center via wired (eg, coaxial cable, fiber optic, digital subscriber line (DSL)) or wireless (eg, infrared, wireless, microwave, etc.).
  • wired eg, coaxial cable, fiber optic, digital subscriber line (DSL)
  • wireless eg, infrared, wireless, microwave, etc.
  • the computer-readable storage medium may be any available medium that can be stored by a computer, or a data storage device such as a training device or a data center integrated with one or more available media.
  • the available medium may be a magnetic medium (such as a floppy disk, a hard disk, or a magnetic tape), an optical medium (such as a DVD), or a semiconductor medium (such as a solid state disk (Solid State Disk, SSD)), etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Devices For Executing Special Programs (AREA)

Abstract

本申请公开了一种编译方法,应用于人工智能技术领域。该方法包括:获取神经网络模型的算子描述,算子描述包括对矩阵乘法运算的描述;解析算子描述,以生成目标代码;其中,目标代码调用有第一接口,第一接口用于指示多个第一映射关系,第一映射关系是一个实例与第一数据之间的映射关系,实例用于处理与实例对应的第一数据,第一数据为参与矩阵乘法运算的数据,其中,同一阶段内并行执行的多个实例分别与位于不同存储体bank内的第一数据具有第一映射关系。本方案中,通过指定同一阶段内并行执行的实例对应于不同bank内的数据,保证所有实例均不会同时访问一个bank内的数据,从而避免了共享内存中的bank冲突,有效地提高了执行运算的效率。

Description

一种编译方法及相关装置
本申请要求于2021年06月02日提交中国专利局、申请号为202110615376.3、发明名称为“一种编译方法及相关装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及计算机技术领域,尤其涉及一种编译方法及相关装置。
背景技术
人工智能(artificial intelligence,AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。换句话说,人工智能是计算机科学的一个分支,它企图了解智能的实质,并生产出一种新的能以人类智能相似的方式作出反应的智能机器。人工智能也就是研究各种智能机器的设计原理与实现方法,使机器具有感知、推理与决策的功能。
随着人工智能应用成熟度的提升,人工智能相关的应用被辐射到众多领域。深度学习方法,是近年来人工智能领域发展的一个关键推动力,在多种任务中取得了令人瞩目的效果。目前,基于深度学习的网络模型的规模和复杂度呈指数级地增大,尤其是应用于自动驾驶、机器人和内容推荐等热门综合场景下的网络模型。在基于深度学习的网络模型中,大部分的计算都来源于矩阵乘法(General Matrix Multiplication,GEMM)运算,因此对矩阵乘法运算的优化是至关重要的。
目前,在编译矩阵乘法运算的过程中,通过解析矩阵乘法运算的算子描述,调用了线程束层级矩阵乘积和累加(Warp-level Matrix Multiply and Accumulate,WMMA)接口来生成执行代码,以通过WMMA接口来执行矩阵乘法运算。然而,目前基于调用WMMA接口的代码执行矩阵乘法运算的效率较低,导致网络模型的运算速度较慢。
发明内容
本申请提供了一种编译方法,通过在对算子描述进行编译的过程中,在生成的目标代码中调用包括特定指令代码的接口,以指定同一阶段内并行执行的实例对应于不同bank内的数据,即可保证所有实例均不会同时访问一个bank内的数据,从而避免了共享内存中所发生的bank冲突,有效地提高了执行运算的效率。
本申请第一方面提供一种编译方法,该方法可以应用于具有编译功能的终端上。该方法包括:终端获取算子描述,所述算子描述包括对矩阵乘法运算的描述,即所述算子描述中定义了代码运行期间需要执行的矩阵乘法运算,以及执行矩阵乘法运算所需的数据。终端解析所述算子描述,得到目标代码。可选的,终端可以是先解析算子描述,得到中间表示。其中,中间表示的作用是使得待编译的算子描述的结构在逻辑上更为简单明确,从而使得对最终的目标代码的优化比较容易实现。然后,终端再解析所述中间表示,以生成目标代码。其中,目标代码是终端对算子描述进行编译后所生成的代码。一般来说,算子描述是基于高级语言编写的,目标代码则是介于高级语言和机器语言之间的语言。目标代码 能够被进一步转换为可执行的二进制机器代码。
其中,所述目标代码调用第一接口,所述第一接口用于指示多个第一映射关系,所述第一映射关系是一个实例与第一数据之间的映射关系,所述实例用于处理与所述实例对应的第一数据,所述第一数据为参与所述矩阵乘法运算的数据,其中,同一阶段内并行执行的多个实例分别与位于不同存储体bank内的第一数据具有所述第一映射关系。
在实例向量化处理数据的情况下,共同执行矩阵乘法运算的多个实例会分成多个阶段来处理参与矩阵乘法运算的数据。对于在同一个阶段内执行的多个实例来说,该多个实例中的每个实例所对应的第一数据都是位于不同的bank内,即不存在有第一数据是同时与该多个实例中的任意两个或两个以上实例具有第一映射关系的。
其中,一个实例可以对应于多个bank内的第一数据。一个bank内的第一数据也可以是对应于不在同一阶段内并行执行的多个实例。
可选的,实例可以是线程或超线程。
一般地,算子的输入信息、输出信息和计算信息可以理解为对一个算子的描述,简称算子描述。当然,一些实现下,算子描述还可以包括其他的与算子有关的信息。其中,输入信息可以包括参与运算的矩阵的数量、矩阵的数据大小、矩阵的数据类型和矩阵的数据排布方式;输出信息可以包括输出的矩阵的数量、矩阵的数据大小、矩阵的数据类型和矩阵的数据排布方式;计算信息包括运算的类型,例如矩阵乘法运算。
在所述第一接口的指示下,实例只处理与其具有映射关系的第一数据。也就是说,在同一阶段内并行执行的多个实例分别用于处理位于不同bank内的第一数据。这样,在同一阶段内执行的所有实例并不会访问同一个bank内的第一数据,从而有效避免了共享内存中所发生的bank冲突,保证了执行运算的效率。
在一种可能的实现方式中,目标代码调用第一接口的方式可以为:目标代码中包括调用库文件的语句以及调用库文件中接口的语句(即上述的第一接口)。基于调用库文件的语句,可以实现库文件的调用。在调用库文件后,即可基于调用库文件中接口的语句,调用库文件中的接口。在另外一些可能的实现方式中,目标代码也可以是通过静态链接库、动态链接库或者是内联库实现等方式来调用第一接口。
在一种可能的实现方式中,所述终端解析所述算子描述,生成目标代码,包括:解析所述算子描述,以得到中间表示;将所述中间表示中的第一语句替换为第一接口语句,以得到所述目标代码,所述第一语句用于指示执行所述矩阵乘法运算,所述第一接口语句用于调用所述第一接口,所述第一接口用于执行所述矩阵乘法运算。其中,中间表示中指示执行矩阵乘法运算的第一语句可以是矩阵乘法运算的表达式。
简单来说,终端可以通过匹配中间表示中的矩阵乘法运算的表达式,确定中间表示中的矩阵乘法运算以及矩阵乘法运算的信息。然后,终端将矩阵乘法运算的表达式替换为调用第一接口的第一接口语句,从而实现生成调用第一接口的目标代码。其中,所述第一接口语句包括有矩阵乘法运算的信息,以使得在执行目标代码时,能够基于所述矩阵乘法运算的信息实现矩阵乘法运算。
本方案中,通过将中间表示中指示执行矩阵乘法运算的语句替换为第一接口语句,以 使得在执行运算的过程中调用第一接口来实现。在第一接口指示了实例执行矩阵乘法运算的基础上,由于第一接口还进一步指定了各个实例与数据之间的映射关系,能够保证同一阶段内执行的多个实例在执行矩阵乘法运算的过程中不会同时访问同一个bank内的数据,避免了bank冲突,保证了执行矩阵乘法运算的效率,从而提高执行运算的效率。
在一种可能的实现方式中,所述第一接口包括有并行实例执行(Parallel Thread eXecution,PTX)指令代码。
在一种可能的实现方式中,所述矩阵乘法运算的信息包括所述第一数据、所述第一数据的排布方式(例如row_major排布方式或col_major排布方式)、所述第一数据的数据类型(例如float16或float32等数据类型)和所述矩阵乘法运算的表达式(例如D=C+A*B)。
在一种可能的实现方式中,所述终端解析所述算子描述,生成目标代码,包括:终端解析所述算子描述,以得到中间表示;终端将所述中间表示中的第二语句替换为第二接口语句,以得到所述目标代码。所述第二语句用于指示将所述第一数据搬移至局部内存或者将第一数据搬移出局部内存。其中,参与矩阵乘法运算的第一数据通常需要搬移至局部内存,以实现后续的矩阵乘法运算。所述第二接口语句用于调用所述第一接口,所述第一接口还用于搬移所述第一数据。基于第一接口,能够实现将所述第一数据从全局内存或者共享内存中搬移到局部内存中,以便于后续基于局部内存中的第一数据执行矩阵乘法运算;基于第一接口,还能够实现在矩阵乘法运算执行完毕之后将局部内存中的第一数据搬移至全局内存或者共享内存中,以腾出局部内存中的空间。
本方案中,通过将中间表示中指示搬移数据的语句替换为第二接口语句,以使得在搬移数据的过程中调用第一接口来实现。在第一接口指示了实例执行数据搬移的基础上,第一接口还进一步指定了各个实例与数据之间的映射关系,能够保证同一阶段内执行的多个实例在数据搬移过程中不会同时访问同一个bank内的数据,避免了bank冲突,保证了数据搬移的效率,从而提高执行运算的效率。
在一种可能的实现方式中,终端解析所述算子描述,生成目标代码,包括:终端解析所述算子描述,得到中间表示;终端将所述中间表示中的第三语句替换为第三接口语句,以得到所述目标代码,所述第三语句用于指示执行融合运算,所述融合运算的输入包括所述矩阵乘法运算的输出,所述第三接口语句用于调用第二接口,所述第二接口用于执行融合运算。其中,融合运算是指将多个算子(例如矩阵乘法运算的算子)合并在一起的运算,即能够实现矩阵乘法运算的输出与其他算子的合并运算。具体地,所述融合运算可以为基于矩阵乘法运算的输出进行逐元素运算的一种运算。示例性地,所述融合运算例如可以包括以下运算中的至少一种:加法运算、减法运算、乘法运算、除法运算、结果向下取整的除法运算、取模运算和结果向下取整的取模运算。在融合运算为加法运算时,对于两个参与融合运算的矩阵,融合运算的过程为:将这两个参与融合运算的矩阵中位于相同位置上的元素逐个相加,最终得到融合运算的结果。
本实施例中,通过在算子描述中包括矩阵乘法运算和融合运算,可以实现多个算子的融合,提高终端的运算效率以及提升终端的资源利用率。
在一种可能的实现方式中,在终端所生成的目标代码中,目标代码所调用的第一接口 还用于指示得到所述第一数据的逻辑存储结构以及所述第一数据的数据类型,并根据所述逻辑存储结构和所述数据类型确定数据加载指针的大小。
其中,所述第一数据是指参与矩阵乘法运算的数据。所述第一数据实际上可以为一个矩阵数据,矩阵数据中包括多个元素。所述第一数据的逻辑存储结构是指所述第一数据在内存中存储的逻辑结构形式。所述第一数据的数据类型用于指示作为矩阵数据的第一数据中的元素的数据量。所述数据加载指针的大小用于指示所述实例单次加载数据的数据量。例如,数据加载指针的大小为128比特时,该数据加载指针的大小则指示实例单次加载数据的数据量为128比特。
本方案中,通过基于第一数据的逻辑存储结构以及第一数据的数据类型来确定实例单次加载数据的数据量,能够实现数据最大限度的向量化加载,保证数据的合并访问以及扩大数据存取的吞吐量,提高执行运算的效率。
在一种可能的实现方式中,所述方法还包括:
终端基于所述算子描述,生成用于矩阵分块的参数,所述用于矩阵分块的参数用于指示矩阵划分的方式。然后,终端根据用于矩阵分块的参数,对目标矩阵执行分块操作,得到所述目标矩阵的划分结果,所述目标矩阵为参与所述矩阵乘法运算的矩阵。最后,终端根据所述目标矩阵的划分结果,在所述中间表示中添加数据搬移语句,所述数据搬移语句用于指示在内存中搬移所述目标矩阵的数据。所述目标矩阵的数据为参与所述矩阵乘法运算的目标矩阵内的数据。例如,所述数据搬移语句可以是用于指示将所述目标矩阵的数据从全局内存搬移至共享内存,或者是将所述目标矩阵的数据从共享内存搬移至局部内存。
由于参与矩阵乘法运算的数据需要存储于局部内存中,因此本实施例中设计了多层级的内存提升机制,即将数据由全局内存根据矩阵分块的大小提升至数据读写速度较高的共享内存,再二次提升至数据读写速度更高的局部内存。终端基于矩阵的划分结果,将外层的矩阵对应的数据从全局内存提前搬移至共享内存,然后将外层矩阵中的内层矩阵所对应的数据再从共享内存搬移至局部内存,提高数据加载的效率。
此外,通过对矩阵执行分块操作,可以使得终端在执行矩阵乘法运算的过程中,将矩阵乘法运算分成多个部分来执行,实现多实例并行执行矩阵乘法运算,从而提高运算效率。
在一种可能的实现方式中,所述目标矩阵的划分结果可以是包括第一矩阵,所述第一矩阵包括所述第二矩阵。所述终端在目标代码中添加数据搬移语句具体可以包括:终端在指示划分第一矩阵的语句后添加第一数据搬移语句,以及在指示划分第二矩阵的语句后添加第二数据搬移语句。其中,所述第一数据搬移语句用于指示将所述第一矩阵的数据从全局内存搬移至共享内存,所述第二数据搬移语句用于指示将所述第二矩阵的数据从共享内存搬移至局部内存。一般来说,全局内存的容量大于共享内存的容量,共享内存的容量大于局部内存的容量;局部内存的数据读写速度大于共享内存的读写速度,共享内存的读写速度大于全局内存。将数据从全局内存搬移至共享内存,以及将数据从共享内存搬移至局部内存均能够有效提升数据的访问速度。
在一种可能的实现方式中,所述目标代码中还包括第二映射关系,所述第二映射关系是实例与数据搬移语句之间的映射关系,所述第二映射关系用于指示执行数据搬移语句的实例,所述第二映射关系是基于实例的数据以及划分后的矩阵的数据结构建立的。
本方案中,终端在生成目标代码的过程中指定实例与数据搬移语句之间的映射关系,从而保证实例与数据搬移语句之间的合理匹配,保证数据的访问的局部性,提高了在目标代码执行过程中实例搬移数据的效率。
在一种可能的实现方式中,所述方法还包括:终端根据参与所述矩阵乘法运算的实例的总数,确定线程束warp的数量,其中,每个warp中包括相同数量的实例。一般地,在GPU中,每32个实例组成一个warp,warp是调度和运行的基本单元。终端基于所述warp的数量以及目标矩阵的数据结构,在所述中间表示中建立warp与所述目标矩阵中的轴之间的第三映射关系,所述第三映射关系用于指示执行矩阵中的轴的运算的warp。
由于终端在目标代码中调用了有PTX指令代码的第一接口,而第一接口的统一操作层级为warp层级,因此本方案中通过建立warp层级的计算语句映射,能够保证多实例与计算语句之间的映射关系更为合理,以进一步优化运算的效率。
本申请第二方面提供一种编译装置,包括:获取单元和处理单元;所述获取单元,用于获取神经网络模型的算子描述,所述算子描述包括对矩阵乘法运算的描述;所述处理单元,用于解析所述算子描述,生成目标代码;
其中,所述目标代码调用第一接口,所述第一接口用于指示多个第一映射关系,所述第一映射关系是一个实例与第一数据之间的映射关系,所述实例用于处理与所述实例对应的第一数据,所述第一数据为参与所述矩阵乘法运算的数据,其中,同一阶段内并行执行的多个实例分别与位于不同存储体bank内的第一数据具有所述第一映射关系。
在一种可能的实现方式中,所述处理单元还用于:解析所述算子描述,得到中间表示;将所述中间表示中的第一语句替换为第一接口语句,以得到所述目标代码,所述第一语句用于指示执行所述矩阵乘法运算,所述第一接口语句用于调用所述第一接口,所述第一接口用于执行所述矩阵乘法运算。
在一种可能的实现方式中,所述第一接口包括有PTX指令代码。
在一种可能的实现方式中,所述处理单元还用于:解析所述算子描述,得到中间表示;将所述中间表示中的第二语句替换为第二接口语句,以得到所述目标代码,所述第二语句用于指示搬移所述第一数据,所述第二接口语句用于调用所述第一接口,所述第一接口用于搬移所述第一数据。
在一种可能的实现方式中,所述处理单元还用于:解析所述算子描述,得到中间表示;将所述中间表示中的第三语句替换为第三接口语句,以得到所述目标代码,所述第三语句用于指示执行融合运算,所述融合运算的输入包括所述矩阵乘法运算的输出,所述第三接口语句用于调用第二接口,所述第二接口用于执行融合运算。
在一种可能的实现方式中,所述融合运算包括以下运算中的至少一种:加法运算、减法运算、乘法运算、除法运算、结果向下取整的除法运算、取模运算和结果向下取整的取 模运算。
在一种可能的实现方式中,所述第一接口还用于指示获取所述第一数据的逻辑存储结构以及所述目标数据的数据类型,并根据所述逻辑存储结构和所述数据类型确定数据加载指针的大小,所述数据加载指针的大小用于指示所述实例单次加载数据的数据量。
在一种可能的实现方式中,所述处理单元还用于:基于所述算子描述,生成用于矩阵分块的参数;根据所述用于矩阵分块的参数,对目标矩阵执行分块操作,以得到所述目标矩阵的划分结果,所述目标矩阵为参与所述矩阵乘法运算的矩阵;根据所述目标矩阵的划分结果,在所述目标代码中添加数据搬移语句,所述数据搬移语句用于指示在内存中搬移所述目标矩阵的数据。
在一种可能的实现方式中,所述目标矩阵的划分结果包括第一矩阵,所述第一矩阵包括第二矩阵;所述处理单元还用于:在所述目标代码中指示划分第一矩阵的语句后添加第一数据搬移语句,以及在所述目标代码中指示划分第二矩阵的语句后添加第二数据搬移语句;
其中,所述第一数据搬移语句用于指示将所述第一矩阵的数据从全局内存搬移至共享内存,所述第二数据搬移语句用于指示将所述第二矩阵的数据从共享内存搬移至局部内存。
在一种可能的实现方式中,所述目标代码中还包括第二映射关系,所述第二映射关系是实例与数据搬移语句之间的映射关系,所述第二映射关系用于指示执行数据搬移语句的实例,所述第二映射关系是基于实例的数据以及划分后的矩阵的数据结构建立的。
在一种可能的实现方式中,所述目标代码中还包括线程束warp与所述目标矩阵中的轴之间的第三映射关系;其中,所述第三映射关系用于指示执行矩阵中的轴的运算的warp,所述warp的数量是基于参与所述矩阵乘法运算的实例的总数确定的,每个warp中包括相同数量的实例,所述目标矩阵为参与所述矩阵乘法运算的矩阵。
本申请第三方面提供了一种编译装置,可以包括处理器,处理器和存储器耦合,存储器存储有程序指令,当存储器存储的程序指令被处理器执行时实现上述第一方面所述的方法。对于处理器执行第一方面的各个可能实现方式中的步骤,具体均可以参阅第一方面,此处不再赘述。
本申请第四方面提供了一种计算机可读存储介质,所述计算机可读存储介质中存储有计算机程序,当其在计算机上运行时,使得计算机执行上述第一方面所述的方法。
本申请第五方面提供了一种电路系统,所述电路系统包括处理电路,所述处理电路配置为执行上述第一方面所述的方法。
本申请第六方面提供了一种计算机程序产品,当其在计算机上运行时,使得计算机执行上述第一方面所述的方法。
本申请第七方面提供了一种芯片系统,该芯片系统包括处理器,用于支持服务器或门限值获取装置实现上述第一方面中所涉及的功能,例如,发送或处理上述方法中所涉及的数据和/或信息。在一种可能的设计中,所述芯片系统还包括存储器,所述存储器,用于保存服务器或通信设备必要的程序指令和数据。该芯片系统,可以由芯片构成,也可以包括 芯片和其他分立器件。
附图说明
图1为本申请实施例提供的两种数据排布方式的示意图;
图2为本申请实施例提供的TVM使能TensorCore的矩阵乘法运算的流程示意图;
图3为本申请实施例提供的一种MindSpore图算融合特性的架构图;
图4为本申请实施例提供的一种编译方法400的流程示意图;
图5为本申请实施例提供的一种矩阵乘法运算的示意图;
图6为本申请实施例提供的一种多实例处理数据的示意图;
图7为本申请实施例提供的一种非融合算子的计算流程示意图;
图8为本申请实施例提供的一种融合算子的计算流程示意图;
图9为本申请实施例提供的一种第一数据的逻辑存储结构的示意图;
图10为本申请实施例提供的一种编译方法1000的流程示意图;
图11为本申请实施例提供的一种基于多面体模型对中间表示进行调度优化的流程示意图;
图12为现有流程的数据访问方式;
图13为本申请实施例提供的一种基于Bank冲突避免Pass优化后的数据访问排布示意图;
图14为相关技术中的一种计算顺序示意图;
图15为本申请实施例提供的一种经过数据流水Pass优化后的计算顺序示意图;
图16为本申请实施例提供的一种添加数据流水Pass后的伪代码逻辑;
图17为本申请实施例提供的一种PTX内联库的示意图;
图18为本申请实施例提供的一种编译装置的结构示意图;
图19为本申请实施例提供的执行设备的一种结构示意图;
图20为本申请实施例提供的芯片的一种结构示意图。
具体实施方式
下面结合本申请实施例中的附图对本申请实施例进行描述。本申请的实施方式部分使用的术语仅用于对本申请的具体实施例进行解释,而非旨在限定本申请。
下面结合附图,对本申请的实施例进行描述。本领域普通技术人员可知,随着技术的发展和新场景的出现,本申请实施例提供的技术方案对于类似的技术问题,同样适用。
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的术语在适当情况下可以互换,这仅仅是描述本申请的实施例中对相同属性的对象在描述时所采用的区分方式。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,以便包含一系列单元的过程、方法、系统、产品或设备不必限于那些单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它单元。
为便于理解,以下将对本申请实施例所涉及的技术术语进行解释。
编译:是指利用编译程序从源语言编写的源程序产生目标代码的过程。目标代码是介于高级语言和机器语言之间的语言。目标代码能够被进一步转换为可执行的二进制机器代码。简单来说,编译是将由高级语言编写的源程序转换为由更接近机器语言的目标代码。
中间代码:是源程序的一种内部表示,又可以称为中间表示(Intermediate Representation,IR)。中间表示的作用是可使编译程序的结构在逻辑上更为简单明确,特别是使目标代码的优化比较容易实现。中间表示的复杂性介于源程序语言和机器语言之间。
代码优化:是指对程序进行多种等价变换,使得从变换后的程序出发,能生成更有效的目标代码。所谓等价,是指不改变程序的运行结果。所谓有效,主要指目标代码运行时间较短,以及占用的存储空间较小。这种变换称为优化。
优化Pass:优化Pass是编译框架中的重要部分。优化Pass对中间表示进行分析和修改。在代码优化的过程中,由多个优化Pass对中间表示进行分析和修改,每个Pass完成特定的优化工作。
自动算子生成(Auto Kernel Generator,AKG):一种人工智能编译框架。
线程束(Warp):在图形处理器(graphics processing unit,GPU)中,连续32个线程组成一个Warp,是GPU调度和运行的基本单元。在Warp之上,还包括网格(Grid)和实例块(Block)。一般来说,一个Grid包括多个Block,一个Block包括多个Warp,一个Warp包括32个线程。
共享内存:是指允许两个不相关的实例访问同一个逻辑内存。共享内存是两个正在运行的实例之间共享和传递数据的一种非常有效的方式。不同实例之间共享的内存通常为同一段物理内存。实例可以将同一段物理内存连接到自己的地址空间中,所有的实例都可以访问共享内存中的地址。如果某个实例向共享内存写入数据,所做的改动将立即影响到可以访问同一段共享内存的任何其他实例。
存储体(Bank):共享内存被分成32个大小相等的Bank。每个Bank的带宽可以为32比特(bit)或64bit。以每个Bank的带宽为32bit为例,连续的32bit数据被存储于一个Bank中,下一个连续的32bit数据被存储于下一个bank中。
Bank冲突:当不同实例同时访问同一个Bank,则会产生Bank冲突。如果在使用共享内存时发生了Bank冲突,实例的访问请求将会变为串行的,即排队进行。Bank冲突会大大降低内存带宽,导致运行性能大幅劣化。
算子融合:通过分析和优化现有图逻辑,将原有逐个计算逻辑进行重组,形成融合子图逻辑。算子融合能够在大幅减少算子执行间隙的开销的同时提升设备资源利用率。
矩阵乘法运算:对于矩阵A和矩阵B,矩阵A和矩阵B执行矩阵乘法运算后得到矩阵C。矩阵C的第m行第n列的元素等于矩阵A的第m行的元素与矩阵B的第n列对应元素乘积之和。设A为m×p的矩阵,B为p×n的矩阵,那么称m×n的矩阵C为矩阵A与B的乘积,记作C=A*B。其中,矩阵A可以表示为[m,p],矩阵B可以表示为[p,n],矩阵C可以表示为[m,n]。矩阵C中的第i行第j列元素可以表示为:
Figure PCTCN2022094998-appb-000001
具体地,假设矩阵A、矩阵B和矩阵C的一种可能的示例如下所示:
Figure PCTCN2022094998-appb-000002
Figure PCTCN2022094998-appb-000003
Figure PCTCN2022094998-appb-000004
一般来说,矩阵乘法运算只有在第一个矩阵的列数(column)和第二个矩阵的行数(row)相同时才有意义。一个m×n的矩阵就是m×n个数排成m行n列的一个数阵。
在矩阵A表示为[m,p],矩阵B表示为[p,n],矩阵C表示为[m,n]的情况下,在执行矩阵乘法运算后,矩阵A中的p轴以及矩阵B中的p轴均被消去,矩阵C中只保留了m轴和n轴。矩阵A和矩阵B中被消去的轴可以称为规约轴。
批次(Batch):在矩阵乘法运算中,有一或多根Batch轴,不参与乘累加运算。例如,对于矩阵A,矩阵B,矩阵C以及矩阵D,D[Batch1,Batch2,Batch……,M,N]=C[Batch1,Batch2,Batch……,M,N]+A[Batch1,Batch2,Batch……,M,K]*B[Batch1,Batch2,Batch……,K,N]。
数据排布方式:对于矩阵乘法运算来说,输入矩阵参与乘累加运算的轴具有四种排布方式。以输入矩阵为矩阵A和矩阵B为例,四种数据排布方式分别为:
1,A[M,K]且B[N,K],此时矩阵A为行优先(row_major)排布,矩阵B为列优先(col_major)排布;
2,A[M,K]且B[K,N],此时矩阵A为row_major排布,矩阵B为row_major排布;
3,A[K,M]且B[N,K],此时矩阵A为col_major排布,矩阵B为col_major排布;
4,A[K,M]且B[K,N],此时矩阵A为col_major排布,矩阵B为row_major排布。
在计算机中,row_major和col_major是存储器存储多维数组的方法。两种顺序的区别在于数组中的哪些元素在内存中是连续的。对于row_major排布方式,数组中行的连续元素在内存中彼此相邻;对于col_major排布方式,数组中列的连续元素在内存中彼此相邻。示例性地,可以参阅图1,图1为本申请实施例提供的两种数据排布方式的示意图。如图1所示,对于同一个矩阵,在矩阵为row_major排布方式的情况下,该矩阵中的元素是从左到右逐行存储的,相同行中的元素在内存中是连续的;在矩阵为col_major排布方式的情况下,该矩阵中的元素是从上到下逐列存储的,相同列中的元素在内存中是连续的。
Float:一种数据类型。Float数据类型用于存储单精度浮点数或双精度浮点数。一般地,数据类型float16指的是16比特的数据,数据类型float32指的是32比特的数据。
取模运算:是指求两个整数相除的余数。取模运算可以用符号“MOD”来表示。例如,83 MOD 10=3。
整除运算:是指在整数运算中求一个整数除以另一个整数时取整数商的运算,且不考虑运算的余数。整除运算也称为DIV运算,整除运算可以用符号“DIV”来表示。例如,83 DIV 10=8。
多面体模型:一种调度编译优化技术。多面体模型的实质是将程序表达中的仿射循环嵌套进行抽象表示,通过这些多面体上的几何操作来分析和优化程序的对应调度的编译优化,以扩大程序的自动并行性。
Pluto算法:应用于求解高效的多面体调度,本质是一种以通信数据量最优化为目的的代价模型,并基于该代价模型依次求解调度变换所需的划分平面。给定循环嵌套序列的多面体表示,该算法可用于确定有效的调度,以便满足读写依赖关系。
图算融合:一种网络性能优化技术。图算融合可以通过自动分析,优化现有网络计算图逻辑,并结合目标硬件能力,对计算图进行计算化简、算子拆分和融合、算子特例化编译等优化,以提升设备计算资源利用率,实现对网络性能的整体优化。相比传统优化技术,图算融合具有多算子跨边界联合优化、与算子编译跨层协同、基于多面体自动调度优化的算子即时编译等独特优势。
随着人工智能应用成熟度的提升,人工智能相关的应用被辐射到众多领域。深度学习方法,是近年来人工智能领域发展的一个关键推动力,在多种任务中取得了令人瞩目的效果。目前,基于深度学习的网络模型的规模和复杂度呈指数级地增大,尤其是应用于自动驾驶、机器人和内容推荐等热门综合场景下的网络模型。示例性地,部分主流网络模型的规模统计如表1所示。
表1网络模型的规模统计
网络模型 内存(MB) 参数(百万) 计算量(百万)
AlexNet 200+ 60 720
VGG16 500+ 138 15300
Inception-v3 90~100 23.2 5000
GPT-3 350000 175000 /
在基于深度学习的网络模型中,大部分的计算都来源于矩阵乘法(General Matrix Multiplication,GEMM)运算,因此对矩阵乘法运算的优化是至关重要的。
鉴于矩阵乘法运算在网络模型中的重要性,不同架构针对矩阵乘法运算设计的特殊单元也应运而生。以GPU为例,显卡厂商NVIDIA在Volta架构中引入了TensorCore矩阵乘法运算单元。作为GPU架构的矩阵乘法核心单元,TensorCore单元具有极其强大的吞吐能力。目前,针对TensorCore的技术方案主要包括以张量虚拟机(Tensor Virtual Machine,TVM)为代表的人工智能编译框架。
TVM作为人工智能编译框架,实现了使能TensorCore单元的矩阵乘法运算优化。可以参阅图2,图2为本申请实施例提供的TVM使能TensorCore的矩阵乘法运算的流程示意 图。如图2所示,TVM使能TensorCore的矩阵乘法运算的流程包括以下的步骤201-204。
步骤201,对矩阵乘法运算对应的算子描述进行解析编译,生成中间表示。
步骤202,将生成的中间表示作为TensorCore优化Pass的输入,由优化Pass对中间表示进行优化。优化Pass对中间表示进行矩阵乘法模式识别和TensorCore功能单元匹配。如模式匹配成功,则对该中间表示进行标注。
步骤203,根据对中间表示所进行的标注,在编译后端的代码生成模块,解析中间表达,并调用WMMA,以通过WMMA接口来执行矩阵乘法运算。其中,WMMA是统一计算设备架构(Compute Unified Device Architecture,CUDA)提供的编程接口(Application Programming Interface,API)。CUDA是一种由NVIDIA推出的通用并行计算架构,该架构使GPU能够解决复杂的计算问题。
步骤204,最终,基于中间表示生成可使能TensorCore的目标代码。
然而,经申请人研究发现,基于调用WMMA接口的目标代码执行矩阵乘法运算的过程中,在共享内存中会存在数据访问冲突,即bank冲突,降低了执行运算的效率。
有鉴于此,本申请实施例提供了一种编译方法,通过在对算子描述进行编译的过程中,在生成的目标代码中调用包括特定指令代码的接口,以指定同一阶段内并行执行的实例对应于不同bank内的数据,即可保证所有实例均不会同时访问一个bank内的数据,从而避免了共享内存中所发生的bank冲突。
本申请实施例所提供的编译方法可以应用于人工智能场景下的网络模型的训练和推理,例如MindSpore框架中图算融合核心特性下的网络模型。其中,MindSpore是一种全场景下的人工智能计算框架。
本申请实施例提出的编译方法具体在MindSpore框架下模型运算的应用流程如图3所示,图3为本申请实施例提供的一种MindSpore图算融合特性的架构图。
如图3所示,在MindSpore图层编译框架(3001)中,首先在MindSpore前端对网络模型(3011)进行公共优化(3012)后生成MindSpore前端表示(3013)。基于MindSpore前端表示,进行图层前端图优化(3014)和后端计算优化(3015)。优化后生成矩阵乘法运算相关的算子描述(3021),并传入算子层编译框架(3002)。在算子层编译框架(3002)中,对算子描述进行解析,得到中间表示(3022)。对中间表示进行优化(3023)后,解析中间表示并调用运算库(3024),生成目标代码(3025),以供图层编译框架(3001)调用运行。
示例性地,本申请实施例所提供的编译方法可以应用于终端上。本申请实施例所提供的终端例如可以是手机(mobile phone)、个人电脑(personal computer,PC)、笔记本电脑、服务器、平板电脑、智慧电视、移动互联网设备(mobile internet device,MID)、可穿戴设备,虚拟现实(virtual reality,VR)设备、增强现实(augmented reality,AR)设备、工业控制(industrial control)中的无线终端、无人驾驶(self driving)中的无线终端、远程手术(remote medical surgery)中的无线终端、智能电网(smart grid)中的无线终端、运输安全(transportation safety)中的无线终端、智慧城市(smart city)中的无线终端、智慧家 庭(smart home)中的无线终端等。
可以参阅图4,图4为本申请实施例提供的一种编译方法400的流程示意图。如图4所示,该编译方法400包括以下的步骤401-403。
步骤401,获取神经网络模型的算子描述,所述算子描述包括对矩阵乘法运算的描述。
在编译过程中,终端可以获取到需要进行编译的算子描述,该算子描述可以为神经网络模型的算子描述。其中,该算子描述可以是由领域专用语言(Domain-Special Language,DSL)编写的,用于定义代码运行期间需要执行的运算。具体地,该算子描述包括对矩阵乘法运算的描述,即算子描述中定义了代码运行期间需要执行的矩阵乘法运算,以及执行矩阵乘法运算所需的数据。
步骤402,解析所述算子描述,得到中间表示。
本实施例中,终端通过对算子描述进行解析,可以得到中间表示。其中,中间表示的作用是使得待编译的算子描述的结构在逻辑上更为简单明确,从而使得对最终的目标代码的优化比较容易实现。一般来说,中间表示的复杂性介于算子描述的编写语言和目标代码对应的编写语言之间。
步骤403,解析所述中间表示,以生成目标代码。
本实施例中,所述目标代码调用有第一接口,所述第一接口包括有指令代码,例如第一接口包括有PTX指令代码。其中,PTX指令代码为汇编级代码,PTX指令代码的格式为asm()。所述第一接口用于指示多个第一映射关系,所述第一映射关系是一个实例与第一数据之间的映射关系。所述实例用于处理与所述实例对应的第一数据,所述第一数据为参与所述矩阵乘法运算的数据。同一阶段内并行执行的多个实例分别与位于不同存储体(bank)内的第一数据具有所述第一映射关系。在所述指令代码的指示下,实例只处理与其具有映射关系的第一数据。也就是说,在同一阶段内并行执行的多个实例分别用于处理位于不同bank内的第一数据。这样,在同一阶段内执行的所有实例并不会访问同一个bank内的第一数据,从而有效避免了共享内存中所发生的bank冲突,保证了执行运算的效率。
本实施例中,实例可以是包括线程或超线程。在大部分的场景下,本申请实施例中所提及的实例可以为线程。
一般来说,在实例向量化处理数据的情况下,共同执行矩阵乘法运算的多个实例会分成多个阶段来处理参与矩阵乘法运算的数据。例如,假设共同执行矩阵乘法运算的实例一共有32个。基于一条指示处理矩阵乘法运算的指令,32个实例可以分成多个阶段(例如两个阶段或四个阶段)来处理参与矩阵乘法运算的数据。以32个实例分为四个阶段来处理数据为例,每个阶段内会有8个实例并行处理参与矩阵乘法运算的数据,且每个实例都是在唯一的一个阶段内处理参与矩阵乘法运算的数据。例如,32个实例记为t0-t31,t0-t7在第一阶段内处理数据,t8-15在第二阶段内处理数据,t16-t23在第三阶段内处理数据,t24-t31在第四阶段内处理数据。
因此,本实施例中,通过指定同一阶段内并行执行的实例对应于不同bank内的数据,即可保证所有实例均不会同时访问一个bank内的数据,从而避免了共享内存中所发生的 bank冲突。
为了便于理解,以下将详细介绍本实施例提供的编译方法能够有效避免bank冲突的原理。
可以理解的是,由于矩阵乘法运算实际上是两个矩阵的行数据与列数据之间的乘累加运算,因此在基于多个实例执行矩阵乘法运算时,不同的实例可能需要用到相同的一行数据或者相同的一列数据。
示例性地,可以参阅图5,图5为本申请实施例提供的一种矩阵乘法运算的示意图。如图5所示,矩阵A为4×7大小的矩阵,矩阵B为7×4大小的矩阵,矩阵A和矩阵B执行矩阵乘法运算后得到的矩阵C为4×4大小的矩阵。其中,矩阵C中位于第一行及第一列的数据是由矩阵A中的第一行数据与矩阵B中的第一列执行乘积累加运算得到的,矩阵C中位于第四行及第四列的数据是由矩阵A中的第四行数据与矩阵B中的第四列执行乘积累加运算得到的。对于矩阵C来说,矩阵C中每一行的数据均是基于矩阵A中的相同行的数据得到的,例如矩阵C中第一行的数据均是基于矩阵A中的第一行数据得到的。
那么,在这种情况下,采用多个实例来执行矩阵C的矩阵乘法运算时,则不同的实例可能需要用到相同的一行数据或者相同的一列数据。例如,实例A用于求取矩阵C中第一行第一列的数据,实例B用于求取矩阵C中第一行第二列的数据,那么实例A和实例B都需要用到矩阵A中第一行的数据。此时,实例A和实例B同时访问矩阵A中第一行的数据,则会产生数据访问冲突,实例A和实例B的数据访问请求需要排队执行,影响了运算效率。
由于参与矩阵乘法运算的数据通常存储于共享内存中,以供不同的实例访问,因此数据访问冲突的问题可以理解为共享内存中的bank冲突。即,不同的实例同时访问共享内存中的同一个bank时,则产生bank冲突,影响运算效率。
基于此,本实施例中通过指定同一阶段内并行执行的实例对应于不同bank内的数据,即可保证所有实例均不会同时访问一个bank内的数据,从而避免了共享内存中所发生的bank冲突。
示例性地,可以参阅图6,图6为本申请实施例提供的一种多实例处理数据的示意图。如图6所示,FragmentA[16,4]表示矩阵A中的数据,t0~t31分别表示不同的32个实例。FragmentA中连续的两个数据属于同一个bank,即数据0、数据1属于bank0,数据2、数据3属于bank1…数据62、数据63属于bank31。
在本方案中,单个实例处理FragmentA中的一行数据。其中,实例t0-t7属于在第一阶段内并行执行的实例,实例t8-15属于在第二阶段内并行执行的实例,实例t16-t23属于在第三阶段内并行执行的实例,实例t24-t31属于在第四阶段内并行执行的实例。实例t0-t7对应于FragmentA中的前八行数据,且每个实例对应于不同行的数据。例如,实例t0对应于第一行的数据,实例t1对应于第二行的数据。这样一来,可以确保同一阶段内并行执行的实例t0-t7分别对应于不同bank内的数据,避免了实例t0-t7的执行期间发生bank冲突。
实例t8-t15同样对应于FragmentA中的前四行数据。由于实例t0-t7与实例t8-t15属于 不同阶段内的实例,因此实例t0-t7与实例t8-t15对应于相同bank内的数据并不会产生bank冲突。
示例性地,基于图6对应的实施例,可以参阅以下的代码实现。
Figure PCTCN2022094998-appb-000005
在以上的代码中,代码“const unsigned lane_id=get_lane_id()”表示获取实例ID,代码“const unsigned row”表示执行位运算,该位运算转换为数学运算,对应为row=lane_id MOD 8+int(lane_id/16)*8。其中,row表示行数据,lane_id表示实例ID。基于执行位运算的代码,能够实现行数据与实例ID之间的绑定,即指示了实例与数据之间的映射关系。另外,行与行之间的数据的索引偏移为ldm*row。基于上述的代码,可以确定实例t0与实例t8共同绑定数据地址为p+0的数据(即FragmentA第一行的数据),t1与t9共同绑定数据地址为p+ldm(ldm表示第一行数据与第二行数据的偏移)的数据……以此类推,最终实现所有数据与实例之间的绑定。
以上介绍了基于第一接口中的指令代码实现避免bank冲突的过程,以下将详细介绍基于中间表示生成目标代码的过程。
在一个可能的实施例中,终端可以解析所述算子描述,得到中间表示;然后,终端将所述中间表示中的第一语句替换为第一接口语句,以得到所述目标代码,所述第一语句用于指示执行所述矩阵乘法运算,所述第一接口语句用于调用所述第一接口,所述第一接口用于执行所述矩阵乘法运算。其中,中间表示中指示执行矩阵乘法运算的第一语句可以是矩阵乘法运算的表达式。
示例性地,终端可以解析所述中间表示,得到运算信息,所述运算信息包括所述矩阵乘法运算的信息。基于所述运算信息,终端确定所述中间表示中所述矩阵乘法运算所在的位置,并将所述矩阵乘法运算所在的位置的语句(即上述的第一语句)替换为第一接口语 句,以得到所述目标代码,所述第一接口语句用于调用所述第一接口。
简单来说,终端可以通过匹配中间表示中的矩阵乘法运算的表达式,确定中间表示中的矩阵乘法运算以及矩阵乘法运算的信息。然后,终端在矩阵乘法运算所在的位置替换调用第一接口的第一接口语句,从而实现生成调用有第一接口的目标代码。其中,所述第一接口语句包括有矩阵乘法运算的信息,以使得在执行目标代码时,能够基于所述矩阵乘法运算的信息实现矩阵乘法运算。
具体地,矩阵乘法运算的信息包括参与矩阵乘法运算的第一数据、所述第一数据的排布方式、所述第一数据的数据类型(例如float16或float32等数据类型)和所述矩阵乘法运算的表达式(例如D=C+A*B)。
在一个可能的实施例中,终端解析所述算子描述,得到中间表示;终端将所述中间表示中的第二语句替换为第二接口语句,以得到所述目标代码。所述第二语句用于指示将所述第一数据搬移至局部内存或者将第一数据搬移出局部内存。其中,参与矩阵乘法运算的第一数据通常需要搬移至局部内存,以实现后续的矩阵乘法运算。所述第二接口语句用于调用所述第一接口,所述第一接口用于搬移所述第一数据。
本方案中,通过将中间表示中指示搬移数据的语句替换为第二接口语句,以使得在搬移数据的过程中调用第一接口来实现。由于第一接口能够指定实例与数据之间的映射关系,能够保证数据搬移过程中不发生bank冲突,保证了数据搬移的效率,从而提高执行运算的效率。
在一个可能的实施例中,终端解析所述算子描述,得到中间表示;终端将所述中间表示中的第三语句替换为第三接口语句,以得到所述目标代码,所述第三语句用于指示执行融合运算,所述融合运算的输入包括所述矩阵乘法运算的输出,所述第三接口语句用于调用第二接口,所述第二接口用于执行融合运算。
示例性地,在终端解析所述中间表示所得到的运算信息中还包括融合运算的信息,所述融合运算的输入包括所述矩阵乘法运算的输出。其中,融合运算是指将多个算子(例如矩阵乘法运算的算子)合并在一起的运算,即能够实现矩阵乘法运算的输出与其他算子的合并运算。
在终端解析得到的运算信息中还包括融合运算的信息的情况下,所述终端解析中间表示,生成目标代码的过程,还包括:终端基于所述运算信息,确定所述中间表示中的所述融合运算所在的位置,并将所述融合运算所在的位置的语句(即上述的第三接口语句)替换为第二接口语句,得到所述目标代码。所述第二接口语句用于调用第二接口,所述第二接口包括用于指示执行融合运算的指令代码。
示例性地,所述融合运算可以包括以下运算中的至少一种:加法运算、减法运算、乘法运算、除法运算、结果向下取整的除法运算、取模运算和结果向下取整的取模运算。例如,所述融合运算具体可以是对两个不同的矩阵乘法运算的输出进行逐元素相加。所述融合运算的表达式可以为:D=A+B,其中D为融合运算的输出,A为一个矩阵乘法运算的输出,B为另一个矩阵乘法运算的输出。
本实施例中,通过在算子描述中包括矩阵乘法运算和融合运算,可以实现多个算子的融合,提高终端的运算效率以及提升终端的资源利用率。
示例性地,可以参阅图7和图8,图7为本申请实施例提供的一种非融合算子的计算流程示意图;图8为本申请实施例提供的一种融合算子的计算流程示意图。图7示出了未进行融合的多个矩阵运算,而图8示出了对应的融合后的矩阵运算。对于图7所示的运算,在CPU端需要耗时四次去调用四个不同的算子(即矩阵乘法算子7001、数据广播算子7002、加法算子7003、数据变形算子7004),且每个算子执行时需要在效率最低的全局内存加载数据(即步骤7011~7016),计算后再将计算结果存储回全局内存(即步骤7014~7017)。即该运算共有4次算子调用和7次全局内存访问(3011-3017)的执行开销。
然而,在图8中,实现算子融合之后,即矩阵乘法算子8001、数据广播算子8002、加法算子8003、数据变形算子8004融合为融合算子8021,只需要进行1次算子调用(8021)和4次全局内存访问(8011~8014)。对比图7和图8可知,算子融合能够在大幅减少算子执行间隙的开销的同时提升设备资源利用率。
可以理解的是,在终端基于目标代码执行矩阵乘法运算的过程中,终端往往需要先从内存中加载参与矩阵乘法运算的数据,然后再基于这些参与矩阵乘法运算的数据来执行矩阵乘法运算。因此,在终端执行矩阵乘法运算的过程中,高效地实现数据加载,能够提高终端执行矩阵乘法运算的效率。
在一个可能的实施例中,在终端所生成的目标代码中,目标代码所调用的第一接口中的指令代码还用于指示获取所述第一数据的逻辑存储结构以及所述第一数据的数据类型,并根据所述逻辑存储结构和所述数据类型确定数据加载指针的大小。
其中,所述第一数据是指参与矩阵乘法运算的数据。所述第一数据的逻辑存储结构是指所述第一数据在内存中存储的逻辑结构形式。所述第一数据实际上可以为一个矩阵数据,矩阵数据中包括多个元素。某个矩阵中参与矩阵乘法运算的数据可以称为Fragment数据(即矩阵数据),例如矩阵A中参与矩阵乘法运算的数据可以称为FragmentA数据。Fragment数据对应于一个warp中所有实例对应的矩阵数据集合。因此,Fragment数据的逻辑存储结构例如可以表示为[16,4]或[16,8],即Fragment数据的逻辑存储结构为一个16*4大小的矩阵,或者是一个16*8大小的矩阵。在确定第一数据的逻辑存储结构后,即可基于数据存储接口确定第一数据中的数据总个数。例如,在第一数据的逻辑存储结构为[16,4]时,第一数据中的数据总个数为16*4=64。
所述第一数据的数据类型用于指示作为矩阵数据的第一数据中的元素的数据量。对于第一数据而言,在第一数据的数据类型确定的情况下,第一数据中的每个数据的数据量都是固定不变的。例如,在第一数据的数据类型为Float16的情况下,第一数据中的单个数据的数据量为16比特;在第一数据的数据类型为Float32的情况下,第一数据中的单个数据的数据量为32比特。
所述数据加载指针用于指示所述实例单次加载数据的数据量。例如,数据加载指针的大小为128比特时,该数据加载指针则指示实例单次加载数据的数据量为128比特。
通过基于第一数据的逻辑存储结构以及第一数据的数据类型来确定实例单次加载数据的数据量,能够实现数据最大限度的向量化加载,保证数据的合并访问和扩大数据存取的吞吐量,提高执行运算的效率。
示例性地,在第一数据的数据类型为Float16,且第一数据的逻辑存储结构为[16,4]的情况下,第一数据在内存区域中是每4个数据连续在一起。具体地,可以参阅图9,图9为本申请实施例提供的一种第一数据的逻辑存储结构的示意图。如图9所示,第一数据为FragmentA数据,FragmentA数据存储于[16,4]大小的内存区域中,共享内存的大小则为[128,32]。在FragmentA数据中,相邻的两行数据在内存地址上实际相差32个数据。因此,对于加载数据的单个实例而言,单个实例可以加载FragmentA数据中单行的所有数据。因此,终端可以确定数据加载指针的大小即为第一数据中在内存方向上连续的数据的数据量,即16*4=64。
在第一数据的数据类型为Float16,且第一数据的逻辑存储结构为[16,8]的情况下,第一数据在内存区域中是每8个数据连续在一起。此时,终端可以确定数据加载指针的大小为16*8=128。
此外,由于GPU加载指令的最大带宽为128比特,因此,在实际应用中,数据加载指针的大小需要设置为小于或等于128比特。
在一些可能的实施例中,终端在得到中间表示中,可以对中间表示进行进一步的调度优化,以得到优化后的中间表示。
具体地,在所述终端解析所述中间表示之前,所述方法还包括:
终端基于所述中间表示,生成用于矩阵分块的参数,所述用于矩阵分块的参数用于指示矩阵划分的方式。然后,终端根据用于矩阵分块的参数,对目标矩阵执行分块操作,得到所述目标矩阵的划分结果,所述目标矩阵为参与所述矩阵乘法运算的矩阵。最后,终端根据所述目标矩阵的划分结果,在所述中间表示中添加数据搬移语句,以使得目标代码中包括所述数据搬移语句。所述数据搬移语句用于指示在内存中搬移所述目标矩阵的数据。例如,所述数据搬移语句可以是用于指示将所述目标矩阵的数据从全局内存搬移至共享内存,或者是将所述目标矩阵的数据从共享内存搬移至局部内存。
由于参与矩阵乘法运算的数据需要存储于局部内存中,因此本实施例中设计了多层级的内存提升机制,即将数据由全局内存根据矩阵分块的大小提升至数据读写速度较高的共享内存,再二次提升至数据读写速度更高的局部内存。终端基于矩阵的划分结果,将矩阵对应的数据从全局内存提前搬移至共享内存,然后将外层矩阵中的内层矩阵所对应的数据再从共享内存搬移至局部内存,提高数据加载的效率。
此外,通过对矩阵执行分块操作,可以使得终端在执行矩阵乘法运算的过程中,将矩阵乘法运算分成多个部分来执行,实现多实例并行执行矩阵乘法运算,从而提高运算效率。
示例性地,所述目标矩阵的划分结果可以是包括第一矩阵,所述第一矩阵包括所述第二矩阵。所述终端在中间表示中添加数据搬移语句具体可以包括:终端在指示划分第一矩阵的语句后添加第一数据搬移语句,以及在指示划分第二矩阵的语句后添加第二数据搬移 语句。其中,所述第一数据搬移语句用于指示将所述第一矩阵的数据从全局内存搬移至共享内存,所述第二数据搬移语句用于指示将所述第二矩阵的数据从共享内存搬移至局部内存。一般来说,全局内存的容量大于共享内存的容量,共享内存的容量大于局部内存的容量;局部内存的数据读写速度大于共享内存的读写速度,共享内存的读写速度大于全局内存。将数据从全局内存搬移至共享内存,以及将数据从共享内存搬移至局部内存均能够有效提升数据的访问速度。
示例性地,对于大小为[768,768]的矩阵,终端可以将该矩阵划分为多个第一矩阵和多个第二矩阵。首先,终端可以将[768,768]大小的矩阵划分为36个[128,128]大小的第一矩阵;然后,终端将每个[128,128]大小的第一矩阵划分为64个[16,16]大小的第二矩阵。终端在执行矩阵乘法运算的过程中,实际上是对[16,16]大小的多个第二矩阵执行矩阵乘法运算,从而实现对大小为[768,768]的矩阵所执行的矩阵乘法运算。在终端划分矩阵的过程中,终端可以在指示划分第一矩阵的语句后添加第一数据搬移语句,第一数据搬移语句则是指将[128,128]大小的第一矩阵对应的数据从全局内存搬移至共享内存;以及,终端在指示划分第二矩阵的语句后添加第二数据搬移语句,第一数据搬移语句则是指将[16,16]大小的第二矩阵对应的数据从共享内存搬移至局部内存。
简单来说,矩阵可以被划分为多个层级,外层的矩阵可以进一步被划分为多个内层的矩阵,且每个层级对应于不同类型的数据搬移语句。这样一来,在终端执行目标代码的过程中,终端可以在将矩阵划分为外层的矩阵时,将外层的矩阵对应的数据先搬移至共享内存;然后,终端在将外层的矩阵划分为内层的矩阵时,终端将内层的矩阵对应的数据从共享内存搬移至局部内存,从而便于终端在执行内层的矩阵对应的矩阵乘法运算时能够从局部内存中快速获取到相应的数据。此外,终端在执行外层的矩阵中的其他内层矩阵之前,终端也可以是从共享内存中提前将待执行矩阵乘法运算的数据搬移至局部内存中,保证了数据访问的效率。即,终端根据矩阵分块的大小,将参与矩阵乘法运算的数据由全局内存提升至共享内存,再二次提升至局部内存。
在一个可能的实施例中,为了提高在目标代码执行过程中实例搬移数据的效率,终端还可以生成目标代码的过程中指定实例与数据搬移语句之间的映射关系,从而保证实例与数据搬移语句之间的合理匹配,保证数据的访问的局部性。
具体地,在终端解析所述中间表示之前,所述方法还包括:终端基于实例的数量以及划分后的矩阵的数据结构,在所述中间表示中建立实例与数据搬移语句之间的第二映射关系,以得到目标代码中包括所述第二映射关系。所述第二映射关系用于指示执行数据搬移语句的实例。
一般来说,在相关技术中通常是基于需要计算得到的矩阵来确定实例与矩阵对应的数据之间的映射关系。即,在相关技术中,基于矩阵乘法运算输出部分的矩阵大小来确定实例与运算语句以及数据搬移语句之间的映射关系。但是,与其他算子相比,矩阵乘法运算的特别之处在于:矩阵乘法运算对应的输入矩阵与输出矩阵的大小可能是不一致的。例如,对于输入矩阵A所参与的矩阵乘法运算,输入矩阵A的大小为[128,32],该矩阵乘法运算 的输出矩阵C的大小可能为[32,128]。在基于输出矩阵C的大小来建立实例与数据搬移语句之间的映射关系时,由于映射关系不合理,往往会使得实例执行数据搬移语句的效率较低。
示例性地,假设基于输出矩阵建立了实例与矩阵的映射关系之后,实例的映射为[32,4]。那么,对于输入矩阵A[M=2,K=64],在搬移输入矩阵的数据时,基于已建立的映射关系,会将4个实例绑定K轴,32个实例绑定给M轴。在这种情况下,K轴需要串行执行for循环64/4=16次,而M轴却存在冗余。
基于本实施例提供的方法,终端可以基于实例的总数量以及划分后的矩阵的数据结构,将实例从矩阵的最内轴往最外轴的方向(即从内存连续方向往内存不连续方向)映射。例如,对于输入矩阵A[M=2,K=64],在搬移输入矩阵的数据时,将64个实例绑定给K轴,剩余的128/64=2个实例绑定给M轴。这样一来,128个实例并行执行一次即可加载好输入矩阵A的数据。
在一个可能的实施例中,由于终端在目标代码中调用了有PTX指令代码的第一接口,而第一接口的统一操作层级为warp层级,因此本实施例中可以建立warp层级的计算语句映射,从而保证多实例与计算语句之间的映射关系更为合理,以进一步优化运算的效率。
具体地,在所述终端解析所述中间表示之前,所述方法还包括:终端根据参与所述矩阵乘法运算的实例的总数,确定warp的数量,其中,每个warp中包括相同数量的实例。一般地,在GPU中,每32个实例组成一个warp,warp是调度和运行的基本单元。终端基于所述warp的数量以及目标矩阵的数据结构,在所述中间表示中建立warp与所述目标矩阵中的轴之间的第三映射关系,所述第三映射关系用于指示执行矩阵中的轴的运算的warp。
示例性地,终端先根据实例的总数,推导得到warp的数量。然后,终端基于warp的数量,将多个warp尽可能地分配给两个维度w0和w1。两个维度w0和w1中的warp分别用于绑定矩阵乘法运算中的矩阵的M轴和N轴。
例如,在实例的总数为128的情况下,终端可以确定warp的数量为128/32=4。然后,终端对得到的warp的数量值执行开根运算,得到两个维度的值分别为w0=2,w1=2。这样,终端可以在中间表示的接口调用语句层级的上一层添加对应矩阵M/N轴与w0/w1的对应映射。其中,w0/w1可以是以实例表达式的方式来表示。比如,在w0=2绑定M轴,w1=2绑定N轴的情况下,则w0索引表示为threadIdx.x MOD(32*2)div 32;w1索引表示为threadIdx.x div(32*2)。其中,MOD表示取模运算,div表示整除运算。
以上介绍了本申请实施例提供的编译方法的实现过程,为了便于理解,以下将结合例子详细描述本申请实施例提供的编译方法1000。
可以参阅图10,图10为本申请实施例提供的一种编译方法1000的流程示意图。该编译方法1000可以实现于MindSpore框架中的AKG算子编译优化框架。AKG对深度神经网络中的算子进行优化,并提供特定模式下的算子自动融合功能。AKG与MindSpore图算 融合特性协同工作,可提升异构后端网络运行效率。具体地,该编译方法包括以下的步骤1001-1005。
步骤1001,获取算子描述。
在MindSpore图层编译框架对网络模型进行后端计算优化后,AKG可以接收到算子描述,该算子描述中包括矩阵乘法运算以及矩阵乘法运算相关的融合运算。其中,融合运算的输入包括矩阵乘法运算的输出。
步骤1002,解析算子描述,生成中间表示并记录运算信息。
AKG在接收到算子描述后,解析该算子描述,并生成初始的中间表示。此外,在AKG生成中间表示后,AKG可以基于中间表示分析矩阵乘法运算对应的计算逻辑以及算子融合模式,得到运算信息。其中,矩阵乘法运算对应的计算逻辑包括参与矩阵乘法运算的第一数据、所述第一数据的排布方式、所述第一数据的数据类型和所述矩阵乘法运算的表达式。算子融合模式则包括融合计算的计算逻辑以及融合计算语句的位置。类似地,融合计算的计算逻辑包括参与融合运算的数据、参与融合运算的数据的排布方式、参与融合运算的数据的数据类型和融合运算的表达式。
步骤1003,基于多面体模型对中间表示进行调度优化。
在解析得到算子描述对应的中间表示后,AKG可以基于多面体编译模型,执行用于处理软硬件协同的调度优化。
首先,AKG可以根据中间表示及得到的运算信息,实现自适应地生成GPU配置参数。其中,GPU配置参数包括用于矩阵分块的参数和Grid/Block的配置参数。然后,AKG再根据用于矩阵分块的参数,进行矩阵分块。分块后,AKG基于上述的Grid/Block配置参数,将计算语句进行数据映射的绑定。
其次,AKG将参与运算的数据进行多层级的内存提升,即根据矩阵切分的大小,将数据所在的内存位置由全局内存提升至共享内存,再提升至局部内存。此时,AKG会再将上述内存提升对应的数据搬移语句与Grid/Block参数匹配。
步骤1004,对调度优化后的中间表示进行后端Pass优化。
在步骤04中,AKG对调度优化后的中间表示进行通用类的优化。具体地,对中间表示进行优化的Pass主要包括:共享内存Bank冲突避免Pass,循环体展开Pass,向量化加载Pass,数据流水化预取Pass等优化Pass。本步骤中,所有优化Pass的执行方式为对中间表达进行模式匹配后进行中间表示标注和修改变换。
步骤1005,解析执行后端Pass优化后的中间表示,并基于融合模式链接库,生成目标代码。
本步骤中,核心流程是解析中间表示,同时根据步骤1002记录的运算信息,调用PTX内联库和Fragment层级的Elem-Wise矩阵运算库,发射对应的API接口,最终生成目标代码。
其中,PTX内联库中包括多个接口,PTX内联库中的多个接口对应于矩阵乘法运算。终端在执行目标代码时,终端基于目标代码中所调用的PTX内联库中的接口来执行矩阵乘法计算。示例性地,PTX内联库中可以包括矩阵乘法运算接口、数据初始化接口、数据加 载接口和数据存储接口。在目标代码的实际执行过程中,终端可以是基于数据加载接口加载参与矩阵乘法运算的数据,基于数据初始化接口来设置Fragment中所有元素的初值,并基于矩阵乘法运算接口来执行矩阵乘法运算,最后基于数据存储接口存储运算得到的数据。
Elem-Wise矩阵运算库包括多个接口,Elem-Wise矩阵运算库中的多个接口对应于融合计算。终端在执行目标代码时,终端基于目标代码中所调用的Elem-Wise矩阵运算库中的接口来执行融合计算。示例性地,Elem-Wise矩阵运算库中可以包括加法运算接口、减法运算接口、乘法运算接口和除法运算接口,分别用于执行不同类型的融合计算。
为便于叙述,以下将结合例子分别详细介绍以上的步骤1002-步骤1005。
步骤1002,解析算子描述,生成中间表示并记录运算信息。
以下将结合具体的代码,描述终端解析算子描述,生成中间表示并记录运算信息的过程。
终端在解析算子描述后,生成中间表示。分析矩阵乘法表达式对应的计算逻辑以及算子融合模式。具体地,终端可以对算子描述执行矩阵乘法运算模式的匹配,得到匹配上的矩阵乘法运算模式。在匹配得到矩阵乘法运算模式后,终端确定参与计算的矩阵的大小、参与计算的矩阵的数据排布方式、参与计算的矩阵的数据类型以及矩阵对应的融合模式。
示例性地,可以参阅以下所展示的中间表示的代码。
Figure PCTCN2022094998-appb-000006
Figure PCTCN2022094998-appb-000007
基于上述的代码,终端执行矩阵乘法运算模式的匹配,并且确定参与计算的矩阵的大小、参与计算的矩阵的数据排布方式、参与计算的矩阵的数据类型以及矩阵对应的融合模式。
运算模式匹配:终端通过分析以上代码中的“compute(i,j)=compute(i,j)+input_1*input_2”,可以确定该计算匹配表达式为D=C+A*B的矩阵乘累加模式,从而匹配到该计算为矩阵乘法运算。
计算矩阵大小:终端可以基于中间表示中的realize节点及对应循环轴的大小,确定参与计算的矩阵的大小。以上述的代码为例,分析realize compute可知输出矩阵D的大小为[768,768],分析循环轴for对应的三个轴的大小可知输入矩阵A的大小为[768,768]、输入矩阵B的大小为[768,768]。
确定矩阵的数据排布方式:终端通过分析参与计算的矩阵compute、input_1、input_2的计算关系,可以确定两个输入矩阵input_1和input_2分别对应一根规约轴reduce_axis。输入矩阵input_1和input_2分别对应的规约轴的位置均处于最内轴,对应排布方式为A[M,K]&B[N,K]。
确定矩阵的数据类型:分析realize节点(即代码“realize compute”所在行)可解析数据类型为float16。
确定矩阵对应的融合模式:结合上述矩阵乘法运算模式匹配,分析上述代码“T_add_compute_input_3(ax0,ax1)=(compute(ax0,ax1)+input_3(ax0,ax1))”可知:矩阵乘法运算的输出矩阵compute参与了融合运算,即矩阵乘法运算的输出矩阵compute作为了融合运算中的输入矩阵。其中,融合运算模式为加法运算,另一个输入矩阵为input_3[ax1=768],代表input_3矩阵需要进行数据广播的方式与compute矩阵进行加法操作。具体地,融合模式的表达式可以表示为:E=D+input3,即E=C+A*B+input3。
此外,对于多根Batch处理轴的矩阵乘法运算,终端在生成中间表示的过程中可以将多根Batch轴进行轴融合。
Figure PCTCN2022094998-appb-000008
以上述的代码为例,分析计算语句与compute节点可知,计算为四维矩阵乘法运算,其中前两根轴B∈[0,32)与b∈[0,12)为批处理轴,其中,[0,32)和[0,12)为整数区间。本方案将其相乘融合成为一根B.b.fused轴,B.b.fused∈[0,384)。B.b.fused轴对应原始B轴与b轴的索引变为了取模运算和DIV运算的表达式,即compute(floordiv(B.b.fused,12),floormod(B.b.fused,12)。如Batch轴大于2根,也通过如上方式解决,总体乘积融合成一根轴。因此,基于批处理轴融合的方式,中间表示中能够得到固定的矩阵乘计算模式D[Batch_fused,i,j]=C[Batch_fused,i,j]+A[Batch_fused,i,k]*B[Batch_fused,j,k],使后续模块可以直接处理。通过融合批处理轴,可以解决目前编译技术中无法处理多根Batch轴矩阵乘法运算的问题。
简单来说,对于矩阵A[10,10,M,N],矩阵A中的前两个轴为批处理轴。矩阵A[10,10,M,N]可以理解为矩阵大小为[M,N]的多个矩阵被分成了10批,每一批矩阵内还包括10个矩阵。通过将矩阵A[10,10,M,N]中的两个批处理轴进行融合后,得到矩阵A[100,M,N]。此时,矩阵A[100,M,N]可以理解为矩阵大小为[M,N]的多个矩阵被分成了100批,每批矩阵只包括1个矩阵。
以上介绍步骤1002的具体执行过程,以下将介绍步骤1003的具体执行过程。
步骤1003,基于多面体模型对中间表示进行调度优化。
具体地,可以参阅图11,图11为本申请实施例提供的一种基于多面体模型对中间表示进行调度优化的流程示意图。如图11所示,基于多面体模型对中间表示进行调度优化的具体过程可以包括以下的步骤1101-1106。
步骤1101,基于Pluto算法对中间表示进行多面体调度优化。
本实施例中,Pluto的定义可以参考上述对技术术语的解释,在此不再赘述。终端基于Pluto算法对中间表示进行多面体调度优化,能够实现有效的基于多面体模型的初始循环嵌套调度优化。
步骤1102,自适应生成配置参数。
在对中间表示所执行的调度优化中涉及到多个配置参数,因此终端可以自适应生成对应的配置参数。示例性地,终端可以根据GPU内存利用率、数据局部性以及运算并发角度来进行参数的配置。具体地,终端输出的配置参数可以包括用于矩阵分块的参数和Grid/Block配置参数。
步骤1103,基于配置参数执行矩阵分块。
在本步骤中,终端可以根据步骤1102中计算出的用于矩阵分块的参数,将所有矩阵进行轴切分,实现矩阵分块,便于后续将分块后的内部矩阵进行内存提升。此外,对矩阵进行分块后所切分的外层切分结果造成的循环操作能够有效绑定Grid实现并行计算。
具体地,终端针对中间表示的操作具体可以为:终端分析终端的schedule节点(即调度语句实例),将矩阵中的所有轴基于矩阵切分参数,切分成多层调度。
示例性地,对于M N K轴均为768的矩阵,矩阵切分参数可以为:外层切分参数为M128N128K32,即M、N、K轴切分后的大小分别为128、128和32;内层切分参数为M16N16K8,即M、N、K轴进一步切分后的大小分别为16、16和8。
终端根据外层切分参数对M N K三根轴对应的schedule调度节点进行切分,通过取模运算/整除运算形成两层调度;再对内层的调度进行第二次M16N16K8的切分,仍通过取模运算/整除运算的方式形成两层调度。此时,原有的一层调度,被切分成三层,分别为[M/128,N/128,K/32],[(MMOD128)/16,(NMOD128)/16,(KMOD32)/8],[MMOD16,NMOD16,KMOD8]。三层调度能够更好地适配后续与GPU硬件的绑定优化。例如,最外层调度绑定Grid,中间层调度绑定Warp,最内层用于匹配上述实施例所述的用于执行矩阵乘法运算的第一接口。
步骤1104,建立计算语句与Grid/Block之间的映射关系。
由于终端在目标代码中调用了有PTX指令代码的第一接口,而第一接口的统一操作层级为warp层级,但原有的多面体调度编译技术只有Block和Thread层级。因此本实施例中可以建立warp层级的计算语句映射,从而保证多实例与计算语句之间的映射关系更为合理,以进一步优化运算的效率。
示例性地,终端先根据实例的总数,推导得到warp的数量。然后,终端基于warp的 数量,将多个warp尽可能地分配给两个维度w0和w1。两个维度w0和w1中的warp分别用于绑定矩阵乘法运算中的矩阵的M轴和N轴。
例如,在实例的总数为128的情况下,终端可以确定warp的数量为128/32=4。然后,终端对得到的warp的数量值执行开根运算,得到两个维度的值分别为w0=2,w1=2。这样,终端可以在中间表示的接口调用语句层级的上一层添加对应矩阵M/N轴与w0/w1的对应映射。其中,w0/w1可以是以实例表达式的方式来表示。比如,在w0=2绑定M轴,w1=2绑定N轴的情况下,则w0索引表示为threadIdx.x MOD(32*2)div 32;w1索引表示为threadIdx.x div(32*2)。其中,MOD表示取模运算,div表示整除运算。
步骤1105,执行多层级内存提升。
在相关技术中,原有的多面体调度技术只会对矩阵进行一次的内存提升。本实施例为了提高执行性能,设计了多层级的内存提升机制,即将数据由全局内存根据矩阵分块的大小提升至共享内存,再二次提升至局部内存。
具体操作流程为,首先在前述步骤1104中添加不同调度层级的标签(例如在代码中添加mark节点,添加标签的逻辑即调度的切分层级)。然后,在多层级内存提升的过程中根据标签计算所需的内存和需要进行提升的内存层级;如内存充足,则添加对应的内存申请语句和数据搬移语句;否则,则缩小内存提升的数据量,直至内存充足。
具体地,添加的数据搬移语句通过在中间表示中对应的mark节点下方的schedule节点中插入孩子(child)节点——extension节点来实现。其中,该extension节点包含了数据搬移的输入输出矩阵名称即索引对应关系。(”->”为数据搬移顺序)。具体地,可以参阅以下的代码,以下的代码给出了A矩阵(input_1)数据由全局内存加载至共享内存的数据搬移语句实例。
Figure PCTCN2022094998-appb-000009
步骤1106,建立数据搬移语句与实例之间的映射关系。
本步骤中,终端可以计算矩阵对应的实例总数,并根据分块后的矩阵大小由内轴(内存连续轴)而外地重新分配实例,直至实例映射完毕。采用此方式能够有效保证数据访问的局部性。
示例性地,终端可以执行以下的三个具体的步骤:
1,Block参数配置整合:根据步骤1102中得到的Block配置信息,合并二维Thread配置乘积,统一记为ThreadIdx.x。例如,原有的配置信息为:ThreadIdx.x=32,ThreadIdx.y=4,则计算返回新的配置为ThreadIdx.x=128,即实例总数为128。
2,Block配置拆分:根据位于共享内存中的矩阵的大小,将实例由最内轴至外轴映射。例如,假设实例总数为128,共享内存所存储的矩阵大小A_shared[M,K]=(128,32),此时新的配置信息计算为
映射内轴K:ThreadIdx.x=(Thread(128)>K(32))?K:Thread=32
映射外轴M:ThreadIdx.y=[Thread(128)/ThreadIdx.x(32)]=4
此时,终端可以将input_2_shared[128,32]与Block[4,32]绑定,则剩余循环轴为[32,1]。
3,实例重表示:为了表示方便,统一采取ThreadIdx.x表示上述的二维Block配置,表示方式可以是采用取模运算/整除运算的方式。例如,假设实例总数为128,分配给虚拟二维信息ThreadIdx.x=32,ThreadIdx.y=4,则最终生成的目标代码分别对应为ThreadIdx.x MOD 32和ThreadIdx.x/32。
为了验证以上步骤1003所执行的调度优化的有益效果,本申请中对上述实施例进行控制变量的性能对比测试,结果如表2所示。
表2
Figure PCTCN2022094998-appb-000010
在表2中,第一列为具体实例的矩阵乘法运算数据量介绍,第二列为基于本实施例优化方法进行的运算的耗时,第三列为基于现有优化方法进行的运算的耗时。第四列为第二列、第三列的性能差距,分析可知本实施例对于不同矩阵乘法运算实例均有不同程度的提升,提升比例大于70%。
以上介绍了步骤1003的具体执行过程,以下将介绍步骤1004的具体执行过程。
步骤1004,对调度优化后的中间表示进行后端Pass优化。
在步骤1004中,用于执行后端优化的Pass具体可以包括共享内存Bank冲突避免Pass,循环体展开Pass,向量化加载Pass,数据流水化预取Pass等。以下将分别介绍上述的各个Pass。
Bank冲突避免Pass:Bank冲突避免Pass能够对中间表示进行修改,调整Fragment数据存储的方式,以使得矩阵乘法运算的执行过程中能够消除Bank冲突。
以矩阵A为输入矩阵为例,可以参阅图12,图12为现有流程的数据访问方式。如图11所示,A_Global表示全局内存,A_shared表示共享内存,Fragment表示用于存储Fragment数据的Fragment区域。共享内存的大小为[128,32],共享内存中包括多组相同的Bank,每一组Bank中包括32个Bank,分别为B0~B31。Fragmen区域的大小为[16,8]。
现有流程中,在执行矩阵乘法运算的过程,每次从共享内存读取数据并写入Fragment区域时,此时Fragment数据均处于Bank0~3和Bank16~19中,因此在执行数据读取和写入时会产生Bank冲突。
本申请实施例中,基于Bank冲突避免Pass进行优化,能够实现Fragment数据重排,将Fragment数据存储于连续的共享内存中,即Fragment[16,8]存储于shared[1,128]中。这样一来,相对于现有的Fragment区域,Fragment区域所在的位置变成了[1,128],Fragment区域的大小并没有发生变化。在Fragment区域的位置发生变化的情况下,Fragment区域中的数据分属于不同Bank,从而达到了消除Bank冲突的目的。示例性地,可以参阅图13,图13为本申请实施例提供的一种基于Bank冲突避免Pass优化后的数据访问排布示意图。
循环体展开Pass:循环体展开Pass用于对中间表示中的for循环进行展开优化,以避免增加过多的指令数。
可以理解的是,在以Warp作为调度执行单位来执行for循环的情况下,内部实例执行for的判断条件或者for里的if条件时,可能会产生分支冲突,从而增加指令数。简单来说,在包括多层for循环的情况下,每一层for循环下包括多个分支。在不展开for循环的情况下,容易产生分支冲突而增加指令数。因此,在这种情况下,可以将for循环展开写出来,即分别展开写各种分支。
示例性地,循环体展开Pass通过将中间表示中的三个参数分别与预先设置的阈值进行比较,来判定是否对中间表示中的某个for循环做展开处理。如果判定结果为某个for循环需要做展开处理,则将中间表示中对应的for节点标记为展开(unrolled)节点,并在最终代码生成阶段,生成对应的unroll指令,即在对应for循环的代码前一行添加一行宏指令代码”#pragma unroll”。上述提出的三个参数以及预先设置的阈值,具体如表3所示。
表3
参数 意义 阈值
auto_max_step For循环内的语句数量 3
auto_max_depth 需要unroll的For的层数(for的嵌套) 8
auto_max_extent For循环的上界 16
如表3所示,三个参数分别为auto_max_step、auto_max_depth和auto_max_extent。auto_max_step表示For循环内的语句数量;auto_max_depth表示需要unroll的For的层数;auto_max_extent表示For循环的上界。其中,三个参数对应的阈值的取值可以根据实际情况进行调整,并不局限于表2所示的数值,只要确保阈值的取值大于0即可。
示例性地,包括多层for循环的中间表示的代码如下所示:
Figure PCTCN2022094998-appb-000011
Figure PCTCN2022094998-appb-000012
在上述的代码中,对于第一行的for循环(cc9),参数auto_max_step为该For循环内部的语句数量,即第五行、第七行的两条计算语句,值为2;参数auto_max_depth对应内部for嵌套的数量,加上本身,共两层,值为2;参数auto_max_extent为该for语句的最大执行次数,即cc9∈[0,2),值为2。此时该for循环对应的三个参数均小于表中设置的限制,因此可以进行循环展开。示例性地,上述代码执行循环展开后的展开结果如下所示:
Figure PCTCN2022094998-appb-000013
由展开结果可以看出:中间表示中该for循环被标注为unrolled节点。
向量化加载Pass:向量化加载类似于单指令多数据流(Single Instruction Multiple Data,SIMD)指令。SIMD指令是通过复制多个操作数,并把它们打包在大型寄存器的一组指令集。由于一个指令能够一次性处理多个数据,因此能够减少总体指令执行的次数,扩大带宽利用率。
本实施例中,经过向量化加载Pass优化中间表示后,基于中间表示编译得到的目标代码能够指示终端采用Float128数据类型格式来读取数据,即终端每次读取数据的大小为 128bits。
具体地,向量化加载Pass首先对中间表示中的数据加载矩阵进行向量化切分,切分系数为Float128数据类型与当前数据类型的比特数的倍数值。例如,在当前数据类型为Float16时,切分系数为Float128/Float16=128/16=8。
示例性地,可以参阅以下的代码,以下的代码为向量化加载Pass处理前的中间表示的代码。
Figure PCTCN2022094998-appb-000014
对cc3轴进行向量化系数为8的切分,切分出cc8的内层循环,不与GPU的实例绑定,即一个实例处理8个Float16数据类型的数据。同时将对应For循环在IR中标记为vectorized节点。
示例性地,经过向量化加载Pass处理后的中间表示的代码如下所示。
Figure PCTCN2022094998-appb-000015
数据流水Pass:数据流水Pass用于额外申请部分局部内存,以中转预取得到的数据。基于数据流水Pass,终端在执行目标代码时可以提前读取一部分数据用于后续的计算,在计算过程中同时再进行后续数据的读取。这样一来,可以实现数据读取和计算的同步进行, 节省了时间开销。其中,终端提取读取的部分数据则存储于数据流水Pass所申请的局部内存中。
在相关技术中,数据读取与计算通常为顺序执行,即终端需要先读取完用于计算的数据之后,再基于读取得到的数据执行计算。示例性地,可以参阅图14,图14为相关技术中的一种计算顺序示意图。如图所示,计算语句(compute)需要等待输入数据存储于共享内存(shared)后再进行。即,终端需要将输入数据存储于共享内存后,再执行第一次计算;在第一次计算结束后,再将第二次计算所需的输入数据存储于共享内存,再执行第二次计算。
可以参阅图15,图15为本申请实施例提供的一种经过数据流水Pass优化后的计算顺序示意图。如图15所示,数据流水Pass额外申请了数据读写速度更快的局部内存,用于中转预取得到的数据。在第一次计算的执行期间,终端提前将第二次计算所需的输入数据读取至局部内存中,从而保证了在第一次计算结束后,能够快速地将局部内存中的数据读取至共享内存处,节省了数据读取的时间。此外,可以参阅图16,图16为本申请实施例提供的一种添加数据流水Pass后的伪代码逻辑。
以上介绍了步骤1004的具体执行过程,以下将介绍步骤1005的具体执行过程。
步骤1005,解析执行后端Pass优化后的中间表示,并基于融合模式链接库,生成目标代码。
终端解析中间表示,同时根据记录的融合模式,调用上述的PTX内联库和Fragment层级Elem-Wise矩阵运算库,以发射对应的接口,最终生成目标代码。
其中,解析中间表示并生成目标代码的过程包括对中间表示中不同节点的分析处理。具体地,终端通过解析中间表示,确定中间表示中的特定节点,并将中间表示中的特定点解转换为对应的代码语句,从而实现生成中间表示对应的目标代码。
示例性地,以下将分别介绍中间表示中可能存在的节点:tvm_load_matrix_sync节点、tvm_fill_fragment节点、tvm_mma_sync节点和tvm_store_matrix_sync节点。
tvm_load_matrix_sync节点:tvm_load_matrix_sync节点用于指示将数据搬入Fragment,具有8个参数。示例性地,tvm_load_matrix_sync节点对应的8个参数如表4所示。
表4
参数index 解析意义
1 目的Fragment矩阵
2~4 TensorCore接口调用模式,如M16N16K8
5 Fragment首地址索引
6 数据加载的源地址
7 Ldm:数据加载的源数据中内轴的大小
8 该Fragment排布方式(row_major/col_major)
tvm_fill_fragment节点:tvm_fill_fragment节点用于实现乘累加矩阵的初始赋值,具有6个参数。示例性地,tvm_fill_fragment节点对应的6个参数如表5所示。
表5
参数index 解析意义
1 需要被赋初值的Fragment矩阵
2~4 TensorCore接口调用模式,如M16N16K8
5 Fragment首地址索引
6 设置的初始值
tvm_mma_sync节点:tvm_mma_sync节点为乘累加计算语句,具有8个参数,每两个参数为一组。示例性地,tvm_mma_sync节点对应的8个参数如表6所示。
表6
参数index 解析意义
1~2 结果Fragment矩阵及其首地址索引
3~4 A矩阵Fragment及其首地址索引
5~6 B矩阵Fragment及其首地址索引
7~8 C矩阵Fragment及其首地址索引
tvm_store_matrix_sync节点:tvm_store_matrix_sync节点用于指示将数据搬出Fragment,具有8个参数。示例性地,tvm_store_matrix_sync节点对应的8个参数如表7所示。
表7
参数index 解析意义
1 源Fragment矩阵D
2~4 TensorCore接口调用模式,如M16N16K8
5 Fragment D首地址索引
6 数据存储的目的地址
7 Ldm:数据存储的目的矩阵中内轴的大小
8 该Fragment排布方式(row_major/col_major)
此外,上述的内联PTX库和Elem-Wise运算库存于wmma.hpp中,在目标代码中添加引用头文件的方式(#include"akg_mma_lib/wmma.hpp"),即可调用API库,调用内联PTX库和Elem-Wise运算库中的接口。
以下介绍终端在目标代码中调用上述的PTX内联库和Fragment层级Elem-Wise矩阵运算库的具体过程。
可以参阅图17,图17为本申请实施例提供的一种PTX内联库的示意图。如图17所示,PTX内联库中包括矩阵乘法运算接口、数据初始化接口、数据加载接口和数据存储接口,其中数据加载接口包括用于加载输入矩阵以及乘累加矩阵的接口。示例性地,数据加载接口在目标代码中的实现可以为:akg::wmma::load_matrix_sync。
具体地,终端可以在目标代码中基于内联PTX指令的函数get_lane_id()来获取实例的ID,即ThreadId。在确定实例的ID后,在后续调用PTX内联库的接口的过程中,指定每个实例如何处理数据,从而实现细粒度的数据和计算控制。
对于数据加载接口和数据存储接口,均可以是基于相同的方式来建立实例与数据之间的映射关系,从而实现数据的加载和存储。具体地,数据加载接口和数据存储接口中建立 实例与数据之间的映射关系的方式可以参考上述图6对应的实施例的描述,在此不再赘述。
此外,数据初始化接口则用于设置Fragment中所有元素的初值。具体地,数据初始化接口可以将常数转换为Fragment对应的数据类型,并赋值至Fragment的每个元素(即通过for循环的方式遍历实现)。
为了验证以上PTX内联库的有益效果,本申请中对上述实施例进行控制变量的性能对比测试,结果如表8所示。
表8
Figure PCTCN2022094998-appb-000016
在表8中,第一列为具体实例的矩阵乘法运算数据量介绍,第二列为基于现有的WMMA接口执行矩阵乘法运算的耗时,第三列为基于本实施例提供的PTX内联库执行矩阵乘法运算的耗时。第四列为第二、三列的性能差距,分析可知本实施例对于不同矩阵乘法运算实例均有不同程度的提升,提升比例达到近50%。
此外,本实施例中还提供了一种基于TensorCore计算层级的Fragment级算子融合方案。该实施例与矩阵乘运算的非融合场景的主要区别有二,分别为Fragment层级Elem-Wise矩阵运算库,及步骤1002中记录的运算信息。
首先,对于Elem-Wise矩阵运算库而言,Elem-Wise矩阵运算库中包括多个运算接口,例如加法运算接口、减法运算接口、乘法运算接口和除法运算接口。在Elem-Wise矩阵运算库的接口设计中,对于参与融合运算的输入矩阵,可以采用与矩阵乘法运算中的输入矩阵相同的数据加载方式,即调用上述的数据加载接口akg::wmma::load_matrix_sync。该方式能够将融合矩阵存储为Fragment,同时也保证了融合矩阵Fragment的数据存储方式与矩阵乘法运算部分的Fragment数据存储方式相同。
具体地,ElemWise矩阵运算库中的接口采取Fragment数据结构内逐元素计算的方式。以加法运算接口为例,加法运算接口所接收的输入矩阵A、输入矩阵B以及输出的输出矩阵C均为Fragment结构,加法运算接口通告遍历FragmentC的大小,对每个元素进行A[i]与B[i]的加法运算,存于c[i],最终结果为FragmentC。
同时,在调度上,需要在矩阵乘单算子优化的基础上加入融合模式的识别和对应融合语句的调度优化。终端在解析算子描述后对矩阵乘法运算模式进行匹配,并且在确定矩阵乘法运算不为唯一运算后则判定有融合运算,并将该融合运算的模式(即融合模式)记录。其中,融合模式分析记录主要包含了对于融合语句位置的记录标注,对于融合语句计算逻辑的记录,对于参与融合语句的矩阵相关信息(数据大小、类型、排布方式)的记录。
例如,在分析中间表示后,确定矩阵乘运算的输出矩阵compute参与了融合运算,且融合算子的类型为加法运算。在融合运算中,另一个输入矩阵的相关信息为input_3[ax1=768],代表input_3矩阵需要进行数据广播的方式与compute矩阵进行加法操 作。在对中间表示进行分析后,则存储上述分析得出的信息,即矩阵乘法运算数据信息及Elem-Wise语句运算信息。最终,在后端解析中间表示时,将中间表示中融合位置的语句发射成为Fragment融合接口,调用上述介绍的Fragment级Elem-Wise矩阵运算库中的加法运算接口(例如通过引用库的方式#include"akg_mma_lib/wmma.hpp"),生成最终融合场景的目标代码。
具体地,在分析中间表示的过程中,终端通过遍历中间表示,并根据记录的融合模式(即ElemWise计算与矩阵乘法运算的依赖关系),确认Elem-Wise语句位置。并且,终端根据所记录的相关矩阵乘法运算信息,将中间表示中对应的语句信息(例如D=C+A*B语句,以及对应的C矩阵的清零计算语句,输入矩阵A、输入矩阵B的数据搬入语句),匹配至对应的PTX内联库的接口中。最后,终端根据所记录的ElemWise计算的相关信息,例如融合计算语句、及参与该计算的矩阵信息(大小、数据类型、排布方式)等信息,将中间表示中的融合计算语句,匹配Elem-Wise矩阵运算库中的接口,发射对应的融合计算接口。
为了验证以上Elem-Wise矩阵运算库的有益效果,本申请中对上述实施例进行控制变量的性能对比测试,结果如表9所示。
表9
Figure PCTCN2022094998-appb-000017
在表9中,第一列为具体实例的矩阵乘法运算数据量介绍,第二列为单矩阵乘法运算的耗时,第三列为基于本实施例提供的Elem-Wise矩阵运算库执行融合运算的耗时。第四列为第二、三列的性能差距,分析可知本实施例对于不同矩阵乘法运算,融合占比均小于5%。
可以参阅图18,图18为本申请实施例提供的一种编译装置的结构示意图。如图18所示,本申请实施例提供的一种编译装置,包括:获取单元1801和处理单元1802;所述获取单元1801,用于获取神经网络模型的算子描述,所述算子描述包括对矩阵乘法运算的描述;所述处理单元1802,用于解析所述算子描述,生成目标代码;
其中,所述目标代码调用第一接口,所述第一接口用于指示多个第一映射关系,所述第一映射关系是一个实例与第一数据之间的映射关系,所述实例用于处理与所述实例对应的第一数据,所述第一数据为参与所述矩阵乘法运算的数据,其中,同一阶段内并行执行的多个实例分别与位于不同存储体bank内的第一数据具有所述第一映射关系。
在一种可能的实现方式中,所述处理单元1802还用于:解析所述算子描述,得到中间表示;将所述中间表示中的第一语句替换为第一接口语句,以得到所述目标代码,所述第一语句用于指示执行所述矩阵乘法运算,所述第一接口语句用于调用所述第一接口,所述 第一接口用于执行所述矩阵乘法运算。
在一种可能的实现方式中,所述第一接口有并行线程执行PTX指令代码。
在一种可能的实现方式中,所述处理单元1802还用于:解析所述算子描述,得到中间表示;将所述中间表示中的第二语句替换为第二接口语句,以得到所述目标代码,所述第二语句用于指示搬移所述第一数据,所述第二接口语句用于调用所述第一接口,所述第一接口用于搬移所述第一数据。
在一种可能的实现方式中,所述处理单元1802还用于:解析所述算子描述,得到中间表示;将所述中间表示中的第三语句替换为第三接口语句,以得到所述目标代码,所述第三语句用于指示执行融合运算,所述融合运算的输入包括所述矩阵乘法运算的输出,所述第三接口语句用于调用第二接口,所述第二接口用于执行融合运算。
在一种可能的实现方式中,所述融合运算包括以下运算中的至少一种:加法运算、减法运算、乘法运算、除法运算、结果向下取整的除法运算、取模运算和结果向下取整的取模运算。
在一种可能的实现方式中,所述第一接口还用于指示得到所述第一数据的逻辑存储结构以及所述目标数据的数据类型,并根据所述逻辑存储结构和所述数据类型确定数据加载指针的大小,所述数据加载指针的大小用于指示所述实例单次加载数据的数据量。
在一种可能的实现方式中,所述处理单元1802还用于:基于所述算子描述,生成用于矩阵分块的参数;根据所述用于矩阵分块的参数,对目标矩阵执行分块操作,以得到所述目标矩阵的划分结果,所述目标矩阵为参与所述矩阵乘法运算的矩阵;根据所述目标矩阵的划分结果,在所述目标代码中添加数据搬移语句,所述数据搬移语句用于指示在内存中搬移所述目标矩阵的数据。
在一种可能的实现方式中,所述目标矩阵的划分结果包括第一矩阵,所述第一矩阵包括第二矩阵;所述处理单元1802还用于:在所述目标代码中指示划分第一矩阵的语句后添加第一数据搬移语句,以及在所述目标代码中指示划分第二矩阵的语句后添加第二数据搬移语句;
其中,所述第一数据搬移语句用于指示将所述第一矩阵的数据从全局内存搬移至共享内存,所述第二数据搬移语句用于指示将所述第二矩阵的数据从共享内存搬移至局部内存。
在一种可能的实现方式中,所述目标代码中还包括第二映射关系,所述第二映射关系是实例与数据搬移语句之间的映射关系,所述第二映射关系用于指示执行数据搬移语句的实例,所述第二映射关系是基于实例的数据以及划分后的矩阵的数据结构建立的。
在一种可能的实现方式中,所述目标代码中还包括线程束warp与所述目标矩阵中的轴之间的第三映射关系;其中,所述第三映射关系用于指示执行矩阵中的轴的运算的warp,所述warp的数量是基于参与所述矩阵乘法运算的实例的总数确定的,每个warp中包括相同数量的实例,所述目标矩阵为参与所述矩阵乘法运算的矩阵。
接下来介绍本申请实施例提供的一种执行设备,请参阅图19,图19为本申请实施例提供的执行设备的一种结构示意图,执行设备1900具体可以表现为手机、平板、笔记本电 脑、智能穿戴设备、服务器等,此处不做限定。其中,执行设备1900上可以部署有图19对应实施例中所描述的数据处理装置,用于实现图19对应实施例中数据处理的功能。具体的,执行设备1900包括:接收器1901、发射器1902、处理器1903和存储器1904(其中执行设备1900中的处理器1903的数量可以一个或多个,图19中以一个处理器为例),其中,处理器1903可以包括应用处理器19031和通信处理器19032。在本申请的一些实施例中,接收器1901、发射器1902、处理器1903和存储器1904可通过总线或其它方式连接。
存储器1904可以包括只读存储器和随机存取存储器,并向处理器1903提供指令和数据。存储器1904的一部分还可以包括非易失性随机存取存储器(non-volatile random access memory,NVRAM)。存储器1904存储有处理器和操作指令、可执行模块或者数据结构,或者它们的子集,或者它们的扩展集,其中,操作指令可包括各种操作指令,用于实现各种操作。
处理器1903控制执行设备的操作。具体的应用中,执行设备的各个组件通过总线系统耦合在一起,其中总线系统除包括数据总线之外,还可以包括电源总线、控制总线和状态信号总线等。但是为了清楚说明起见,在图中将各种总线都称为总线系统。
上述本申请实施例揭示的方法可以应用于处理器1903中,或者由处理器1903实现。处理器1903可以是一种集成电路芯片,具有信号的处理能力。在实现过程中,上述方法的各步骤可以通过处理器1903中的硬件的集成逻辑电路或者软件形式的指令完成。上述的处理器1903可以是通用处理器、数字信号处理器(digital signal processing,DSP)、微处理器或微控制器,还可进一步包括专用集成电路(application specific integrated circuit,ASIC)、现场可编程门阵列(field-programmable gate array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。该处理器1903可以实现或者执行本申请实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合本申请实施例所公开的方法的步骤可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器,闪存、只读存储器,可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器1904,处理器1903读取存储器1904中的信息,结合其硬件完成上述方法的步骤。
接收器1901可用于接收输入的数字或字符信息,以及产生与执行设备的相关设置以及功能控制有关的信号输入。发射器1902可用于通过第一接口输出数字或字符信息;发射器1902还可用于通过第一接口向磁盘组发送指令,以修改磁盘组中的数据;发射器1902还可以包括显示屏等显示设备。
本申请实施例中还提供一种包括计算机程序产品,当其在计算机上运行时,使得计算机执行如前述执行设备所执行的步骤,或者,使得计算机执行如前述训练设备所执行的步骤。
本申请实施例中还提供一种计算机可读存储介质,该计算机可读存储介质中存储有用于进行信号处理的程序,当其在计算机上运行时,使得计算机执行如前述执行设备所执行 的步骤,或者,使得计算机执行如前述训练设备所执行的步骤。
本申请实施例提供的执行设备或终端设备具体可以为芯片,芯片包括:处理单元和通信单元,所述处理单元例如可以是处理器,所述通信单元例如可以是输入/输出接口、管脚或电路等。该处理单元可执行存储单元存储的计算机执行指令,以使执行设备内的芯片执行上述实施例描述的编译方法。可选地,所述存储单元为所述芯片内的存储单元,如寄存器、缓存等,所述存储单元还可以是所述无线接入设备端内的位于所述芯片外部的存储单元,如只读存储器(read-only memory,ROM)或可存储静态信息和指令的其他类型的静态存储设备,随机存取存储器(random access memory,RAM)等。
具体的,请参阅图20,图20为本申请实施例提供的芯片的一种结构示意图,所述芯片可以表现为处理器2000,NPU 2000作为协处理器挂载到主CPU(Host CPU)上,由Host CPU分配任务。NPU的核心部分为运算电路2003,通过控制器2004控制运算电路2003提取存储器中的矩阵数据并进行乘法运算。
在一些实现中,运算电路2003内部包括多个处理单元(Process Engine,PE)。在一些实现中,运算电路2003是二维脉动阵列。运算电路2003还可以是一维脉动阵列或者能够执行例如乘法和加法这样的数学运算的其它电子线路。在一些实现中,运算电路2003是通用的矩阵处理器。
举例来说,假设有输入矩阵A,权重矩阵B,输出矩阵C。运算电路从权重存储器2002中取矩阵B相应的数据,并缓存在运算电路中每一个PE上。运算电路从输入存储器2001中取矩阵A数据与矩阵B进行矩阵运算,得到的矩阵的部分结果或最终结果,保存在累加器(accumulator)2008中。
统一存储器2006用于存放输入数据以及输出数据。权重数据直接通过存储单元访问控制器(Direct Memory Access Controller,DMAC)2005,DMAC被搬运到权重存储器2002中。输入数据也通过DMAC被搬运到统一存储器2006中。
BIU为Bus Interface Unit即,总线接口单元2020,用于AXI总线与DMAC和取指存储器(Instruction Fetch Buffer,IFB)2009的交互。
总线接口单元2020(Bus Interface Unit,简称BIU),用于取指存储器2009从外部存储器获取指令,还用于存储单元访问控制器2005从外部存储器获取输入矩阵A或者权重矩阵B的原数据。
DMAC主要用于将外部存储器DDR中的输入数据搬运到统一存储器2006或将权重数据搬运到权重存储器2002中或将输入数据数据搬运到输入存储器2001中。
向量计算单元2007包括多个运算处理单元,在需要的情况下,对运算电路2003的输出做进一步处理,如向量乘,向量加,指数运算,对数运算,大小比较等等。主要用于神经网络中非卷积/全连接层网络计算,如Batch Normalization(批归一化),像素级求和,对特征平面进行上采样等。
在一些实现中,向量计算单元2007能将经处理的输出的向量存储到统一存储器2006。例如,向量计算单元2007可以将线性函数;或,非线性函数应用到运算电路2003 的输出,例如对卷积层提取的特征平面进行线性插值,再例如累加值的向量,用以生成激活值。在一些实现中,向量计算单元2007生成归一化的值、像素级求和的值,或二者均有。在一些实现中,处理过的输出的向量能够用作到运算电路2003的激活输入,例如用于在神经网络中的后续层中的使用。
控制器2004连接的取指存储器(instruction fetch buffer)2009,用于存储控制器2004使用的指令;
统一存储器2006,输入存储器2001,权重存储器2002以及取指存储器2009均为On-Chip存储器。外部存储器私有于该NPU硬件架构。
其中,上述任一处提到的处理器,可以是一个通用中央处理器,微处理器,ASIC,或一个或多个用于控制上述程序执行的集成电路。
另外需说明的是,以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。另外,本申请提供的装置实施例附图中,模块之间的连接关系表示它们之间具有通信连接,具体可以实现为一条或多条通信总线或信号线。
通过以上的实施方式的描述,所属领域的技术人员可以清楚地了解到本申请可借助软件加必需的通用硬件的方式来实现,当然也可以通过专用硬件包括专用集成电路、专用CPU、专用存储器、专用元器件等来实现。一般情况下,凡由计算机程序完成的功能都可以很容易地用相应的硬件来实现,而且,用来实现同一功能的具体硬件结构也可以是多种多样的,例如模拟电路、数字电路或专用电路等。但是,对本申请而言更多情况下软件程序实现是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在可读取的存储介质中,如计算机的软盘、U盘、移动硬盘、ROM、RAM、磁碟或者光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,训练设备,或者网络设备等)执行本申请各个实施例所述的方法。
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。
所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、训练设备或数据中心通过有线(例如同轴电缆、光纤、数字用户线(DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、训练设备或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存储的任何可用介质或者是包含一个或多个可用介质集成的训练设备、数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介 质(例如,DVD)、或者半导体介质(例如固态硬盘(Solid State Disk,SSD))等。

Claims (13)

  1. 一种编译方法,其特征在于,包括:
    获取神经网络模型的算子描述,所述算子描述包括对矩阵乘法运算的描述;
    解析所述算子描述,以生成目标代码;
    其中,所述目标代码调用第一接口,所述第一接口用于指示多个第一映射关系,所述第一映射关系是一个实例与第一数据之间的映射关系,所述实例用于处理与所述实例对应的第一数据,所述第一数据为参与所述矩阵乘法运算的数据,其中,同一阶段内并行执行的多个实例分别与位于不同存储体bank内的第一数据具有所述第一映射关系。
  2. 根据权利要求1所述的方法,其特征在于,所述解析所述算子描述,生成目标代码,包括:
    解析所述算子描述,以得到中间表示;
    将所述中间表示中的第一语句替换为第一接口语句,以得到所述目标代码,所述第一语句用于指示执行所述矩阵乘法运算,所述第一接口语句用于调用所述第一接口,所述第一接口还用于执行所述矩阵乘法运算。
  3. 根据权利要求1所述的方法,其特征在于,所述解析所述算子描述,生成目标代码,包括:
    解析所述算子描述,以得到中间表示;
    将所述中间表示中的第二语句替换为第二接口语句,以得到所述目标代码,所述第二语句用于指示搬移所述第一数据,所述第二接口语句用于调用所述第一接口,所述第一接口用于搬移所述第一数据。
  4. 根据权利要求1至3所述的方法,其特征在于,所述解析所述算子描述,生成目标代码,包括:
    解析所述算子描述,以得到中间表示;
    将所述中间表示中的第三语句替换为第三接口语句,以得到所述目标代码,所述第三语句用于指示执行融合运算,所述融合运算的输入包括所述矩阵乘法运算的输出,所述第三接口语句还用于调用第二接口,所述第二接口用于执行融合运算。
  5. 根据权利要求4所述的方法,其特征在于,所述融合运算包括以下运算中的至少一种:加法运算、减法运算、乘法运算、除法运算、结果向下取整的除法运算、取模运算和结果向下取整的取模运算。
  6. 根据权利要求1-5任意一项所述的方法,其特征在于,所述方法还包括:
    基于所述算子描述,生成用于矩阵分块的参数;
    根据所述用于矩阵分块的参数,对目标矩阵执行分块操作,以得到所述目标矩阵的划 分结果,所述目标矩阵为参与所述矩阵乘法运算的矩阵;
    根据所述目标矩阵的划分结果,在所述目标代码中添加数据搬移语句,所述数据搬移语句用于指示在内存中搬移所述目标矩阵的数据。
  7. 根据权利要求6所述的方法,其特征在于,所述目标矩阵的划分结果包括第一矩阵矩阵,所述第一矩阵包括第二矩阵;
    所述在所述目标代码中添加数据搬移语句包括:
    在所述目标代码中指示划分第一矩阵的语句后添加第一数据搬移语句,以及在所述目标代码中指示划分第二矩阵的语句后添加第二数据搬移语句;
    其中,所述第一数据搬移语句用于指示将所述第一矩阵的数据从全局内存搬移至共享内存,所述第二数据搬移语句用于指示将所述第二矩阵的数据从共享内存搬移至局部内存。
  8. 根据权利要求6或7所述的方法,其特征在于,所述目标代码中还包括第二映射关系,所述第二映射关系是实例与数据搬移语句之间的映射关系,所述第二映射关系用于指示执行数据搬移语句的实例,所述第二映射关系是基于实例的数据以及划分后的矩阵的数据结构建立的。
  9. 根据权利要求1-8任意一项所述的方法,其特征在于,所述目标代码中还包括线程束warp与所述目标矩阵中的轴之间的第三映射关系;
    其中,所述第三映射关系用于指示执行矩阵中的轴的运算的warp,所述warp的数量是基于参与所述矩阵乘法运算的实例的总数确定的,每个warp中包括相同数量的实例,所述目标矩阵为参与所述矩阵乘法运算的矩阵。
  10. 根据权利要求1-9任意一项所述的方法,其特征在于,所述第一接口包括有并行线程执行PTX指令代码。
  11. 一种编译装置,其特征在于,包括存储器和处理器;所述存储器存储有代码,所述处理器被配置为执行所述代码,当所述代码被执行时,所述编译装置执行如权利要求1至10任意一项所述的方法。
  12. 一种计算机存储介质,其特征在于,所述计算机存储介质存储有指令,所述指令在由计算机执行时使得所述计算机实施权利要求1至10任意一项所述的方法。
  13. 一种计算机程序产品,其特征在于,所述计算机程序产品存储有指令,所述指令在由计算机执行时使得所述计算机实施权利要求1至10任意一项所述的方法。
PCT/CN2022/094998 2021-06-02 2022-05-25 一种编译方法及相关装置 WO2022253075A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110615376.3 2021-06-02
CN202110615376.3A CN115437637A (zh) 2021-06-02 2021-06-02 一种编译方法及相关装置

Publications (1)

Publication Number Publication Date
WO2022253075A1 true WO2022253075A1 (zh) 2022-12-08

Family

ID=84271760

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/094998 WO2022253075A1 (zh) 2021-06-02 2022-05-25 一种编译方法及相关装置

Country Status (2)

Country Link
CN (1) CN115437637A (zh)
WO (1) WO2022253075A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116702855A (zh) * 2023-04-27 2023-09-05 珠海市芯动力科技有限公司 神经网络计算图的优化方法、装置及相关设备
CN117764122A (zh) * 2023-12-29 2024-03-26 苏州亿铸智能科技有限公司 计算图处理方法、装置、电子设备及存储介质

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116560666B (zh) * 2023-07-10 2023-09-22 上海燧原科技有限公司 基于多层级代码生成的ai前端统一计算方法、装置及介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106598688A (zh) * 2016-12-09 2017-04-26 曙光信息产业(北京)有限公司 一种深度学习汇编优化中的寄存器冲突避免方法
US20190187964A1 (en) * 2017-12-20 2019-06-20 Advanced Micro Devices, Inc. Method and Apparatus for Compiler Driven Bank Conflict Avoidance
US20200319861A1 (en) * 2019-04-02 2020-10-08 Graphcore Limited Compiling a Program from a Graph
CN112328227A (zh) * 2020-11-03 2021-02-05 清华大学 编译方法、装置、计算设备和介质

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106598688A (zh) * 2016-12-09 2017-04-26 曙光信息产业(北京)有限公司 一种深度学习汇编优化中的寄存器冲突避免方法
US20190187964A1 (en) * 2017-12-20 2019-06-20 Advanced Micro Devices, Inc. Method and Apparatus for Compiler Driven Bank Conflict Avoidance
US20200319861A1 (en) * 2019-04-02 2020-10-08 Graphcore Limited Compiling a Program from a Graph
CN112328227A (zh) * 2020-11-03 2021-02-05 清华大学 编译方法、装置、计算设备和介质

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116702855A (zh) * 2023-04-27 2023-09-05 珠海市芯动力科技有限公司 神经网络计算图的优化方法、装置及相关设备
CN117764122A (zh) * 2023-12-29 2024-03-26 苏州亿铸智能科技有限公司 计算图处理方法、装置、电子设备及存储介质

Also Published As

Publication number Publication date
CN115437637A (zh) 2022-12-06

Similar Documents

Publication Publication Date Title
US20230251861A1 (en) Accelerating linear algebra kernels for any processor architecture
WO2022253075A1 (zh) 一种编译方法及相关装置
Mittal et al. A survey of deep learning on cpus: opportunities and co-optimizations
WO2021000970A1 (zh) 深度学习算法的编译方法、装置及相关产品
US20220012575A1 (en) Methods and apparatus for localized processing within multicore neural networks
US11669443B2 (en) Data layout optimization on processing in memory architecture for executing neural network model
Ben-Nun et al. Memory access patterns: The missing piece of the multi-GPU puzzle
US7937567B1 (en) Methods for scalably exploiting parallelism in a parallel processing system
Mittal A survey of accelerator architectures for 3D convolution neural networks
US20200004514A1 (en) High parallelism computing system and instruction scheduling method thereof
US20210334234A1 (en) Distributed graphics processor unit architecture
Xu et al. A dedicated hardware accelerator for real-time acceleration of YOLOv2
US8615770B1 (en) System and method for dynamically spawning thread blocks within multi-threaded processing systems
WO2021000971A1 (zh) 操作数据的生成方法、装置及相关产品
US12026606B2 (en) Fractal calculating device and method, integrated circuit and board card
US20230289292A1 (en) Method and apparatus for efficient access to multidimensional data structures and/or other large data blocks
CN113469336A (zh) 优化神经网络模型的编译方法、执行方法及相关产品
WO2021115149A1 (zh) 神经网络处理器、芯片和电子设备
Kirtzic et al. A parallel algorithm development model for the GPU architecture
WO2023030507A1 (zh) 编译优化方法、装置、计算机设备以及存储介质
US8959497B1 (en) System and method for dynamically spawning thread blocks within multi-threaded processing systems
CN116755878A (zh) 程序运行方法、装置、设备、介质和程序产品
Guo et al. Novel accelerated methods for convolution neural network with matrix core
US20230289304A1 (en) Method and apparatus for efficient access to multidimensional data structures and/or other large data blocks
US20220019531A1 (en) Allocating Variables to Computer Memory

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22815121

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 22815121

Country of ref document: EP

Kind code of ref document: A1