CN116306918A

CN116306918A - FPGA neural network model deployment method based on MLIR

Info

Publication number: CN116306918A
Application number: CN202310107219.0A
Authority: CN
Inventors: 请求不公布姓名
Original assignee: Shencun Technology Wuxi Co ltd
Current assignee: Shencun Technology Wuxi Co ltd
Priority date: 2023-02-13
Filing date: 2023-02-13
Publication date: 2023-06-23

Abstract

The application discloses an FPGA neural network model deployment method based on MLIR, which relates to the field of FPGA and extracts an original neural network model to be deployed to an FPGA platform; converting an original neural network model into an onnx model in an open source format through an onnx tool, and then converting the onnx model into an mlir intermediate representation file through an open source onnx-mlir tool; converting and compiling the converted MLIR intermediate representation file through an MLIR tool to obtain a first program file for running on an FPGA kernel; the first program file is a high-level language and is used for operating the FPGA chip; and writing an external interactive program, compiling and linking the interactive program and the first program file, and deploying the interactive program and the first program file into the FPGA chip. According to the scheme, writing of low-efficiency RTL related model codes is avoided, a deep neural network model is efficiently operated on the FPGA, and deployment efficiency is improved.

Description

FPGA neural network model deployment method based on MLIR

Technical Field

The embodiment of the application relates to the technical field of FPGA, in particular to an FPGA neural network model deployment method based on MLIR.

Background

In recent years, with rapid development and application of deep learning, the deployment of neural network models is more important. There is currently no efficient solution to deploy deep learning models to FPGA platforms. The deep learning model can be run on the FPGA, hardware Register Transfer Language (RTL) Verilog is most commonly adopted at present, but the production efficiency of the hardware register transfer language is even lower than C or C++, not to mention model description languages Python, lua and the like, so that in the aspect of programming, the maintenance program function is correct, most of the cost of software can be occupied, meanwhile, the burden in the aspect of RTL programming can be larger, and the industry is also exploring how to use the collaborative design of a domain-specific language and a system to reduce the programming difficulty of the FPGA.

Wherein the Xilinx platform provides a set of tools for user deployment models. However, the tool is closed and open, many operations cannot be deployed through xilinx, and the tool is not supported by operation symbols in many models, so that the difficulty of deploying the models is greatly increased, and meanwhile, even if the tool adopts alternative operation symbols, the precision is greatly lost, so that the tool is hardly usable.

Disclosure of Invention

The embodiment of the application provides an FPGA neural network model deployment method based on MLIR, which solves the problems of low neural network model deployment efficiency and programming difficulty, and comprises the following steps:

s1, extracting an original neural network model to be deployed to an FPGA platform;

s2, converting the original neural network model into an onnx model in an open source format through an onnx tool, and then converting the onnx model into an mlir intermediate representation file through an open source onnx-mlir tool;

s3, converting and compiling the converted MLIR intermediate representation file through an MLIR tool to obtain a first program file for running on an FPGA kernel; the first program file is a high-level language and is used for operating the FPGA chip;

s4, writing an external interactive program, compiling and linking the interactive program and the first program file in a combined mode, and deploying the interactive program and the first program file into an FPGA chip.

Specifically, the original neural network model is based on a Tensorflow framework, a Pytorch framework or an MXNET framework and is written by using a Python language; the model type is Caffe model, chainer model, coreML model, kerasmodel, libSVM model, lightGBM model or Scikit-learn model.

Specific S3 includes:

s31, converting the mlir intermediate representation file into an llvm intermediate representation file through mlir-opt sinking, wherein the llvm intermediate representation file is a low-level intermediate language representation file;

s32, compiling the LLVM intermediate representation file through an LLVM tool to generate the first program file and a calling function for operating the FPGA chip kernel.

Specifically, LLVM tools are sets of modular, reusable compilers, and low-level virtual machines that are a combination of tool chain techniques; s32 includes:

determining the type of an operation terminal, and compiling the llvm intermediate representation file into a dynamic link library through a clang compiler in the operation terminal; the dynamic link library comprises all driving codes for driving the FPGA chip and operation functions for executing calculation, wherein the operation functions at least comprise addition, subtraction, multiplication and division, combination operation, convolution operation and pooling operation, and the operation functions are formed by the combination of four operations;

compiling the dynamic link library into the first program file and providing an entry function for operating the method inside the dynamic library.

Specifically, the process of generating the operation function includes:

redefining a calculation function to be converted in the llvm intermediate representation file through a linking instruction in the mlir;

converting a calculation function to be converted into an operation function formed by combining four arithmetic operations through an mlir-opt conv1. Mlir-linear-buffering-arith-buffering-tension-buffering-buffer-deallocation-conversion-linear-to-loops-conversion-scf-to-cf-conversion-linear-to-llvm instruction; and the operation function formed by combining four operations is directly used for the FPGA core operation.

Specifically, S4 includes:

s41, determining model parameters of an original neural network model and input data for executing an operation function;

s42, writing a program entry function, generating an executable file, compiling the first program file and the executable file through a gcc compiler, and deploying the first program file and the executable file on terminal equipment with an FPGA chip.

Specifically, the compiling process of the gcc compiler includes:

generating a target compiling code of a target computing operation according to the FPGA core, the input data and the operation function; the target compiling code is a write operation function for operating the FPGA core and is used for directly transmitting the data matrix to the corresponding FPGA core; the FPGA core at least comprises a convolution core, a multiplication core, a division core, an addition core, a subtraction core and a pooling core, which are respectively used for executing corresponding calculation.

Specifically, the gcc compiler also generates a read operation function corresponding to the write operation, and when the FPGA core receives and calculates, a calculation result is obtained through the input read operation function.

The beneficial effects that technical scheme that this application embodiment provided include at least: after the original neural network model is converted into an onnx model with an open source through an onnx tool, the original neural network model is opened, and the onnx model is conveniently converted into an mlir file through an onnx-mlir tool; the mlir-opt designed by the scheme can be converted into a low-level intermediate language expression file, and then the low-level intermediate language expression file is converted into a first program file which can be directly loaded and run on an FPGA kernel through the combined LLVM tool proposal, so that when deployment operation is executed, the first program file can be directly combined, compiled and written on the FPGA device through writing a few small interactive programs in the outside, the whole process does not need to write low-efficiency RTL files, and the high-efficiency execution of the FPGA is facilitated while the deployment efficiency is improved.

Drawings

FIG. 1 is a method for deploying an FPGA neural network model based on MLIR according to an embodiment of the present application;

fig. 2 is a schematic diagram of an FPGA neural network model deployment method based on MLIR according to an embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

References herein to "a plurality" means two or more. "and/or", describes an association relationship of an association object, and indicates that there may be three relationships, for example, a and/or B, and may indicate: a exists alone, A and B exist together, and B exists alone. The character "/" generally indicates that the context-dependent object is an "or" relationship.

Taking the Xilinx platform of the target as an example, it provides a set of Vitis tools to the user deployment model. However, the tool is closed and not open enough, many operations cannot be deployed through the Xilinx tool, and operators in many models cannot run on the FPGA due to defects of the Xilinx tool itself, so in this case the accuracy and the speed need to be balanced. There is a loss in accuracy if the operator supported by Xilinx is used instead of the above-mentioned unsupported operator. If the FPGA is not used for calculation, the speed is greatly compromised. Under the situation, the scheme provides a deployment mode which can enable operators which are not supported by some xilinx tools to generate codes capable of running on the FPGA through the tools shown in the scheme, and meanwhile accuracy and speed are guaranteed.

Fig. 1 is a flowchart of an FPGA neural network model deployment method based on MLIR according to an embodiment of the present application, including the following steps:

s1, extracting an original neural network model to be deployed to an FPGA platform.

The device deployed in the scheme is a device or platform on which the FPGA chip is mounted. Firstly, an original neural network model to be deployed is obtained, and the setting of the original neural network is completely dominated by a user. The original neural network model can be based on a Tensorflow framework, a Pytorch framework or an MXNET framework and is written by using a Python language, a user can develop a depth neural network model of the user for different scenes, and the type of the neural network model can be a Caffe model, a Chainer model, a CoreML model, a Keras model, a LibSVM model, a LightGBM model or a Scikit-learn model and the like, and the model is developed in a targeted manner based on the models.

S2, converting the original neural network model into an onnx model in an open source format through an onnx tool, and then converting the onnx model into an mlir intermediate representation file through an open source onnx-mlir tool.

The Onnx tool is an open source format for exchanging deep learning models between different frameworks and tools. It allows developers to create and deploy machine learning models on different platforms (e.g., windows, linux and MacOS) and on different devices (e.g., CPU, GPU, and FPGA). ONNX allows interoperability between different machine learning frameworks, such as TensorFlow, pyTorch and Caffe2, allowing developers to easily create and deploy machine learning models in a variety of environments. This step converts the original neural network model into an onnx model in open source format for subsequent modification and conversion. If the model framework for Tensorflow, the following commands are used for conversion:

pyth on-m tf2 onx.control-driven-model < model path to be converted > -output < model path to be output >.

It should be noted that the conversion method using the other model is similar to this. After conversion to the onnx model, the onnx model is converted to an mlir intermediate representation file by an open source onnx-mlir tool, i.e., the. Onnx model is converted to an. Mlir file. MLIR (Multi-Level Intermediate Representation) is a high-level representation language for representing and optimizing intermediate representations of multiple types of compilers. It provides a generic representation that can help the compiler optimize more code and can easily translate multiple programming languages into an MLIR representation. MLIR is designed to be extensible, whose functionality can be extended by adding new language-specific optimizations and representations. It also has a rich operation library, which can help compiler developers to realize common optimization and conversion. It is this property that is used by this scheme to achieve intermediate transformations of the model.

S3, converting and compiling the converted MLIR intermediate representation file through an MLIR tool to obtain a first program file for running on an FPGA kernel; the first program file is a high-level language for operating the FPGA chip.

The step mainly comprises the steps of converting the intermediate file mlir into codes capable of being directly operated on an FPGA kernel, and specifically comprises the following steps:

s31, converting the mlir intermediate representation file into an llvm intermediate representation file through mlir-opt sinking, wherein the llvm intermediate representation file is a low-level intermediate language representation file.

This step is a major cause of precision loss in the related art, because the editing transformation mode of RTL replaces operators in the original model, which inevitably causes the precision of some operators to be reduced.

As shown in FIG. 2, the scheme is that the mlir intermediate representation file is firstly subjected to sinking conversion into the llvm intermediate representation file through the mlir-opt, that is, the.mlir file is converted into the.ll file, and the.ll file is essentially a low-level intermediate language expression, essentially an intermediate language which cannot be directly operated by the FPGA, so that further conversion is required.

S32, compiling the LLVM intermediate representation file through the LLVM tool, generating a first program file and providing a calling function for operating the FPGA chip core.

The compiling process is a process of converting low-level programming languages and operators, and is completed through an LLVM tool, and compiling is performed to convert various operators and corresponding operation functions included in the low-level programming languages and operators into a format running on an FPGA kernel, namely a first program file. For example, four operations of adding and subtracting multipliers are converted, logic operations such as convolution, pooling, normalization and the like are converted, a specific operation function is determined according to tasks required to be executed by a model and related processes, and the embodiment is not limited in this way. The calling function is used for calling according to an externally written program in the deployment process and is used for operating the FPGA chip kernel to execute a specific calculation task. Specifically, the method further comprises the following steps:

and a, determining the type of the running terminal, and compiling the llvm intermediate representation file into a dynamic link library through a clang compiler in the running terminal.

Because LLVM tools are actually low-level virtual machines, i.e., a collection of various toolkits, that are composed of several groups of modular, reusable compilers, and tool chain technologies, a common Intermediate Representation (IR) is provided to the compilers. Which is used to build compilers for various programming languages, including C, C ++ and Fortran. It is also used in a variety of other applications such as static analysis tools, debuggers, and runtime environments.

The process, when executed, needs to determine the platform type of the terminal to be deployed/operated in advance, such as an AIE (Xilinx computing platform) or an X86 platform. The llvm intermediate representation file is then compiled into a dynamically linked library by a clang compiler therein. For example: the clang-O2-target= AIE xx.ll, the target is the type of the corresponding running terminal, the AIE refers to an AIE computing platform (which can be replaced by other computing platforms) deduced by Xilinx, the dynamic link library contains all driving codes for driving the FPGA chip and computing functions for executing computation, the converted computing functions only contain simple four-way operations (addition, subtraction, multiplication and division), all other operations are unfolded into simple four-way operations, such as rolling and pooling, and other complex operations are unfolded into combinations of simple four-way operations, such as convolution operations, namely, a combination of for loop and four-way operations.

The process specifically further comprises the following steps:

1. redefining the calculation function to be converted in the llvm intermediate representation file through a linking instruction in the mlir.

2. The calculation function to be converted is converted into an operation function which is formed by combining four arithmetic operations through an mlir-opt xxx.mlir-linear-buffering-arith-buffering-tension-buffering-func-buffering-buffer-deallocation-conversion-linear-to-loops-conversion-scf-to-cf-conversion-linear-to-llvm instruction.

Linling is a term in mlir and encompasses a wide variety of higher order operations including convolution, pooling, and the like. Memref is the expression form of the data stored in the mlir, and by the method, definition of a simple calculation operation can be completed, and mlir expression is displayed as a simple four-rule operation.

As an example of convolution operation, the following is shown in mlir:

func.func@conv_1d(％arg0:memref<？xf32>,％arg1:memref<？xf32>,％arg2:memref<？xf32>){

linalg.conv_1dins (% arg0,% arg1: memref <

outs(％arg2:memref<？xf32>)

return

}

The transformed clang compiler is shown as follows (cut-out):

func.func@conv_1d(％arge:memref<？xf32>，％arg1:memref<？xf32>，％arg2:memref<？xf32>){

cf.br^bb1(％c:index)

^bb1(％8:index)://2preds:^bbe，^bb5

％9＝builtin.unrealized conversion_cast％8:index to i64

％1e＝arith.cmpi slt,％8,％7:index

cf.cond br％10，^bb2，^bb6

^bb2://pred:^bb1

cf.br^bb3(％c:index)

… (slightly)

Multiplication in a// convolution operation

％25＝arith.mulf％18，％21:f32

Addition in a// convolution operation

％26＝arith.addf％24，％25:f32

％27＝llvm.extractvalue％1[1]:！llvm.struct<(ptr<f32>,ptr<f32>,i64,array<1x i64>,array<1x i64>)》％28＝llvm.getelementptr％27[％9]:(！llvm.ptr<f32>,64)->！lvm.ptr<f32>

Assignment of//

1lvm.store％26，％28:！1lvm.ptr<f32>

％29＝arith.addi％11，％c1:index

cf.br^bb3(％29:index)

^bb5://pred:^bb3

％30＝arith.addi％8，％c1:index

cf.br^bb1(％3:index)

^bb6://pred:^bb1

return

}

The above procedure successfully converts the convolution operation into a program step that includes only four operations.

b, compiling the dynamic link library into a first executable file and providing an entry function for operating the method inside the dynamic library.

The compiling is performed after the clang compiler, and the clang compiler is repackaged to generate a first program file directly operated by the FPGA, namely an executable file used for directly connecting the FPGA chip with the core, such as a cpp file or a C file.

S4, writing an external interactive program, combining, compiling and linking the interactive program and the first program file, and deploying the interactive program and the first program file into the FPGA chip.

The external program written in the step is not RTL language, but other high-level languages, such as C language, and the like, and is compiled through an external program entry function main to generate files such as an elf file and the like for files executed by the FPGA. Specifically, the method comprises the following steps:

s41, determining model parameters of the original neural network model and input data for executing an operation function.

The model parameters include weight information, some parameter information and the like, and the input data is determined according to the executed calculation tasks, such as picture data, calculation matrix, vector and the like.

The step is to write simple codes externally, call the call function generated in the process to generate an executable file by combining with an FPGA core, and compile the first program file and the executable file to generate an elf file by a gcc compiler.

In the foregoing, the simple four operations can be performed in the FPGA, and the code is compiled and then placed on an FPGA core to perform the operations, that is, once data is input into the FPGA core, the data output from the FPGA after the operation is the result after the convolution. Therefore, to be able to pass in data to different FPGA cores, it is necessary to operate on the different FPGA cores. If the picture and the data of the convolution kernel are transmitted to the convolution kernel FPGA, 2 matrixes are transmitted to the multiplication kernel FPGA.

Thus, the compilation process of the gcc compiler can then generate target compiled code for the target computing operation from the FPGA core, the input data, and the operational functions. The target compiled code is a write operation function that operates the FPGA core for directly transferring the data matrix to the corresponding FPGA core. The FPGA core at least comprises a convolution core, a multiplication core, a division core, an addition core, a subtraction core, a pooling core, and the like, and the different cores are respectively used for performing corresponding computations. It should be noted that, the FPGA is an editable logic device, and the definition of a specific core is specifically defined according to the previous user definition of the chip, which is not limited in this application.

In one possible implementation, we define the convolution kernel as the label a and the multiplication kernel as b, then the following write operation code needs to be automatically generated:

and after receiving the corresponding codes and executing the codes with high efficiency, the FPGA core directly returns a calculation result. Therefore, the gcc compiler also generates a read operation function corresponding to the write operation, and when the FPGA core receives and calculates, a calculation result is obtained through the input read operation function. Corresponding to the above write operation, the read operation is represented as follows:

taking the LeNet-5 model as an example, the Xilinx FPGA VCK190 development board is a simple handwritten character recognition model. Firstly, a network is written by using a pytorch, then the network written by the pytorch is converted into an onnx model through an onnx tool, then the converted onnx model is converted into an mlir intermediate expression through an onnx-mlir tool, then the mlir intermediate expression is converted into an llvm lower-level intermediate expression, then llvm codes are converted into high-level C/C++ codes through mlir-aie, then external calling codes are written, the weight input of handwritten characters and models is called, then the handwritten characters and the weight input are compiled into an. If file, and finally the generated. If file is transmitted to a vck190 development board for operation. As a result, the process avoids the defect of writing FPGA codes, and can rapidly and efficiently deploy the model on the FPGA.

In summary, the original neural network model is converted into the onnx model with an open source through the onnx tool, and then the onnx model is opened, so that the onnx model is conveniently converted into an mlir file through the onnx-mlir tool; the mlir-opt designed by the scheme can be converted into a low-level intermediate language expression file, and then the low-level intermediate language expression file is converted into a first program file which can be directly loaded and run on an FPGA kernel through the combined LLVM tool proposal, so that when deployment operation is executed, the first program file can be directly combined, compiled and written on the FPGA device through writing a few small interactive programs in the outside, the whole process does not need to write low-efficiency RTL files, and the high-efficiency execution of the FPGA is facilitated while the deployment efficiency is improved.

The foregoing describes preferred embodiments of the present invention; it is to be understood that the invention is not limited to the specific embodiments described above, wherein devices and structures not described in detail are to be understood as being implemented in a manner common in the art; any person skilled in the art will make many possible variations and modifications, or adaptations to equivalent embodiments without departing from the technical solution of the present invention, which do not affect the essential content of the present invention; therefore, any simple modification, equivalent variation and modification of the above embodiments according to the technical substance of the present invention still fall within the scope of the technical solution of the present invention.

Claims

1. An FPGA neural network model deployment method based on MLIR, which is characterized by comprising the following steps:

2. The MLIR-based FPGA neural network model deployment method of claim 1, wherein the original neural network model is based on a Tensorflow framework, a Pytorch framework, or an MXNET framework and is written in Python language; the model type is Caffe model, chainer model, coreML model, kerasmodel, libSVM model, lightGBM model or Scikit-learn model.

3. The MLIR-based FPGA neural network model deployment method of claim 2, wherein S3 comprises:

4. The MLIR-based FPGA neural network model deployment method of claim 3, wherein LLVM tools are sets of modular, reusable compilers, and low-level virtual machines of tool chain technology combinations; s32 includes:

5. The method for deploying an MLIR-based FPGA neural network model of claim 4 wherein the generating an operational function comprises:

converting a calculation function to be converted into an operation function formed by combining four arithmetic operations through an mlir-opt xxx.mlir-linear-buffering-arith-buffering-tension-buffering-func-buffering-buffer-deallocation-conversion-linear-to-loops-conversion-scf-to-cf-conversion-linear-to-llvm instruction; and the operation function formed by combining four operations is directly used for FPGA core operation, and xxx.mlir is a file to be converted.

6. The MLIR-based FPGA neural network model deployment method of claim 5, wherein S4 comprises:

7. The MLIR-based FPGA neural network model deployment method of claim 6, wherein the compilation process of the gcc compiler comprises:

8. The method for deploying an FPGA neural network model based on MLIR according to claim 7, wherein the gcc compiler further generates a read operation function corresponding to the write operation, and the calculation result is obtained by the incoming read operation function after the FPGA core receives and calculates.