CN111104124B

CN111104124B - Pythrch framework-based rapid deployment method of convolutional neural network on FPGA

Info

Publication number: CN111104124B
Application number: CN201911084126.0A
Authority: CN
Inventors: 姜宏旭; 韩琪; 刘晓戬; 李波; 张永华; 林珂玉
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2019-11-07
Filing date: 2019-11-07
Publication date: 2021-07-20
Anticipated expiration: 2039-11-07
Also published as: CN111104124A

Abstract

The invention discloses a rapid deployment method of a convolutional neural network on an FPGA (field programmable gate array) based on a Pythroch framework, which comprises the steps of establishing a model rapid mapping mechanism, constructing a reconfigurable computing unit and carrying out a self-adaptive processing flow based on rule mapping. When a convolutional neural network is defined under a Pythrch framework, a model rapid mapping mechanism is established through the construction of naming rules. And performing optimization strategy calculation under the constraint condition of hardware resources, establishing a template base based on the hardware optimization strategy, and constructing a reconfigurable calculation unit at the FPGA end. And finally, decomposing the complex network model file at the FPGA end in the self-adaptive processing flow based on the rule mapping, abstracting the network into a directed acyclic graph, and finally generating a neural network accelerator to realize the integrated flow from the model file of the Pythrch frame to the FPGA deployment. The method can establish the directed acyclic graph of the network through a model fast mapping mechanism, can complete the FPGA deployment process only by inputting hardware design variables in the FPGA deployment process, and is simple and strong in universality.

Description

Pythrch framework-based rapid deployment method of convolutional neural network on FPGA

Technical Field

The invention belongs to the technical field of hardware acceleration of convolutional neural networks, and relates to a rapid deployment method of a convolutional neural network on an FPGA (field programmable gate array) based on a Pythrch framework.

Background

In recent years, convolutional neural networks have been widely used in the fields of natural language processing, computer vision, and the like. Neural networks generally include: training and testing. Training is a process of extracting model parameters from training data and neural network models (ResNet, RNN, etc.) by using a CPU or a GPU. The test is to check the result after the test data is run by using a trained model (a neural network model plus model parameters). And the tasks, Pythrch and tensorflow are used for uniformly abstracting link data related to the training process to form a usable frame.

At present, the deep learning model deployed on the FPGA in the industry mostly obtains a network structure by analyzing the buffer prototxt, and finds a parameter value (eg. deep science and technology, horizon, and business soup) in the corresponding buffer model. The Pythrch framework definition network has simple structure and flexible use and is widely used in academic circles. Since the Pytorch model does not contain topology information of the network, the model trained in the Pytorch can be deployed only by converting the model into a buffer model through a tool such as onnx, but the onnx tool only supports a conventional buffer layer, and cannot be converted if the model is a custom layer in the Pytorch. And even after the conversion is successful, a great deal of effort is required to align the output of the pytore with the output of the caffe.

Therefore, providing a fast deployment method of a convolutional neural network based on a Pytorch framework on an FPGA is a technical problem to be solved urgently by those skilled in the art.

Disclosure of Invention

In order to achieve the above purpose, the invention provides a rapid deployment method of a convolutional neural network based on a Pytorch framework on an FPGA, which facilitates rapid deployment of the trained convolutional neural network of the Pytorch framework on the FPGA. The method mainly comprises the steps of establishing a model fast mapping mechanism, constructing a reconfigurable computing unit and carrying out a self-adaptive processing flow based on rule mapping. The efficient and convenient naming rule is created, the upper layer topology sequence and the lower layer topology sequence of the model are stored in the model file, then the complex model file is decomposed, each layer is stored as a renamed binary file, and the establishment of the model fast mapping mechanism is completed. And then, carrying out optimization strategy calculation under the constraint condition of hardware resources, and selecting the optimization strategy adopted by the FPGA deployment according to the current hardware resources. And meanwhile, a template base based on a hardware optimization strategy is established, corresponding template files in the template base are directly called when the FPGA is deployed, and the template base comprises a basic convolutional neural network structure. And finally, creating a specific structure, reading the weight file of the layer at the FPGA end, updating the structure of the structure according to the configuration information, and storing the name of the next layer so as to find the lower layer information to be executed from the weight file. Therefore, the rapid deployment of the convolutional neural network under the Pythrch framework on the FPGA is completed. The specific scheme for achieving the purpose is as follows:

firstly, establishing a model fast mapping mechanism, naming each layer of a convolutional neural network model topological structure under a Pythrch frame according to the input and output sequence of an upper layer and a lower layer, carrying out decomposition and storage on each layer of a neural network model file obtained after model training, and completing the establishment of the network topological structure of the model file under the Pythrch frame;

constructing a reconfigurable computing unit, wherein the reconfigurable computing unit comprises optimization strategy calculation under a hardware resource constraint condition and establishment of a template base based on a hardware optimization strategy and is used for generating the reconfigurable computing unit of the FPGA end;

and step three, analyzing configuration information of each layer in the neural network model file based on the adaptive processing flow of the rule mapping, performing FPGA control logic adaptive configuration through a reconfigurable computing unit of an FPGA end, and finally generating the neural network accelerator.

Preferably, the first step specifically includes:

(1) constructing naming rules of each layer of the model in a Pythrch frame, and renaming each layer of the model according to the naming rules;

(2) training the renamed network model to obtain a neural network model file with a network topological structure;

(3) and decomposing the neural network model file, storing each layer as a renamed binary file, and completing establishment of a model fast mapping mechanism.

Preferably, in the step (1), each layer in the convolutional neural network is named, and the naming rule is the name of the layer + the name of the lower layer + the configuration information.

Preferably, the convolutional layer configuration information is: convolution kernel size _ step size _ zero padding; the pooling layer configuration information is: the pooling window size _ step _ zero padding, BN layer and active layer do not need configuration information.

Preferably, in the step (3), in a neural network model file decomposition and storage stage, the trained neural network model file is firstly propagated forward once, and each time a layer of the neural network model is read, parameters in the neural network model file are stored in a binary file, where a file name of the binary file is a name of a corresponding layer. In particular, when the layer has no parameters, it needs to be saved as an empty binary file.

Preferably, the second step specifically includes:

(1) performing optimization strategy calculation under the constraint condition of hardware resources, and selecting an optimization strategy adopted by FPGA deployment according to the resources of the current hardware;

(2) and establishing a template library based on a hardware optimization strategy, and directly calling a corresponding template file in the template library when the FPGA is deployed.

Preferably, the hardware optimization strategy in step (1) includes setting of feature diagram block size, input feature diagram parallelism, and output feature diagram parallelism.

Preferably, the template library based on the hardware optimization strategy in the step (2) mainly includes a convolution module, a BN layer module, an active layer module, a pooling module, a full connection layer calculation module, and an input/output feature map buffering module.

Preferably, the third step specifically includes: creating a structure, reading the weight file of the corresponding layer at the FPGA end, analyzing the configuration information according to the name of the weight file, updating the structure according to the configuration information, and storing the name of the next layer to find the lower layer information to be executed from the weight file.

Preferably, the structure maintenance information includes hardware optimization parameters, convolutional layer configuration parameters, BN layer configuration parameters, pooling layer configuration parameters, and names of the current layer and the lower layer.

Preferably, the structure comprises the following parameters: the method comprises the steps of partitioning the size of a feature map, outputting the parallelism of the feature map, inputting the parallelism of the feature map, a convolutional layer flag bit, a BN layer flag bit, an activation layer flag bit, a pooling layer flag bit, a full-link layer flag bit, a convolutional layer convolution kernel size, a convolutional layer convolution window sliding step length, the zero padding number of the convolutional layer input feature map, the pooling layer pooling window size, a pooling layer pooling window sliding step length, the zero padding number of the pooling layer input feature map and the full-link layer calculation kernel size.

Compared with the prior art, the invention has the following beneficial effects:

1. a model fast mapping mechanism is established through efficient and convenient naming rules and a complex model file decomposition and storage mechanism, the convolutional neural network topology information is stored in a model parameter file, and the problem that the model file obtained through Pythrch frame training does not contain network topology information is solved.

2. By performing optimization strategy calculation under the constraint of hardware resources, the method is convenient for selecting proper optimization acceleration strategies for different hardware. Meanwhile, the template base based on the hardware optimization strategy can support common operations in the convolutional neural network, and has certain universality and expansibility.

3. The self-adaptive processing flow based on the rule mapping is established, the network can be abstracted into a directed acyclic graph in the network reasoning stage of the FPGA, the FPGA control logic can be configured in a self-adaptive mode according to a specific structure, and the links needing human participation in FPGA deployment are reduced.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below. It is obvious that the drawings in the following description are only embodiments of the invention, and that for a person skilled in the art, other drawings can be obtained from the provided drawings without inventive effort.

FIG. 1 is a flowchart of a fast deployment method of a convolutional neural network based on a Pythrch framework in an FPGA according to the present invention;

fig. 2 is a diagram illustrating an input/output characteristic diagram buffer module constructed by a reconfigurable computing unit according to an embodiment of the present invention;

fig. 3 is a flowchart illustrating generation of a network topology directed acyclic graph in an adaptive processing flow based on rule mapping according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, for a flow chart of a fast deployment method of a convolutional neural network based on a Pytorch framework on an FPGA, the design and implementation of the fast deployment method of the convolutional neural network on the FPGA are mainly divided into three parts: establishing a model fast mapping mechanism, constructing a reconfigurable computing unit and carrying out a self-adaptive processing flow based on rule mapping.

S1 model establishing fast mapping mechanism

Firstly, renaming each layer of the model when the convolutional neural network model is defined under a Pythrch framework, naming each layer in the model according to a naming mode of 'the layer name + the lower layer name + the configuration information', and separating the 'the layer name' from the 'the lower layer name' by two underlines. For example, for a convolutional layer with a convolutional kernel size of 3 × 3, a step size of 1, and padding of 0, the layer is named "conv 1__ conv2__3_1_ 0". Therefore, when the neural network model of the Pytrch frame is built, the upper and lower layer topological structures of the model are saved in the model name, and the model file of the Pytrch frame is conveniently converted into the binary file identified by the FPGA in the next step. Specifically, the output of a certain layer is a plurality of next layers, which are named as "the name of the present layer + the lower layer 1+ the lower layer 2 … + the configuration information", for example, "conv 2__ conv3__ conv4__3_2_ 1". And after naming the convolutional neural network model under the Pythrch framework, carrying out model training, so that the upper layer topological sequence and the lower layer topological sequence of the model can be stored in a model file.

And then loading the obtained neural network model file into a convolutional neural network model under a Pythrch frame for one-time forward propagation, and after reading one layer of the model, saving parameters in the model into a binary bin file, wherein the name of the binary bin file is the layer name. In particular, when the layer has no parameters, it needs to be saved as an empty binary file.

For the convolutional layer, the configuration information is "convolutional kernel size _ step _ padding", for the pooling layer, the configuration information is "pooling window size _ step _ padding", and for the BN layer and the activation function layer, no configuration information is needed.

S2 reconfigurable computing unit construction

The reconfigurable computing unit constructs an optimization strategy calculation under the constraint condition of hardware resources and establishes a template base based on the hardware optimization strategy. The specific steps of the optimization strategy calculation under the hardware resource constraint condition are shown in the following table:

traversing the size of the input feature diagram and the size of the input and output channels of the neural network, searching for Tm (parallelism of the output feature diagram), Tn (parallelism of the input feature diagram), Tr (height of the feature diagram after partitioning) and Tc (width of the feature diagram after partitioning) which accord with the following formula, if the occupied resource of the searched hardware design parameter group is less than the total resource of the hardware and the current overall calculation delay is the minimum delay, saving the hardware design parameter group, and if not, continuously searching, and finally outputting the hardware design parameter adopted by the minimum delay as an optimization strategy under the constraint condition of the hardware resource.

DSP＝18×Tm×Tn (3)

In the formula, LAT represents the whole calculation time delay under the current optimization strategy, BRAM represents the storage resource occupation under the current optimization strategy, and DSP represents the DSP calculation resource occupation under the current optimization strategy. R represents the height of the characteristic diagram, C represents the width of the characteristic diagram, M represents an output channel, N represents an input channel, K represents the size of a convolution kernel, Bw represents the data calculation bit width, and BS represents the size of each block BRAM in the FPGA.

The template base based on the hardware optimization strategy comprises a convolution module, a BN layer module, an activation layer module, a pooling module, a full connection layer calculation module and an input and output characteristic diagram buffer module. For the entire reconfigurable computing unit, the reconfigurable parameters include: feature map block size, input feature map parallelism, output feature map parallelism, and the like. As shown in the table:

the convolution calculation module comprises a block input feature map buffer area, a weight parameter buffer area, a multiplier and an adder. Before convolution starts, a convolution calculation module is configured according to convolution layer configuration parameters, wherein the convolution calculation module comprises convolution kernel size, convolution window sliding step length, input and output feature diagram parallelism and the like. And after configuration is completed, loading Tm & ltx & gt C _ k & ltx & gt weights to a weight parameter buffer area in a data stream mode, and loading Tn block input feature maps with the size of Tr & ltx & gt to an input feature map buffer area in a weight data stream mode. The convolution calculation module designs a convolution calculation unit for the sliding window convolution calculation, and the convolution calculation unit is composed of C _ k × C _ k DSPs. When convolution operation starts, C _ k × C _ k input feature graphs of Tn input channels and weights corresponding to the same number are input into a convolution calculation unit, window sliding is carried out on the block input feature graphs after multiplication and accumulation calculation is completed, and the next group of feature graphs and weights are loaded for convolution calculation. And after the convolution calculation of the group of input feature maps is finished until the input feature maps slide to the position of Tr & Tc, and after the calculation is finished, the convolution output feature maps are stored in an output feature map buffer area of the reconfigurable calculation IP core.

The pooling calculation module is a process for sampling the feature map output by convolution, and before pooling calculation is started, the pooling calculation module is configured according to pooling layer configuration parameters, including pooling window size and parallelism of output feature maps. And loading the Tm output feature maps obtained after convolution calculation into a pooling layer input feature map buffer area after configuration is finished, inputting P _ k feature values of the same positions of the Tm output feature maps into a comparison unit when pooling operation is started, and storing the maximum value in the P _ k feature values as the feature value of the output feature map in the output feature map buffer area of the reconfigurable computing IP core by the comparison unit.

The full-connection calculation module is similar to the convolution calculation module in hardware structure and calculation process and comprises an input characteristic diagram buffer area, a weight parameter buffer area, a multiplier and an adder. Before the calculation of the full-connection layer is started, the full-connection calculation module is firstly configured according to full-connection layer configuration parameters, including the calculation core size of the full-connection layer, the parallelism of input and output characteristic diagrams and the like. And after configuration is completed, loading Tm (maximum transmission number) Tn F _ k weights to a weight parameter buffer area in a data stream mode, and loading Tn block input feature maps with the size of Tr Tc to an input feature map buffer area in a weight data stream mode. And if the size of the input feature map of the full connection layer is smaller than Tr Tc, loading the whole input feature map into the input feature map buffer area. When the full-connection calculation is started, F _ k × F _ k input feature maps of Tn input channels are input to the full-connection calculation unit with the same number of weights. After the Tm output characteristic graphs are calculated, the Tm output characteristic graphs are respectively added with the corresponding Tm offsets, and then the sum is output to an output characteristic graph buffer area to be used as a calculation result of a full-connection calculation module.

The input/output characteristic diagram buffer module comprises an input characteristic diagram pingpang ram and an output characteristic diagram pingpang ram, as shown in fig. 3, a first group of input characteristic diagrams are stored in an I _ ram1 before calculation is started, a reconfigurable computing IP core reads the first group of input characteristic diagrams for calculation after the calculation is started, meanwhile, a second group of input characteristic diagrams are stored in an I _ ram2, the reconfigurable computing IP core reads data of the I _ ram2 for calculation after the calculation of the first group of input characteristic diagrams is completed, and at this time, a third group of input characteristic diagrams are stored in an I _ ram 1. This reduces the latency of the transmission of the input profile. Similarly, after the reconfigurable computing IP core is computed, the first group of output characteristic diagrams are stored in the O _ ram1, after the first group of data is output, the second group of output characteristic diagrams are output to the O _ ram2, and in the same way, the two output characteristic diagram buffers are alternately output, so that the waiting time for transmitting the output characteristic diagrams is reduced.

S3 adaptive processing flow based on rule mapping

A specific structure is first created, as shown below,

the structure maintenance information includes hardware optimization parameters, convolutional layer configuration parameters, BN layer configuration parameters, pooling layer configuration parameters, and the names of the current layer and the lower layer. Reading the layer weight file at the FPGA end, analyzing the layer configuration information according to the weight file name, updating the structure, and storing the next layer name to find the lower layer information to be executed from the weight file.

In the inference stage of the FPGA neural network, as shown in fig. 3, the weight file name of the layer is firstly analyzed at the FPGA end, and the structure is updated. And then reading the weight and the input characteristic diagram of the layer from the off-chip memory, judging whether the layer is a convolution layer according to the maintained structure, if so, calling a convolution calculation module in a template library based on a hardware optimization strategy, if not, judging whether the layer is a pooling layer, and calling a pooling calculation module in the template library based on the hardware optimization strategy in the same way. And after judging whether the function activating operation or the full connection operation is included, writing the data into the output buffer area and writing the data back to the off-chip memory from the output buffer area. Through a series of judgment and execution logics, a given neural network model topological structure is abstracted into a directed acyclic graph executed by the network through the maintained structure, so that the sequence of layer-by-layer serial execution is configured.

The method comprises the steps that in the self-adaptive processing flow based on rule mapping, an FPGA reads generated weight file names, analyzes the weight file names to maintain a specific structure, abstracts a network directed acyclic graph in judgment logic, and then calls each module in a template library based on a hardware optimization strategy, and in the process, the rapid deployment process from a convolutional neural network model under a Pythrch frame to the FPGA can be completed in a self-adaptive mode.

The fast deployment method of the convolutional neural network based on the Pytorch framework on the FPGA provided by the present invention is described in detail above, a specific example is applied in the present document to explain the principle and the implementation of the present invention, and the description of the above embodiment is only used to help understanding the method of the present invention and the core idea thereof; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

Claims

1. A rapid deployment method of a convolutional neural network on an FPGA based on a Pythrch framework is characterized by comprising the following steps:

firstly, establishing a model fast mapping mechanism, naming each layer of a convolutional neural network model topological structure under a Pythrch frame according to the input and output sequence of an upper layer and a lower layer, constructing naming rules of each layer of the model in the Pythrch frame, and renaming each layer of the model according to the naming rules; training the renamed network model to obtain a neural network model file with a network topological structure, performing decomposition and storage on each layer of the neural network model file obtained after model training, wherein each layer of the neural network model file is stored as a renamed binary file, and completing establishment of the network topological structure of the model file under a Pythrch frame;

2. The method according to claim 1, wherein in the step (1), each layer in the convolutional neural network is named, and a naming rule is a name of the layer + a name of a lower layer + configuration information.

3. The method of claim 2, wherein the convolutional layer configuration information is: convolution kernel size _ step size _ zero padding; the pooling layer configuration information is: the pooling window size _ step _ zero padding, BN layer and active layer do not need configuration information.

4. The method according to claim 1, wherein in the step (3), in the neural network model file decomposition and storage stage, the trained neural network model file is first propagated forward once, and each time a layer of the neural network model is read, parameters in the trained neural network model file are stored in a binary file, where a file name of the binary file is a name of the corresponding layer.

5. The method according to claim 1, wherein the second step specifically comprises:

6. The fast deployment method of the convolutional neural network based on the Pythrch framework on the FPGA as claimed in claim 5, wherein the hardware optimization strategy in step (1) includes setting of feature map block size, input feature map parallelism, and output feature map parallelism.

7. The fast deployment method of the Pythrch framework-based convolutional neural network on the FPGA as claimed in claim 5, wherein the template library based on the hardware optimization strategy in step (2) mainly comprises a convolution module, a BN layer module, an activation layer module, a pooling module, a fully-connected layer calculation module and an input-output feature map buffer module.

8. The method for rapidly deploying the convolutional neural network based on the Pythrch framework on the FPGA as claimed in claim 7, wherein the convolutional module comprises a weight parameter cache region, the convolutional module is configured according to the configuration parameters of the convolutional layer before the start of convolution, and the weight is loaded to the weight parameter cache region in the form of data stream, namely a weight file, after the configuration is completed; the third step specifically comprises: creating a structure, reading the weight file of the corresponding layer at the FPGA end, analyzing the configuration information according to the name of the weight file, updating the structure according to the configuration information, and storing the name of the next layer to find the lower layer information to be executed from the weight file.

9. The method for rapidly deploying a convolutional neural network on an FPGA based on a Pytorch framework as claimed in claim 8, wherein the structure maintenance information includes hardware optimization parameters, convolutional layer configuration parameters, BN layer configuration parameters, pooling layer configuration parameters, and names of the current layer and the lower layer.