CN110738311A

CN110738311A - LSTM network acceleration method based on high-level synthesis

Info

Publication number: CN110738311A
Application number: CN201910975595.5A
Authority: CN
Inventors: 刘大同; 蒋闵; 王本宽; 彭宇; 彭喜元
Original assignee: Harbin Institute of Technology
Current assignee: Harbin Institute of Technology
Priority date: 2019-10-14
Filing date: 2019-10-14
Publication date: 2020-01-31

Abstract

The invention discloses an LSTM network acceleration method based on high-level synthesis, belongs to the field of embedded online application of deep neural networks, and aims to solve the problems that an existing LSTM network is complex in operation and an embedded platform is low in operation speed. The specific process of the invention is as follows: constructing an LSTM network model by using MATLAB for training; piecewise fitting the activation function; when the fitting error of the activation function is within the threshold value range, the LSTM network model is converted into a high-level language code, the code structure is optimized, and an optimization instruction is obtained; adding an optimization instruction in high-level synthesis, and replacing the data type with a fixed point to obtain an LSTM acceleration network; running an unoptimized accelerated LSTM network at a PS end by adopting a Zynq running platform to obtain running time, and running the LSTM accelerated network at a PL end to obtain the running time; and calculating an acceleration ratio and an error, and finishing the optimized acceleration when the acceleration ratio and the error are in a threshold range. The invention is used for accelerating the LSTM network.

Description

LSTM network acceleration method based on high-level synthesis

Technical Field

The invention relates to LSTM network acceleration methods based on high-level synthesis, and belongs to the field of embedded online application of deep neural networks.

Background

The LSTM (Long Short-Term Memory Network) belongs to types of Recurrent Neural Networks (RNN), is generally used for scenes with relatively Long interval and delay in multi-dimensional time sequence prediction, and is different from a common RNN Network in the structure of a neuron.

The FPGA is used as a special Circuit, has higher running speed and lower power consumption compared with the FPGA, but has poor universality, complex design and higher cost, the FPGA has High parallel Processing speed, flexible design and strong universality, is suitable for different embedded platforms, and can provide different optimized acceleration schemes.

In summary, it is imperative to design methods for accelerating the LSTM network to solve the problems of slow operation speed and poor real-time performance of the LSTM network, in order to solve the problem that the LSTM network is difficult to be embedded in online application under the limitations of time, power consumption, volume, etc.

Disclosure of Invention

The invention aims to solve the problems of complex operation and low operation speed of an embedded platform of the existing LSTM network, and provides LSTM network acceleration methods based on high-level synthesis.

The invention relates to a LSTM network acceleration method based on high-level synthesis, which comprises the following specific processes:

s1, constructing an LSTM network model by using MATLAB, and training the LSTM network model;

s2, fitting an activation function of the LSTM network model by MATLAB in a segmented manner;

s3, acquiring the fitting error of the activation function, judging whether the fitting error is within the fitting error threshold range, if not, returning to execute S2, and if so, executing S4;

s4, converting the LSTM network model into a high-level language code, and optimizing a code structure to obtain an optimization instruction;

s5, adding an optimization instruction in high-level synthesis, and replacing the data type with a fixed point to obtain an LSTM acceleration network;

s6, creating a project in Vivado, adopting a Zynq operation platform, operating an LSTM network which is not optimized and accelerated at a PS end to obtain the operation time of an unoptimized network model, and operating the LSTM accelerated network obtained at the PL end in S5 to obtain the operation time of an optimized network model;

s7, calculating the acceleration ratio and the error of the running time, judging whether the acceleration ratio is in the range of the acceleration ratio threshold value, and judging whether the error is in the range of the error threshold value, if not, returning to execute S5, and if so, finishing the optimized acceleration.

Preferably, the activation functions of the segment fitting in S2 are sigmoid function and tanh function.

Preferably, the specific method of the segment fitting in S2 is as follows:

fitting was performed using a cubic function in the (0,1) interval and a quadratic function in the other intervals.

Preferably, the specific process of obtaining the fitting error of the activation function in S3 is as follows:

s3-1, obtaining a fitting curve of the original activation function;

s3-2, obtaining a fitting curve of the piecewise fitting activation function;

and S3-3, subtracting the fitting curve of S3-2 from the fitting curve of S3-1, wherein the difference is the fitting error.

Preferably, the fitting error threshold range of S3 is: less than 10^-3Of the order of magnitude.

Preferably, the LSTM network model is converted into C + + code as described in S4.

Preferably, the optimizing the code structure includes:

adopting a memset () function to complete data initialization;

repeatedly multiplying the intermediate variable substitution matrix;

setting parameter cache data in the calculation process;

transferring array elements using pointers;

and receiving the data stream of the function data interface by adopting a cache array.

Preferably, the specific method for replacing the data type with the fixed point in S5 is as follows:

using 24-bit fixed point data, -bit sign bit, three integer bits, and the remaining bits are decimal bits.

Preferably, the method for calculating the acceleration ratio of the operating time in S7 includes: dividing the optimized network model running time of the PL terminal by the unoptimized network model running time of the PS terminal;

the error calculation method comprises the following steps: and (4) making a difference between the operation result of the PL-terminal optimized network model obtained in the step (S6) and the operation result of the LSTM network model obtained in the step (S1), wherein the difference is an error.

Preferably, the acceleration ratio threshold range of S7 is: greater than or equal to 50;

s7 the error threshold range is: less than or equal to 10^-9Magnitude.

The invention has the advantages that LSTM network acceleration methods based on high-level synthesis are provided to meet the requirements of LSTM network model on optimization acceleration under different scenesThe invention adopts a high-level comprehensive technology to complete the low-speed problem, and uses Xilinx Zynq-7000 as an operation platform. The running time of the existing LSTM network at the PS end is 7.23ms, the running time of the network acceleration method at the PL end is 132.27us, and the acceleration ratio is 54.66. The calculation error of the LSTM network at the PS end is 2.29051e^-14The calculation error at PL end by using the network acceleration method of the invention is 3.95783e-⁰⁹。

Drawings

FIG. 1 is a flow chart of the LSTM network acceleration method based on high-level synthesis according to the present invention

FIG. 2 is a fitting error curve of sigmoid function and tanh function.

Detailed Description

The technical solutions in the embodiments of the present invention will be described clearly and completely with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only partial embodiments of of the present invention, rather than all embodiments.

It should be noted that the embodiments and features of the embodiments may be combined with each other without conflict.

The invention will now be described in further with reference to the following figures and examples, but not as a limitation of the invention.

In specific embodiment , the following describes the present embodiment with reference to fig. 1, and the specific process of the LSTM network acceleration method based on high-level integration in the present embodiment is as follows:

In this embodiment, High-Level Synthesis (HLS), which is "Synthesis", is to translate the program code into a special netlist file of the NGC and implement the same. HLS is a technique that is described from a high level and then synthesized into a usable netlist file.

In the embodiment, Xilinx Zynq-7000 is used as an operation platform, and the model is Zynq-XC7Z045 SoC. The PS end consists of a dual-core ARM Cortex-A9 processor, and the main frequency is 666.666 MHz. The PL terminal is an FPGA-based architecture and has a clock frequency of 66.666 MHz.

And , the activation functions of the piecewise fitting of S2 are sigmoid function and tanh function.

In the embodiment, the LSTM network neurons use two activation functions, namely a sigmoid function and a tanh function, both of the two activation functions use exponential operation and division operation, and the hardware implementation of the two operations in the FPGA occupies a large amount of resources and consumes a long time for calculation. Therefore, the sigmoid function and the tanh function are fitted in a segmented manner, so that exponential operation and division operation of the LSTM network model are replaced, and the acceleration effect is achieved.

Step , the specific method of the piecewise fitting of S2 is:

In this embodiment, the coefficients of the respective piecewise fitting of the sigmoid function and the tanh function of the two activation functions are shown in tables 1 and 2:

TABLE 1

Interval(s)	Coefficient of cubic term	Coefficient of quadratic term	degree coefficient	Constant term
					(0,0.3)	-0.020321834	-1.306422e-04	0.25001000	0.4999998
(0.3,0.6)	-0.016878724	-0.003441058	0.25110966	0.4998746
					(0.6,1)	-0.010110341	-0.016191247	0.25922768	0.4981305
(1,1.5)	——	-0.047769817	0.29248601	0.4863312
					(1.5,2)	——	-0.044297020	0.28135895	0.4952337
(2,2.5)	——	-0.034908787	0.24360674	0.5332612
					(2.5,3.5)	——	-0.020618677	0.16971086	0.6290141
(3.5,5)	——	-0.006967799	0.07381673	0.7980843
					(5,7)	——	-0.001314399	0.01849083	0.9339095

TABLE 2

The fitting errors of the two activation functions are shown in fig. 2, wherein a curve a represents a fitting error curve of a sigmoid function, and a curve b represents a fitting error curve of a tanh function, so that the fitting errors are relatively small, and the function replacement can be performed by the errors.

, the specific process of obtaining the fitting error of the activation function in S3 is:

s3-1, obtaining a fitting curve of the original activation function;

s3-2, obtaining a fitting curve of the piecewise fitting activation function;

, the fitting error threshold range S3 is less than 10^-3Of the order of magnitude.

Further , the LSTM network model is converted into C + + code as described in S4.

Further to step , the optimizing the code structure includes:

adopting a memset () function to complete data initialization;

repeatedly multiplying the intermediate variable substitution matrix;

setting parameter cache data in the calculation process;

transferring array elements using pointers;

In the present embodiment, the memset () function is used to complete data initialization, so that time consumption caused by data initialization can be avoided.

In the present embodiment, the multiplication is repeated by an intermediate variable substitution matrix, for example: the function needs to be multiplied by x repeatedly, and a x may be first calculated and then participate in the subsequent calculation, so as to reduce the time consumption caused by a large number of multiplication operations.

In the embodiment, parameter cache data is set in the calculation process, so that calculation delay caused by dynamic memory allocation is avoided.

In this embodiment, pointers are used to transfer array elements, reducing the number of cycles of array assignments.

In this embodiment, the data stream of the function data interface is received by using the cache array, so that the problem of interface data stream blockage is avoided.

, the specific method for changing the data type to fixed point in S5 is:

In the embodiment, the LSTM network structure is described using a high-level language, and the floating-point type data can obtain an accurate result, but the LSTM network has a large number of multiplication operations, which occupy a large amount of hardware resources, and the floating-point data is complex to calculate and consumes a long time. The use of the fixed point data type can cause the reduction of data precision and the increase of calculation errors, but the calculation speed can be greatly improved, and in the acceptable range of the final calculation errors, proper data width, integer digits and decimal digits are selected, so that a better acceleration effect can be obtained.

, calculating the running time acceleration ratio in S7 by dividing the optimized network model running time of the PL terminal by the unoptimized network model running time of the PS terminal;

, the acceleration ratio threshold range S7 is greater than or equal to 50;

s7 the error threshold range is: less than or equal to 10^-9Magnitude.

And , the PS end is a processing system of the Zynq operation platform, the PL end is programmable logic of the Zynq operation platform, and the PS end and the PL end communicate by adopting an AXI bus protocol.

Although the invention herein has been described with reference to particular embodiments, it is to be understood that these embodiments are merely illustrative of the principles and applications of the present invention. It is therefore to be understood that numerous modifications may be made to the illustrative embodiments and that other arrangements may be devised without departing from the spirit and scope of the present invention as defined by the appended claims. It should be understood that features described in different dependent claims and herein may be combined in ways different from those described in the original claims. It is also to be understood that features described in connection with individual embodiments may be used in other described embodiments.

Claims

1. The LSTM network acceleration method based on high-level synthesis is characterized by comprising the following specific processes:

2. The LSTM network acceleration method based on high-level synthesis of claim 1, wherein the activation functions of the piecewise fitting of S2 are sigmoid function and tanh function.

3. The LSTM network acceleration method based on high-level synthesis according to claim 1 or 2, wherein the segment fitting method of S2 is as follows:

4. The LSTM network acceleration method based on high-level synthesis according to claim 3, wherein the specific process of obtaining the fitting error of the activation function in S3 is as follows:

s3-1, obtaining a fitting curve of the original activation function;

s3-2, obtaining a fitting curve of the piecewise fitting activation function;

5. The LSTM network acceleration method based on high-level synthesis according to claim 4, wherein the fitting error threshold range of S3 is: less than 10^-3Of the order of magnitude.

6. The LSTM network acceleration method based on high-level synthesis according to claim 1, wherein the step of S4 is to convert the LSTM network model into C + + code.

7. The LSTM network acceleration method based on high-level synthesis according to claim 1 or 6, wherein optimizing the code structure comprises:

adopting a memset () function to complete data initialization;

repeatedly multiplying the intermediate variable substitution matrix;

setting parameter cache data in the calculation process;

transferring array elements using pointers;

8. The LSTM network acceleration method based on high-level integration according to claim 1, wherein the specific method for replacing the data type with fixed point at S5 is:

9. The LSTM network acceleration method based on high-level synthesis according to claim 1, wherein the acceleration ratio of the runtime of S7 is calculated by: dividing the optimized network model running time of the PL terminal by the unoptimized network model running time of the PS terminal;

10. The LSTM network acceleration method based on high-level synthesis according to claim 9, wherein the acceleration ratio threshold range of S7 is: greater than or equal to 50;

s7 the error threshold range is: less than or equal to 10^-9Magnitude.