CN115983354B

CN115983354B - High-precision adjustable general activation function implementation method

Info

Publication number: CN115983354B
Application number: CN202310052328.7A
Authority: CN
Inventors: 马艳华; 徐琪灿; 陈聪聪; 宋泽睿
Original assignee: Dalian University of Technology
Current assignee: Dalian University of Technology
Priority date: 2023-02-02
Filing date: 2023-02-02
Publication date: 2023-08-22
Anticipated expiration: 2043-02-02
Also published as: CN115983354A

Abstract

The invention belongs to the technical field of field programmable gate array hardware accelerators, and discloses a high-precision adjustable general activation function implementation method, which aims to realize high-precision activation function approximation by using a small amount of storage resources and on-chip resources, and can set precision according to requirements to realize balance between precision and storage space. The method for realizing the universal activation function can accurately estimate the accuracy which can be achieved by the segmentation strategy of the provided activation function, thereby realizing the adjustment of the segmentation strategy under the condition that the target of the given activation function approaches the accuracy and avoiding the waste of on-chip resources caused by accuracy overflow. Compared with the traditional method, the method provided by the invention has the advantages of higher precision, larger adjustable space and less hardware resources consumption compared with other methods capable of realizing high precision.

Description

High-precision adjustable general activation function implementation method

Technical Field

The invention belongs to the technical field of Field Programmable Gate Array (FPGA) hardware accelerators, and particularly provides an approximation method which aims at effectively solving various common nonlinear activation functions, reduces the consumption of FPGA hardware resources while realizing high precision, provides a larger adjustable space, avoids the resource waste caused by precision overflow, and particularly relates to a high-precision adjustable general activation function realization method.

Background

The nonlinear activation function provides a nonlinear factor to the neural network and is an important component of the neural network. In general, the nonlinear activation functions are quite complex to calculate and difficult to implement accurately on an FPGA. Therefore, when a designer needs to implement a nonlinear activation function on an FPGA, it is necessary to approximate the activation function using a certain approximation method.

In recent years, in order to improve the approximation accuracy of the activation function, students at home and abroad have conducted corresponding researches. The FPGA-oriented activation function approximation method is mainly divided into two types, the first type is a piecewise approximation method, a target activation function is divided into a plurality of areas according to a specific piecewise mode, and each area is described by using different linearization expressions, so that the purpose of approximating an original function is achieved (FPGA implementation for the sigmoid with piecewise linear fitting method based on curvature analysis published in Electronics in 2022). Another method is a lookup table method, in which all the input/output values of the activation function are stored in the memory and read according to the lookup table method (A twofold lookup table architecture for efficient approximation of activation functions published in ieee transactionverylargescalei program system in 2020). This approach can achieve very high accuracy, but requires a large amount of memory space, resulting in a significant stress on the on-chip memory cells of the FPGA. In addition, a hybrid approach has emerged in the current study to improve approximation accuracy by using a small amount of memory (A modular approximationmethodology for efficient fixed-point hardware implementation of the sigmoid function published in IEEETransactions On Industrial Electronics in 2022). However, in the above method, only the piecewise approximation or the mixing method is used, and only the function obtained by the special mathematical calculation through the Sigmoid such as the Sigmoid or the Tanh is ignored, and the new activation functions such as Swish, mish and the like are ignored. Considering the advance of neural network hardware acceleration research, a general method should be designed when an activation function approximation method is designed. If only the look-up table method is used, storing all possible output results of the activation function in the look-up table is not friendly to the memory space on the FPGA chip. Therefore, an activation function approximation method which is universal, high in precision, adjustable and small in occupied storage space is necessary.

Disclosure of Invention

Since the hybrid method in the existing research uses a small amount of memory space and a special mathematical calculation method, the hybrid method is only suitable for Sigmoid activation functions and activation functions obtained by simple mathematical calculation, such as Tanh, and lacks versatility. And due to the fixity of the mathematical calculation process, the realized approximation activation function has smaller adjustable space in precision. Aiming at the problems, the invention provides a general mixed approximation method of an activation function for an FPGA (field programmable gate array), which aims to realize high-precision activation function approximation by using a small amount of storage resources and on-chip resources, and can set precision according to requirements to realize the balance between precision and storage space.

The technical scheme of the invention is as follows:

a high-precision adjustable general activation function implementation method comprises the following steps:

step 1: assuming that the bit width of the input data x is n, uniformly dividing the activation function f (x) into 16 segments, using an expected error E to represent the expected activation function precision, and dividing all segments into three types through calculation according to the expected error;

step 1.1: calculating an approximation error E when approximating an activation function using a piecewise linear approach of 16-segment equipartition _1avg Average curvature of each of the 16 segments and maximum curvature C of the entire activation function _max ；

Step 1.2: determining a required constant coefficient K ₁ And K ₂ The formula is as follows:

step 1.3: the average curvature of each segment is rearranged in order from big to small, and the segments with the largest average curvature are counted, so that k segments with the largest average curvature are obtained as the first class, wherein k is the minimum integer which needs to satisfy the following inequality:

k<C _sum K ₂ –16(E _1avg –E)K ₁ K ₂

wherein C is _sum Representing the sum of the average curvatures of the first class of segments of number k;

step 1.4: estimating margin E of approximation error ₂ The formula is as follows:

according to margin E ₂ Size, from the smallest fraction of the average curvatureStarting counting of the segments, merging adjacent segments, calculating the error increment after merging as error summation before merging the segments and multiplying the error summation by the number of the segments; under the condition of meeting the allowance requirement, combining as many segments as possible to obtain segments to be combined as a third type segment, wherein the number of the segments is m, the rest segments are the second type segments, and the number of the segments is 16-k-m;

step 2: three different approximation methods are used for three different classes of segments;

step 2.1: for the first class of segments, a method of nonlinear approximation is used; firstly, calculating a tangent g (x) of a left end point of the segment, then squaring the lowest n-5 effective bits of the x, taking the first ten bits of the result, forming a corresponding relation with f (x) -g (x), training a single-layer perceptron as a data set, and adding the result of the single-layer perceptron with the g (x) to obtain an approximation result of the f (x);

step 2.2: aiming at the second class of segments, a linear approximation method is used, and approximation is carried out through a least square method;

step 2.3: for the third class of segments, several adjacent segments are combined and then a linear approximation method, i.e. a least squares method, is used.

Step 3: completing hardware deployment according to the algorithm designed in the step 1 and the step 2;

step 3.1: calculating to obtain all weights, biases and coefficients required in the step 1 and the step 2;

step 3.2: coding the highest 4-bit valid bit of x according to the segmentation condition, reading the coefficient and bias of the straight line as an address, if the segmentation uses nonlinear approximation, reading the weight of a single-layer perceptron, and accumulating the bias of the single-layer perceptron to the bias of the straight line;

step 3.3: and (3) cutting the weight of the single-layer perceptron, only leaving the data valid bit, calculating an approximation error, if the approximation error is smaller than the expected error, further cutting the weight, and cutting the bit width of all the weights to be consistent with the weight of the least valid bit.

The invention has the beneficial effects that: the curvature-based approximation precision prediction method provided by the invention can accurately estimate the precision which can be achieved by the proposed activation function segmentation strategy, so that the adjustment of the segmentation strategy under the condition of the approximation precision of the given activation function target is realized, and the waste of on-chip resources caused by precision overflow is avoided. Compared with the traditional method, the method provided by the invention has the advantages of higher precision, larger adjustable space and less hardware resources consumption compared with other methods capable of realizing high precision.

Drawings

FIG. 1 is a hardware deployment scenario of the method used by the present invention, where x represents input data, f (x) represents output data, W _i I=1, …,9, which is the weight of the single-layer perceptron. k (k) ₁ And b ₁ Is the slope and offset of the straight line.

Fig. 2 is a flow chart of an algorithm of a nonlinear approximation method used in the present invention.

FIG. 3 is a schematic diagram of weight truncation for a single layer perceptron in the present invention. Where each point represents a bit, (a) represents the state of an untruncated bit, (b) represents leaving only the data valid bit, (c) represents truncating the valid bits of all data to the same.

Detailed Description

The invention will be further described with reference to the drawings and the specific embodiments, wherein the method provided by the invention is used for approaching the Sigmoid activation function, and the expected error is 7 multiplied by 10 ^-5 。

Step 1: the bit width of the input data x is 16, and the formula of the Sigmoid activation function f (x) comprises one decimal place, three decimal places and twelve decimal places:

since the activation function is symmetrical about a point on the vertical axis, only data ranging between 0 and 8 is approximated, dividing the function in this range into 16 segments, each segment having a length of 0.5. The expected activation function approximation error is 7×10 ^-5 All segments are divided into three categories by calculation based on the expected error.

Step 1.1: calculation of segment line when 16 segment averages are usedApproximation error E when sexual approach approximates activation function _1avg Is 7.11 multiplied by 10 ^-3 Maximum curvature C of the entire activation function _max An average curvature of each segment of 0.0275,0.0717,0.0908,0.0862,0.0690,0.0496,0.0334,0.0216,0.0136,0.0084,0.0052,0.0032,0.0019,0.0012,7.15 ×10, respectively, of 0.0924 ^-4 ，4.34×10 ^-4 。

Step 1.2: calculating required constant coefficient, K1 is 1.299×10 ³ K2 is 12.6.

Step 1.3: the average curvature of each segment is reordered according to the sequence from big to small, the segments with the largest average curvature are counted, the number k of segments of the first segment classification is calculated according to a formula, the minimum integer value of k is 9, and the segments of the first class are 9 segments with the largest average curvature, namely 9 segments with 0 to 4.5.

Step 1.4: estimating margin E of approximation error according to formula ₂ 6.7X10 ^-6 And according to the margin size, counting from the segment with the minimum average curvature, merging adjacent segments, and estimating the error increment after merging as the error summation before merging the segments and multiplying the error summation by the number of segments. Under the condition that the allowance requirement is met, the segments which can be combined are 7 to 7.5 and 7.5 to 8, the number of the third type of segments is 2, the rest segments are the second type of segments, and the number is 5.

Step 2: three different approaches are used for three different classes of segments.

Step 2.1: for the first class of segments, a method of non-linear approximation is used. Firstly, calculating a tangent g (x) of the left end point of the segment, then squaring the lowest 11 valid bits of the x, taking the first ten bits of the result, forming a corresponding relation with f (x) -g (x), training a single-layer perceptron as a data set, and adding the result of the single-layer perceptron and the g (x) to obtain an approximation result of f (x). The flow of the nonlinear approximation is shown in fig. 2.

Step 2.2: for the second class of segments, a linear approximation method is used, and approximation is performed by a least square method.

Step 2.3: for the third class of segments, several adjacent segments are combined and then a linear approximation method, that is, a least square method, is used.

Step 3: and (3) completing hardware deployment according to the algorithm designed in the step (1) and the step (2).

Step 3.1: all weights, biases and coefficients required in the step 1 and the step 2 are obtained through calculation. The weights using the nonlinear approximation method are shown in table 1. The coefficients and offsets of the straight lines are shown in table 2.

Step 3.2: the most significant 4 bits of x are encoded according to the segmentation condition, the coefficients and the bias of the straight line are read as addresses, if nonlinear approximation is used for the segmentation, the weight of the single-layer perceptron is also required to be read, and the bias of the single-layer perceptron is added to the bias of the straight line. The hardware deployment is shown in fig. 1.

Step 3.3: the weight of the single-layer perceptron is cut off, only the data valid bit is left, and the approximation error is calculated to be 6.54 multiplied by 10 ^-5 Smaller than expected error, thus further truncating the weights, truncating the bit widths of all weights to be consistent with the weight of the least significant bit, increasing the approximation error to 6.82 x 10 ^-5 。

The approximation result and the resource use condition of the invention are shown in table 3, and the following can be seen from the above: the invention provides an approximation method which is based on an FPGA and is effective for various common nonlinear activation functions. The whole method mainly comprises the steps of adopting different approximation methods according to the average curvature of the segments, estimating the number of the segments of the activation function by using the different approximation methods according to the approximation error of the expected activation function and hardware deployment thereof. The method reduces the consumption of hardware resources while achieving high precision, and provides a large adjustable space.

Table 1 is the weights and biases used in the present invention when approximating the Sigmoid activation function using a nonlinear method

Left end point of interval	0	0.5	1	1.5	2	2.5	3	3.5	4
										W ₀ (×10 ^-6 )	5.72	13.4	14.3	11.4	7.63	3.81	2.86	2.86	1.91
W ₁ (×10 ^-6 )	9.54	21.9	24.8	22.9	15.3	8.58	6.68	4.77	3.81
										W ₂ (×10 ^-6 )	19.1	39.1	47.7	45.8	31.5	19.1	14.3	9.54	8.58
W ₃ (×10 ^-5 )	3.53	7.34	9.25	9.16	6.39	4.10	3.05	2.00	1.62
										W ₄ (×10 ^-5 )	8.49	14.3	18.4	18.3	13.2	8.77	6.20	4.01	3.15
W ₅ (×10 ^-5 )	14.8	28.4	36.7	36.4	26.6	18.0	12.8	8.30	6.29
										W ₆ (×10 ^-4 )	2.57	5.67	7.32	7.26	5.41	3.74	2.61	1.68	1.23
W ₇ (×10 ^-4 )	4.84	11.34	14.66	14.42	10.94	7.66	5.27	3.41	2.40
										W ₈ (×10 ^-4 )	9.65	22.89	29.45	28.73	21.98	15.46	10.61	6.87	4.74
W ₉ (×10 ^-4 )	19.28	46.49	58.07	56.46	43.48	30.59	20.66	13.41	8.98
										Bias (. Times.10) ^-6 )	132	7.63	-2.86	1.91	-37.2	-42.9	-25.7	-16.2	2.86

Table 2 shows the linear slopes and offsets of all segments when approaching the Sigmoid activation function in the present invention

Table 3 shows the results of the present invention compared with other designs in terms of accuracy and hardware resource usage

Design of	Average error	Maximum error	Lookup table	Trigger device	Storage occupancy
						The invention is that	6.82×10 ^-5	2.91×10 ^-4	200	61	827 bits
Piecewise linearity	5.87×10 ^-3	1.89×10 ^-2	158	46	0 bit
						Lookup table method	6.17×10 ^-5	1.22×10 ^-4	0	0	10 ⁶ Bits

。

Claims

1. The method for realizing the high-precision adjustable general activation function is characterized by comprising the following steps of:

Step 1.2: determining constant coefficient K ₁ And K ₂ The formula is as follows:

step 1.3: the average curvature of each segment is reordered from the largest to the smallest, and the segments with the largest average curvature are counted, so that k segments with the largest average curvature are obtained as the first class, wherein k is the minimum integer which satisfies the following inequality:

k<C _sum K ₂ –16(E _1avg –E)K ₁ K ₂

according to margin E ₂ Counting from the segment with the minimum average curvature, merging adjacent segments, and calculating the error increment after merging as the error summation before merging the segments and multiplying the error summation by the number of segments; under the condition of meeting the allowance requirement, combining as many segments as possible to obtain segments to be combined as a third type segment, wherein the number of the segments is m, the rest segments are the second type segments, and the number of the segments is 16-k-m;

step 2.3: for the third class of segments, combining a plurality of adjacent segments and then using a linear approximation method, namely a least square method;