Configurable activation function device and method suitable for deep learning hardware accelerator
Technical Field
The invention relates to the technical field of electronic circuits, in particular to a hardware structure with 64bit precision and a configurable activation function suitable for a deep learning hardware accelerator and an implementation method.
Background
Deep learning is a field in machine learning that is very close to artificial intelligence, and aims to establish a neural network to simulate the process of learning and analysis of the human brain. The main idea of deep learning is to stack a plurality of layers, take the output of a lower layer as the input of a higher layer, and a multilayer sensor with a plurality of hidden layers is the embodiment of a deep learning structure. In this way, deep learning is able to discover a distributed feature representation of data by combining lower-level features to form more abstract, higher-level representation attributes. How to make the precision of the deep learning operation higher is a difficult problem in the presence of a plurality of engineers.
CN109389212A discloses a reconfigurable activation quantization pooling system facing to a low-bit-width convolutional neural network, which is faced to a low-precision (less than or equal to 4bit precision) convolutional network and cannot meet the high-precision requirement of the market.
Disclosure of Invention
In view of the defects in the prior art, the present invention aims to provide a configurable activation function device and method suitable for a deep learning hardware accelerator.
The invention provides a configurable activation function device suitable for a deep learning hardware accelerator, which comprises:
a first arithmetic unit: the input end is connected with a symbol integer input data source to obtain symbol integer input data, and operation is executed according to the operation parameters;
a multiplexer: the two input ends are respectively connected with the output end of the first arithmetic unit and the signed integer input data source, and one input is selected to be transmitted to the output end according to the preset requirement;
a second arithmetic unit: the input end is connected with the output end of the multiplexer, and operation is executed according to the operation parameters;
a modified linear unit: the input end of the second arithmetic unit is connected with the output end of the second arithmetic unit, and the correction linear operation is carried out on the arithmetic result of the second arithmetic unit;
a third arithmetic unit: the input end is connected with the output end of the correction linear unit and executes operation according to the operation parameters.
Preferably, the first arithmetic unit, the second arithmetic unit, and the third arithmetic unit each include:
an adder: adding the input signed integer input data and the operation parameter;
a multiplier: the input end is connected with the output end of the adder and multiplies the input addition result by the operation parameter;
an arithmetic shifter: the input end is connected with the output end of the multiplier, and the input multiplication result is subjected to arithmetic shift.
Preferably, the signed input data provided by the signed input data source is 32bit signed input data.
Preferably, the operation parameters include:
bias parameters: the adder is transmitted to the first arithmetic unit, the second arithmetic unit and the third arithmetic unit, and the bit width is 32 bits;
grade and offset parameters: and the bit width of the multiplier is 64 bits and is transmitted to the first arithmetic unit, the second arithmetic unit and the third arithmetic unit.
Preferably, the bias parameters are stored in a first on-chip SRAM cache, and the slope and bias parameters are stored in a second on-chip SRAM cache.
According to the configurable activating function method suitable for the deep learning hardware accelerator provided by the invention, the configurable activating function device suitable for the deep learning hardware accelerator is provided, and the execution steps comprise:
step 1: selecting a calculation operation according to the type of the current activation function to be calculated;
step 2: if the current activation function needs to accumulate the bias parameters, loading the bias parameters from the outside to the inside of the first on-chip SRAM cache; if the current activation function does not need to accumulate the offset value, filling all the SRAMs in the SRAM cache on the first chip with 0 values;
and step 3: if the calculation operation selected in the step 1 is a correction linear unit, filling all the SRAMs in the second on-chip SRAM cache with 0 values, and then jumping to a step 4, otherwise, loading gradient and bias parameters from the outside to the second on-chip SRAM cache, and then jumping to the step 4;
and 4, step 4: if the current calculation needs to be carried out with batch normalization operation, and the batch normalization operation needs to be completed before the calculation of the activation function, the calculation result of the first operation unit is output to the multiplexer, the multiplexer inputs the calculation result into the second operation unit, then the step 5 is skipped, if the current calculation does not need to be carried out with batch normalization operation, the multiplexer inputs signed integer input data into the second operation unit, and then the step 5 is skipped;
and 5: the second arithmetic unit adds the offset parameter and the output of the multi-path selector, the addition result is multiplied by the gradient and the offset parameter, and the product is output to the correction linear unit through the shift of the arithmetic shifter;
step 6: if the type of the currently calculated activation function is a modified linear unit, the input data is marked as x, and the output data is marked as f (x), the modified linear unit performs the following operations on the signed integer input data x:
if the type of the currently calculated activation function is not a modified linear unit, the modified linear unit performs the following operations on the signed integer input data x:
f(x)=x;
and 7: if the current calculation needs to be performed with batch normalization operation, and the batch normalization operation needs to be completed after the activation function calculation, the third operation unit completes addition, multiplication and shift operation required by the batch normalization operation, and the operation result is used as final output data; and if the batch normalization operation is not required in the current calculation, directly taking the output of the modified linear unit as final output data.
Preferably, the calculation operation in step 1 includes:
a modified linear unit, a parameterized modified linear unit, a leakage modified linear unit, an exponential linear unit, an S-shaped activation function and a hyperbolic tangent activation function.
Preferably, in the step 5, the second operation unit makes an addition result of the offset parameter and the output of the multiplexer a sum of 32 bits, and a result of multiplying the addition result by the slope and the offset parameter is a product of 64 bits, and shifts the result by the arithmetic shifter to obtain an output of 32 bits.
Preferably, the first arithmetic unit, the second arithmetic unit, and the third arithmetic unit each include:
an adder: adding the input signed integer input data and the operation parameter;
a multiplier: the input end is connected with the output end of the adder and multiplies the input addition result by the operation parameter;
an arithmetic shifter: the input end is connected with the output end of the multiplier, and the input multiplication result is subjected to arithmetic shift.
Preferably, the bias parameter is 32 bits wide, and the slope and bias parameters are 64 bits wide.
Compared with the prior art, the invention has the following beneficial effects:
the invention can support hardware accelerating units of various activating function operations, and the supported activating function types comprise: modified Linear Unit (ReLu), Parametric modified Linear Unit (Parametric ReLu), leakage modified Linear Unit (leak ReLu), Exponential Linear Unit (ELU), sigmoid activation function (sigmoid), hyperbolic tangent activation function (Tanh). Batch Normalization operations (Batch Normalization) may also be supported. The input and output data are integer data, the highest precision of the input and output data can reach 32 bits, and the precision of the intermediate calculation result can reach 64 bits.
Drawings
Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments with reference to the following drawings:
FIG. 1 is a schematic structural diagram of the present invention.
Detailed Description
The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the invention, but are not intended to limit the invention in any way. It should be noted that it would be obvious to those skilled in the art that various changes and modifications can be made without departing from the spirit of the invention. All falling within the scope of the present invention.
As shown in fig. 1, the present embodiment provides a configurable activation function apparatus suitable for a deep learning hardware accelerator, including:
first arithmetic unit 1: the input end is connected with a symbol integer input data source to obtain symbol integer input data, and operation is executed according to the operation parameters;
the multiplexer 2: the two input ends are respectively connected with the output end of the first arithmetic unit and the signed integer input data source, and one input is selected to be transmitted to the output end according to the preset requirement;
second arithmetic unit 3: the input end is connected with the output end of the multiplexer, and operation is executed according to the operation parameters;
the correction linear unit 4: the input end is connected with the output end of the second arithmetic unit, and the correction linear operation is carried out on the arithmetic result of the second arithmetic unit;
third arithmetic unit 5: the input end is connected with the output end of the correction linear unit and executes operation according to the operation parameters.
In this embodiment, the first operation unit, the second operation unit, and the third operation unit each include:
the adder 6: adding the input signed integer input data and the operation parameters, inputting 2 32-bit signed integer data, finishing the operation of adding two 32-bit signed integers, and outputting 32-bit signed integer data;
the multiplier 7: the input end is connected with the output end of the adder, the input addition result and the operation parameter are multiplied, 2 32-bit signed integer data are input, and 64-bit signed integer data are output;
arithmetic shifter 8: the input end is connected with the output end of the multiplier, and the input multiplication result is subjected to arithmetic shift, wherein the input is a 64-bit signed integer, and the output is a 32-bit signed integer.
In fig. 1, X is 32-bit signed input data and Y is 32-bit signed output data, and the data stream moves in the direction of the arrows in the figure.
The operation parameters comprise:
bias parameters: the data are stored in an SRAM cache on the first chip and transmitted to adders of a first arithmetic unit, a second arithmetic unit and a third arithmetic unit, the bit width is 32 bits, and the depth is 1024;
grade and offset parameters: and the bit width is 64 bits, and the depth is 64.
The working principle of the invention is as follows:
step 1: according to the type of the current activation function to be calculated, the following calculation operation is selected:
a) a modified linear unit (ReLu);
b) parameterized modified linear units (parametrical ReLu, preelu);
c) a leakage corrected linear unit (leak ReLu);
d) exponential Linear Unit (ELU);
e) a sigmoid activation function (sigmoid);
f) hyperbolic tangent activation function (Tanh);
step 2: if the current activation function needs to accumulate the bias parameters, loading the bias parameters from the outside to the inside of the first on-chip SRAM cache; if the current activation function does not need to accumulate the offset value, filling all the SRAMs in the SRAM cache on the first chip with 0 values;
and step 3: if the calculation operation selected in the step 1 is a correction linear unit, filling all the SRAMs in the second on-chip SRAM cache with 0 values, and then jumping to a step 4, otherwise, loading gradient and bias parameters from the outside to the second on-chip SRAM cache, and then jumping to the step 4;
and 4, step 4: if the current calculation needs to be carried out with batch normalization operation, and the batch normalization operation needs to be completed before the calculation of the activation function, the calculation result of the first operation unit is output to the multiplexer, the multiplexer inputs the calculation result into the second operation unit, then the step 5 is skipped, if the current calculation does not need to be carried out with batch normalization operation, the multiplexer inputs signed integer input data into the second operation unit, and then the step 5 is skipped;
and 5: the second arithmetic unit adds the offset parameter and the output of the multi-path selector, the addition result is multiplied by the gradient and the offset parameter, and the product is output to the correction linear unit through the shift of the arithmetic shifter;
step 6: if the type of the currently calculated activation function is a modified linear unit, the input data is marked as x, and the output data is marked as f (x), the modified linear unit performs the following operations on the signed integer input data x:
if the type of the currently calculated activation function is not a modified linear unit, the modified linear unit performs the following operations on the signed integer input data x:
f(x)=x;
and 7: if the current calculation needs to be performed with batch normalization operation, and the batch normalization operation needs to be completed after the activation function calculation, the third operation unit completes addition, multiplication and shift operation required by the batch normalization operation, and the operation result is used as final output data; and if the batch normalization operation is not required in the current calculation, directly taking the output of the modified linear unit as final output data.
After the 7 steps, the calculation processes of the activation function and the batch normalization operation are completed, the input data is X, and the output data is Y.
The foregoing description of specific embodiments of the present invention has been presented. It is to be understood that the present invention is not limited to the specific embodiments described above, and that various changes or modifications may be made by one skilled in the art within the scope of the appended claims without departing from the spirit of the invention. The embodiments and features of the embodiments of the present application may be combined with each other arbitrarily without conflict.