Configurable activation primitive device and method suitable for deep learning hardware accelerator
Technical field
The present invention relates to electronic circuit technology fields, and in particular, to the 64bit suitable for deep learning hardware accelerator
The hardware configuration and implementation method of precision, configurable activation primitive.
Background technique
Deep learning is the field of a very close artificial intelligence in machine learning, its purpose is to establish a mind
The process of study and the analysis of human brain is simulated through network.The main thought of deep learning is exactly to stack multiple layers, by low layer
The input as higher level is exported, the multilayer perceptron containing more hidden layers is exactly a kind of embodiment of deep learning structure.Pass through this
The mode of sample, deep learning can form more abstract high-rise expression attribute by combination low-level feature, to find number
According to distributed nature indicate.And how to make the precision of deep learning operation higher is the problem put in face of many engineers.
CN109389212A discloses a kind of " restructural activation quantization pondization system towards low-bit width convolutional neural networks
System ", the invention (are less than, equal to 4bit precision) convolutional network towards low precision, are unable to satisfy the requirements for high precision in market.
Summary of the invention
For the defects in the prior art, the object of the present invention is to provide a kind of suitable for deep learning hardware accelerator
Configurable activation primitive device and method.
A kind of configurable activation primitive device suitable for deep learning hardware accelerator provided according to the present invention, packet
It includes:
First arithmetic element: input terminal, which is connected with symbol integer input data source and obtains, symbol integer input data, and
Operation is executed according to operational parameter;
Multiple selector: two input terminals are separately connected the output end of first arithmetic element and described have symbol whole
Type input data source, according to scheduled requirement selection, input is transmitted to output end all the way;
Second arithmetic element: input terminal connects the output end of the multiple selector, and executes operation according to operational parameter;
Correct linear unit: input terminal connects the output end of second arithmetic element, to second arithmetic element
Operation result is modified linear operation;
Third arithmetic element: input terminal connects the output end of the amendment linear unit, and executes fortune according to operational parameter
It calculates.
Preferably, first arithmetic element, second arithmetic element and the third arithmetic element include:
Adder: to there is symbol integer input data to be added described in input with the operational parameter;
Multiplier: input terminal connects the output end of the adder, addition result and the operational parameter to input
It is multiplied;
Count shift unit: input terminal connects the output end of the multiplier, carries out the displacement that counts to the multiplied result of input.
Preferably, the symbol integer input data that has for having symbol integer input data source to provide is that 32bit has symbol
Integer input data.
Preferably, the operational parameter includes:
Offset parameter: first arithmetic element, second arithmetic element and the third arithmetic element are transmitted to
The adder, bit wide 32bit;
The gradient and offset parameter: first arithmetic element, second arithmetic element and the third operation are transmitted to
The multiplier of unit, bit wide 64bit.
Preferably, the offset parameter is stored in the first on piece sram cache, and the gradient and offset parameter are stored in
Two on piece sram caches.
A kind of configurable activation primitive method suitable for deep learning hardware accelerator provided according to the present invention, provides
The above-mentioned configurable activation primitive device suitable for deep learning hardware accelerator, executing step includes:
Step 1: according to the type of current activation primitive to be calculated, selecting calculating operation;
Step 2: if the currently active function needs cumulative offset parameter, from external load offset parameter to the first on piece
Inside sram cache;If the currently active function does not need cumulative bias, the SRAM in the first on piece sram cache is whole
Fill 0 value;
Step 3: if the calculating operation selected in step 1 is amendment linear unit, in the second on piece sram cache
SRAM is stuffed entirely with 0 value, then gos to step 4, otherwise slow from the external load gradient and offset parameter to the second on piece SRAM
It deposits, then gos to step 4;
Step 4: if current calculate needs to carry out batch normalization operation, and batch normalization operation is needed in activation letter
It is completed before several calculating, then the calculated result of the first arithmetic element is exported to multiple selector, and multiple selector will calculate tie again
Fruit inputs the second arithmetic element, then branches to step 5, if current calculate does not need to carry out batch normalization operation, multichannel
Selector will have symbol integer input data to input the second arithmetic element, then branch to step 5;
Step 5: offset parameter is added by the second arithmetic element with the output of multiple selector, addition result and the gradient
It is multiplied with offset parameter, exports by the displacement for the shift unit that counts to amendment linear unit;
Step 6: if the type of the activation primitive currently calculated is amendment linear unit, input data is denoted as x, exports number
According to f (x) is denoted as, then linear unit is corrected to there is symbol integer input data x to proceed as follows:
If the type of the activation primitive currently calculated is not amendment linear unit, it is whole to there is symbol to correct linear unit
Type input data x is proceeded as follows:
F (x)=x;
Step 7: if current calculate needs to carry out batch normalization operation, and batch normalization operation is needed in activation letter
Number is completed after calculating, then addition, multiplication and shifting function needed for third arithmetic element completion batch normalization operation, fortune
Result is calculated as final output data;If current calculate does not need to carry out batch normalization operation, linear unit will be corrected
Output is directly as final output data.
Preferably, the calculating operation in step 1 includes:
Linear unit, parametrization amendment linear unit, band leakage amendment linear unit, index linear unit, S type is corrected to swash
Function and tanh activation primitive living.
Preferably, in the step 5, the second arithmetic element is by the addition result of the output of offset parameter and multiple selector
For the sum of a 32bit, the result that addition result is multiplied with the gradient and offset parameter is the product of a 64bit, is moved by counting
Position device is shifted to obtain the output of a 32bit.
Preferably, first arithmetic element, second arithmetic element and the third arithmetic element include:
Adder: to there is symbol integer input data to be added described in input with the operational parameter;
Multiplier: input terminal connects the output end of the adder, addition result and the operational parameter to input
It is multiplied;
Count shift unit: input terminal connects the output end of the multiplier, carries out the displacement that counts to the multiplied result of input.
Preferably, the bit wide of offset parameter is 32bit, and the bit wide of the gradient and offset parameter is 64bit.
Compared with prior art, the present invention have it is following the utility model has the advantages that
The present invention can support the hardware acceleration unit of a variety of activation primitive operations, and the activation primitive type of support includes: to repair
Linear positive unit (Rectified linear unit, ReLu), parametrization amendment linear unit (Parametric ReLu,
PReLu), band leakage amendment linear unit (Leaky ReLu), index linear unit (Exponential Linear Unit,
ELU), S type activation primitive (sigmoid), tanh activation primitive (Tanh).Batch normalization operation can also be supported simultaneously
(Batch Normalization).Inputoutput data is integer type data, and the full accuracy of inputoutput data is up to 32 ratios
Spy, and the precision of results of intermediate calculations reaches as high as 64 bits.
Detailed description of the invention
Upon reading the detailed description of non-limiting embodiments with reference to the following drawings, other feature of the invention,
Objects and advantages will become more apparent upon:
Fig. 1 is the structural diagram of the present invention.
Specific embodiment
The present invention is described in detail combined with specific embodiments below.Following embodiment will be helpful to the technology of this field
Personnel further understand the present invention, but the invention is not limited in any way.It should be pointed out that the ordinary skill of this field
For personnel, without departing from the inventive concept of the premise, several changes and improvements can also be made.These belong to the present invention
Protection scope.
As shown in Figure 1, a kind of configurable activation primitive suitable for deep learning hardware accelerator provided in this embodiment
Device, comprising:
First arithmetic element 1: input terminal, which is connected with symbol integer input data source and obtains, symbol integer input data, and
Operation is executed according to operational parameter;
2: two input terminals of multiple selector are separately connected the output end of the first arithmetic element and have the input of symbol integer
Data source, according to scheduled requirement selection, input is transmitted to output end all the way;
Second arithmetic element 3: input terminal connects the output end of multiple selector, and executes operation according to operational parameter;
Correct linear unit 4: input terminal connects the output end of the second arithmetic element, to the operation result of the second arithmetic element
It is modified linear operation;
Third arithmetic element 5: the output end of input terminal connection amendment linear unit, and operation is executed according to operational parameter.
In the present embodiment, the first arithmetic element, the second arithmetic element and third arithmetic element include:
Adder 6: symbol integer input data is added with operational parameter input, inputting has for 2 32bit
Symbol integer data completes the operation that two 32 signed integers are added, and exporting has symbol integer data for 32;
Multiplier 7: input terminal connects the output end of adder, and the addition result and operational parameter to input carry out phase
Multiply, inputting has symbol integer data for 2 32bit, and exporting has symbol integer data for 64bit;
Count shift unit 8: input terminal connects the output end of multiplier, carries out the displacement that counts to the multiplied result of input, defeated
Entering is a 64bit signed integer, and output is a 32bit signed integer.
In Fig. 1, X is that 32bit has symbol integer input data, and Y is that 32bit has symbol integer output data, and data flow is pressed
It is moved according to arrow direction in figure.
Operational parameter includes:
Offset parameter: being stored in the first on piece sram cache, is transmitted to the first arithmetic element, the second arithmetic element and third
The adder of arithmetic element, bit wide 32bit, depth 1024;
The gradient and offset parameter: being stored in the second on piece sram cache, is transmitted to the first arithmetic element, the second arithmetic element
With the multiplier of third arithmetic element, bit wide 64bit, depth 64.
Working principle of the present invention is as follows:
Step 1: according to the type of current activation primitive to be calculated, select following calculating operation:
A) linear unit (Rectified linear unit, ReLu) is corrected;
B) parametrization amendment linear unit (Parametric ReLu, PReLu);
C) band leakage amendment linear unit (Leaky ReLu);
D) index linear unit (Exponential Linear Unit, ELU);
E) S type activation primitive (sigmoid);
F) tanh activation primitive (Tanh);
Step 2: if the currently active function needs cumulative offset parameter, from external load offset parameter to the first on piece
Inside sram cache;If the currently active function does not need cumulative bias, the SRAM in the first on piece sram cache is whole
Fill 0 value;
Step 3: if the calculating operation selected in step 1 is amendment linear unit, in the second on piece sram cache
SRAM is stuffed entirely with 0 value, then gos to step 4, otherwise slow from the external load gradient and offset parameter to the second on piece SRAM
It deposits, then gos to step 4;
Step 4: if current calculate needs to carry out batch normalization operation, and batch normalization operation is needed in activation letter
It is completed before several calculating, then the calculated result of the first arithmetic element is exported to multiple selector, and multiple selector will calculate tie again
Fruit inputs the second arithmetic element, then branches to step 5, if current calculate does not need to carry out batch normalization operation, multichannel
Selector will have symbol integer input data to input the second arithmetic element, then branch to step 5;
Step 5: offset parameter is added by the second arithmetic element with the output of multiple selector, addition result and the gradient
It is multiplied with offset parameter, exports by the displacement for the shift unit that counts to amendment linear unit;
Step 6: if the type of the activation primitive currently calculated is amendment linear unit, input data is denoted as x, exports number
According to f (x) is denoted as, then linear unit is corrected to there is symbol integer input data x to proceed as follows:
If the type of the activation primitive currently calculated is not amendment linear unit, it is whole to there is symbol to correct linear unit
Type input data x is proceeded as follows:
F (x)=x;
Step 7: if current calculate needs to carry out batch normalization operation, and batch normalization operation is needed in activation letter
Number is completed after calculating, then addition, multiplication and shifting function needed for third arithmetic element completion batch normalization operation, fortune
Result is calculated as final output data;If current calculate does not need to carry out batch normalization operation, linear unit will be corrected
Output is directly as final output data.
By above-mentioned 7 steps, the calculating process of activation primitive and batch normalization operation has all been completed, input data
For X, output data Y.
Specific embodiments of the present invention are described above.It is to be appreciated that the invention is not limited to above-mentioned
Particular implementation, those skilled in the art can make a variety of changes or modify within the scope of the claims, this not shadow
Ring substantive content of the invention.In the absence of conflict, the feature in embodiments herein and embodiment can any phase
Mutually combination.