CN110222815A

CN110222815A - Configurable activation primitive device and method suitable for deep learning hardware accelerator

Info

Publication number: CN110222815A
Application number: CN201910344947.7A
Authority: CN
Inventors: 沈沙; 沈松剑; 李毅
Original assignee: Shanghai Cool Core Microelectronics Co Ltd
Current assignee: Hefei Kuxin Microelectronics Co ltd
Priority date: 2019-04-26
Filing date: 2019-04-26
Publication date: 2019-09-10
Anticipated expiration: 2039-04-26
Also published as: CN110222815B

Abstract

The present invention provides a kind of configurable activation primitive device and methods suitable for deep learning hardware accelerator, comprising: the first arithmetic element: input terminal, which is connected with symbol integer input data source and obtains, symbol integer input data；Multiple selector: two input terminals are separately connected the output end of first arithmetic element and described have symbol integer input data source；Second arithmetic element: input terminal connects the output end of the multiple selector；Correct linear unit: input terminal connects the output end of second arithmetic element；Third arithmetic element: input terminal connects the output end of the amendment linear unit.The present invention can support the hardware acceleration unit of a variety of activation primitive operations, while can also support batch normalization operation.Inputoutput data is integer type data, and the full accuracy of inputoutput data is up to 32 bits, and the precision of results of intermediate calculations reaches as high as 64 bits.

Description

Configurable activation primitive device and method suitable for deep learning hardware accelerator

Technical field

The present invention relates to electronic circuit technology fields, and in particular, to the 64bit suitable for deep learning hardware accelerator The hardware configuration and implementation method of precision, configurable activation primitive.

Background technique

Deep learning is the field of a very close artificial intelligence in machine learning, its purpose is to establish a mind The process of study and the analysis of human brain is simulated through network.The main thought of deep learning is exactly to stack multiple layers, by low layer The input as higher level is exported, the multilayer perceptron containing more hidden layers is exactly a kind of embodiment of deep learning structure.Pass through this The mode of sample, deep learning can form more abstract high-rise expression attribute by combination low-level feature, to find number According to distributed nature indicate.And how to make the precision of deep learning operation higher is the problem put in face of many engineers.

CN109389212A discloses a kind of " restructural activation quantization pondization system towards low-bit width convolutional neural networks System ", the invention (are less than, equal to 4bit precision) convolutional network towards low precision, are unable to satisfy the requirements for high precision in market.

Summary of the invention

For the defects in the prior art, the object of the present invention is to provide a kind of suitable for deep learning hardware accelerator Configurable activation primitive device and method.

A kind of configurable activation primitive device suitable for deep learning hardware accelerator provided according to the present invention, packet It includes:

First arithmetic element: input terminal, which is connected with symbol integer input data source and obtains, symbol integer input data, and Operation is executed according to operational parameter；

Multiple selector: two input terminals are separately connected the output end of first arithmetic element and described have symbol whole Type input data source, according to scheduled requirement selection, input is transmitted to output end all the way；

Second arithmetic element: input terminal connects the output end of the multiple selector, and executes operation according to operational parameter；

Correct linear unit: input terminal connects the output end of second arithmetic element, to second arithmetic element Operation result is modified linear operation；

Third arithmetic element: input terminal connects the output end of the amendment linear unit, and executes fortune according to operational parameter It calculates.

Preferably, first arithmetic element, second arithmetic element and the third arithmetic element include:

Adder: to there is symbol integer input data to be added described in input with the operational parameter；

Multiplier: input terminal connects the output end of the adder, addition result and the operational parameter to input It is multiplied；

Count shift unit: input terminal connects the output end of the multiplier, carries out the displacement that counts to the multiplied result of input.

Preferably, the symbol integer input data that has for having symbol integer input data source to provide is that 32bit has symbol Integer input data.

Preferably, the operational parameter includes:

Offset parameter: first arithmetic element, second arithmetic element and the third arithmetic element are transmitted to The adder, bit wide 32bit；

The gradient and offset parameter: first arithmetic element, second arithmetic element and the third operation are transmitted to The multiplier of unit, bit wide 64bit.

Preferably, the offset parameter is stored in the first on piece sram cache, and the gradient and offset parameter are stored in Two on piece sram caches.

A kind of configurable activation primitive method suitable for deep learning hardware accelerator provided according to the present invention, provides The above-mentioned configurable activation primitive device suitable for deep learning hardware accelerator, executing step includes:

Step 1: according to the type of current activation primitive to be calculated, selecting calculating operation；

Step 2: if the currently active function needs cumulative offset parameter, from external load offset parameter to the first on piece Inside sram cache；If the currently active function does not need cumulative bias, the SRAM in the first on piece sram cache is whole Fill 0 value；

Step 3: if the calculating operation selected in step 1 is amendment linear unit, in the second on piece sram cache SRAM is stuffed entirely with 0 value, then gos to step 4, otherwise slow from the external load gradient and offset parameter to the second on piece SRAM It deposits, then gos to step 4；

Step 4: if current calculate needs to carry out batch normalization operation, and batch normalization operation is needed in activation letter It is completed before several calculating, then the calculated result of the first arithmetic element is exported to multiple selector, and multiple selector will calculate tie again Fruit inputs the second arithmetic element, then branches to step 5, if current calculate does not need to carry out batch normalization operation, multichannel Selector will have symbol integer input data to input the second arithmetic element, then branch to step 5；

Step 5: offset parameter is added by the second arithmetic element with the output of multiple selector, addition result and the gradient It is multiplied with offset parameter, exports by the displacement for the shift unit that counts to amendment linear unit；

Step 6: if the type of the activation primitive currently calculated is amendment linear unit, input data is denoted as x, exports number According to f (x) is denoted as, then linear unit is corrected to there is symbol integer input data x to proceed as follows:

If the type of the activation primitive currently calculated is not amendment linear unit, it is whole to there is symbol to correct linear unit Type input data x is proceeded as follows:

F (x)=x；

Step 7: if current calculate needs to carry out batch normalization operation, and batch normalization operation is needed in activation letter Number is completed after calculating, then addition, multiplication and shifting function needed for third arithmetic element completion batch normalization operation, fortune Result is calculated as final output data；If current calculate does not need to carry out batch normalization operation, linear unit will be corrected Output is directly as final output data.

Preferably, the calculating operation in step 1 includes:

Linear unit, parametrization amendment linear unit, band leakage amendment linear unit, index linear unit, S type is corrected to swash Function and tanh activation primitive living.

Preferably, in the step 5, the second arithmetic element is by the addition result of the output of offset parameter and multiple selector For the sum of a 32bit, the result that addition result is multiplied with the gradient and offset parameter is the product of a 64bit, is moved by counting Position device is shifted to obtain the output of a 32bit.

Preferably, the bit wide of offset parameter is 32bit, and the bit wide of the gradient and offset parameter is 64bit.

Compared with prior art, the present invention have it is following the utility model has the advantages that

The present invention can support the hardware acceleration unit of a variety of activation primitive operations, and the activation primitive type of support includes: to repair Linear positive unit (Rectified linear unit, ReLu), parametrization amendment linear unit (Parametric ReLu, PReLu), band leakage amendment linear unit (Leaky ReLu), index linear unit (Exponential Linear Unit, ELU), S type activation primitive (sigmoid), tanh activation primitive (Tanh).Batch normalization operation can also be supported simultaneously (Batch Normalization).Inputoutput data is integer type data, and the full accuracy of inputoutput data is up to 32 ratios Spy, and the precision of results of intermediate calculations reaches as high as 64 bits.

Detailed description of the invention

Upon reading the detailed description of non-limiting embodiments with reference to the following drawings, other feature of the invention, Objects and advantages will become more apparent upon:

Fig. 1 is the structural diagram of the present invention.

Specific embodiment

The present invention is described in detail combined with specific embodiments below.Following embodiment will be helpful to the technology of this field Personnel further understand the present invention, but the invention is not limited in any way.It should be pointed out that the ordinary skill of this field For personnel, without departing from the inventive concept of the premise, several changes and improvements can also be made.These belong to the present invention Protection scope.

As shown in Figure 1, a kind of configurable activation primitive suitable for deep learning hardware accelerator provided in this embodiment Device, comprising:

First arithmetic element 1: input terminal, which is connected with symbol integer input data source and obtains, symbol integer input data, and Operation is executed according to operational parameter；

2: two input terminals of multiple selector are separately connected the output end of the first arithmetic element and have the input of symbol integer Data source, according to scheduled requirement selection, input is transmitted to output end all the way；

Second arithmetic element 3: input terminal connects the output end of multiple selector, and executes operation according to operational parameter；

Correct linear unit 4: input terminal connects the output end of the second arithmetic element, to the operation result of the second arithmetic element It is modified linear operation；

Third arithmetic element 5: the output end of input terminal connection amendment linear unit, and operation is executed according to operational parameter.

In the present embodiment, the first arithmetic element, the second arithmetic element and third arithmetic element include:

Adder 6: symbol integer input data is added with operational parameter input, inputting has for 2 32bit Symbol integer data completes the operation that two 32 signed integers are added, and exporting has symbol integer data for 32；

Multiplier 7: input terminal connects the output end of adder, and the addition result and operational parameter to input carry out phase Multiply, inputting has symbol integer data for 2 32bit, and exporting has symbol integer data for 64bit；

Count shift unit 8: input terminal connects the output end of multiplier, carries out the displacement that counts to the multiplied result of input, defeated Entering is a 64bit signed integer, and output is a 32bit signed integer.

In Fig. 1, X is that 32bit has symbol integer input data, and Y is that 32bit has symbol integer output data, and data flow is pressed It is moved according to arrow direction in figure.

Operational parameter includes:

Offset parameter: being stored in the first on piece sram cache, is transmitted to the first arithmetic element, the second arithmetic element and third The adder of arithmetic element, bit wide 32bit, depth 1024；

The gradient and offset parameter: being stored in the second on piece sram cache, is transmitted to the first arithmetic element, the second arithmetic element With the multiplier of third arithmetic element, bit wide 64bit, depth 64.

Working principle of the present invention is as follows:

Step 1: according to the type of current activation primitive to be calculated, select following calculating operation:

A) linear unit (Rectified linear unit, ReLu) is corrected；

B) parametrization amendment linear unit (Parametric ReLu, PReLu)；

C) band leakage amendment linear unit (Leaky ReLu)；

D) index linear unit (Exponential Linear Unit, ELU)；

E) S type activation primitive (sigmoid)；

F) tanh activation primitive (Tanh)；

F (x)=x；

By above-mentioned 7 steps, the calculating process of activation primitive and batch normalization operation has all been completed, input data For X, output data Y.

Specific embodiments of the present invention are described above.It is to be appreciated that the invention is not limited to above-mentioned Particular implementation, those skilled in the art can make a variety of changes or modify within the scope of the claims, this not shadow Ring substantive content of the invention.In the absence of conflict, the feature in embodiments herein and embodiment can any phase Mutually combination.

Claims

1. a kind of configurable activation primitive device suitable for deep learning hardware accelerator characterized by comprising

First arithmetic element: input terminal, which is connected with symbol integer input data source and obtains, symbol integer input data, and according to Operational parameter executes operation；

Multiple selector: two input terminals are separately connected the output end of first arithmetic element and described have symbol integer defeated Enter data source, input is transmitted to output end all the way according to scheduled requirement selection；

Correct linear unit: input terminal connects the output end of second arithmetic element, the operation to second arithmetic element As a result it is modified linear operation；

Third arithmetic element: input terminal connects the output end of the amendment linear unit, and executes operation according to operational parameter.

2. the configurable activation primitive device according to claim 1 suitable for deep learning hardware accelerator, feature It is, first arithmetic element, second arithmetic element and the third arithmetic element include:

Multiplier: input terminal connects the output end of the adder, and the addition result and the operational parameter to input carry out It is multiplied；

3. the configurable activation primitive device according to claim 1 suitable for deep learning hardware accelerator, feature It is, the symbol integer input data that has for having symbol integer input data source to provide is that 32bit has symbol integer to input number According to.

4. the configurable activation primitive device according to claim 2 suitable for deep learning hardware accelerator, feature It is, the operational parameter includes:

Offset parameter: it is transmitted to the described of first arithmetic element, second arithmetic element and the third arithmetic element Adder, bit wide 32bit；

The gradient and offset parameter: first arithmetic element, second arithmetic element and the third arithmetic element are transmitted to The multiplier, bit wide 64bit.

5. the configurable activation primitive device according to claim 4 suitable for deep learning hardware accelerator, feature It is, the offset parameter is stored in the first on piece sram cache, and the gradient and offset parameter are stored in the second on piece SRAM Caching.

6. a kind of configurable activation primitive method suitable for deep learning hardware accelerator, which is characterized in that wanted using right It is suitable for the configurable activation primitive device of deep learning hardware accelerator described in asking 1, executing step includes:

Step 2: if the currently active function needs cumulative offset parameter, from external load offset parameter to the first on piece SRAM Caching is internal；If the currently active function does not need cumulative bias, the SRAM in the first on piece sram cache is stuffed entirely with 0 Value；

Step 3: if SRAM of the calculating operation selected in step 1 to correct linear unit, in the second on piece sram cache It is stuffed entirely with 0 value, then gos to step 4, otherwise from the external load gradient and offset parameter to the second on piece sram cache, so After go to step 4；

Step 4: if current calculate needs to carry out batch normalization operation, and batch normalization operation is needed in activation primitive It is completed before calculating, then the calculated result of the first arithmetic element is exported to multiple selector, and multiple selector is defeated by calculated result again Enter the second arithmetic element, then branch to step 5, if current calculate does not need to carry out batch normalization operation, multi-path choice Device will have symbol integer input data to input the second arithmetic element, then branch to step 5；

Step 5: offset parameter is added by the second arithmetic element with the output of multiple selector, addition result and the gradient and inclined Parameter multiplication is set, is exported by the displacement for the shift unit that counts to amendment linear unit；

Step 6: if the type of the activation primitive currently calculated is amendment linear unit, input data is denoted as x, output data note For f (x), then linear unit is corrected to there is symbol integer input data x to proceed as follows:

If the type of the activation primitive currently calculated is not amendment linear unit, linear unit is corrected to there is symbol integer defeated Enter data x to proceed as follows:

F (x)=x；

Step 7: if current calculate needs to carry out batch normalization operation, and batch normalization operation is needed in activation primitive meter It is completed after calculating, then addition, multiplication and shifting function needed for third arithmetic element completion batch normalization operation, operation knot Fruit is as final output data；If current calculate does not need to carry out batch normalization operation, the output of linear unit will be corrected Directly as final output data.

7. the configurable activation primitive method according to claim 6 suitable for deep learning hardware accelerator, feature It is, the calculating operation in step 1 includes:

It corrects linear unit, parametrization amendment linear unit, band leakage amendment linear unit, index linear unit, S type and activates letter Several and tanh activation primitive.

8. the configurable activation primitive method according to claim 6 suitable for deep learning hardware accelerator, feature It is, in the step 5, the addition result of the output of offset parameter and multiple selector is one by the second arithmetic element The sum of 32bit, the result that addition result is multiplied with the gradient and offset parameter be a 64bit product, through counting shift unit into Row displacement obtains the output of a 32bit.

9. the configurable activation primitive method according to claim 6 suitable for deep learning hardware accelerator, feature It is, first arithmetic element, second arithmetic element and the third arithmetic element include:

10. the configurable activation primitive method according to claim 6 suitable for deep learning hardware accelerator, feature It is, the bit wide of offset parameter is 32bit, and the bit wide of the gradient and offset parameter is 64bit.