CN112540946B

CN112540946B - Reconfigurable processor and method for calculating activation functions of various neural networks on reconfigurable processor

Info

Publication number: CN112540946B
Application number: CN202011511272.XA
Authority: CN
Inventors: 尹首一; 邓大峥; 谷江源; 韩慧明; 刘雷波; 魏少军
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Filing date: 2020-12-18
Publication date: 2024-06-28
Anticipated expiration: 2040-12-18

Abstract

The embodiment of the invention provides a reconfigurable processor and a method for calculating a plurality of neural network activation functions on the reconfigurable processor, wherein the method comprises the following steps: splitting the neural network activation function into basic operations; according to the calculation sequence of each basic operation in the neural network activation function, each basic operation is sequentially realized by reading input data from a shared memory through a reconfigurable processing array of a reconfigurable processor, processing units on the peripheral edge in the reconfigurable processing array can be used for executing memory access operation and other operation operations, which are called memory access processing units, processing units in the reconfigurable processing array except for the processing units on the peripheral edge can be used for executing operation operations, which are called operation processing units, the processing units on the peripheral edge perform data transmission with the processing units which are arranged on rows or columns and used for executing operation operations, and each processing unit in the reconfigurable processing array performs data transmission with the processing units which exist on the upper, lower, left and right directions and are adjacent to each other.

Description

Reconfigurable processor and method for calculating activation functions of various neural networks on reconfigurable processor

Technical Field

The invention relates to the technical field of reconfigurable processors, in particular to a reconfigurable processor and a method for calculating activation functions of various neural networks on the reconfigurable processor.

Background

In recent years, with the development of technologies such as artificial intelligence, cloud computing, big data and the like, the demand of human beings for computing is higher and higher, and the demand of chip performance is also higher and higher. However, as chip size gradually shrinks, moore's law gradually approaches physical limits and power of integrated circuits is difficult to continue to increase, thus requiring chip design to shift from an increase in power performance to an increase in energy efficiency and flexibility. Therefore, chip structure design in a special field capable of performing optimal design for a certain field becomes the mainstream of chip design nowadays, and considering high performance, high energy efficiency ratio and high flexibility becomes an important index of chip design today.

Meanwhile, with the continuous development of the neural network, the network structure and the activation function are also continuously changed, and for a special ASIC neural network accelerator, after the network structure and the activation function are changed, the acceleration effect is reduced, and even the special ASIC neural network accelerator is not suitable for a novel network.

Disclosure of Invention

The embodiment of the invention provides a method for calculating various neural network activation functions on a reconfigurable processor, which aims to solve the technical problem that an ASIC neural network accelerator in the prior art has low acceleration effect after a network structure and an activation function are changed. The method comprises the following steps:

splitting the neural network activation function into basic operations;

according to the calculation sequence of each basic operation in the neural network activation function, each basic operation is sequentially realized by reading input data from a shared memory through a reconfigurable processing array of a reconfigurable processor, wherein processing units on the peripheral edge in the reconfigurable processing array are used for executing access memory operation and are called access memory processing units, other processing units except the processing units on the peripheral edge in the reconfigurable processing array are used for executing operation and are called operation processing units, the processing units on the peripheral edge carry out data transmission with the processing units which are arranged on rows or on columns and are used for executing operation, and each processing unit in the reconfigurable processing array carries out data transmission with the processing units which exist on the upper, lower, left and right directions and are adjacent to each other.

The embodiment of the invention also provides a reconfigurable processor for realizing the calculation of the activation functions of the various neural networks, so as to solve the technical problem that the acceleration effect of the ASIC neural network accelerator after the network structure and the activation functions are changed is low in the prior art. The reconfigurable processing array comprises:

a shared memory for storing input data;

and the reconfigurable processing array is used for reading input data from the shared memory to sequentially realize each basic operation according to the calculation sequence of each basic operation after the neural network activation function is split, wherein the processing units on the peripheral edge in the reconfigurable processing array are used for executing access memory operation and are called access memory processing units, the other processing units except the processing units on the peripheral edge in the reconfigurable processing array are used for executing operation and are called operation processing units, the processing units on the peripheral edge and the processing units which are arranged on the row or the column and are used for executing operation carry out data transmission, and each processing unit in the reconfigurable processing array carries out data transmission with the processing units which are arranged on the upper, lower, left and right directions and are adjacent to each other.

In the embodiment of the invention, the neural network activation function is split into the basic operations, and then the basic operations are sequentially realized by reading the input data from the shared memory through the reconfigurable processing array according to the calculation sequence of the basic operations in the neural network activation function, so that the operation of the neural network activation function is realized on the existing reconfigurable processing array structure, the reconfigurable processing array structure is not required to be changed, and the circuit structure is not required to be added, namely, different processing units in the reconfigurable processing array are configured according to the algorithm requirements of different neural network activation functions to perform corresponding operations, so that the complicated activation function operation can be realized on the reconfigurable processing array structure by utilizing the basic operations such as addition, subtraction, multiplication, displacement and the like, thereby being beneficial to simplifying the circuit design of the activation function operation, and improving the circuit operation speed and throughput rate.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate and together with the description serve to explain the application. In the drawings:

FIG. 1 is a flowchart of a method for computing activation functions of multiple neural networks on a reconfigurable processor according to an embodiment of the present invention;

FIG. 2 is a graph illustrating a relu function provided by an embodiment of the present invention;

FIG. 3 is a schematic diagram of a calculation flow of relu functions according to an embodiment of the present invention;

FIG. 4 is a schematic diagram illustrating an arrangement of processing units in a reconfigurable processing array during operation relu of a function according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of a sigmoid function provided by an embodiment of the present invention;

FIG. 6 is a schematic diagram of a calculation flow of a sigmoid function according to an embodiment of the present invention;

FIG. 7 is a schematic diagram of the arrangement of processing units in a reconfigurable processing array when a sigmoid function is calculated according to an embodiment of the present invention;

FIG. 8 is a schematic diagram of a time-division function image of an operational sigmoid function provided by an embodiment of the present invention;

FIG. 9 is a schematic diagram of a time-division function image after accumulating a time-division function image according to an embodiment of the present invention;

FIG. 10 is a graph illustrating a tanh function according to an embodiment of the present invention;

FIG. 11 is a schematic diagram of a calculation flow of a tanh function according to an embodiment of the present invention;

FIG. 12 is a schematic diagram showing the arrangement of processing units in a reconfigurable processing array when calculating the tanh function according to an embodiment of the present invention;

FIG. 13 is a schematic diagram of a calculation flow of anti-overflow processing according to an embodiment of the present invention;

FIG. 14 is a schematic diagram illustrating an arrangement of processing units in a reconfigurable processing array during anti-overflow processing according to an embodiment of the present invention;

FIG. 15 is a schematic diagram of a calculation flow for calculating e ^x according to an embodiment of the present invention;

FIG. 16 is a schematic diagram showing the arrangement of processing units in a reconfigurable processing array when e ^x is calculated according to an embodiment of the present invention;

FIG. 17 is a schematic diagram of a calculation flow for calculating ln (Σe ^x) according to an embodiment of the present invention;

FIG. 18 is a schematic diagram showing the arrangement of processing units in a reconfigurable processing array when ln (Σe ^x) is calculated according to an embodiment of the present invention;

fig. 19 is a block diagram of a reconfigurable processor for implementing computation of various neural network activation functions according to an embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the following embodiments and the accompanying drawings, in order to make the objects, technical solutions and advantages of the present invention more apparent. The exemplary embodiments of the present invention and the descriptions thereof are used herein to explain the present invention, but are not intended to limit the invention.

The inventors have found that coarse-grained reconfigurable processor architectures are gaining increasing attention due to their low power consumption, high performance and energy efficiency and flexible dynamic reconfigurability. The flexibility of the reconfigurable computing architecture is between that of a general purpose processor and an ASIC processor, while the efficiency of the reconfigurable computing architecture can be approximated to the ASIC processor by optimization, thus having the advantages of both. Its characteristics determine that it is very suitable for data-intensive operations, which is in full agreement with the computational requirements of neural networks. In the computation of neural networks, the implementation of the activation function is particularly important as part of providing nonlinearities, however, unlike dedicated ASIC processors, coarse-grained reconfigurable processors do not have circuitry dedicated to handling the activation function, and if neural network activation function implementation circuitry is to be added to the reconfigurable computing architecture, some redundancy tends to be created, and complex circuit designs also result in reduced performance and increased power consumption. Therefore, the inventor provides a plurality of neural network activation function calculation methods on the reconfigurable processor, and realizes the operation of the more complex neural network activation function on the existing simpler reconfigurable processing array circuit design.

In an embodiment of the present invention, a method for calculating activation functions of multiple types of neural networks on a reconfigurable processor is provided, as shown in fig. 1, where the method includes:

Step 102: splitting the neural network activation function into basic operations;

Step 104: according to the calculation sequence of each basic operation in the neural network activation function, each basic operation is sequentially realized by reading input data from a shared memory through a reconfigurable processing array of a reconfigurable processor, wherein processing units on the peripheral edge in the reconfigurable processing array are used for executing access memory operation and are called access memory processing units, other processing units except the processing units on the peripheral edge in the reconfigurable processing array are used for executing operation and are called operation processing units, the processing units on the peripheral edge carry out data transmission with the processing units which are arranged on rows or on columns and are used for executing operation, and each processing unit in the reconfigurable processing array carries out data transmission with the processing units which exist on the upper, lower, left and right directions and are adjacent to each other.

As can be seen from the flow shown in fig. 1, in the embodiment of the present invention, the neural network activation function is split into basic operations, and then, according to the calculation sequence of each basic operation in the neural network activation function, each basic operation is sequentially implemented by reading input data from the shared memory by the reconfigurable processing array, so that the operation of the neural network activation function is implemented on the existing reconfigurable processing array structure, no change is required to the reconfigurable processing array structure, and no circuit structure is required to be added to the reconfigurable processing array structure, that is, different processing units in the reconfigurable processing array are configured according to the algorithm requirements of different neural network activation functions to perform corresponding operations, so that complex activation function operations can be implemented on the reconfigurable processing array structure by using basic operations such as addition, subtraction, multiplication, displacement, and the like, thereby facilitating simplification of circuit design of activation function operations, and improvement of circuit operation speed and throughput rate.

In specific implementation, for different neural network activation functions, the operation of the neural network activation functions can be split into basic operations, and then input data are read from a shared memory through a reconfigurable processing array to sequentially realize each basic operation. Specifically, for the same neural network activation function, the operation of the neural network activation function has expandability and can meet the requirements of different precision and different throughput requirements by adjusting the fineness of the basic operation split on the operation of the neural network activation function and different splitting schemes. For example, under the requirement of low precision, the neural network activation function can be roughly split into fewer basic operations, so that the precision is reduced, and the throughput is improved; under the high precision requirement, the neural network activation function can be finely split into a plurality of basic operations so as to improve the precision.

In specific implementation, the basic operation may include: basic, simple operations such as addition, subtraction, multiplication, multiply-accumulate operations, shift operations, and select operations. To enable the operation of complex neural network activation functions by performing simple basic operations on a reconfigurable processing array.

In practice, for a linear piecewise neural network activation function, the function may be computed on a reconfigurable processing array by, for example,

Splitting the neural network activation function into basic operations, including:

Splitting a neural network activation function into selection operations for a linear piecewise neural network activation function;

according to the calculation sequence of each basic operation in the neural network activation function, sequentially realizing each basic operation through the reconfigurable processing array, comprising:

and reading input data from the shared memory through a plurality of access processing units in the reconfigurable processing array, transmitting the input data to an operation processing unit in a row or a column where the access processing unit is located for selection operation through each access processing unit, transmitting a calculation result of the selection operation to the access processing unit in the row or the column where the access processing unit is located through the operation processing unit, and further storing the calculation result into the shared memory, wherein the access processing unit for reading the input data and the access processing unit for storing the calculation result are different access processing units, and transmitting the calculation result output by the different operation processing units to different access processing units.

In specific implementation, the above-mentioned linear piecewise neural network activation function is exemplified by a linear rectification function (i.e., relu functions), and relu functions are f (z) =max (0, x), and as shown in fig. 2, the curves have characteristics of monotonic increase and easy derivation.

In particular, implementing relu functions on a reconfigurable computing architecture, it is a consideration how to map the ASIC circuit algorithm implementation of the hardware of relu functions onto the reconfigurable computing architecture. Considering the ASIC circuit implementation principle of relu functions, we need to fetch the input data x from the shared memory of the reconfigurable processing array, then select it by sel operation, determine the sign of the input number, and thus select whether the final output is 0 or x.

In the following we will take the reconfigurable processing array PEA of 4*4 (which is a quarter of the overall reconfigurable processing array, typically 8 x 8) as an example, to illustrate the implementation of the relu function. Firstly, as shown in table 1 below, the basic operation of relu function division is shown in fig. 3, we take input data from the shared memory by executing Load operation by the processing unit PE (i.e. the above access processing unit) on the edge of the reconfigurable processing array, then implement sel operation by the processing unit PE (i.e. the above operation processing unit) inside the reconfigurable processing array, select whether 0 or x is output, and finally store the calculated result into shared memory by executing Save operation by the processing unit PE on the edge of the reconfigurable processing array, specifically, the arrangement that each processing unit in the reconfigurable processing array executes different operations is shown in fig. 4, where the access processing unit for reading the input data and the access processing unit for storing the calculated result are different access processing units, so as to implement pipeline execution, and the calculated result output by the different operation processing units is transmitted to the different access processing units, thereby implementing that the different access processing units store the calculated result output by the different operation processing units into the shared memory, and avoid data coverage.

TABLE 1

Arithmetic symbol	Meaning of
		Load	Fetch, fetch data in memory
Sel	Selecting, inputting a, b, c, selecting b or c to output according to the value of a
		Save	Storing data into a memory

In practice, for neural network activation functions that are symmetric and allow for a piecewise taylor expansion fit, the activation functions may be computed on a reconfigurable processing array by, for example,

For a symmetrical neural network activation function allowing segment taylor expansion fitting, splitting the neural network activation function into a first symmetrical part and a second symmetrical part according to symmetry, dividing input data of the first symmetrical part into a plurality of data segments, sequentially splitting operation of each data segment into subtraction operation, selection operation and multiplication and accumulation operation, carrying out addition operation on multiplication and accumulation operation results of each data segment, subtracting the output maximum value of the first symmetrical part from the accumulation result, carrying out selection operation to obtain output data of the first symmetrical part, subtracting the output data of the first symmetrical part from the output maximum value of the first symmetrical part, and carrying out selection operation to obtain output data of the second symmetrical part;

Subtracting the read numerical value and the endpoint numerical value of the divided data segments from the shared memory through one access processing unit in the reconfigurable processing array, respectively forming a first-stage selector through a plurality of operation processing units, wherein each operation processing unit in the first-stage selector corresponds to one data segment, and each operation processing unit in the first-stage selector outputs the minimum value in the read numerical value and the maximum value of the corresponding data segment based on the subtraction result; the second-stage selector is composed of a plurality of operation processing units, each operation processing unit in the second-stage selector corresponds to a previous data segment, a first operation processing unit in the second-stage selector outputs the output of the first operation processing unit in the first-stage selector, other operation processing units in the second-stage selector output the maximum value in the maximum value of the corresponding operation processing unit and the previous data segment in the first-stage selector, multiply-accumulate operation is respectively carried out on the output of the operation processing units in the second-stage selector through the operation processing units, addition operation is carried out on the result of each multiply-accumulate operation through the operation processing units, 1 is subtracted from the result of the addition operation through the operation processing units, selection operation is carried out on the result of the addition operation, output data of the first symmetrical portion is obtained, 1 is subtracted from the output data of the first symmetrical portion through the operation processing units, and selection operation is carried out on the output data of the second symmetrical portion is obtained.

In particular implementations, the neural network activation functions that are symmetric and allow for piecewise taylor expansion fitting described above are exemplified by an S-shaped growth curve function (i.e., sigmoid function) and a hyperbolic tangent function (i.e., tanh function). Sigmoid function ofIs a common S-shaped function in biology. It can map the input variable between (0, 1), as shown in fig. 5, with monotonically increasing and easily derivable characteristics. In the neural network, if the output unit processes the two classification problems, we can obtain a sigmoid function by using generalized linear distribution, and the output result is Bernoulli distribution.

In practice, it is difficult to implement a look-up table based on a pipeline on a reconfigurable array. The address of the fetch will change due to the change in the input data. Based on a general reconfigurable array, the fetch address of the processing unit is typically implemented by a base address and an offset address. If the look-up table is implemented with the reconfigurable array, the fetch address will change with the input data, causing pipeline stall. Therefore, the present embodiment proposes to integrate and accumulate the functions in segments, and to implement the calculation of the functions in a pipelined manner. Specifically, the basic operation of splitting the Sigmoid function when it is operated on is shown in table 2 below.

TABLE 2

In practical implementation, according to the symmetry of sigmoid, we can know that only the part with the function larger than 0 (i.e. the first symmetrical part) needs to be calculated, and finally the other half of the function (i.e. the second symmetrical part) can be obtained through rotation change. Therefore, we map all input data to 0, infinity), is provided.

Secondly, taylor expansion is carried out on different parts of the sigmoid function, and an approximation function of the sigmoid function is obtained. Based on the reconfigurable processing array, we split the input data portion of the sigmoid function at 0, + -infinity into 4 data segments (in practice, the number of data segments can be determined according to different precision requirements, and the more the number of the data segments is, the higher the precision is) is, for example, [0,4 ], [4,8 ], [8, 15), [15, ], respectively.

First, we determine the data range of the input data, and use the sel operation function. The input of the Sel operation function is a, b and c, and the selecter can select any one of b and c to output according to the value of the input a. Firstly, subtracting the input data, and judging the input range.

We construct a two-stage selection function by the processing unit. The first stage selection function is realized by three processing units, wherein the smaller one of the two numbers is selected for output, and the second stage selection function is realized by three processing units, wherein the larger one of the two numbers is selected for output.

As shown in fig. 6 and 7, the input data is subtracted by 4, 8, and 15 (i.e., the end point values of the divided data segments), and the range of the input data is determined by the mapping according to the subtraction result. We will take the input data in 3 segments of intervals as examples for analysis, and 1, 6 and 18 as examples for input data.

When the input data is 1, it passes through the first stage selector, where the first selector (inputs 1 and 4) will output 1, the second selector (inputs 1 and 8) will output 1, and the third selector (inputs 1 and 15) will output 1. And the output data of the first-stage selector is output through the second-stage selector, wherein the output of the first selector is 1, the output of the first selector in the first-stage selector is directly output through the first selector in the second-stage selector through routing operation, the output of the second selector (with the inputs of 1 and 4) is 4, and the output of the third selector (with the inputs of 1 and 8) is 8.

Similarly, when the input data is 6, the first selector will output 4, the second selector will output 6, and the third selector will output 6 when passing through the first stage selector. The output data of the first stage selector is passed through the second stage selector, the first selector output is 4, the second selector (inputs 6 and 4) output is 6, and the third selector (inputs 6 and 8) output is 8.

Similarly, when the input data is 18, the first selector will output 4, the second selector will output 8, and the third will output 15 when passing through the first stage selector. The output data of the first stage selector is passed through the second stage selector, the first selector output is 4, the second selector (inputs 8 and 4) output is 8, and the third selector (inputs 8 and 18) output is 18.

In summary, the sel operation function formed by the two-stage selector can be expressed as the following formula (1)

sel(x,y,z)＝max(min(x,y),z)y＝4、5、8 z＝4、8 (1)

The output results of the three selectors of the second-stage selector are subjected to MAC operation through the processing units of three different paths respectively, namely, the Taylor expansion function generated by expanding at three different points. And accumulating the two to obtain the final output result. As shown in fig. 8, the function image of the solid line is a half sigmoid function, the function image of the "o" mark is a taylor function developed at [0, 4), the function image of the "|" mark is a taylor function developed at [4, 8), the function image of the "|" mark is a taylor function developed at [8, 15), and the function image of the "x" mark is 1. By stitching together, i.e. adding up, a new function can be obtained, as shown in fig. 9, it can be seen that the sigmoid function image is well fitted by the taylor's expanded function.

In practice, in consideration of the problem of precision loss, the segment interval of the Sigmoid function is exemplified by [0,4 ], [4,8 ], [8, 15], [15, +#). After (15, +_j), the result will be taken to be 1, with a negligible loss of accuracy of about 10 ^-7. In the interval [0,15], the function of piecewise taylor function expansion is adopted, expansion is carried out to the third order, an approximation function is obtained, the specific precision loss and taylor expansion function are shown in the following table 3, only the interval [0,30] is shown in the table, and the negative interval can be obtained through central symmetry about x=0.

TABLE 3 Table 3

	Taylor expansion function	Maximum loss of accuracy
			[0，4)	3.5610^-3x³-5.7110^-2x²+2.9310^-1x+4.92*10^-1	7.53*10^-3
[4，8)	4.9610^-4x³-1.0510^-2x²+7.5110^-2x+8.19*10^-1	5.23*10^-4
			[8，15)	3.2110^-6x³-1.2210^-4x²+1.5410^-3x+9.94*10^-1	3.71*10^-5
[15,∞)	1	3.06*10^-7

In the implementation, when a common operation is performed on PEs on the reconfigurable processing array, if the input of the PEs is a, b, the function f (a, b) executed for the PEs is output, and one value of a, b can be selected as the output, specifically, which value is output, depending on the position of the input a, b in the compiling instruction for configuring the PEs, so that the specific operation and output executed by each PE in the reconfigurable processing array can be realized through configuration.

Specifically, the operation of the sigmoid function adopts an implementation mode based on piecewise integral accumulation, and finally, based on the symmetry of the function, the running calculation of the function is realized, and the implementation can be realized by using 3 global PE and 28 processing units PE.

In particular, the Tanh function isAs shown in FIG. 10, the sigmoid function has the characteristics of monotonically increasing and easy derivation, while the input variables can be mapped between (-1, 1).

Specifically, the operation of the Tanh function may be similar to the operation of the sigmoid function, except that the segment intervals are different, and [0,1 ], [1,2 ], [2,4 ], [4, ++l) are taken as an example. The calculation flow chart of the calculation tanh is shown in fig. 11, the arrangement schematic diagram of the processing units in the reconfigurable processing array when the tanh function is calculated is shown in fig. 12, the specific precision loss and taylor expansion function are shown in the following table 4, only intervals [0,15] are shown in the table, and the negative number interval can be obtained through the central symmetry about x=0.

TABLE 4 Table 4

	Taylor expansion function	Maximum loss of accuracy
			[0，1)	5.7010^-2x³-4.5710^-1x²+1.17100x-1.50*10^-2	1.50*10^-2
[1，2)	8.6910^-2x³-0.55910^-1x²+1.27100x-3.83*10^-2	3.27*10^-4
			[2，4)	7.9310^-3x³-8.4210^-2x²+3.0110^-1x+6.37*10^-1	1.04*10^-3
[4,∞)	1	6.71*10^-4

In particular implementations, for neural network activation functions including division, operations on reconfigurable processing arrays are performed by, for example,

subtracting the maximum value of the input data from the input data of the neural network activation function to avoid overflow, converting the division in the neural network activation function into subtraction, and dividing the parameters participating in operation into different operation items according to the subtraction in the neural network activation function;

the operation of each operation item is realized in turn through the reconfigurable processing array.

In particular, the neural network activation function including division is exemplified by Softmax, which has the expression

The softmax function can be converted to by anti-overflow processing (i.e., replacing the input data x with x-x _max)

I.e., the number x-x _max is entered, thereby avoiding overflow as a result of the e ^x function being too large. Because the division is complex in the circuit, the subtraction is adopted to replace the division, so that the generated power consumption and consumed resources are reduced, and the operation speed and efficiency are improved. The softmax function can be converted into using logarithmic change

Therefore, the softmax function operation is mainly divided into four parts, the first part is an anti-overflow part, i.e. the solution x-x _max (i.e. the operation term above). The second part is the calculation e ^x (i.e., the above operation). The third part is to accumulate the obtained e ^x and obtain ln (Σe ^x) (i.e., the above operation term). The fourth part is solving(I.e., the above-described terms).

In particular, in order to implement anti-overflow processing, it is proposed to subtract the maximum value of the input data from the input data, specifically, find the maximum value of the input data by dividing the input data into a plurality of data groups, for each data group, reading the input data by a memory processing unit, receiving the input data by an operation processing unit and performing a selection operation on the input data, outputting the maximum value of the data group, parallel processing the plurality of data groups to obtain the maximum value of each data group, reading the maximum value of each data group by a memory processing unit, receiving the maximum value of each data group by an operation processing unit and performing a selection operation on the received data, and outputting the maximum value of the maximum values of each data group to obtain the maximum value of the input data.

Specifically, taking the operation item of the first step of operating softmax as an example, the input data may be divided into 16 data sets, the operation of determining the maximum value in the input data may include the operation shown in the following table 5, the comparison operation of the processing array of the RPU may be performed in parallel to compare the 16 data sets respectively, as shown in fig. 13 and 14, the memory processing unit performs load operation to read the input data of each data set from the shared memory, the operation processing unit performs subtraction and selection operation to select the maximum value in each data set of the 16 data sets, and the memory processing unit performs sel operation to store the maximum value of each data set into the shared memory. Finally, the maximum values of the 16 data sets are compared with each other, thereby obtaining the maximum value of the input data. The characteristic that the RPU can process data in parallel is utilized, so that the data processing is accelerated, and the efficiency is improved.

TABLE 5

Arithmetic symbol	Meaning of
		Load	Fetch, fetch data in memory
Sel	Selecting, inputting a, b, c, selecting b or c to output according to the value of a
		-	Subtraction, inputs a, b, outputs a-b
Save	Storing, storing data in a memory

In particular, for the exponential function based on e in the operation term, the embodiment proposes that the input data is read by a memory processing unit, then the input data and the maximum value of the input data are subtracted by an operation processing unit, and the result of the subtraction is compared with the maximum value of the input data by an operation processing unitAnd performing multiplication operation, namely converting the exponential function into an exponential function with 2 as a base, taking the result of the multiplication operation as input data of the exponential function after the base conversion, wherein the input data of the exponential function after the base conversion comprises an integer part and a decimal part, performing Taylor expansion on the exponential function with 2 as the base and the decimal part as an index to obtain a polynomial, performing corresponding operation on the polynomial through an operation processing unit to obtain output of the exponential function with 2 as the base and the decimal part as the index, performing shift operation on the output and the integer part through the operation processing unit to obtain output of the exponential function, and performing accumulation operation on the output of the exponential function through the operation processing unit.

Specifically, taking the operation term of the second step of calculating softmax as an example, e ^x is calculated. Notably, the operations of subtracting x _max are performed on the input data to prevent overflow. First, the bottom formula is adopted, e ^x is changed into

Where u _i is the integer part of the input data after the change of the base formula, v _i is the decimal part, and y _i＝x-x_max. According to the characteristics of binary numbers, the above formula can be changed again

At this time, the range of the input data is reduced to between [ -1,0], so that the Taylor expansion solution can be adoptedWe pairPerforming Taylor expansion with

Finally, the obtained result is shifted to obtainAs a result of (a).

Specifically, calculateAs shown in table 6 below, the processing of (a) is performed by performing load execution fetch operation using the memory processing unit, fetching the input data from the memory, and completing anti-overflow and updating the data by subtracting x _max obtained by the anti-overflow processing of the previous stage from the subtraction. Then the data after the anti-overflow processing is multiplied withThe multiplication to obtain u _i+v_i can obtain u _i and v _i respectively through AND operation, u _i is stored, and the multiplication accumulation is executed through the operation processing unit to perform polynomial calculation on v _i, and the specific calculation is formula (7). Finally, u _i is fetched from the memory by using a fetch operation, and the result of the polynomial calculation is shifted to obtain a final output result, which is stored in the memory.

All e ^x are added together to give Σe ^x and stored in memory for the next calculation.

TABLE 6

Arithmetic symbol	Meaning of
		Load	Fetch, fetch data in memory
Sel	Selecting, inputting a, b, c, selecting b or c to output according to the value of a
		And	And operation, input a, b, output a & b
>>	Shift operation, input a, output a after shift
		+	Addition, input a, b, output a+b
-	Subtraction, inputs a, b, outputs a-b
		*	Multiplication, input a, b, output a.times.b
MAC	Multiply-accumulate, input a, b, c, execute ab+c
		Save	Storing, storing data in a memory

In a specific implementation, for a logarithmic function with e as a base in an operation term, the embodiment proposes that an input term of the logarithmic function is accumulation of an exponential function with e as a base, the accumulation of the exponential function is converted into a product of an exponential function with w as an index with 2 as a base and k, a leading 0 operation is performed by an operation processing unit to obtain a value of w, a shift operation is performed on the accumulation of the exponential function with e as a base to obtain a value of k, a polynomial is obtained after taylor expansion is performed on the logarithmic function based on the value of w and the value of k, and an operation processing unit is used for performing operation on the polynomial to obtain an output of the logarithmic function.

Specifically, taking the operation term of the third step of calculating softmax as an example, the obtained e ^x is added up, and ln (Σe ^x) is obtained. The accumulating part can be synchronously realized in the operation item process of the second step of operating softmax, and the result is accumulated into the global register every time one result is calculated. While the central idea of calculating ln (Σe ^x) is the taylor function expansion. The following changes are applied to ln (Σe ^x) to obtain

ln(∑e^x)＝ln(2^w*k) (8)

From the characteristics of e ^x, we can see that the value of Σe ^x must be a positive number, so in binary numbers this number is stored using the original code. By shifting, we can obtain the value of k, and reduce the calculated data to the [0,1] interval, so as to perform the calculation of Taylor expansion. And the value of w is calculated by leading 0. After obtaining the value of w, the value of k can be obtained by performing a shift operation on Σe ^x. Then, the expression (8) is changed and taylor expansion is performed, so that the final calculation expression (9) can be obtained. Specifically, the basic operation adopted in the process of calculating ln (Σe ^x) is shown in the following table 7, the schematic diagram of the calculation flow of calculating ln (Σe ^x) is shown in fig. 17, and the schematic diagram of the arrangement of the processing units in the reconfigurable processing array when calculating ln (Σe ^x) is shown in fig. 18.

TABLE 7

Arithmetic symbol	Meaning of
		Load	Fetch, fetch data in memory
Clz	Leading 0 calculation, calculating leading 0 number in input data
		+	Addition, input a, b, output a+b
-	Subtraction, inputs a, b, outputs a-b
		*	Multiplication, input a, b, output a.times.b
MAC	Multiply-accumulate, input a, b, c, execute ab+c
		Save	Storing, storing data in a memory

In specific implementation, in the fourth step of computing softmax, the solution is obtainedSince the first step has solved for x _max, the third step has solved forThus updating the number to be subtracted toAnd then carrying out e ^x function calculation in the second step, wherein the calculation flow chart is identical to the calculation flow chart in the second step.

In the specific implementation, in the process of sequentially realizing basic operations through the reconfigurable processing array, when each operation processing unit needs to perform data transmission with a processing unit which is not in a row or a column of the operation processing unit, the operation processing unit performs routing operation through the processing unit which is in data transmission interconnection with the operation processing unit, so that the operation processing unit performs data transmission with the processing unit which is not in the row or the column of the operation processing unit; or outputting the data of the operation processing unit to a global register for storage, so that the data can be read by the processing unit of a row or a column which is not located by the operation processing unit.

In specific implementation, the multiple neural network activation function calculation methods on the reconfigurable processor can be subjected to simulation test by adopting the python language, input data are random numbers between (-101, 101), and the number of the input data is random numbers between (1, 100), and the number of the input data is 100 times. According to the result of the final simulation, the maximum error is about 0.01, and is the precision of 6-7 bit 2 decimal places, the precision can be improved by improving the order of the Taylor expansion, and the precision of the Taylor expansion is not improved here in order to reduce the power consumption and improve the operation precision.

The method for calculating the neural network activation functions on the reconfigurable processor realizes the operation of the neural network activation functions on the reconfigurable architecture mainly through a Taylor expansion mode. In addition, in the calculation of the softmax function, subtraction is adopted to replace division, and the bottom-changing formula is combined with shift to replace e ^x, so that the time of coefficient and operation which need to be stored is reduced, the cost of hardware resources of the processor is further reduced, and the area and the power consumption are further reduced.

In addition, the method for calculating the activation functions of the multiple neural networks on the reconfigurable processor has certain flexibility, and the expansion order can be customized and determined according to the application, so that the requirements of various precision data are met, and better balance among power consumption, calculation efficiency and precision is achieved.

Based on the same inventive concept, a reconfigurable processor for implementing calculation of various neural network activation functions is also provided in the embodiments of the present invention, as described in the following embodiments. Because the principle of solving the problem by the reconfigurable processor for implementing the calculation of the multiple neural network activation functions is similar to that of the multiple neural network activation function calculation methods on the reconfigurable processor, the implementation of the reconfigurable processor for implementing the calculation of the multiple neural network activation functions can be referred to the implementation of the multiple neural network activation function calculation methods on the reconfigurable processor, and the repetition is omitted. As used below, the term "unit" or "module" may be a combination of software and/or hardware that implements the intended function. While the means described in the following embodiments are preferably implemented in software, implementation in hardware, or a combination of software and hardware, is also possible and contemplated.

FIG. 19 is a block diagram of a reconfigurable processor for implementing computation of a plurality of neural network activation functions according to an embodiment of the present invention, as shown in FIG. 19, including:

A shared memory 1902 for storing input data;

And a reconfigurable processing array 1904, configured to sequentially implement each basic operation by reading input data from the shared memory according to the calculation sequence of each basic operation after splitting the neural network activation function, where processing units on the peripheral edge in the reconfigurable processing array are used to perform memory access operation, called memory access processing units, and processing units in the reconfigurable processing array other than the processing units on the peripheral edge are used to perform operation, called arithmetic processing units, where the processing units on the peripheral edge perform data transmission with processing units on rows or columns where the processing units are located, and each processing unit in the reconfigurable processing array performs data transmission with processing units that exist on top of the processing unit or on left and right sides of the processing unit and are adjacent to the processing unit.

In another embodiment, there is also provided software for executing the technical solutions described in the foregoing embodiments and preferred embodiments.

In another embodiment, there is also provided a storage medium having the software stored therein, including but not limited to: optical discs, floppy discs, hard discs, erasable memory, etc.

The embodiment of the invention realizes the following technical effects: according to the method, a neural network activation function is split into basic operations, then the basic operations are sequentially realized by reading input data from a shared memory through a reconfigurable processing array according to the calculation sequence of the basic operations in the neural network activation function, the operation of the neural network activation function is realized on the existing reconfigurable processing array structure, the reconfigurable processing array structure is not required to be changed, and a circuit structure is not required to be added, namely, different processing units in the reconfigurable processing array are configured according to the algorithm requirements of different neural network activation functions to perform corresponding operations, so that complex activation function operations can be realized on the reconfigurable processing array structure by utilizing basic operations such as addition, subtraction, multiplication and displacement, thereby being beneficial to simplifying the circuit design of the activation function operations, and improving the circuit operation speed and throughput rate.

It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The foregoing description of the embodiments has been provided for the purpose of illustrating the general principles of the invention, and is not meant to limit the scope of the invention, but to limit the invention to the particular embodiments, and any modifications, equivalents, improvements, etc. that fall within the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims

1. A method for computing activation functions of a plurality of neural networks on a reconfigurable processor, comprising:

Splitting the neural network activation function into basic operations; the basic operation includes: addition, subtraction, multiplication, multiply-accumulate operations, and selection operations;

2. The method of computing a plurality of neural network activation functions on a reconfigurable processor of claim 1,

3. The method of computing a plurality of neural network activation functions on a reconfigurable processor of claim 1,

For a symmetrical neural network activation function which can be fitted through segmentation Taylor expansion, dividing the neural network activation function into a first symmetrical part and a second symmetrical part according to symmetry, dividing input data of the first symmetrical part into a plurality of data segments, dividing operation of each data segment into subtraction operation, selection operation and multiplication and accumulation operation in sequence, carrying out addition operation on multiplication and accumulation operation results of each data segment, comparing the accumulation results with output maximum values of the first symmetrical part, carrying out selection operation to obtain output data of the first symmetrical part, subtracting the output data of the first symmetrical part from the output maximum values of the first symmetrical part, and carrying out selection operation to obtain output data of the second symmetrical part;

Subtracting the read numerical value and the endpoint numerical value of the divided data segments from the shared memory through one access processing unit in the reconfigurable processing array, respectively forming a first-stage selector through a plurality of operation processing units, wherein each operation processing unit in the first-stage selector corresponds to one data segment, and each operation processing unit in the first-stage selector outputs the minimum value in the read numerical value and the maximum value of the corresponding data segment based on the subtraction result; the second-stage selector is composed of a plurality of operation processing units, each operation processing unit in the second-stage selector corresponds to the previous data segment, the first operation processing unit in the second-stage selector outputs the output of the first operation processing unit in the first-stage selector, the other operation processing units in the second-stage selector output the maximum value in the maximum value of the corresponding operation processing unit and the previous data segment in the first-stage selector, the operation processing units respectively multiply and accumulate the output of the operation processing units in the second-stage selector, the operation processing units add the results of the multiply and accumulate operations, the operation processing units subtract the output maximum value of the first symmetrical part from the result of the addition operation and select the output maximum value of the first symmetrical part to obtain the output data of the first symmetrical part, and the operation processing units subtract the output data of the first symmetrical part from the output maximum value of the first symmetrical part and select the output data of the second symmetrical part to obtain the output data of the second symmetrical part.

4. A method for computing a plurality of neural network activation functions on a reconfigurable processor as claimed in any one of claims 1 to 3,

Subtracting the maximum value of the input data from the input data of the neural network activation function to prevent overflow, converting the division in the neural network activation function into subtraction, and dividing the parameters participating in the operation into different operation items according to the subtraction in the neural network activation function;

5. The method for computing multiple neural network activation functions on a reconfigurable processor of claim 4, wherein the operations of the operation terms are sequentially implemented by the reconfigurable processing array, comprising:

Dividing input data into a plurality of data groups, aiming at each data group, reading the input data through a memory access processing unit, receiving the input data through an operation processing unit, carrying out selection operation on the input data, outputting the maximum value of the data group, carrying out parallel processing on the plurality of data groups to obtain the maximum value of each data group, reading the maximum value of each data group through the memory access processing unit, receiving the maximum value of each data group through the operation processing unit, carrying out selection operation on the received data, and outputting the maximum value of the maximum values of each data group to obtain the maximum value of the input data.

6. The method for computing multiple neural network activation functions on a reconfigurable processor of claim 4, wherein the operations of the operation terms are sequentially implemented by the reconfigurable processing array, comprising:

For an exponential function with e as a base in an operation item, reading input data through a memory processing unit, subtracting the maximum numerical value of the input data from the input data through an operation processing unit, multiplying the result of the subtraction with log ^e ₂ through an operation processing unit, converting the exponential function into an exponential function with 2 as a base, converting the result of the multiplication into the input data of the exponential function with 2 as a base, converting the input data of the exponential function with 2 as a base into the input data of the exponential function with 2 as a base, performing Taylor expansion on the exponential function with 2 as a base and with the decimal as an index to obtain a polynomial, performing corresponding operation on the polynomial through the operation processing unit to obtain an output of the exponential function with 2 as a base, performing shift operation on the output and the integer part through the operation processing unit to obtain an output of the exponential function, and performing accumulation operation on the output of the exponential function through the operation processing unit.

7. The method for computing a plurality of neural network activation functions on a reconfigurable processor of claim 6, wherein the operations of the operation terms are sequentially implemented by the reconfigurable processing array, comprising:

And for a logarithmic function with the base of e in operation terms, the input term of the logarithmic function is accumulation of an exponential function with the base of e, the accumulation of the exponential function is converted into a product of an exponential function with the base of 2 and the exponent of w, leading 0 operation is carried out through an operation processing unit to obtain a value of w, shifting operation is carried out on the accumulation of the exponential function with the base of e to obtain a value of k, a polynomial is obtained after Taylor expansion is carried out on the logarithmic function based on the value of w and the value of k, and the output of the logarithmic function is obtained through operation of the operation processing unit on the polynomial.

8. The method of computing a plurality of neural network activation functions on a reconfigurable processor of claim 4, wherein sequentially implementing each base operation through the reconfigurable processing array comprises:

In the process of sequentially realizing basic operations through the reconfigurable processing array, when each operation processing unit needs to perform data transmission with a processing unit which is not in a row or a column of the operation processing unit, the operation processing unit performs routing operation through the processing unit which is in data transmission interconnection with the operation processing unit, so that the operation processing unit performs data transmission with the processing unit which is not in the row or the column of the operation processing unit; or outputting the data of the operation processing unit to a global register for storage, so that the data can be read by the processing unit of a row or a column which is not located by the operation processing unit.

9. A reconfigurable processor for implementing a plurality of neural network activation function calculations, comprising:

a shared memory for storing input data;

And the reconfigurable processing array is used for reading input data from the shared memory to sequentially realize the basic operations according to the calculation sequence of the basic operations after the neural network activation function is split, wherein the basic operations comprise: addition, subtraction, multiplication, multiply-accumulate operations, and selection operations; the processing units on the peripheral edges in the reconfigurable processing array are used for executing memory access operation, and are called memory access processing units, the processing units in the reconfigurable processing array except for the processing units on the peripheral edges are used for executing operation, and are called operation processing units, the processing units on the peripheral edges and the processing units on the rows or the columns where the processing units are located are used for executing operation data transmission, and each processing unit in the reconfigurable processing array and the adjacent processing units which exist in the vertical and horizontal directions of the processing units are used for data transmission.