WO2022126630A1 - 可重构处理器及其上多种神经网络激活函数计算方法 - Google Patents

可重构处理器及其上多种神经网络激活函数计算方法 Download PDF

Info

Publication number
WO2022126630A1
WO2022126630A1 PCT/CN2020/137702 CN2020137702W WO2022126630A1 WO 2022126630 A1 WO2022126630 A1 WO 2022126630A1 CN 2020137702 W CN2020137702 W CN 2020137702W WO 2022126630 A1 WO2022126630 A1 WO 2022126630A1
Authority
WO
WIPO (PCT)
Prior art keywords
processing unit
neural network
reconfigurable
data
output
Prior art date
Application number
PCT/CN2020/137702
Other languages
English (en)
French (fr)
Inventor
尹首一
邓大峥
谷江源
韩慧明
刘雷波
魏少军
Original Assignee
清华大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 清华大学 filed Critical 清华大学
Priority to PCT/CN2020/137702 priority Critical patent/WO2022126630A1/zh
Publication of WO2022126630A1 publication Critical patent/WO2022126630A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/16Combinations of two or more digital computers each having at least an arithmetic unit, a program unit and a register, e.g. for a simultaneous processing of several programs
    • G06F15/163Interprocessor communication
    • G06F15/167Interprocessor communication using a common memory, e.g. mailbox
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/16Combinations of two or more digital computers each having at least an arithmetic unit, a program unit and a register, e.g. for a simultaneous processing of several programs
    • G06F15/177Initialisation or configuration control
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/10Interfaces, programming languages or software development kits, e.g. for simulating neural networks

Definitions

  • the present invention relates to the technical field of reconfigurable processors, in particular to a reconfigurable processor and methods for calculating various neural network activation functions thereon.
  • chip design is required to shift from the improvement of power performance to the improvement of energy efficiency and flexibility. Therefore, the chip structure design of a special field that can be optimized for a certain field has become the mainstream of today's chip design, and taking into account high performance, high energy efficiency ratio and high flexibility has become an important indicator of today's chip design.
  • the network structure and activation function are also constantly changing.
  • the acceleration effect will decline, or even no longer applicable. on the new network.
  • Embodiments of the present invention provide a method for calculating multiple neural network activation functions on a reconfigurable processor, so as to solve the technical problem of low acceleration effect of ASIC neural network accelerators in the prior art after network structure and activation function changes.
  • the method includes:
  • each basic operation is sequentially implemented by reading input data from the shared memory through the reconfigurable processing array of the reconfigurable processor, wherein the reconfigurable processing array
  • the processing units on the surrounding edges are used to perform memory access operations, which are called memory access processing units, and other processing units other than the processing units on the surrounding edges in the reconfigurable processing array are used to perform arithmetic operations, which are called operations.
  • the processing unit, the processing units on the surrounding edges perform data transmission with the processing units on the row or column for performing arithmetic operations, and each processing unit in the reconfigurable processing array exists in the upper, lower, left, and right azimuths of the reconfigurable processing array. and adjacent processing units for data transfer.
  • the embodiment of the present invention also provides a reconfigurable processor for realizing the calculation of various neural network activation functions, so as to solve the technical problem in the prior art that the acceleration effect of the ASIC neural network accelerator is low after the network structure and activation function are changed.
  • the reconfigurable processing array includes:
  • the reconfigurable processing array is used to read the input data from the shared memory according to the calculation order of the basic operations after the neural network activation function is split to realize the basic operations in sequence, wherein the surrounding edges in the reconfigurable processing array
  • the processing units above are used to perform memory access operations, which are called memory access processing units.
  • Other processing units in the reconfigurable processing array except the processing units on the surrounding edges are used to perform arithmetic operations, which are called arithmetic processing units.
  • the processing units on the surrounding edges perform data transmission with the processing units on the row or column for performing arithmetic operations
  • each processing unit in the reconfigurable processing array and the Adjacent processing units perform data transfer.
  • the neural network activation function is split into basic operations, and then according to the calculation order of each basic operation in the neural network activation function, read input data from the shared memory through the reconfigurable processing array to sequentially It realizes each basic operation and realizes the operation of the neural network activation function on the existing reconfigurable processing array structure, without changing the reconfigurable processing array structure, and without adding a circuit structure to the reconfigurable processing array structure.
  • Algorithms of different neural network activation functions require different processing units in the reconfigurable processing array to perform corresponding operations, so that basic operations such as addition, subtraction, multiplication, and shift can be used on the reconfigurable processing array structure.
  • the activation function operation is beneficial to simplify the circuit design of the activation function operation, and it is beneficial to improve the circuit operation speed and throughput. Because the operation algorithm of the processing unit in the reconfigurable processing array can be flexibly configured and the input and output method of the pipeline is adopted, so that It is beneficial to satisfy the operation of the activation function with different changes, so that it has scalability, and it is also beneficial to improve the utilization rate of the processing unit.
  • FIG. 1 is a flowchart of a method for calculating multiple neural network activation functions on a reconfigurable processor provided by an embodiment of the present invention
  • FIG. 2 is a schematic diagram of a curve of a relu function provided by an embodiment of the present invention.
  • FIG. 3 is a schematic diagram of a calculation flow of a relu function provided by an embodiment of the present invention.
  • FIG. 4 is a schematic diagram of the arrangement of processing units in a reconfigurable processing array when a relu function is operated according to an embodiment of the present invention
  • FIG. 5 is a schematic diagram of a curve of a sigmoid function provided by an embodiment of the present invention.
  • FIG. 6 is a schematic diagram of a calculation flow of a sigmoid function provided by an embodiment of the present invention.
  • FIG. 7 is a schematic diagram of the arrangement of processing units in a reconfigurable processing array when a sigmoid function is calculated according to an embodiment of the present invention
  • FIG. 8 is a schematic diagram of a segmented function image when a sigmoid function is calculated according to an embodiment of the present invention
  • FIG. 9 is a schematic diagram of the accumulation of segmented function images when a sigmoid function is calculated according to an embodiment of the present invention.
  • FIG. 10 is a schematic diagram of a curve of a tanh function provided by an embodiment of the present invention.
  • FIG. 11 is a schematic diagram of a calculation flow of a tanh function provided by an embodiment of the present invention.
  • FIG. 12 is a schematic diagram of the arrangement of processing units in a reconfigurable processing array when operating a tanh function according to an embodiment of the present invention
  • FIG. 13 is a schematic diagram of a calculation flow of an overflow prevention process provided by an embodiment of the present invention.
  • FIG. 14 is a schematic diagram of the arrangement of processing units in a reconfigurable processing array during overflow prevention processing provided by an embodiment of the present invention
  • 15 is a schematic diagram of a calculation flow for calculating e x according to an embodiment of the present invention.
  • 16 is a schematic diagram of the arrangement of processing units in a reconfigurable processing array when calculating ex according to an embodiment of the present invention
  • 17 is a schematic diagram of a calculation flow for calculating ln( ⁇ ex ) provided by an embodiment of the present invention.
  • FIG. 18 is a schematic diagram of the arrangement of processing units in a reconfigurable processing array when calculating ln( ⁇ ex ) according to an embodiment of the present invention
  • FIG. 19 is a structural block diagram of a reconfigurable processor for implementing multiple neural network activation function calculations according to an embodiment of the present invention.
  • the inventors of the present application have found that the coarse-grained reconfigurable processor architecture is attracting more and more attention due to its characteristics of low energy consumption, high performance, high energy efficiency, and flexible dynamic reconfiguration.
  • the flexibility of the reconfigurable computing architecture is between that of general-purpose processors and ASIC processors, and at the same time, the efficiency of the reconfigurable computing architecture can be made close to that of ASIC processors through optimization, so it has the advantages of both. Its characteristics determine that it is very suitable for data-intensive operations, which is completely consistent with the computational requirements of neural networks. In the computation of neural network, the realization of activation function is particularly important as part of providing nonlinearity. However, unlike dedicated ASIC processors, coarse-grained reconfigurable processors do not have circuits dedicated to processing activation functions.
  • the inventor of the present application proposes a variety of neural network activation function calculation methods on the above-mentioned reconfigurable processor, which realizes the realization of a relatively complex neural network activation function on the existing relatively simple reconfigurable processing array circuit design. operation.
  • a method for calculating multiple neural network activation functions on a reconfigurable processor includes:
  • Step 102 Split the neural network activation function into basic operations
  • Step 104 According to the calculation order of each basic operation in the neural network activation function, read the input data from the shared memory through the reconfigurable processing array of the reconfigurable processor to realize each basic operation in sequence, wherein the reconfigurable processing array
  • the processing units on the surrounding edges in the processing array are used to perform memory access operations, which are called memory access processing units, and other processing units in the reconfigurable processing array except the processing units on the surrounding edges are used to perform arithmetic operations, It is called an arithmetic processing unit, and the processing units on the surrounding edges perform data transmission with the processing units on the row or column for performing arithmetic operations.
  • the azimuthally existing and adjacent processing units perform data transmission.
  • the reconfigurable processing array is used to obtain the basic operation.
  • the input data is read in the shared memory to realize each basic operation in sequence, and the operation of the neural network activation function is realized on the existing reconfigurable processing array structure, without changing the reconfigurable processing array structure, nor in the reconfigurable processing array structure.
  • the circuit structure is added to the array structure, that is, according to the algorithm requirements of different neural network activation functions, different processing units in the reconfigurable processing array are configured to perform corresponding operations, so that addition, subtraction, multiplication,
  • the basic operations such as shifting realize the complex activation function operation, which is beneficial to simplify the circuit design of the activation function operation, and is beneficial to improve the circuit operation speed and throughput rate, because the operation algorithm of the processing unit in the reconfigurable processing array can be flexibly configured
  • the pipeline input and output method is adopted, which is beneficial to satisfy the operation of the activation function of different changes, makes it scalable, and also helps to improve the utilization rate of the processing unit.
  • the operation of the neural network activation function can be divided into basic operations, and then each basic operation can be sequentially implemented by reading input data from a shared memory through a reconfigurable processing array.
  • the operation of the neural network activation function can be divided into the fineness and different splitting schemes of the basic operation by adjusting the operation of the neural network activation function, so that the operation of the neural network activation function is scalable and can also meet the For different precision requirements, different throughput requirements. For example, under low-precision requirements, the neural network activation function can be roughly divided into fewer basic operations to reduce accuracy and improve throughput; under high-precision requirements, the neural network activation function can be finely divided into multiple basic operations operations to improve accuracy.
  • the above-mentioned basic operations may include basic and simple operations such as addition, subtraction, multiplication, multiply-accumulate operation, shift operation, and selection operation.
  • basic and simple operations such as addition, subtraction, multiplication, multiply-accumulate operation, shift operation, and selection operation.
  • the following steps can be used to operate on the reconfigurable processing array, for example,
  • each basic operation is implemented sequentially through the reconfigurable processing array, including:
  • the input data is read from the shared memory through a plurality of memory access processing units in the reconfigurable processing array, and each memory access processing unit transmits the input data to the operation processing unit in its own row or column for selection operation , the calculation result of the selection operation is transmitted to the memory access processing unit of its own row or column through the calculation processing unit, and then the calculation result is stored in the shared memory, wherein the memory access processing unit that reads the input data and the memory access processing unit that stores the calculation result
  • the memory access processing units are different memory access processing units, and the calculation results output by different operation processing units are transmitted to different memory access processing units.
  • the result of the calculation is stored in the shared memory (shared memory).
  • the memory access processing unit that reads the input data and The memory access processing units that store the calculation results are different memory access processing units to implement pipeline execution, and the calculation results output by different operation processing units are transmitted to different memory access processing units, thereby enabling different memory access processing units to perform different operations.
  • the calculation results output by the processing unit are stored in the shared memory to avoid data overwriting.
  • the following steps can be used to operate on the reconfigurable processing array, for example,
  • the neural network activation function is divided into a first symmetric part and a second symmetric part according to the symmetry, and the input data of the first symmetric part is divided into multiple data
  • the operation of each data segment is divided into subtraction, selection operation and multiply-accumulate operation in turn, and the multiply-accumulate operation result of each data segment is added, and the accumulated result is subtracted from the output maximum value of the first symmetrical part and selected.
  • the output data of the first symmetrical part is obtained by operation, the output data of the first symmetrical part is subtracted from the output maximum value of the first symmetrical part, and the selection operation is performed to obtain the output data of the second symmetrical part;
  • each basic operation is implemented sequentially through the reconfigurable processing array, including:
  • a memory access processing unit in the reconfigurable processing array sequentially selects a value in each data segment from the shared memory, and subtracts the read value from the endpoint value of the divided data segment through a plurality of operation processing units, respectively,
  • the first-level selector is composed of multiple operation processing units, each operation processing unit in the first-level selector corresponds to a data segment, and each operation processing unit in the first-level selector reads the value based on the subtraction result.
  • the second-level selector is composed of multiple operation processing units, each operation processing unit in the second-level selector corresponds to the previous data segment, and the second-level selector in the second-level selector
  • the first operation processing unit outputs the output of the first operation processing unit in the first-stage selector
  • other operation processing units in the second-stage selector correspond to the output of the operation processing unit and the previous data in the first-stage selector
  • the maximum value is output from the maximum value of the segment, and the output of the operation processing unit in the second-stage selector is respectively multiplied and accumulated by the operation processing unit, and the result of each multiplication and accumulation operation is added by the operation processing unit.
  • the unit subtracts 1 from the result of the addition operation and performs a selection operation to obtain the output data of the first symmetrical part, and subtracts the output data of the first symmetrical part by 1 through the operation processing unit and performs a selection operation to obtain the output data of the second symmetrical part .
  • the above-mentioned neural network activation function that is symmetrical and allows piecewise Taylor expansion fitting is exemplified by a sigmoid growth curve function (ie, Sigmoid function) and a hyperbolic tangent function (ie, Tanh function).
  • the sigmoid function is is a common sigmoid function in biology. It can map the input variables between (0, 1), as shown in Figure 5, with the characteristics of monotonically increasing and easy derivation.
  • our output unit is dealing with a binary classification problem, then using the generalized linear distribution, we can get the sigmoid function, and the output result is the Bernoulli distribution.
  • the fetch address of the processing unit is generally realized by the base address and the offset address. If the reconfigurable array is used to implement the lookup table, the fetch address will change with the change of the input data, thus causing the pipeline to stall. Therefore, the present embodiment proposes to integrate and accumulate the function in pieces to realize the calculation of the function in a pipelined manner. Specifically, the basic operations for splitting the sigmoid function when operating it are shown in Table 2 below.
  • the input of the Sel operation function is a, b, c, and he can select any one of b or c for output according to the value of the input a.
  • the first-level selection function uses three processing units to select the smaller of the two numbers for output, and the second-level selection function selects the larger of the two numbers for output through three processing units. .
  • the input data is subtracted by 4, 8, and 15 respectively (ie, the endpoint values of the above-mentioned divided data segments), and then according to the result of the subtraction, the mapping can determine the range of the input data.
  • the input data in the 3-segment interval as an example for analysis, and the input data will be 1, 6, and 18 as examples.
  • the first selector When the input data is 1, when passing through the first stage selector, the first selector (its inputs are 1 and 4) will output 1, and the second selector (its inputs are 1 and 8) will output 1 , the third selector (whose inputs are 1 and 15) will output 1. Then pass the output data of the first-level selector through the second-level selector, where the output of the first selector is 1, and the output of the first selector in the first-level selector directly passes through the second-level selector through the routing operation.
  • the first selector of the selectors outputs, the second selector (its inputs are 1 and 4) outputs 4, and the third selector (its inputs are 1 and 8) outputs 8.
  • the first selector when the input data is 18, when the first selector is passed, the first selector will output 4, the second selector will output 8, and the third selector will output 15. Then pass the output data of the first-stage selector through the second-stage selector, the output of the first selector is 4, the output of the second selector (its input is 8 and 4) is 8, and the output of the third selector (its input is 8). 8 and 18) the output is 18.
  • the segment interval of the sigmoid function is [0, 4), [4, 8), [8, 15), [15, ⁇ ) as an example. After (15, ⁇ ), it will be possible to take the result as 1 , and the loss of precision is about 10-7 , which is negligible.
  • the function expanded by the piecewise Taylor function is used to expand to the third order to obtain an approximate function.
  • the input of the PE is a, b
  • the output is the function f(a, b) executed by the PE, and one of a and b can be selected.
  • a certain value is used as the output, and the specific output value depends on the position of the input a and b in the compilation instructions for configuring the PE. Therefore, the specific operation and output performed by each PE in the reconfigurable processing array can be realized through configuration.
  • the operation of the above-mentioned sigmoid function adopts an implementation method based on piecewise integration and accumulation, and finally, based on the symmetry of the function, the pipeline calculation of the function is realized, which can be realized by using 3 global PEs and 28 processing units PE.
  • the Tanh function is As shown in Figure 10, similar to the sigmoid function, it has the characteristics of monotonous increase and easy derivation, and can map the input variables to between (-1, 1).
  • the operation of the Tanh function can also be similar to the operation of the sigmoid function, but the segment interval is different. Take [0, 1), [1, 2), [2, 4), [4, ⁇ ) as an example .
  • the flow chart of calculating tanh is shown in Figure 11, and the schematic diagram of the arrangement of the processing units in the reconfigurable processing array when calculating the tanh function is shown in Figure 12.
  • the operation on the reconfigurable processing array is implemented through the following steps, for example,
  • the input data of the neural network activation function is subtracted from the maximum value of the input data to avoid overflow, and the division in the neural network activation function is converted into subtraction, according to the neural network activation function in The subtraction of , divides the parameters involved in the operation into different operands;
  • each basic operation is implemented in sequence through a reconfigurable processing array, including:
  • each operand is implemented sequentially through a reconfigurable processing array.
  • the above-mentioned neural network activation function including division takes Softmax as an example, and the expression of Softmax is
  • the softmax function can be transformed into
  • the number xx max is about to be input, so as to avoid the overflow caused by the result of the e x function being too large. Since it is more complicated to realize the division in the circuit, the present invention adopts the subtraction instead of the division, which reduces the power consumption and the consumed resources, thereby improving the speed and efficiency of the operation. Using logarithmic variation, the softmax function can be transformed into
  • the operation of the softmax function is mainly divided into four parts.
  • the first part is the anti-overflow part, that is, the solution of xx max (ie, the above operation item).
  • the second part is computing ex (ie, the above-mentioned operand).
  • the third part is to accumulate the obtained ex, and to obtain ln( ⁇ ex ) (ie, the above-mentioned operand).
  • the fourth part is to solve (i.e. the above operands).
  • the maximum value of the input data is found in the following manner.
  • the input data is divided into multiple data groups, and for each Data group, read the input data through a memory access processing unit, receive the input data through an operation processing unit and perform a selection operation on the input data, output the maximum value of the data group, and process multiple data groups in parallel to obtain the value of each data group.
  • the maximum value and then read the maximum value of each data group through a memory access processing unit, and then receive the maximum value of each data group through an operation processing unit and perform a selection operation on the received data, and output the maximum value of each data group.
  • Maximum value to get the maximum value of the input data.
  • the input data can be divided into 16 data groups as an example, and the operation of determining the maximum value in the input data can include the operations shown in Table 5 below, which can be determined by The comparison operation of the processing array of the RPU compares 16 data groups in parallel.
  • the memory access processing unit executes the load operation to read the input data of each data group from the shared memory, and passes the operation processing unit. Perform subtraction and selection operations to select the maximum value in each data group in the 16 data groups, and store the maximum value of each data group in the shared memory by executing the sel operation through the memory access processing unit. Finally, the maximum values of the 16 data sets are compared with each other to obtain the maximum value of the input data.
  • the RPU can process data in parallel, thereby speeding up data processing and improving efficiency.
  • this embodiment proposes to read the input data through a memory access processing unit, and then perform a subtraction operation between the input data and the maximum value of the input data through an operation processing unit. , the result of the subtraction operation is compared with the Do the multiplication operation, after replacing the exponential function with the exponential function with the base 2, the result of the multiplication operation is the input data of the exponential function after the base change, and the input data of the exponential function after the base change includes the integer part and the decimal part.
  • the polynomial is obtained by Taylor expansion of the exponential function with the base as the base and the fractional part as the exponent, and the corresponding operation is performed on the polynomial by the operation processing unit to obtain the output of the exponential function with the base 2 and the fractional part as the exponent.
  • the output and the integer part are subjected to a shift operation to obtain the output of the exponential function, and the output of the exponential function is accumulated by the operation processing unit.
  • ui is the integer part of the changed input data after using the base changing formula
  • vi is the fractional part
  • y i xx max .
  • this embodiment proposes that the input item of the logarithmic function is the accumulation of the exponential function with the base e, and the accumulation of the exponential function is converted into 2
  • the product of the exponential function with the base w as the exponent and k the value of w is obtained by performing the leading 0 operation by the operation processing unit, and the value of k is obtained by shifting the accumulation of the exponential function with the base e, based on the value of w
  • a polynomial is obtained by Taylor-expanding the logarithmic function with the value of k, and the output of the logarithmic function is obtained by operating the polynomial by an operation processing unit.
  • the obtained ex is accumulated, and ln( ⁇ ex ) is obtained.
  • the accumulating part can be implemented synchronously during the operation of the second step of the softmax operation. Every time a result is calculated, the result is accumulated in the global register.
  • the central idea of calculating ln( ⁇ e x ) is the Taylor function expansion. Taking the following changes to ln( ⁇ e x ), we can get
  • FIG. 17 A schematic diagram of the arrangement of processing units in the processing array is shown in FIG. 18 .
  • the third step has solved So update the number to be subtracted to Then bring in the ex function calculation of the second step, and the calculation flow chart is exactly the same as the calculation flow chart of the second step.
  • each operation processing unit needs to perform data transmission with the processing unit that is not in its own row or column, it is processed by The unit has a processing unit with data transmission interconnection to perform routing operation, so that the operation processing unit can perform data transmission with the processing unit that is not in its own row or column; or, the data of the operation processing unit is output to the global register for storage.
  • the operation processing unit is not the processing unit of the row or column to read data.
  • the above-mentioned various neural network activation function calculation methods on the reconfigurable processor can be simulated and tested by using the python language, and the input data is a random number between (-101, 101), and the number of input data is A random number between (1, 100), 100 rounds.
  • the maximum error is about 0.01, which is the precision of 6 to 7 binary decimals.
  • the precision can be improved by increasing the order of Taylor expansion. In order to reduce power consumption and improve operation precision, there is no Improve the accuracy of Taylor expansion.
  • the above-mentioned calculation methods of various neural network activation functions on the reconfigurable processor mainly realize the calculation of the neural network activation functions on the reconfigurable architecture by means of Taylor expansion. And in the calculation of the softmax function, subtraction is used instead of division, and the bottom-changing formula is combined with shifting to replace e x , which reduces the coefficients that need to be stored and the time of operation, thus further reducing the overhead of hardware resources of the device, thereby reducing the area. and power consumption.
  • the various neural network activation function calculation methods on the above-mentioned reconfigurable processors have certain flexibility, and the expansion order can be customized for the application to meet the needs of various precision data, in terms of power consumption, computational efficiency and A good balance is achieved in terms of accuracy.
  • an embodiment of the present invention also provides a reconfigurable processor for implementing the computation of various neural network activation functions, as described in the following embodiments. Since the problem-solving principle of the reconfigurable processor used to realize the calculation of various neural network activation functions is similar to the calculation method of various neural network activation functions on the reconfigurable processor, the method used to realize the calculation of various neural network activation functions is similar.
  • the reconfigurable processor reference may be made to the implementation of various neural network activation function calculation methods on the reconfigurable processor, and the repetition will not be repeated.
  • the term "unit” or "module” may be a combination of software and/or hardware that implements a predetermined function.
  • FIG. 19 is a structural block diagram of a reconfigurable processor for realizing the calculation of various neural network activation functions according to an embodiment of the present invention, as shown in FIG. 19 , including:
  • shared memory 1902 for storing input data
  • the reconfigurable processing array 1904 is used to read the input data from the shared memory according to the calculation sequence of the basic operations after the neural network activation function is split to realize the basic operations in sequence, wherein the The processing units on the edge are used to perform memory access operations, which are called memory access processing units, and the processing units other than the processing units on the surrounding edges in the reconfigurable processing array are used to perform arithmetic operations, which are called arithmetic processing. unit, the processing units on the surrounding edges perform data transmission with the processing units on the row or column for performing arithmetic operations, and each processing unit in the reconfigurable processing array And adjacent processing units perform data transmission.
  • a software is also provided, and the software is used to execute the technical solutions described in the foregoing embodiment and the preferred implementation manner.
  • a storage medium in which the above-mentioned software is stored, and the storage medium includes but is not limited to: an optical disk, a floppy disk, a hard disk, a rewritable memory, and the like.
  • the embodiments of the present invention achieve the following technical effects: it is proposed to divide the neural network activation function into basic operations, and then read the input from the shared memory through the reconfigurable processing array according to the calculation order of each basic operation in the neural network activation function
  • the basic operation is realized in turn by using the data to realize the operation of the neural network activation function on the existing reconfigurable processing array structure, without changing the reconfigurable processing array structure, and without adding a circuit structure to the reconfigurable processing array structure.
  • different processing units in the reconfigurable processing array are configured to perform corresponding operations, so that basic operations such as addition, subtraction, multiplication, and shift can be used on the reconfigurable processing array structure.
  • embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
  • computer-usable storage media including, but not limited to, disk storage, CD-ROM, optical storage, etc.
  • These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory result in an article of manufacture comprising instruction means, the instructions
  • the apparatus implements the functions specified in the flow or flow of the flowcharts and/or the block or blocks of the block diagrams.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • Computing Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Advance Control (AREA)

Abstract

一种可重构处理器及其上多种神经网络激活函数计算方法,其中,该方法包括:将神经网络激活函数拆分为基础运算(102);根据神经网络激活函数中各基础运算的计算顺序,通过可重构处理器的可重构处理阵列从共享存储器中读取输入数据来依次实现各基础运算,可重构处理阵列中四周边缘上的处理单元可用于执行访存操作及其他运算操作,称为访存处理单元,可重构处理阵列中除了四周边缘上的处理单元之外的其他处理单元只可用于执行运算操作,称为运算处理单元,四周边缘上的处理单元与所在行上的或所在列上的用于执行运算操作的处理单元进行数据传输,可重构处理阵列中每个处理单元与自身上下左右方位上存在的且相邻的处理单元进行数据传输(104)。

Description

可重构处理器及其上多种神经网络激活函数计算方法 技术领域
本发明涉及可重构处理器技术领域,特别涉及一种可重构处理器及其上多种神经网络激活函数计算方法。
背景技术
近年来,随着人工智能、云计算、大数据等技术的发展,人类对计算的需求越来越高,对芯片性能的需求也日益提高。然而,随着芯片尺寸的逐步缩小,摩尔定律逐渐逼近物理极限,集成电路的功率难以继续提升,因此要求芯片设计从功率性能方面的提升转移到能量效率和灵活性方面的提升。因此,能够针对某一领域进行优化设计的专用领域的芯片结构设计则成为了当今芯片设计的主流,而兼顾高性能、高能效比和高灵活性则成为今天芯片设计的重要指标。
同时,随着神经网络的不断发展,网络结构和激活函数也在不断地变化,对于专门的ASIC神经网络加速器,当网络结构和激活函数变化之后,加速的效果便有所下降,甚至不再适用于新型的网络了。
发明内容
本发明实施例提供了一种可重构处理器上多种神经网络激活函数计算方法,以解决现有技术中ASIC神经网络加速器在网络结构和激活函数变化后加速效果低的技术问题。该方法包括:
将神经网络激活函数拆分为基础运算;
根据神经网络激活函数中各基础运算的计算顺序,通过可重构处理器的可重构处理阵列从共享存储器中读取输入数据来依次实现各基础运算,其中,所述可重构处理阵列中四周边缘上的处理单元用于执行访存操作,称为访存处理单元,所述可重构处理阵列中除了四周边缘上的处理单元之外的其他处理单元用于执行运算操作,称为运算处理单元,四周边缘上的处理单元与所在行上的或所在列上的用于执行运算操作的处理单元进行数据传输,所述可重构处理阵列中每个处理单元与自身上下左右方位上存在的且相邻的处理单元进行数据传输。
本发明实施例还提供了一种用于实现多种神经网络激活函数计算的可重构处理器,以解决现有技术中ASIC神经网络加速器在网络结构和激活函数变化后加速效果低的技术问题。该可重构处理阵列包括:
共享存储器,用于存储输入数据;
可重构处理阵列,用于根据神经网络激活函数拆分后各基础运算的计算顺序,从共享存储器中读取输入数据来依次实现各基础运算,其中,所述可重构处理阵列中四周边缘上的处理单元用于执行访存操作,称为访存处理单元,所述可重构处理阵列中除了四周边缘上的处理单元之外的其他处理单元用于执行运算操作,称为运算处理单元,四周边缘上的处理单元与所在行上的或所在列上的用于执行运算操作的处理单元进行数据传输,所述可重构处理阵列中每个处理单元与自身上下左右方位上存在的且相邻的处理单元进行数据传输。
在本发明实施例中,提出了将神经网络激活函数拆分为基础运算,进而根据神经网络激活函数中各基础运算的计算顺序,通过可重构处理阵列从共享存储器中读取输入数据来依次实现各基础运算,实现了在现有可重构处理阵列结构上实现神经网络激活函数的运算,无需改变可重构处理阵列结构,也无需在可重构处理阵列结构上添加电路结构,即根据不同神经网络激活函数的算法需求配置可重构处理阵列中不同的处理单元进行相应的运算,使得可以在可重构处理阵列结构上利用加法、减法、乘法、移位等基础运算实现了复杂的激活函数运算,从而有利于简化激活函数运算的电路设计,有利于提高电路运算速度和吞吐率,由于可重构处理阵列中的处理单元的运算算法可以灵活配置且采用流水线的输入输出方式,使得有利于满足不同变化的激活函数的运算,使得具备可扩展性,也有利于提高处理单元的利用率。
附图说明
为了更清楚地说明本发明实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。在附图中:
图1是本发明实施例提供的一种可重构处理器上多种神经网络激活函数计算方法的流程图;
图2是本发明实施例提供的一种relu函数的曲线示意图;
图3是本发明实施例提供的一种relu函数的计算流程示意图;
图4是本发明实施例提供的一种运算relu函数时可重构处理阵列中处理单元的排布示意图;
图5是本发明实施例提供的一种sigmoid函数的曲线示意图;
图6是本发明实施例提供的一种sigmoid函数的计算流程示意图;
图7是本发明实施例提供的一种运算sigmoid函数时可重构处理阵列中处理单元的排布示意图;
图8是本发明实施例提供的一种运算sigmoid函数时分段函数图像的示意图;
图9是本发明实施例提供的一种运算sigmoid函数时分段函数图像累加后的示意图;
图10是本发明实施例提供的一种tanh函数的曲线示意图;
图11是本发明实施例提供的一种tanh函数的计算流程示意图;
图12是本发明实施例提供的一种运算tanh函数时可重构处理阵列中处理单元的排布示意图;
图13是本发明实施例提供的一种防溢出处理的计算流程示意图;
图14是本发明实施例提供的一种防溢出处理时可重构处理阵列中处理单元的排布示意图;
图15是本发明实施例提供的一种计算e x的计算流程示意图;
图16是本发明实施例提供的一种计算e x时可重构处理阵列中处理单元的排布示意图;
图17是本发明实施例提供的一种计算ln(∑e x)的计算流程示意图;
图18是本发明实施例提供的一种计算ln(∑e x)时可重构处理阵列中处理单元的排布示意图;
图19是本发明实施例提供的一种用于实现多种神经网络激活函数计算的可重构处理器的结构框图。
具体实施方式
为使本发明实施例的目的、技术方案和优点更加清楚明白,下面结合附图对本发明实施例做进一步详细说明。在此,本发明的示意性实施例及其说明用于解释本发明,但并不作为对本发明的限定。
本申请发明人发现,粗粒度可重构处理器架构以其低能耗、高性能和高能效和灵活动态可重构的特性,正得到越来越多的关注。可重构计算架构的灵活性介于通用处理器和ASIC处理器之间,同时可以通过优化使得可重构计算架构的效率逼近ASIC处理器,因此兼具了两者的优点。它的特点决定了它非常适用于对于数据密集型的运算,这与神经网络的计算要求是完全一致的。在神经网络的计算中,作为提供非线性的部分,激活函数的实现尤为重要,然而,不同于专用的ASIC处理器,粗粒度可重构处理器并没有专门用于处理激活函数的电路,如果要将神经网络激活函数实现电路加入可重构计算架构中,势必产生一定的冗余,复杂的电路设计也会导致性能的下降以及功耗的上升。因此,本申请发明人提出了上述可重构处理器上多种神经网络激活函数计算方法,实现了在现有的较为简单的可重构处理阵列电路设计上,实现较为复杂的神经网络激活函数的运算。
在本发明实施例中,提供了一种可重构处理器上多种神经网络激活函数计算方法,如图1所示,该方法包括:
步骤102:将神经网络激活函数拆分为基础运算;
步骤104:根据神经网络激活函数中各基础运算的计算顺序,通过可重构处理器的可重构处理阵列从共享存储器中读取输入数据来依次实现各基础运算,其中,所述可重构处理阵列中四周边缘上的处理单元用于执行访存操作,称为访存处理单元,所述可重构处理阵列中除了四周边缘上的处理单元之外的其他处理单元用于执行运算操作,称为运算处理单元,四周边缘上的处理单元与所在行上的或所在列上的用于执行运算操作的处理单元进行数据传输,所述可重构处理阵列中每个处理单元与自身上下左右方位上存在的且相邻的处理单元进行数据传输。
由图1所示的流程可知,在本发明实施例中,提出了将神经网络激活函数拆分为基础运算,进而根据神经网络激活函数中各基础运算的计算顺序,通过可重构处理阵列从共享存储器中读取输入数据来依次实现各基础运算,实现了在现有可重构处理阵列结构上实现神经网络激活函数的运算,无需改变可重构处理阵列结构,也无需在可重构处理阵列结构上添加电路结构,即根据不同神经网络激活函数的算法需求配置可重构处理阵列中不同的处理单元进行相应的运算,使得可以在可重构处理阵列结构上利用加法、减法、乘法、移位等基础运算实现了复杂的激活函数运算,从而有利于简化激活函数运算的电路设计,有利于提高电路运算速度和吞吐率,由于可重构处理阵列中的处理单元的 运算算法可以灵活配置且采用流水线的输入输出方式,使得有利于满足不同变化的激活函数的运算,使得具备可扩展性,也有利于提高处理单元的利用率。
具体实施时,针对不同的神经网络激活函数,可以将神经网络激活函数的运算拆分为基础运算,进而通过可重构处理阵列从共享存储器中读取输入数据来依次实现各基础运算。具体的,针对同一个神经网络激活函数,可以通过调整神经网络激活函数运算上拆分为基础运算的精细度、不同的拆分方案,使得神经网络激活函数的运算具有可扩展性,还可以满足对于不同精度的需求,不同吞吐量需求。如,低精度的需求下,可以将神经网络激活函数粗略拆分为较少的基础运算,以降低精度,提高吞吐量;高精度需求下,可以将神经网络激活函数精细拆分为多个基础运算,以提高精度。
具体实施时,上述基础运算可以包括:加法、减法、乘法、乘累加运算、移位运算以及选择运算等基本的、简单的运算。以实现通过在可重构处理阵列上执行简单的基础运算来实现复杂的神经网络激活函数的运算。
具体实施时,对于线性分段的神经网络激活函数,可以通过以下步骤在可重构处理阵列上运算,例如,
将神经网络激活函数拆分为基础运算,包括:
对于线性分段的神经网络激活函数,将该神经网络激活函数拆分为选择运算;
根据神经网络激活函数中各基础运算的计算顺序,通过可重构处理阵列来依次实现各基础运算,包括:
通过所述可重构处理阵列中的多个访存处理单元从共享存储器中读取输入数据,通过每个访存处理单元将输入数据传输给自身所在行或所在列的运算处理单元进行选择运算,通过运算处理单元将选择运算的计算结果传输给自身所在行或所在列的访存处理单元,进而将计算结果存入共享存储器,其中,读取输入数据的访存处理单元与存储计算结果的访存处理单元为不同的访存处理单元,不同运算处理单元输出的计算结果传输给不同的访存处理单元。
具体实施时,上述线性分段的神经网络激活函数以线性整流函数(即relu函数)为例,relu函数为f(z)=max(0,x),如图2所示,其曲线具有单调增和易于求导的特性。
具体的,在可重构计算架构上实现relu函数,需要考虑的是如何将relu函数的硬件的ASIC电路算法实现映射到可重构计算的架构上。考虑到relu函数的ASIC电路实现原理,我们需要将输入数据x从可重构处理阵列的共享存储器中取出,然后通过sel运算进行选择,判断该输入数的正负,从而选择最后的输出是0还是x。
以下,我们以4*4的可重构处理阵列PEA(其为整体可重构处理阵列的四分之一,一般可重构处理阵列为8*8)为例,说明relu函数的实现。首先,将relu函数划分的基础运算如下表1所示,如图3所示,我们通过可重构处理阵列边缘上的处理单元PE(即上述访存处理单元)执行Load运算从共享存储器取出输入数据,然后通过可重构处理阵列内部的处理单元PE(即上述运算处理单元)实现sel运算,选择输出的是0还是x,最后,通过可重构处理阵列边缘上的处理单元PE执行Save运算将计算的结果存入shared memory(共享存储器)中,具体的,可重构处理阵列中各个处理单元执行不同运算的排布如图4所示,其中,读取输入数据的访存处理单元与存储计算结果的访存处理单元为不同的访存处理单元,以实现流水线执行,不同运算处理单元输出的计算结果传输给不同的访存处理单元,进而实现不同的访存处理单元将不同的运算处理单元输出的计算结果存入共享存储器,避免数据覆盖。
表1
运算符号 含义
Load 取数,取出存储器中的数据
Sel 选择,输入a,b,c,根据a的值,选则b或c输出
Save 存储,将数据存入存储器中
具体实施时,对于对称且允许分段泰勒展开拟合的神经网络激活函数,可以通过以下步骤在可重构处理阵列上运算,例如,
将神经网络激活函数拆分为基础运算,包括:
对于对称且允许分段泰勒展开拟合的神经网络激活函数,根据对称将该神经网络激活函数拆分为第一对称部分和第二对称部分,将第一对称部分的输入数据划分为多个数据段,将每个数据段的运算依次拆分为减法、选择运算以及乘累加运算,将各个数据段的乘累加运算结果进行加法运算,将累加结果减第一对称部分的输出最大值并进行选择运算得到第一对称部分的输出数据,用第一对称部分的输出最大值减去第一对称部分的输出数据并进行选择运算得到第二对称部分的输出数据;
根据神经网络激活函数中各基础运算的计算顺序,通过可重构处理阵列来依次实现各基础运算,包括:
通过所述可重构处理阵列中的一个访存处理单元从共享存储器中依次各数据段中的一个数值,通过多个运算处理单元分别将读取的数值与划分数据段的端点数值做减法,通过多个运算处理单元组成第一级选择器,第一级选择器中的每个运算处理单元对应一个数据段,第一级选择器中的每个运算处理单元基于减法结果在读取的数值和对应数据 段的最大值中输出最小值;通过多个运算处理单元组成第二级选择器,第二级选择器中的每个运算处理单元对应前一个数据段,第二级选择器中的第一个运算处理单元输出第一级选择器中第一个运算处理单元的输出,第二级选择器中的其他运算处理单元在第一级选择器中对应运算处理单元的输出和前一个数据段的最大值中输出最大值,通过运算处理单元分别对第二级选择器中的运算处理单元的输出做乘累加运算,通过运算处理单元对各乘累加运算的结果进行加法运算,通过运算处理单元对加法运算的结果减1并进行选择运算,得到第一对称部分的输出数据,通过运算处理单元用1减去第一对称部分的输出数据并进行选择运算,得到第二对称部分的输出数据。
具体实施时,上述对称且允许分段泰勒展开拟合的神经网络激活函数以S型生长曲线函数(即Sigmoid函数)和双曲正切函数(即Tanh函数)为例。Sigmoid函数为
Figure PCTCN2020137702-appb-000001
是一种生物学中常见的S型函数。它可以将输入的变量映射到(0,1)之间,如图5所示,具有单调增和易于求导的特性。在神经网络中,如果我们的输出单元处理的是二分类问题,那么利用广义线性分布,我们就可以得到sigmoid函数,输出的结果就是伯努利分布。
具体实施时,可重构阵列上,很难基于流水线实现查找表。由于输入数据的变化,将导致取数的地址发生变化。基于一般的可重构阵列,处理单元的取数地址一般是由基地址以及偏移地址实现的。如果用该可重构阵列实现查找表,则取数地址会随着输入数据的变化而发生变化,从而使得流水线发生停顿。因此,本实施例提出将函数进行分段积分累加,以流水的方式实现该函数的计算。具体的,在运算Sigmoid函数时将其拆分的基础运算如下表2所示。
表2
Figure PCTCN2020137702-appb-000002
具体实施时,首先根据sigmoid的对称性,我们可以知道,只需计算函数大于0的部分(即上述第一对称部分),最后通过旋转变化,即可得到另一半函数(即上述第二对称部分)。因此,我们将所有的输入数据都映射到[0,∞)的区间。
其次,我们在sigmoid函数不同部分进行泰勒展开,得到sigmoid函数的近似函数。基于可重构处理阵列,我们将sigmoid函数在[0,+∞)的输入数据部分分为4个数据段(具体实施时,可以根据不同精度需求确定数据段个数,数据段个数越多精度越高)为例,分别是[0,4),[4,8),[8,15),[15,∞)。
首先,我们对输入数据进行数据范围的判断,采用的是sel运算函数。Sel运算函数的输入为a,b,c,他能够根据输入a的取值,选择b或c中的任意一个进行输出。首先对输入数据做减法,判断输入的范围。
我们构建通过处理单元两级选择函数。第一级选择函数通过三个处理单元实现的是选择两个数中较小的那一个数进行输出,第二级选择函数通过三个处理单元选择两个数中较大的那一个数进行输出。
如图6、图7所示,将输入数据分别减去4、8、15(即上述划分数据段的端点数值),然后根据减法的结果,该映射即可进行输入数据的范围判断。我们将分别以3段区间内输入数据为例进行分析,输入数据分别以1、6、18为例。
当输入数据为1时,通过第一级选择器时,其中,第一个选择器(其输入为1和4)将输出1,第二个选择器(其输入为1和8)将输出1,第三个选择器(其输入为1和15)将输出1。再将第一级选择器的输出数据通过第二级选择器,其中,第一个选择器输出为1,第一级选择器中的第一个选择器的输出通过路由运算直接通过第二级选择器中的第一个选择器输出,第二选择器(其输入为1和4)输出为4,第三个选择器(其输入为1和8)输出为8。
同理,当输入数据为6时,通过第一级选择器时,第一个选择器将输出4,第二个选择器将输出6,第三个选择器将输出6。再将第一级选择器的输出数据通过第二级选择器,第一个选择器输出为4,第二选择器(其输入为6和4)输出为6,第三个选择器(其输入为6和8)输出为8。
同理,当输入数据为18时,通过第一级选择器时,第一个选择器将输出4,第二个选择器将输出8,第三个选择将输出15。再将第一级选择器的输出数据通过第二级选择器,第一个选择器输出为4,第二选择器(其输入为8和4)输出为8,第三个选择器(其输入为8和18)输出为18。
综上所述,二级选择器构成的sel运算函数可表示为下式(1)
sel(x,y,z)=max(min(x,y),z)y=4、5、8 z=4、8        (1)
我们将第二级选择器的三个选择器的输出结果分别通过三个不同路径的处理单元进行MAC运算,即通过在三个不同点上展开生成的泰勒展开函数。将它们累加,即可得到最终的输出结果。如图8所示,实线的函数图像为半边sigmoid函数,“o”标记的函数图像为[0,4)处展开的泰勒函数,“|”标记的函数图像为[4,8)处展开的泰勒函数,“*”标记的函数图像为[8,15)处展开的泰勒函数,“x”标记的的函数图像为1。通过将它们拼接在一起,即累加的方式,可以得到一个新的函数,如图9所示,可以看到,通过泰勒展开的函数很好地拟合了sigmoid函数图像。
具体实施时,考虑到精度损失的问题,Sigmoid函数的分段区间以[0,4),[4,8),[8,15),[15,∞)为例。在(15,∞)之后,将可以将结果取为1,精度损失约为10 - 7,可以忽略不计。在[0,15]区间,采用分段泰勒函数展开的函数,展开到第三阶,得到近似函数,具体的精度损失和泰勒展开函数见下表3,表中仅展示了区间[0,30],负数区间可以通过关于x=0的中心对称性得到。
表3
Figure PCTCN2020137702-appb-000003
具体实施时,对于可重构处理阵列上的PE,执行普通的运算时,PE的输入为a,b,则输出为PE执行的函数f(a,b),并可以选择a,b中的某一个值作为输出,具体是输出哪一个值,取决于输入a,b在配置PE的编译指令中的位置,因此,可以通过配置实现可重构处理阵列中各个PE执行的具体运算及输出。
具体的,上述sigmoid函数的运算采用了基于分段积分累加的实现方式,最后基于函数的对称性,实现了该函数的流水计算,可通过利用3个全局PE以及28个处理单元PE实现。
具体实施时,Tanh函数为
Figure PCTCN2020137702-appb-000004
如图10所示,与sigmoid函数类似,具有单调增和易于求导的特性,同时能将输入的变量映射到(-1,1)之间。
具体的,Tanh函数的运算也可以采用与运算sigmoid函数类似的做法,只是分段区间不同,以[0,1),[1,2),[2,4),[4,∞)为例。运算tanh的计算流程图如图11所示,运算tanh函数时可重构处理阵列中处理单元的排布示意图如图12所示,具体的精度损失和泰勒展开函数见下表4所示,表中仅展示了区间[0,15],负数区间可以通过关于x=0的中心对称性得到。
表4
Figure PCTCN2020137702-appb-000005
具体实施时,对于包括除法的神经网络激活函数,通过以下步骤实现在可重构处理阵列上的运算,例如,
将神经网络激活函数拆分为基础运算,包括:
对于包括除法的神经网络激活函数,对该神经网络激活函数的输入数据减去输入数据的最大值以避免溢出,并将该神经网络激活函数中的除法转化为减法,根据该神经网络激活函数中的减法将参与运算的参数划分为不同的运算项;
根据神经网络激活函数中各基础运算的计算顺序,通过可重构处理阵列来依次实现各基础运算,包括:
通过可重构处理阵列来依次实现各运算项的运算。
具体实施时,上述包括除法的神经网络激活函数以Softmax为例,Softmax的表达式为
Figure PCTCN2020137702-appb-000006
通过防溢出处理(即将输入数据x替换为x-x max),可以将softmax函数转化为
Figure PCTCN2020137702-appb-000007
即将输入数x-x max,从而避免e x函数的结果过大导致溢出。由于电路中实现除法较为复杂,所以本发明中采用减法代替除法,减小了产生的功耗和消耗的资源,从而提高了运算的速度和效率。采用对数变化,可以将softmax函数转化为
Figure PCTCN2020137702-appb-000008
因此,对softmax函数运算主要分为四个部分,第一部分是防溢出部分,即求解x-x max(即上述运算项)。第二部分为计算e x(即上述运算项)。第三部分为将求出的e x累加,并且求出ln(∑e x)(即上述运算项)。第四部分为求解
Figure PCTCN2020137702-appb-000009
(即上述运算项)。
具体实施时,为实现防溢出处理,提出了将输入数据减去输入数据的最大值,具体通过以下方式找出输入数据的最大值,例如,将输入数据划分为多个数据组,针对每个数据组,通过一个访存处理单元读取输入数据,通过一个运算处理单元接收输入数据并对输入数据进行选择运算,输出该数据组的最大数值,对多个数据组并行处理得到各个数据组的最大数值,再通过一个访存处理单元读取各个数据组的最大数值,再通过一个运算处理单元接收各个数据组的最大数值并对接收的数据进行选择运算,输出各个数据组的最大数值中的最大数值,得到输入数据的最大值。
具体的,以运算softmax的第一步的运算项为例,可以将输入数据分为16个数据组为例,确定输入数据中的最大数值的运算可以包括下表5所示的运算,可以通过RPU的处理阵列的比较操作并行分别对16个数据组进行比较,如图13、图14所示,访存处理单元执行load运算从共享存储器中读取各个数据组的输入数据,通过运算处理单元执行减法和选择运算选择出16个数据组中每个数据组内的最大数值,通过访存处理单元执行sel运算将每个数据组的最大数值存入共享存储器。最后,将16个数据组的最大数值相互比较,从而得出输入数据的最大数值。利用RPU能够并行处理数据的特性,从而加速处理数据,提高了效率。
表5
运算符号 含义
Load 取数,取出存储器中的数据
Sel 选择,输入a,b,c,根据a的值,选则b或c输出
- 减法,输入a,b,输出a-b
Save 存储,将数据存入储存器中
具体实施时,针对运算项中以e为底的指数函数,本实施例提出,通过一个访存处理单元读取输入数据,再通过一个运算处理单元将输入数据与输入数据的最大数值做减法运算,通过一个运算处理单元将减法运算的结果与
Figure PCTCN2020137702-appb-000010
做乘法运算,将指数函数换为以2为底数的指数函数后,乘法运算的结果为换底后指数函数的输入数据,换底后指数函数的输入数据包括整数部分和小数部分,对以2为底数且以小数部分为指数的指数函数进行泰勒展开得到多项式,通过运算处理单元对多项式进行对应的运算,得到以2为底数且以小数部分为指数的指数函数的输出,通过运算处理单元对输出和所述整数部分进行移位运算,得指数函数的输出,通过运算处理单元对指数函数的输出进行累加运算。
具体的,以运算softmax的第二步的运算项为例,计算e x。值得注意的是,此处要对输入数据采取减去x max的操作,从而防止溢出。首先采用换底公式,e x变为
Figure PCTCN2020137702-appb-000011
式中的u i为采用换底公式后,变化后的输入数据的整数部分,v i则为小数部分,y i=x-x max。而根据二进制数的特点,我们可以将上式再次进行变化
Figure PCTCN2020137702-appb-000012
此时,就将输入数据的范围约化到了[-1,0]之间,从而可以采用泰勒展开求解
Figure PCTCN2020137702-appb-000013
我们对
Figure PCTCN2020137702-appb-000014
进行泰勒展开,有
Figure PCTCN2020137702-appb-000015
最后将得到的结果进行移位处理,就可以得到
Figure PCTCN2020137702-appb-000016
的结果。
具体的,计算
Figure PCTCN2020137702-appb-000017
的过程采用的基础运算如下表6所示,如图15、图16所示,先使用访存处理单元执行load执行取数运算,将输入数据从存储器中取出,并通过执行减法减去上一阶段的防溢出处理得到的x max,完成防溢出,更新数据。然后通过乘法运算,将防溢出处理后的数据与
Figure PCTCN2020137702-appb-000018
相乘,得到u i+v i通过与运算,能分别得到u i和v i,将u i存储起来,并通过运算处理单元执行乘累加将v i进行多项式计算,具体的计算为公式(7)。最后使用取数运算,从存储器中取出u i,并对多项式计算的结果进行移位,得到最终的输出结果,并把它存在存储器中。
将所有的e x通过加法运算累加,得到∑e x,并把它存在存储器中,以便进行下一部分的计算。
表6
运算符号 含义
Load 取数,取出存储器中的数据
Sel 选择,输入a,b,c,根据a的值,选则b或c输出
And 与运算,输入a,b,输出a&b
>> 移位运算,输入a,输出移位后的a
+ 加法,输入a,b,输出a+b
- 减法,输入a,b,输出a-b
* 乘法,输入a,b,输出a*b
MAC 乘累加,输入a,b,c,执行ab+c
Save 存储,将数据存入储存器中
具体实施时,对于运算项中以e为底的对数函数,本实施例提出,所述对数函数的输入项为以e为底的指数函数的累加,将指数函数的累加转化为以2为底以w为指数的指数函数与k的乘积,通过运算处理单元进行前导0运算得到w的值,对以e为底的指数函数的累加进行移位操作得到k的值,基于w的值和k的值将所述对数函数进行泰勒展开后得到多项式,通过运算处理单元对多项式进行运算得到所述对数函数的输出。
具体的,以运算softmax的第三步的运算项为例,将求出的e x累加,并且求出ln(∑e x)。累加的部分,可以在运算softmax的第二步的运算项过程中同步实现,每计算出一个结果,就将结果累加到全局寄存器中。而计算ln(∑e x)的中心思想为泰勒函数展开。对ln(∑e x)采取以下变化,可以得到
ln(∑e x)=ln(2 w*k)             (8)
根据e x的特点,我们可以知道,∑e x的值一定为正数,因此在二进制数中,该数是采用原码存储的。而通过移位变化,我们就可以得到k的值,将计算的数据约化在[0,1]区间,从而能够进行泰勒展开的计算。而w的值通过前导0的计算得出。得到w的值之后,将∑e x进行移位操作,就可以得到k的值。再对式(8)进行变化,并进行泰勒展开,可以得到最终的计算表达式,式(9)。具体的,计算ln(∑e x)的过程采用的基础运算如下表7所示,计算ln(∑e x)的计算流程示意图如图17所示,计算ln(∑e x)时可重构处理阵列中处理单元的排布示意图如图18所示。
Figure PCTCN2020137702-appb-000019
表7
运算符号 含义
Load 取数,取出存储器中的数据
Clz 前导0计算,计算输入数据中前导的0的个数
+ 加法,输入a,b,输出a+b
- 减法,输入a,b,输出a-b
* 乘法,输入a,b,输出a*b
MAC 乘累加,输入a,b,c,执行ab+c
Save 存储,将数据存入储存器中
具体实施时,在运算softmax的第四步的运算项过程中,为求解
Figure PCTCN2020137702-appb-000020
由于第一步已经求解出x max,第三步已经求解出
Figure PCTCN2020137702-appb-000021
因此将要减去的数更新为
Figure PCTCN2020137702-appb-000022
再带入第二步的e x函数计算即可,计算流程图和第二步的计算流程图完全相同。
具体实施时,在通过可重构处理阵列来依次实现各基础运算的过程中,当每个运算处理单元需要与自身非所在行或非所在列的处理单元进行数据传输时,通过与该运算处理单元存在数据传输互联的处理单元执行路由运算,实现该运算处理单元与自身非所在行或非所在列的处理单元进行数据传输;或者,将该运算处理单元的数据输出至全局寄存器中存储,供该运算处理单元非所在行或非所在列的处理单元来读取数据。
具体实施时,可以将上述可重构处理器上多种神经网络激活函数计算方法采用python语言进行仿真测试,并且采取输入数据为(-101,101)之间的随机数,输入数据个数为(1,100)之间的随机数,轮次100次。根据最终仿真的结果,最大误差约为0.01,为6~7位2进制小数的精度,可以通过提高泰勒展开的阶数来提高精度,此处为了减小功耗和提高运算精度,并没有提高泰勒展开的精度。
上述可重构处理器上多种神经网络激活函数计算方法主要通过泰勒展开的方式,实现了可重构架构上神经网络激活函数的运算。并且在softmax函数的计算中,采用了减法代替除法,换底公式结合移位代替e x的方式,减少了需要存储的系数和运算的时间,因此从而进一步减少器硬件资源的开销,从而减少面积和功耗。
另外,上述可重构处理器上多种神经网络激活函数计算方法具有一定的灵活性,可以定制化地针对应用确定展开阶数,从而满足各种精度数据的需求,在功耗、计算效率和精度上达到较好的平衡。
基于同一发明构思,本发明实施例中还提供了一种用于实现多种神经网络激活函数计算的可重构处理器,如下面的实施例所述。由于用于实现多种神经网络激活函数计算的可重构处理器解决问题的原理与可重构处理器上多种神经网络激活函数计算方法相 似,因此用于实现多种神经网络激活函数计算的可重构处理器的实施可以参见可重构处理器上多种神经网络激活函数计算方法的实施,重复之处不再赘述。以下所使用的,术语“单元”或者“模块”可以实现预定功能的软件和/或硬件的组合。尽管以下实施例所描述的装置较佳地以软件来实现,但是硬件,或者软件和硬件的组合的实现也是可能并被构想的。
图19是本发明实施例的用于实现多种神经网络激活函数计算的可重构处理器的一种结构框图,如图19所示,包括:
共享存储器1902,用于存储输入数据;
可重构处理阵列1904,用于根据神经网络激活函数拆分后各基础运算的计算顺序,从共享存储器中读取输入数据来依次实现各基础运算,其中,所述可重构处理阵列中四周边缘上的处理单元用于执行访存操作,称为访存处理单元,所述可重构处理阵列中除了四周边缘上的处理单元之外的其他处理单元用于执行运算操作,称为运算处理单元,四周边缘上的处理单元与所在行上的或所在列上的用于执行运算操作的处理单元进行数据传输,所述可重构处理阵列中每个处理单元与自身上下左右方位上存在的且相邻的处理单元进行数据传输。
在另外一个实施例中,还提供了一种软件,该软件用于执行上述实施例及优选实施方式中描述的技术方案。
在另外一个实施例中,还提供了一种存储介质,该存储介质中存储有上述软件,该存储介质包括但不限于:光盘、软盘、硬盘、可擦写存储器等。
本发明实施例实现了如下技术效果:提出了将神经网络激活函数拆分为基础运算,进而根据神经网络激活函数中各基础运算的计算顺序,通过可重构处理阵列从共享存储器中读取输入数据来依次实现各基础运算,实现了在现有可重构处理阵列结构上实现神经网络激活函数的运算,无需改变可重构处理阵列结构,也无需在可重构处理阵列结构上添加电路结构,即根据不同神经网络激活函数的算法需求配置可重构处理阵列中不同的处理单元进行相应的运算,使得可以在可重构处理阵列结构上利用加法、减法、乘法、移位等基础运算实现了复杂的激活函数运算,从而有利于简化激活函数运算的电路设计,有利于提高电路运算速度和吞吐率,由于可重构处理阵列中的处理单元的运算算法可以灵活配置且采用流水线的输入输出方式,使得有利于满足不同变化的激活函数的运算,使得具备可扩展性,也有利于提高处理单元的利用率。
本领域内的技术人员应明白,本发明的实施例可提供为方法、系统、或计算机程序产品。因此,本发明可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本发明可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。
本发明是参照根据本发明实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。
以上所述的具体实施例,对本发明的目的、技术方案和有益效果进行了进一步详细说明,所应理解的是,以上所述仅为本发明的具体实施例而已,并不用于限定本发明的保护范围,凡在本发明的精神和原则之内,所做的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。

Claims (10)

  1. 一种可重构处理器上多种神经网络激活函数计算方法,其特征在于,包括:
    将神经网络激活函数拆分为基础运算;
    根据神经网络激活函数中各基础运算的计算顺序,通过可重构处理器的可重构处理阵列从共享存储器中读取输入数据来依次实现各基础运算,其中,所述可重构处理阵列中四周边缘上的处理单元用于执行访存操作,称为访存处理单元,所述可重构处理阵列中除了四周边缘上的处理单元之外的其他处理单元用于执行运算操作,称为运算处理单元,四周边缘上的处理单元与所在行上的或所在列上的用于执行运算操作的处理单元进行数据传输,所述可重构处理阵列中每个处理单元与自身上下左右方位上存在的且相邻的处理单元进行数据传输。
  2. 如权利要求1所述的可重构处理器上多种神经网络激活函数计算方法,其特征在于,所述基础运算包括:加法、减法、乘法、乘累加运算以及选择运算。
  3. 如权利要求1所述的可重构处理器上多种神经网络激活函数计算方法,其特征在于,
    将神经网络激活函数拆分为基础运算,包括:
    对于线性分段的神经网络激活函数,将该神经网络激活函数拆分为选择运算;
    根据神经网络激活函数中各基础运算的计算顺序,通过可重构处理阵列来依次实现各基础运算,包括:
    通过所述可重构处理阵列中的多个访存处理单元从共享存储器中读取输入数据,通过每个访存处理单元将输入数据传输给自身所在行或所在列的运算处理单元进行选择运算,通过运算处理单元将选择运算的计算结果传输给自身所在行或所在列的访存处理单元,进而将计算结果存入共享存储器,其中,读取输入数据的访存处理单元与存储计算结果的访存处理单元为不同的访存处理单元,不同运算处理单元输出的计算结果传输给不同的访存处理单元。
  4. 如权利要求1所述的可重构处理器上多种神经网络激活函数计算方法,其特征在于,
    将神经网络激活函数拆分为基础运算,包括:
    对于对称且可通过分段泰勒展开拟合的神经网络激活函数,根据对称将该神经网络激活函数拆分为第一对称部分和第二对称部分,将第一对称部分的输入数据划分为多个数据段,将每个数据段的运算依次拆分为减法、选择运算以及乘累加运算,将各个数据 段的乘累加运算结果进行加法运算,将累加结果与第一对称部分的输出最大值比较并进行选择运算得到第一对称部分的输出数据,用第一对称部分的输出最大值减去第一对称部分的输出数据并进行选择运算得到第二对称部分的输出数据;
    根据神经网络激活函数中各基础运算的计算顺序,通过可重构处理阵列来依次实现各基础运算,包括:
    通过所述可重构处理阵列中的一个访存处理单元从共享存储器中依次各数据段中的一个数值,通过多个运算处理单元分别将读取的数值与划分数据段的端点数值做减法,通过多个运算处理单元组成第一级选择器,第一级选择器中的每个运算处理单元对应一个数据段,第一级选择器中的每个运算处理单元基于减法结果在读取的数值和对应数据段的最大值中输出最小值;通过多个运算处理单元组成第二级选择器,第二级选择器中的每个运算处理单元对应前一个数据段,第二级选择器中的第一个运算处理单元输出第一级选择器中第一个运算处理单元的输出,第二级选择器中的其他运算处理单元在第一级选择器中对应运算处理单元的输出和前一个数据段的最大值中输出最大值,通过运算处理单元分别对第二级选择器中的运算处理单元的输出做乘累加运算,通过运算处理单元对各乘累加运算的结果进行加法运算,通过运算处理单元对加法运算的结果减第一对称部分的输出最大值并进行选择运算,得到第一对称部分的输出数据,通过运算处理单元用第一对称部分的输出最大值减去第一对称部分的输出数据并进行选择运算,得到第二对称部分的输出数据。
  5. 如权利要求1至4中任一项所述的可重构处理器上多种神经网络激活函数计算方法,其特征在于,
    将神经网络激活函数拆分为基础运算,包括:
    对于包括指数累加和指数除法的神经网络激活函数,对该神经网络激活函数的输入数据减去输入数据的最大值以防止溢出,并将该神经网络激活函数中的除法转化为减法,根据该神经网络激活函数中的减法将参与运算的参数划分为不同的运算项;
    根据神经网络激活函数中各基础运算的计算顺序,通过可重构处理阵列来依次实现各基础运算,包括:
    通过可重构处理阵列来依次实现各运算项的运算。
  6. 如权利要求5所述的可重构处理器上多种神经网络激活函数计算方法,其特征在于,通过可重构处理阵列来依次实现各运算项的运算,包括:
    将输入数据划分为多个数据组,针对每个数据组,通过一个访存处理单元读取输入数据,通过一个运算处理单元接收输入数据并对输入数据进行选择运算,输出该数据组的最大数值,对多个数据组并行处理得到各个数据组的最大数值,再通过一个访存处理单元读取各个数据组的最大数值,再通过一个运算处理单元接收各个数据组的最大数值并对接收的数据进行选择运算,输出各个数据组的最大数值中的最大数值,得到输入数据的最大值。
  7. 如权利要求5所述的可重构处理器上多种神经网络激活函数计算方法,其特征在于,通过可重构处理阵列来依次实现各运算项的运算,包括:
    针对运算项中以e为底的指数函数,通过一个访存处理单元读取输入数据,再通过一个运算处理单元将输入数据与输入数据的最大数值做减法运算,通过一个运算处理单元将减法运算的结果与
    Figure PCTCN2020137702-appb-100001
    做乘法运算,将指数函数换为以2为底数的指数函数后,乘法运算的结果为换底后指数函数的输入数据,换底后指数函数的输入数据包括整数部分和小数部分,对以2为底数且以小数部分为指数的指数函数进行泰勒展开得到多项式,通过运算处理单元对多项式进行对应的运算,得到以2为底数且以小数部分为指数的指数函数的输出,通过运算处理单元对输出和所述整数部分进行移位运算,得指数函数的输出,通过运算处理单元对指数函数的输出进行累加运算。
  8. 如权利要求7所述的可重构处理器上多种神经网络激活函数计算方法,其特征在于,通过可重构处理阵列来依次实现各运算项的运算,包括:
    对于运算项中以e为底的对数函数,所述对数函数的输入项为以e为底的指数函数的累加,将指数函数的累加转化为以2为底以w为指数的指数函数与k的乘积,通过运算处理单元进行前导0运算得到w的值,对以e为底的指数函数的累加进行移位操作得到k的值,基于w的值和k的值将所述对数函数进行泰勒展开后得到多项式,通过运算处理单元对多项式进行运算得到所述对数函数的输出。
  9. 如权利要求5所述的可重构处理器上多种神经网络激活函数计算方法,其特征在于,通过可重构处理阵列来依次实现各基础运算,包括:
    在通过可重构处理阵列来依次实现各基础运算的过程中,当每个运算处理单元需要与自身非所在行或非所在列的处理单元进行数据传输时,通过与该运算处理单元存在数据传输互联的处理单元执行路由运算,实现该运算处理单元与自身非所在行或非所在列的处理单元进行数据传输;或者,将该运算处理单元的数据输出至全局寄存器中存储,供该运算处理单元非所在行或非所在列的处理单元来读取数据。
  10. 一种用于实现多种神经网络激活函数计算的可重构处理器,其特征在于,包括:
    共享存储器,用于存储输入数据;
    可重构处理阵列,用于根据神经网络激活函数拆分后各基础运算的计算顺序,从共享存储器中读取输入数据来依次实现各基础运算,其中,所述可重构处理阵列中四周边缘上的处理单元用于执行访存操作,称为访存处理单元,所述可重构处理阵列中除了四周边缘上的处理单元之外的其他处理单元用于执行运算操作,称为运算处理单元,四周边缘上的处理单元与所在行上的或所在列上的用于执行运算操作的处理单元进行数据传输,所述可重构处理阵列中每个处理单元与自身上下左右方位上存在的且相邻的处理单元进行数据传输。
PCT/CN2020/137702 2020-12-18 2020-12-18 可重构处理器及其上多种神经网络激活函数计算方法 WO2022126630A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2020/137702 WO2022126630A1 (zh) 2020-12-18 2020-12-18 可重构处理器及其上多种神经网络激活函数计算方法

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2020/137702 WO2022126630A1 (zh) 2020-12-18 2020-12-18 可重构处理器及其上多种神经网络激活函数计算方法

Publications (1)

Publication Number Publication Date
WO2022126630A1 true WO2022126630A1 (zh) 2022-06-23

Family

ID=82058843

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/137702 WO2022126630A1 (zh) 2020-12-18 2020-12-18 可重构处理器及其上多种神经网络激活函数计算方法

Country Status (1)

Country Link
WO (1) WO2022126630A1 (zh)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108647773A (zh) * 2018-04-20 2018-10-12 复旦大学 一种可重构卷积神经网络的硬件互连架构
CN109472356A (zh) * 2018-12-29 2019-03-15 南京宁麒智能计算芯片研究院有限公司 一种可重构神经网络算法的加速装置及方法
US20190212981A1 (en) * 2018-01-09 2019-07-11 Samsung Electronics Co., Ltd. Neural network processing unit including approximate multiplier and system on chip including the same
US20190272460A1 (en) * 2018-03-05 2019-09-05 Ye Tao Configurable neural network processor for machine learning workloads
CN110516801A (zh) * 2019-08-05 2019-11-29 西安交通大学 一种高吞吐率的动态可重构卷积神经网络加速器架构

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190212981A1 (en) * 2018-01-09 2019-07-11 Samsung Electronics Co., Ltd. Neural network processing unit including approximate multiplier and system on chip including the same
US20190272460A1 (en) * 2018-03-05 2019-09-05 Ye Tao Configurable neural network processor for machine learning workloads
CN108647773A (zh) * 2018-04-20 2018-10-12 复旦大学 一种可重构卷积神经网络的硬件互连架构
CN109472356A (zh) * 2018-12-29 2019-03-15 南京宁麒智能计算芯片研究院有限公司 一种可重构神经网络算法的加速装置及方法
CN110516801A (zh) * 2019-08-05 2019-11-29 西安交通大学 一种高吞吐率的动态可重构卷积神经网络加速器架构

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
SU CHAOYANG, ET AL.: "Implementation of Configurable Activation Function Module Based on Neural Network ", MICROCONTROLLERS & EMBEDDED SYSTEMS= DANPIANJI YU QIANRUSHI XITONG YINGYONG, BEIJING HANGKONG HANGTIAN DAXUE, CN, no. 4, 30 April 2020 (2020-04-30), CN , pages 6 - 9, XP055942870, ISSN: 1009-623X *

Similar Documents

Publication Publication Date Title
JP7405493B2 (ja) 区分線形近似を用いる深層ニューラルネットワークアーキテクチャ
US10642578B2 (en) Approximating functions
US11106431B2 (en) Apparatus and method of fast floating-point adder tree for neural networks
US20230221924A1 (en) Apparatus and Method for Processing Floating-Point Numbers
CN113590195B (zh) 支持浮点格式乘加的存算一体化dram计算部件
Roohi et al. Rnsim: Efficient deep neural network accelerator using residue number systems
CN113887710A (zh) 循环神经网络中的数字格式选择
Li et al. An efficient hardware architecture for activation function in deep learning processor
CN112445454A (zh) 使用范围特定的系数集字段执行一元函数的系统
US20210034327A1 (en) Apparatus and Method for Processing Floating-Point Numbers
WO2022126630A1 (zh) 可重构处理器及其上多种神经网络激活函数计算方法
CN111178492B (zh) 计算装置及相关产品、执行人工神经网络模型的计算方法
Parameswaran et al. Design and investigation of low-complexity Anurupyena Vedic multiplier for machine learning applications
CN112540946B (zh) 可重构处理器及其上多种神经网络激活函数计算方法
WO2022164678A1 (en) Digital circuitry for normalization functions
CN112540946A (zh) 可重构处理器及其上多种神经网络激活函数计算方法
WO2020008642A1 (ja) 学習装置、学習回路、学習方法および学習プログラム
WO2019127480A1 (zh) 用于处理数值数据的方法、设备和计算机可读存储介质
Esmali Nojehdeh et al. Energy-Efficient Hardware Implementation of Fully Connected Artificial Neural Networks Using Approximate Arithmetic Blocks
Zhang et al. A High Energy Efficiency and Low Resource Consumption FPGA Accelerator for Convolutional Neural Network
US20230100785A1 (en) Priority encoder-based techniques for computing the minimum or the maximum of multiple values
Jeon et al. M3FPU: Multiformat Matrix Multiplication FPU Architectures for Neural Network Computations
US20230205957A1 (en) Information processing circuit and method for designing information processing circuit
US20210264273A1 (en) Neural network processor
US20240111525A1 (en) Multiplication hardware block with adaptive fidelity control system

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20965644

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20965644

Country of ref document: EP

Kind code of ref document: A1