CN110807522A

CN110807522A - General calculation circuit of neural network accelerator

Info

Publication number: CN110807522A
Application number: CN201911055499.5A
Authority: CN
Inventors: 杜高明; 任宇翔; 曹红芳; 张多利; 田超; 宋宇鲲; 李桢旻
Original assignee: Hefei University of Technology
Current assignee: Hefei University of Technology
Priority date: 2019-10-31
Filing date: 2019-10-31
Publication date: 2020-02-18
Anticipated expiration: 2039-10-31
Also published as: CN110807522B

Abstract

The invention discloses a general computation module circuit of a neural network accelerator, which consists of m general computation modules PE, wherein any ith general computation module PE consists of an RAM and a 2ⁿThe device comprises a multiplier, an adder tree, a cascade adder, a bias adder, a first-in first-out queue and a ReLu activation function module, and different calculating circuits of the neural network are respectively built by utilizing single PE convolution configuration, cascade PE convolution configuration, a single PE full-connection configuration diagram and cascade PE full-connection configuration. The invention can configure the general computing circuit according to the variable of the neural network accelerator, thereby enabling the neural network to be built or modified more simply, conveniently and quickly, shortening the reasoning time of the neural network and reducing the hardware development time of relevant deep research.

Description

General calculation circuit of neural network accelerator

Technical Field

The invention belongs to the technical field of Field Programmable Gate Array (FPGA) design of integrated circuits, and particularly relates to a general computation module circuit of a neural network accelerator.

Background

In 2012, AlexNet takes away the champion with the challenge of large-scale visual recognition, and the deep neural network becomes a research hotspot again, wherein the research on the convolutional neural network receives more and more attention, and the AlexNet is widely applied to the fields of digital video monitoring, face recognition, image classification and the like. A large amount of iterative operations and data reading are used in the learning process of the convolutional neural network, and the CPU cannot fully utilize the parallelism characteristic existing in the neural network due to the limited number of the inner cores. In order to increase the computation speed of the convolutional neural network, researchers have proposed hardware architectures of the convolutional neural network based on GPU, FPGA and ASIC, wherein development based on GPU has been widely applied in many fields. In these platforms, the FPGA is a computationally intensive device, and a plurality of dedicated arithmetic computation units, logic module resources, and storage resources in the chip are provided on the chip, so that each computation unit of the convolutional neural network can be executed in parallel on the FPGA, and thus the FPGA is very suitable for being used as a hardware accelerator of the convolutional neural network. On the other hand, the FPGA has the characteristics of flexibility and high efficiency, the power consumption of the chip is much lower than that of a GPU, the chip size is smaller and the cost is lower than that of an ASIC chip, so that the FPGA can be very conveniently applied to various electronic products needing online image or sound processing at any time, such as financial prediction, artificial intelligent robots, medical diagnosis and the like, the FPGA is flexible in programming, the product is easy to upgrade and maintain, and the design cycle and the time to market of the product are relatively short. The acceleration research of the convolutional neural network on the FPGA platform is still in a starting stage and is not widely applied to various commercial fields;

although the current FPGA platform can implement convolutional neural network development, the platform also has limitations:

1) when the FPGA platform utilizes a hardware description language to develop the convolutional neural network, the design of a modularized hardware circuit is lacked, the debugging is complicated, and the hardware development period of the convolutional neural network is long;

2) because the traditional development mode of the FPGA describes the circuit behavior by using a hardware description language, when designing and building a neural network, various variables of the built neural network, such as the size of a convolution kernel, the number of feature maps, the number of convolution layers, the number of full connection layers, the network output category and the like, need to be considered, and the hardware design of basic components such as convolution layers, pooling layers, full connection layers, activation function layers and the like is relatively rigid, so that the flexibility is poor, if some variables of the neural network are changed, the circuit behavior of a bottom layer needs to be described by using the hardware description language again, so that the universality is poor;

disclosure of Invention

In order to overcome the defects in the prior art, the invention provides a general computing circuit of a neural network accelerator, which aims to improve the universality and flexibility of a computing module PE, thereby improving the performance of the neural network accelerator and reducing the hardware development time.

The technical scheme adopted by the invention to achieve the aim is as follows:

the general calculation circuit of the neural network accelerator is characterized by consisting of m general calculation modules PE, wherein any ith general calculation module PE consists of RAM and 2ⁿThe system comprises a multiplier, an adder tree, a cascade adder, an offset adder, a first-in first-out queue and a ReLu activation function module;

at the current cycle, 2ⁿThe multiplier acquires the stored weight data from the RAM, receives and processes externally input calculation data to obtain 2 in the current periodⁿPassing the product to the adder tree;

the adder tree pair is 2 in the current cycleⁿThe products are accumulated to obtain the accumulated sum in the current period and then stored in the first-in first-outIn a queue;

the first-in first-out queue reads the accumulated sum in the current period and transmits the accumulated sum to the cascade adder;

the cascade adder receives the accumulated sum in the current period and calculates the accumulated sum with cascade inputs in different configurations to obtain the cascade output of the ith cascade adder in the current period;

the offset adder receives the cascade output of the ith cascade adder in the current period, calculates the cascade output with the offset data input externally in the current period, obtains an addition result and transmits the addition result to the ReLu activation function module;

and processing the addition result by the ReLu activation function module to obtain the output result of the ith general computation module PE in the current period and the output result of the general computation circuit in different configurations.

The general computation circuit of the neural network accelerator is also characterized in that the different configurations are performed according to the following steps:

step 1, judging whether the size of a convolution kernel in a neural network is smaller than the number of multipliers 2ⁿIf yes, executing single PE convolution configuration; otherwise, executing the convolution configuration of the cascaded PE;

step 2, judging whether the number of input characteristic graphs of the full connection layer in the neural network is less than the number of multipliers 2ⁿIf yes, executing single PE full connection configuration, otherwise executing cascade PE full connection configuration.

The single PE convolution configuration is:

setting the cascade input of a cascade adder in the ith general computation module PE to be 0;

and taking the output results of the m general computation modules PE as the output results of the general computation circuit.

The cascaded PE convolution configuration is:

taking the cascade output of the cascade adder in the ith-1 th general computation module PE in the previous period as the cascade input of the cascade adder in the ith general computation module PE;

when i is 1, setting the cascade input of a cascade adder in the ith general computation module PE to be 0;

and taking the output result of the m-th general computation module PE as the output result of the general computation circuit.

The single PE full connection configuration is as follows:

taking the cascade output of the cascade adder in the ith general computing unit in the previous period as the cascade input of the cascade adder in the ith general computing module PE;

The cascade PE full-connection configuration comprises the following steps:

when i is 1, taking the cascade output of the cascade adder in the mth general computation module PE in the previous period as the cascade input of the cascade adder in the ith general computation module PE;

Compared with the prior art, the beneficial technical effects of the invention are as follows:

the invention fully utilizes the general characteristics of convolution neural network convolution calculation and full-connection calculation, configures a general calculation circuit according to variables of a neural network accelerator, determines to adopt single PE convolution configuration or cascade PE convolution configuration to build a calculation circuit of the neural network by judging whether the size of convolution kernels in the built neural network is smaller than the number of multipliers of a general calculation module PE, determines to adopt single PE full-connection configuration or cascade PE full-connection configuration to build a calculation circuit of the neural network by judging whether the number of input characteristic graphs of a full-connection layer in the built neural network is smaller than the number of multipliers of the general calculation module PE, thereby being applicable to convolution calculation and full-connection calculation under various conditions, and the configuration method supports various combination variable variables of the built neural network, when one variable in the neural network changes in the design process, the rewriting module of the calculation module is not required to be overturned, and only the general calculation module is required to be reconfigured, so that the circuit design scheme is simplified, the reasoning time of the neural network is shortened, and the complexity of a circuit of a convolutional neural network system and the hardware development time of the convolutional neural network are reduced;

drawings

FIG. 1 is a block diagram of a general computing module PE hardware circuit according to the present invention;

FIG. 2 is a diagram of a convolutional neural network;

FIG. 3 is a diagram of a single PE convolution arrangement in accordance with the present invention;

FIG. 4 is a diagram of a cascaded PE convolution arrangement in accordance with the present invention;

FIG. 5 is a diagram of a single PE full connection configuration according to the present invention;

fig. 6 is a diagram of the configuration of the cascaded PE full connection according to the present invention.

Detailed Description

In this embodiment, as shown in fig. 1, a general computation circuit of a neural network accelerator is composed of m general computation modules PE, where any one of the general computation modules PE is composed of RAM and 2ⁿThe system comprises a multiplier, an adder tree, a cascade adder, an offset adder, a first-in first-out queue and a ReLu activation function module; in this embodiment, n is 2, thereby obtaining 4 multipliers

In the current period, 4 multipliers acquire stored weight data from the RAM, receive and process externally input calculation data, obtain 4 products in the current period and transmit the products to an adder tree;

the adder tree carries out accumulation processing on the 4 products in the current period to obtain the accumulation sum in the current period and then stores the accumulation sum in a first-in first-out queue;

the first-in first-out queue reads the accumulated sum in the current period and transmits the accumulated sum to the cascade adder, and the first-in first-out queue plays a role of caching data;

the cascade adder receives the accumulated sum in the current period and calculates the accumulated sum with cascade inputs in different configurations to obtain the cascade output of the cascade adder in the current period;

the offset adder receives the cascade output of the cascade adder in the current period, calculates the cascade output with the offset data input from the outside in the current period, and transmits the addition result to the ReLu activation function module;

and processing the addition result by the ReLu activation function module to obtain an output result of the general computation module PE in the current period.

As shown in fig. 2, the convolutional neural network is composed of convolutional layers, pooling layers, activation functions, and fully-connected layers; different configurations of the general computation module PE are suitable for convolution computation and full-connection computation under various conditions; the different configurations are carried out according to the following steps:

step 1, judging whether the size of a convolution kernel in a neural network is smaller than the number of multipliers by 4, if so, executing single PE convolution configuration; otherwise, executing the convolution configuration of the cascaded PE;

and 2, judging whether the number of input feature maps of a full-connection layer in the neural network is less than 4 of the number of multipliers, if so, executing single PE full-connection configuration, and otherwise, executing cascade PE full-connection configuration.

As shown in fig. 3, the single PE convolution configuration is suitable for the case where the size of the convolution kernel in the neural network is smaller than the number of the PE multipliers in the general computation module, for example, the convolution computation is performed when the convolution kernel is 2 × 2, and the general computation circuit in fig. 3 is a general computation module PE;

the single PE convolution configuration is:

setting the cascade input of a cascade adder in a general computation module PE to be 0;

taking the output result of the general computation module PE as the output result of the general computation circuit;

the calculation of the single PE convolution configuration is performed as follows:

step 1, defining 4 input data as pixel values of 4 points of an input characteristic diagram, wherein the pixel values are 1, 2, 3 and 4 respectively; defining 4 weight values of 4 points of 4 weight data stored in the RAM from 2 x 2 convolution kernels, wherein the weight values are 1, 1, 1 and 1 respectively; defining bias data input externally as 5; defining the cascade input of the cascade adder to be 0;

step

2, 4 multipliers obtain the stored weight data from the RAM, receive and process the externally input calculation data, transmit the obtained 4 products to the adder tree, wherein the products are 1 × 1, 1 × 2, 1 × 3, 1 × 4; the calculation process corresponds to a matrix inner product multiplication step in the convolution calculation process under the condition that the convolution kernel is 2 x 2;

step 3, the adder tree carries out accumulation processing on the 4 products to obtain an accumulated sum, and then the accumulated sum is stored in a first-in first-out queue, wherein the accumulated sum is (1 × 1+1 × 2+1 × 3+1 × 4); the calculation process corresponds to a full addition step in the convolution calculation process under the condition that the convolution kernel is 2 x 2;

step 4, the cascade adder receives the accumulated sum and calculates with the cascade input, because the cascade input of the cascade adder is set to '0', the cascade output of the cascade adder is still equal to the accumulated sum, and the cascade output of the cascade adder is (1 × 1+1 × 2+1 × 3+1 × 4);

step 5, the offset adder receives the cascade output of the cascade adder, calculates the cascade output and offset data input from outside, and transmits the addition result to the ReLu activation function module after obtaining the addition result, wherein the addition result is (1 × 1+1 × 2+1 × 3+1 × 4) + 5; the calculation process corresponds to the biasing step in the convolution calculation process under the condition that the convolution kernel is 2 x 2;

step 6, processing the addition result by the ReLu activation function module to obtain an output result of the general computation module PE as an output result of the general computation circuit; the output result of the general calculation circuit is equivalent to the result obtained by one 2 x 2 convolution kernel convolution calculation.

In this embodiment, as shown in fig. 4, the cascaded PE convolution configuration is suitable for a case where the size of a convolution kernel in the neural network is larger than the number of general computation module PE multipliers, for example, convolution computation under the condition of convolution kernel 3 × 3, and the general computation circuit in fig. 4 is a cascade of 3 general computation modules PE;

the cascaded PE convolution configuration is:

taking the cascade output of the cascade adder in the 1 st general computation module PE in the previous period as the cascade input of the cascade adder in the 2 nd general computation module PE in the current period, and taking the cascade output of the cascade adder in the 2 nd general computation module PE in the current period as the cascade input of the cascade adder in the 3 rd general computation module PE in the next period;

setting the cascade input of a cascade adder in the 1 st general computation module PE as '0';

taking the output result of the 3 rd general computation module PE as the output result of the general computation circuit;

the calculation of the convolution configuration of the cascaded PE is carried out according to the following steps:

step 1, defining 9 input data as pixel values of 9 points of an input characteristic diagram, wherein the pixel values are 1, 2, 3, 4, 5, 6, 7, 8 and 9 respectively, a general calculation circuit is a cascade of 3 general calculation modules PE, total 12 multipliers exist, only 9 multipliers are used in the calculation process, unused multipliers are used, and the input of the multipliers is regarded as 0; defining the weight values of 9 points of 9 weight data written in the RAM from a 3 × 3 convolution kernel, wherein the weight values are 1, 1, 1, 1, 1, 1, 1, 1; defining bias data input externally as 5; defining the cascade input of the 1 st cascade adder to be 0;

step

2, 9 multipliers acquire the stored weight data from the RAM, receive and process externally input calculation data, and transmit 9 products to the adder tree, wherein the products are specifically 1 × 1, 1 × 2, 1 × 3, 1 × 4, 1 × 5, 1 × 6, 1 × 7, 1 × 8 and 1 × 9; the calculation process corresponds to a matrix inner product multiplication step in the convolution calculation process under the condition that the convolution kernel is 3 x 3;

step 3, the adder tree of each general computation module PE performs an accumulation process on the 4 products, and the obtained accumulated sum is stored in a first-in first-out queue, where the accumulated sum is (1 × 1+1 × 2+1 × 3+1 × 4), (1 × 5+1 × 6+ 1+ 7+1 × 8) and 1 × 9;

step 4, in fig. 4, the first fifo queue accumulates, reads and transmits to the first cascade adder in the last cycle, the second fifo queue accumulates, reads and transmits to the second cascade adder in the current cycle, and the third fifo queue accumulates, reads and transmits to the third cascade adder in the next cycle;

step 5, in the previous cycle, the first cascade adder receives the accumulated sum and calculates with the first cascade input, and because the cascade input of the first cascade adder is set to '0', the obtained cascade output of the first cascade adder is still equal to the accumulated sum, and the cascade output of the first cascade adder is specifically (1 × 1+1 × 2+1 × 3+1 × 4); the cascade output of the cascade adder in the first general computation module PE in the previous period is used as the cascade input of the cascade adder in the second general computation module PE in the current period; during the current cycle, the second cascade adder receives the accumulated sum and performs calculation with the second cascade input, and since the cascade input of the second cascade adder is the cascade output of the cascade adder in the first general calculation module PE of the previous cycle, the cascade output of the second cascade adder is obtained, specifically, (1 × 1+1 × 2+1 × 3+1 × 4+1 × 5+ 1+ 6+1 × 7+1 × 8); the cascade output of the cascade adder in the second general computation module PE of the current period is used as the cascade input of the cascade adder in the third general computation module PE of the next period; in the next cycle, the third cascade adder receives the accumulated sum and performs calculation with the third cascade input, and since the cascade input of the third cascade adder is the cascade output of the cascade adder in the second general calculation module PE in the current cycle, the cascade output of the third cascade adder is obtained, specifically, (1 × 1+1 × 2+1 × 3+1 × 4+1 × 5+ 1+ 6+1 × 7+1 × 8+1 × 9); the calculation process corresponds to the full addition step in the convolution calculation process under the condition that the convolution kernel is 3 x 3

Step 6, the third offset adder receives the cascade output of the third cascade adder, calculates with the offset data input from outside, and transmits the addition result to the ReLu activation function module, wherein the addition result is (1 × 1+1 × 2+1 × 3+1 × 4+1 × 5+1 × 6+1 × 7+1 × 8+1 × 9) + 5; the calculation process corresponds to the biasing step in the convolution calculation process under the condition that the convolution kernel is 3 x 3;

step 7, the ReLu activation function module processes the addition result, and the output result of the third general computation module PE is used as the output result of the general computation circuit; the output result of the general calculation circuit is equivalent to the result obtained by one convolution calculation of 3 × 3 convolution kernel.

In this embodiment, as shown in fig. 5, the single PE full-connection configuration is suitable for the case where the number of input feature maps of the full-connection layer in the neural network is less than the number of multipliers of the general computation module PE, and the general computation circuit in fig. 4 is a general computation module PE;

the single PE full connection configuration is as follows:

taking the cascade output of the cascade adder in the general computing unit in the previous period as the cascade input of the cascade adder in the general computing module PE;

and taking the output result of the general computation module PE as the output result of the general computation circuit.

The calculation of the single PE full-connection configuration is carried out according to the following steps:

step 1, defining 4 pieces of input data, wherein the input data are constantly changed, inputting the pixel value of a first row first point of each input characteristic diagram in the 1 st period, and sequentially inputting the pixel value of a third row third point of each input characteristic diagram in the 9 th period; defining the weight values of 36 weight data stored in the RAM, which come from 4 × 3 convolution kernels; defining bias data input externally as 5;

step

2, 4 multipliers obtain the weight values of the stored weight data from the RAM, the 4 weight data obtained in the 1 st period come from the weight values of the first row and the first point of each layer of the 4 layers of 3 × 3 convolution kernels, and in turn, the 4 weight data obtained in the 9 th period come from the weight values of the third row and the third point of each layer of the 4 layers of 3 × 3 convolution kernels;

step 3, in the current period, 4 multipliers obtain the stored weight data from the RAM, receive and process the externally input calculation data, and transmit the obtained 4 products to the adder tree; the multiplier sequentially inputs data values of 1, 2, 3, 2, 3, 4, 3, 4 and 5 in 9 cycles, sequentially obtains weight values of 1, 1, 1, 1, 1, 1 and 1, and sequentially transmits the data values to the

adder tree

1, 1, 2, 1, 4, 1, 3, 1, 4 and 5 in 9 cycles; the calculation process corresponds to a matrix inner product multiplication step in the full-connection calculation process;

step 4, the adder tree carries out accumulation processing on the 4 products in the current period to obtain the accumulation sum in the current period and then stores the accumulation sum in a first-in first-out queue;

step 5, the cascade adder receives the accumulation sum in the current period and calculates with the cascade input, because the cascade output of the cascade adder in the general calculation unit in the previous period is the cascade input of the cascade adder in the general calculation module PE in the current period, and the cascade output of the cascade adder in the general calculation unit in the current period is the cascade input of the cascade adder in the general calculation module PE in the next period, the cascade output of the cascade adder in the current period is equal to the result of the accumulation sum total addition from the first period to the current period, and the accumulation sum in the 9 th period is the result of the accumulation sum total addition of 9 periods; the calculation process corresponds to a full addition step in the full connection calculation process;

step 6, the offset adder receives the cascade output of the lower cascade adder in the 9 th period, calculates the cascade output and offset data input from the outside, and transmits the addition result to the ReLu activation function module;

step 7, processing the addition result by the ReLu activation function module to obtain an output result of the general computation module PE at the 9 th cycle as an output result of the general computation circuit; the output result of the general-purpose computation circuit corresponds to the result of the full-connection computation with the number of input signatures of 4 at a time.

As shown in fig. 6, the cascade PE full-connection configuration is suitable for the case where the number of input feature maps of the full-connection layer in the neural network is greater than the number of multipliers of the general computation modules PE, for example, the full-connection computation is performed when the number of input feature maps is 16, and the general computation circuit in fig. 6 is a cascade of four general computation modules PE;

the configuration of the cascade PE full connection is as follows:

taking the cascade output of the cascade adder in the 1 st general-purpose computing module PE in the previous period as the cascade input of the cascade adder in the 2 nd general-purpose computing module PE in the current period, taking the cascade output of the cascade adder in the 2 nd general-purpose computing module PE in the current period as the cascade input of the cascade adder in the 3 rd general-purpose computing module PE in the next period, and taking the cascade output of the cascade adder in the 3 rd general-purpose computing module PE in the next period as the cascade input of the cascade adder in the 4 th general-purpose computing module PE in the next period;

the cascade output of the cascade adder in the 4 th general computation module PE in the previous period is used as the cascade input of the cascade adder in the 1 st general computation module PE;

the output result of the 4 th general computation module PE is used as the output result of the general computation circuit.

The calculation of the cascade PE full-connection configuration is carried out according to the following steps:

step 1, defining 16 input data, wherein the input data are constantly changed, inputting the pixel value of a first row first point of each input characteristic diagram in the 1 st period, and sequentially inputting the pixel value of a third row third point of each input characteristic diagram in the 9 th period; defining the weight values of 144 weight data stored in the RAM from a 16 x 3 convolution kernel; defining bias data input externally as 5;

step 2, the 16 multipliers obtain the stored weight data from the RAM, the weight data are continuously changed, the 16 weight data obtained in the 1 st period come from the weight values of the first row and the first point of each layer of the 16 layers of 3 × 3 convolution kernels, and in turn, the 16 weight data obtained in the 9 th period come from the weight values of the third row and the third point of each layer of the 16 layers of 3 × 3 convolution kernels;

step 3, in the current period, the 16 multipliers obtain the stored weight data from the RAM, receive and process externally input calculation data, and transmit the obtained 16 products to the adder tree;

step 4, the adder tree of each general computation module PE carries out accumulation processing on the 4 products, and the obtained accumulated sum is stored in a first-in first-out queue

Step 5, in fig. 6, the first fifo queue accumulates, reads and transmits to the first cascade adder in the last cycle, the second fifo queue accumulates, reads and transmits to the second cascade adder in the current cycle, the third fifo queue accumulates, reads and transmits to the third cascade adder in the next cycle, and the fourth fifo queue accumulates, reads and transmits to the fourth cascade adder in the next cycle;

step 6, in the previous period, the first cascade adder receives the accumulated sum and calculates with the first cascade input, and because the cascade input of the first cascade adder is set to be '0', the cascade output of the first cascade adder is still equal to the accumulated sum; the cascade output of the cascade adder in the first general computation module PE in the previous period is used as the cascade input of the cascade adder in the second general computation module PE in the current period; in the current period, the second cascade adder receives the accumulated sum and calculates with the second cascade input, and the cascade output of the second cascade adder is obtained because the cascade input of the second cascade adder is the cascade output of the cascade adder in the first general calculation module PE in the previous period; the cascade output of the cascade adder in the second general computation module PE of the current period is used as the cascade input of the cascade adder in the third general computation module PE of the next period; in the next period, the third cascade adder receives the accumulated sum and calculates with the third cascade input, and the cascade output of the third cascade adder is obtained because the cascade input of the third cascade adder is the cascade output of the cascade adder in the second general calculation module PE in the current period; the cascade output of the cascade adder in the third general computation module PE of the next period is used as the cascade input of the cascade adder in the fourth general computation module PE of the next period; in the next period, the fourth cascade adder receives the accumulated sum and calculates with the fourth cascade input, and because the cascade input of the fourth cascade adder is the cascade output of the cascade adder in the third general calculation module PE in the next period, the cascade output of the fourth cascade adder is obtained and is the sum of the accumulated sums of 4 adder trees, that is, the accumulated sum of 16 products;

step 7, because the cascade output of the cascade adder in the fourth general computing unit in the previous period is the cascade input of the cascade adder in the first general computing module PE in the current period, and the cascade output of the cascade adder in the fourth general computing unit in the current period is the cascade input of the cascade adder in the first general computing module PE in the next period, the cascade output of the cascade adder in the current period is equal to the result of the sum of all the additions from the first period to the current period, and the sum of all the additions in the 9 th period is the result of the sum of all the additions in 9 periods; the calculation process corresponds to a full addition step in the full connection calculation process;

step 6, the offset adder receives the cascade output of the fourth cascade adder in the 9 th period, calculates the cascade output with the externally input offset data, and transmits the addition result to the ReLu activation function module;

step 7, processing the addition result by the ReLu activation function module to obtain an output result of a fourth general computation module PE in the 9 th cycle as an output result of the general computation circuit; the output result of the general-purpose computation circuit corresponds to the result of the full-link computation with the number of input profiles of 16 at a time.

Claims

1. A general calculation circuit of a neural network accelerator is characterized by consisting of m general calculation modules PE, wherein any ith general calculation module PE consists of an RAM and a 2ⁿThe system comprises a multiplier, an adder tree, a cascade adder, an offset adder, a first-in first-out queue and a ReLu activation function module;

the adder tree pair is 2 in the current cycleⁿAccumulating the products to obtain the accumulated sum in the current period and storing the accumulated sum in the first-in first-out queue;

2. The general purpose computing circuit of a neural network accelerator, as claimed in claim 1, wherein said different configurations are performed by:

3. The general purpose computing circuit of a neural network accelerator as claimed in claim 2, wherein said single PE convolution configuration is:

4. The method of claim 2, wherein said cascaded PE convolution configuration is:

5. The method of claim 2, wherein the single-PE fully-connected configuration is:

6. The method of claim 2, wherein the cascaded PE fully connected configuration is: