CN110807522A - General calculation circuit of neural network accelerator - Google Patents

General calculation circuit of neural network accelerator Download PDF

Info

Publication number
CN110807522A
CN110807522A CN201911055499.5A CN201911055499A CN110807522A CN 110807522 A CN110807522 A CN 110807522A CN 201911055499 A CN201911055499 A CN 201911055499A CN 110807522 A CN110807522 A CN 110807522A
Authority
CN
China
Prior art keywords
cascade
adder
general
output
input
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911055499.5A
Other languages
Chinese (zh)
Other versions
CN110807522B (en
Inventor
杜高明
任宇翔
曹红芳
张多利
田超
宋宇鲲
李桢旻
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hefei University of Technology
Original Assignee
Hefei University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hefei University of Technology filed Critical Hefei University of Technology
Priority to CN201911055499.5A priority Critical patent/CN110807522B/en
Publication of CN110807522A publication Critical patent/CN110807522A/en
Application granted granted Critical
Publication of CN110807522B publication Critical patent/CN110807522B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Neurology (AREA)
  • Complex Calculations (AREA)

Abstract

The invention discloses a general computation module circuit of a neural network accelerator, which consists of m general computation modules PE, wherein any ith general computation module PE consists of an RAM and a 2nThe device comprises a multiplier, an adder tree, a cascade adder, a bias adder, a first-in first-out queue and a ReLu activation function module, and different calculating circuits of the neural network are respectively built by utilizing single PE convolution configuration, cascade PE convolution configuration, a single PE full-connection configuration diagram and cascade PE full-connection configuration. The invention can configure the general computing circuit according to the variable of the neural network accelerator, thereby enabling the neural network to be built or modified more simply, conveniently and quickly, shortening the reasoning time of the neural network and reducing the hardware development time of relevant deep research.

Description

General calculation circuit of neural network accelerator
Technical Field
The invention belongs to the technical field of Field Programmable Gate Array (FPGA) design of integrated circuits, and particularly relates to a general computation module circuit of a neural network accelerator.
Background
In 2012, AlexNet takes away the champion with the challenge of large-scale visual recognition, and the deep neural network becomes a research hotspot again, wherein the research on the convolutional neural network receives more and more attention, and the AlexNet is widely applied to the fields of digital video monitoring, face recognition, image classification and the like. A large amount of iterative operations and data reading are used in the learning process of the convolutional neural network, and the CPU cannot fully utilize the parallelism characteristic existing in the neural network due to the limited number of the inner cores. In order to increase the computation speed of the convolutional neural network, researchers have proposed hardware architectures of the convolutional neural network based on GPU, FPGA and ASIC, wherein development based on GPU has been widely applied in many fields. In these platforms, the FPGA is a computationally intensive device, and a plurality of dedicated arithmetic computation units, logic module resources, and storage resources in the chip are provided on the chip, so that each computation unit of the convolutional neural network can be executed in parallel on the FPGA, and thus the FPGA is very suitable for being used as a hardware accelerator of the convolutional neural network. On the other hand, the FPGA has the characteristics of flexibility and high efficiency, the power consumption of the chip is much lower than that of a GPU, the chip size is smaller and the cost is lower than that of an ASIC chip, so that the FPGA can be very conveniently applied to various electronic products needing online image or sound processing at any time, such as financial prediction, artificial intelligent robots, medical diagnosis and the like, the FPGA is flexible in programming, the product is easy to upgrade and maintain, and the design cycle and the time to market of the product are relatively short. The acceleration research of the convolutional neural network on the FPGA platform is still in a starting stage and is not widely applied to various commercial fields;
although the current FPGA platform can implement convolutional neural network development, the platform also has limitations:
1) when the FPGA platform utilizes a hardware description language to develop the convolutional neural network, the design of a modularized hardware circuit is lacked, the debugging is complicated, and the hardware development period of the convolutional neural network is long;
2) because the traditional development mode of the FPGA describes the circuit behavior by using a hardware description language, when designing and building a neural network, various variables of the built neural network, such as the size of a convolution kernel, the number of feature maps, the number of convolution layers, the number of full connection layers, the network output category and the like, need to be considered, and the hardware design of basic components such as convolution layers, pooling layers, full connection layers, activation function layers and the like is relatively rigid, so that the flexibility is poor, if some variables of the neural network are changed, the circuit behavior of a bottom layer needs to be described by using the hardware description language again, so that the universality is poor;
disclosure of Invention
In order to overcome the defects in the prior art, the invention provides a general computing circuit of a neural network accelerator, which aims to improve the universality and flexibility of a computing module PE, thereby improving the performance of the neural network accelerator and reducing the hardware development time.
The technical scheme adopted by the invention to achieve the aim is as follows:
the general calculation circuit of the neural network accelerator is characterized by consisting of m general calculation modules PE, wherein any ith general calculation module PE consists of RAM and 2nThe system comprises a multiplier, an adder tree, a cascade adder, an offset adder, a first-in first-out queue and a ReLu activation function module;
at the current cycle, 2nThe multiplier acquires the stored weight data from the RAM, receives and processes externally input calculation data to obtain 2 in the current periodnPassing the product to the adder tree;
the adder tree pair is 2 in the current cyclenThe products are accumulated to obtain the accumulated sum in the current period and then stored in the first-in first-outIn a queue;
the first-in first-out queue reads the accumulated sum in the current period and transmits the accumulated sum to the cascade adder;
the cascade adder receives the accumulated sum in the current period and calculates the accumulated sum with cascade inputs in different configurations to obtain the cascade output of the ith cascade adder in the current period;
the offset adder receives the cascade output of the ith cascade adder in the current period, calculates the cascade output with the offset data input externally in the current period, obtains an addition result and transmits the addition result to the ReLu activation function module;
and processing the addition result by the ReLu activation function module to obtain the output result of the ith general computation module PE in the current period and the output result of the general computation circuit in different configurations.
The general computation circuit of the neural network accelerator is also characterized in that the different configurations are performed according to the following steps:
step 1, judging whether the size of a convolution kernel in a neural network is smaller than the number of multipliers 2nIf yes, executing single PE convolution configuration; otherwise, executing the convolution configuration of the cascaded PE;
step 2, judging whether the number of input characteristic graphs of the full connection layer in the neural network is less than the number of multipliers 2nIf yes, executing single PE full connection configuration, otherwise executing cascade PE full connection configuration.
The single PE convolution configuration is:
setting the cascade input of a cascade adder in the ith general computation module PE to be 0;
and taking the output results of the m general computation modules PE as the output results of the general computation circuit.
The cascaded PE convolution configuration is:
taking the cascade output of the cascade adder in the ith-1 th general computation module PE in the previous period as the cascade input of the cascade adder in the ith general computation module PE;
when i is 1, setting the cascade input of a cascade adder in the ith general computation module PE to be 0;
and taking the output result of the m-th general computation module PE as the output result of the general computation circuit.
The single PE full connection configuration is as follows:
taking the cascade output of the cascade adder in the ith general computing unit in the previous period as the cascade input of the cascade adder in the ith general computing module PE;
and taking the output results of the m general computation modules PE as the output results of the general computation circuit.
The cascade PE full-connection configuration comprises the following steps:
taking the cascade output of the cascade adder in the ith-1 th general computation module PE in the previous period as the cascade input of the cascade adder in the ith general computation module PE;
when i is 1, taking the cascade output of the cascade adder in the mth general computation module PE in the previous period as the cascade input of the cascade adder in the ith general computation module PE;
and taking the output result of the m-th general computation module PE as the output result of the general computation circuit.
Compared with the prior art, the beneficial technical effects of the invention are as follows:
the invention fully utilizes the general characteristics of convolution neural network convolution calculation and full-connection calculation, configures a general calculation circuit according to variables of a neural network accelerator, determines to adopt single PE convolution configuration or cascade PE convolution configuration to build a calculation circuit of the neural network by judging whether the size of convolution kernels in the built neural network is smaller than the number of multipliers of a general calculation module PE, determines to adopt single PE full-connection configuration or cascade PE full-connection configuration to build a calculation circuit of the neural network by judging whether the number of input characteristic graphs of a full-connection layer in the built neural network is smaller than the number of multipliers of the general calculation module PE, thereby being applicable to convolution calculation and full-connection calculation under various conditions, and the configuration method supports various combination variable variables of the built neural network, when one variable in the neural network changes in the design process, the rewriting module of the calculation module is not required to be overturned, and only the general calculation module is required to be reconfigured, so that the circuit design scheme is simplified, the reasoning time of the neural network is shortened, and the complexity of a circuit of a convolutional neural network system and the hardware development time of the convolutional neural network are reduced;
drawings
FIG. 1 is a block diagram of a general computing module PE hardware circuit according to the present invention;
FIG. 2 is a diagram of a convolutional neural network;
FIG. 3 is a diagram of a single PE convolution arrangement in accordance with the present invention;
FIG. 4 is a diagram of a cascaded PE convolution arrangement in accordance with the present invention;
FIG. 5 is a diagram of a single PE full connection configuration according to the present invention;
fig. 6 is a diagram of the configuration of the cascaded PE full connection according to the present invention.
Detailed Description
In this embodiment, as shown in fig. 1, a general computation circuit of a neural network accelerator is composed of m general computation modules PE, where any one of the general computation modules PE is composed of RAM and 2nThe system comprises a multiplier, an adder tree, a cascade adder, an offset adder, a first-in first-out queue and a ReLu activation function module; in this embodiment, n is 2, thereby obtaining 4 multipliers
In the current period, 4 multipliers acquire stored weight data from the RAM, receive and process externally input calculation data, obtain 4 products in the current period and transmit the products to an adder tree;
the adder tree carries out accumulation processing on the 4 products in the current period to obtain the accumulation sum in the current period and then stores the accumulation sum in a first-in first-out queue;
the first-in first-out queue reads the accumulated sum in the current period and transmits the accumulated sum to the cascade adder, and the first-in first-out queue plays a role of caching data;
the cascade adder receives the accumulated sum in the current period and calculates the accumulated sum with cascade inputs in different configurations to obtain the cascade output of the cascade adder in the current period;
the offset adder receives the cascade output of the cascade adder in the current period, calculates the cascade output with the offset data input from the outside in the current period, and transmits the addition result to the ReLu activation function module;
and processing the addition result by the ReLu activation function module to obtain an output result of the general computation module PE in the current period.
As shown in fig. 2, the convolutional neural network is composed of convolutional layers, pooling layers, activation functions, and fully-connected layers; different configurations of the general computation module PE are suitable for convolution computation and full-connection computation under various conditions; the different configurations are carried out according to the following steps:
step 1, judging whether the size of a convolution kernel in a neural network is smaller than the number of multipliers by 4, if so, executing single PE convolution configuration; otherwise, executing the convolution configuration of the cascaded PE;
and 2, judging whether the number of input feature maps of a full-connection layer in the neural network is less than 4 of the number of multipliers, if so, executing single PE full-connection configuration, and otherwise, executing cascade PE full-connection configuration.
As shown in fig. 3, the single PE convolution configuration is suitable for the case where the size of the convolution kernel in the neural network is smaller than the number of the PE multipliers in the general computation module, for example, the convolution computation is performed when the convolution kernel is 2 × 2, and the general computation circuit in fig. 3 is a general computation module PE;
the single PE convolution configuration is:
setting the cascade input of a cascade adder in a general computation module PE to be 0;
taking the output result of the general computation module PE as the output result of the general computation circuit;
the calculation of the single PE convolution configuration is performed as follows:
step 1, defining 4 input data as pixel values of 4 points of an input characteristic diagram, wherein the pixel values are 1, 2, 3 and 4 respectively; defining 4 weight values of 4 points of 4 weight data stored in the RAM from 2 x 2 convolution kernels, wherein the weight values are 1, 1, 1 and 1 respectively; defining bias data input externally as 5; defining the cascade input of the cascade adder to be 0;
step 2, 4 multipliers obtain the stored weight data from the RAM, receive and process the externally input calculation data, transmit the obtained 4 products to the adder tree, wherein the products are 1 × 1, 1 × 2, 1 × 3, 1 × 4; the calculation process corresponds to a matrix inner product multiplication step in the convolution calculation process under the condition that the convolution kernel is 2 x 2;
step 3, the adder tree carries out accumulation processing on the 4 products to obtain an accumulated sum, and then the accumulated sum is stored in a first-in first-out queue, wherein the accumulated sum is (1 × 1+1 × 2+1 × 3+1 × 4); the calculation process corresponds to a full addition step in the convolution calculation process under the condition that the convolution kernel is 2 x 2;
step 4, the cascade adder receives the accumulated sum and calculates with the cascade input, because the cascade input of the cascade adder is set to '0', the cascade output of the cascade adder is still equal to the accumulated sum, and the cascade output of the cascade adder is (1 × 1+1 × 2+1 × 3+1 × 4);
step 5, the offset adder receives the cascade output of the cascade adder, calculates the cascade output and offset data input from outside, and transmits the addition result to the ReLu activation function module after obtaining the addition result, wherein the addition result is (1 × 1+1 × 2+1 × 3+1 × 4) + 5; the calculation process corresponds to the biasing step in the convolution calculation process under the condition that the convolution kernel is 2 x 2;
step 6, processing the addition result by the ReLu activation function module to obtain an output result of the general computation module PE as an output result of the general computation circuit; the output result of the general calculation circuit is equivalent to the result obtained by one 2 x 2 convolution kernel convolution calculation.
In this embodiment, as shown in fig. 4, the cascaded PE convolution configuration is suitable for a case where the size of a convolution kernel in the neural network is larger than the number of general computation module PE multipliers, for example, convolution computation under the condition of convolution kernel 3 × 3, and the general computation circuit in fig. 4 is a cascade of 3 general computation modules PE;
the cascaded PE convolution configuration is:
taking the cascade output of the cascade adder in the 1 st general computation module PE in the previous period as the cascade input of the cascade adder in the 2 nd general computation module PE in the current period, and taking the cascade output of the cascade adder in the 2 nd general computation module PE in the current period as the cascade input of the cascade adder in the 3 rd general computation module PE in the next period;
setting the cascade input of a cascade adder in the 1 st general computation module PE as '0';
taking the output result of the 3 rd general computation module PE as the output result of the general computation circuit;
the calculation of the convolution configuration of the cascaded PE is carried out according to the following steps:
step 1, defining 9 input data as pixel values of 9 points of an input characteristic diagram, wherein the pixel values are 1, 2, 3, 4, 5, 6, 7, 8 and 9 respectively, a general calculation circuit is a cascade of 3 general calculation modules PE, total 12 multipliers exist, only 9 multipliers are used in the calculation process, unused multipliers are used, and the input of the multipliers is regarded as 0; defining the weight values of 9 points of 9 weight data written in the RAM from a 3 × 3 convolution kernel, wherein the weight values are 1, 1, 1, 1, 1, 1, 1, 1; defining bias data input externally as 5; defining the cascade input of the 1 st cascade adder to be 0;
step 2, 9 multipliers acquire the stored weight data from the RAM, receive and process externally input calculation data, and transmit 9 products to the adder tree, wherein the products are specifically 1 × 1, 1 × 2, 1 × 3, 1 × 4, 1 × 5, 1 × 6, 1 × 7, 1 × 8 and 1 × 9; the calculation process corresponds to a matrix inner product multiplication step in the convolution calculation process under the condition that the convolution kernel is 3 x 3;
step 3, the adder tree of each general computation module PE performs an accumulation process on the 4 products, and the obtained accumulated sum is stored in a first-in first-out queue, where the accumulated sum is (1 × 1+1 × 2+1 × 3+1 × 4), (1 × 5+1 × 6+ 1+ 7+1 × 8) and 1 × 9;
step 4, in fig. 4, the first fifo queue accumulates, reads and transmits to the first cascade adder in the last cycle, the second fifo queue accumulates, reads and transmits to the second cascade adder in the current cycle, and the third fifo queue accumulates, reads and transmits to the third cascade adder in the next cycle;
step 5, in the previous cycle, the first cascade adder receives the accumulated sum and calculates with the first cascade input, and because the cascade input of the first cascade adder is set to '0', the obtained cascade output of the first cascade adder is still equal to the accumulated sum, and the cascade output of the first cascade adder is specifically (1 × 1+1 × 2+1 × 3+1 × 4); the cascade output of the cascade adder in the first general computation module PE in the previous period is used as the cascade input of the cascade adder in the second general computation module PE in the current period; during the current cycle, the second cascade adder receives the accumulated sum and performs calculation with the second cascade input, and since the cascade input of the second cascade adder is the cascade output of the cascade adder in the first general calculation module PE of the previous cycle, the cascade output of the second cascade adder is obtained, specifically, (1 × 1+1 × 2+1 × 3+1 × 4+1 × 5+ 1+ 6+1 × 7+1 × 8); the cascade output of the cascade adder in the second general computation module PE of the current period is used as the cascade input of the cascade adder in the third general computation module PE of the next period; in the next cycle, the third cascade adder receives the accumulated sum and performs calculation with the third cascade input, and since the cascade input of the third cascade adder is the cascade output of the cascade adder in the second general calculation module PE in the current cycle, the cascade output of the third cascade adder is obtained, specifically, (1 × 1+1 × 2+1 × 3+1 × 4+1 × 5+ 1+ 6+1 × 7+1 × 8+1 × 9); the calculation process corresponds to the full addition step in the convolution calculation process under the condition that the convolution kernel is 3 x 3
Step 6, the third offset adder receives the cascade output of the third cascade adder, calculates with the offset data input from outside, and transmits the addition result to the ReLu activation function module, wherein the addition result is (1 × 1+1 × 2+1 × 3+1 × 4+1 × 5+1 × 6+1 × 7+1 × 8+1 × 9) + 5; the calculation process corresponds to the biasing step in the convolution calculation process under the condition that the convolution kernel is 3 x 3;
step 7, the ReLu activation function module processes the addition result, and the output result of the third general computation module PE is used as the output result of the general computation circuit; the output result of the general calculation circuit is equivalent to the result obtained by one convolution calculation of 3 × 3 convolution kernel.
In this embodiment, as shown in fig. 5, the single PE full-connection configuration is suitable for the case where the number of input feature maps of the full-connection layer in the neural network is less than the number of multipliers of the general computation module PE, and the general computation circuit in fig. 4 is a general computation module PE;
the single PE full connection configuration is as follows:
taking the cascade output of the cascade adder in the general computing unit in the previous period as the cascade input of the cascade adder in the general computing module PE;
and taking the output result of the general computation module PE as the output result of the general computation circuit.
The calculation of the single PE full-connection configuration is carried out according to the following steps:
step 1, defining 4 pieces of input data, wherein the input data are constantly changed, inputting the pixel value of a first row first point of each input characteristic diagram in the 1 st period, and sequentially inputting the pixel value of a third row third point of each input characteristic diagram in the 9 th period; defining the weight values of 36 weight data stored in the RAM, which come from 4 × 3 convolution kernels; defining bias data input externally as 5;
step 2, 4 multipliers obtain the weight values of the stored weight data from the RAM, the 4 weight data obtained in the 1 st period come from the weight values of the first row and the first point of each layer of the 4 layers of 3 × 3 convolution kernels, and in turn, the 4 weight data obtained in the 9 th period come from the weight values of the third row and the third point of each layer of the 4 layers of 3 × 3 convolution kernels;
step 3, in the current period, 4 multipliers obtain the stored weight data from the RAM, receive and process the externally input calculation data, and transmit the obtained 4 products to the adder tree; the multiplier sequentially inputs data values of 1, 2, 3, 2, 3, 4, 3, 4 and 5 in 9 cycles, sequentially obtains weight values of 1, 1, 1, 1, 1, 1 and 1, and sequentially transmits the data values to the adder tree 1, 1, 2, 1, 4, 1, 3, 1, 4 and 5 in 9 cycles; the calculation process corresponds to a matrix inner product multiplication step in the full-connection calculation process;
step 4, the adder tree carries out accumulation processing on the 4 products in the current period to obtain the accumulation sum in the current period and then stores the accumulation sum in a first-in first-out queue;
step 5, the cascade adder receives the accumulation sum in the current period and calculates with the cascade input, because the cascade output of the cascade adder in the general calculation unit in the previous period is the cascade input of the cascade adder in the general calculation module PE in the current period, and the cascade output of the cascade adder in the general calculation unit in the current period is the cascade input of the cascade adder in the general calculation module PE in the next period, the cascade output of the cascade adder in the current period is equal to the result of the accumulation sum total addition from the first period to the current period, and the accumulation sum in the 9 th period is the result of the accumulation sum total addition of 9 periods; the calculation process corresponds to a full addition step in the full connection calculation process;
step 6, the offset adder receives the cascade output of the lower cascade adder in the 9 th period, calculates the cascade output and offset data input from the outside, and transmits the addition result to the ReLu activation function module;
step 7, processing the addition result by the ReLu activation function module to obtain an output result of the general computation module PE at the 9 th cycle as an output result of the general computation circuit; the output result of the general-purpose computation circuit corresponds to the result of the full-connection computation with the number of input signatures of 4 at a time.
As shown in fig. 6, the cascade PE full-connection configuration is suitable for the case where the number of input feature maps of the full-connection layer in the neural network is greater than the number of multipliers of the general computation modules PE, for example, the full-connection computation is performed when the number of input feature maps is 16, and the general computation circuit in fig. 6 is a cascade of four general computation modules PE;
the configuration of the cascade PE full connection is as follows:
taking the cascade output of the cascade adder in the 1 st general-purpose computing module PE in the previous period as the cascade input of the cascade adder in the 2 nd general-purpose computing module PE in the current period, taking the cascade output of the cascade adder in the 2 nd general-purpose computing module PE in the current period as the cascade input of the cascade adder in the 3 rd general-purpose computing module PE in the next period, and taking the cascade output of the cascade adder in the 3 rd general-purpose computing module PE in the next period as the cascade input of the cascade adder in the 4 th general-purpose computing module PE in the next period;
the cascade output of the cascade adder in the 4 th general computation module PE in the previous period is used as the cascade input of the cascade adder in the 1 st general computation module PE;
the output result of the 4 th general computation module PE is used as the output result of the general computation circuit.
The calculation of the cascade PE full-connection configuration is carried out according to the following steps:
step 1, defining 16 input data, wherein the input data are constantly changed, inputting the pixel value of a first row first point of each input characteristic diagram in the 1 st period, and sequentially inputting the pixel value of a third row third point of each input characteristic diagram in the 9 th period; defining the weight values of 144 weight data stored in the RAM from a 16 x 3 convolution kernel; defining bias data input externally as 5;
step 2, the 16 multipliers obtain the stored weight data from the RAM, the weight data are continuously changed, the 16 weight data obtained in the 1 st period come from the weight values of the first row and the first point of each layer of the 16 layers of 3 × 3 convolution kernels, and in turn, the 16 weight data obtained in the 9 th period come from the weight values of the third row and the third point of each layer of the 16 layers of 3 × 3 convolution kernels;
step 3, in the current period, the 16 multipliers obtain the stored weight data from the RAM, receive and process externally input calculation data, and transmit the obtained 16 products to the adder tree;
step 4, the adder tree of each general computation module PE carries out accumulation processing on the 4 products, and the obtained accumulated sum is stored in a first-in first-out queue
Step 5, in fig. 6, the first fifo queue accumulates, reads and transmits to the first cascade adder in the last cycle, the second fifo queue accumulates, reads and transmits to the second cascade adder in the current cycle, the third fifo queue accumulates, reads and transmits to the third cascade adder in the next cycle, and the fourth fifo queue accumulates, reads and transmits to the fourth cascade adder in the next cycle;
step 6, in the previous period, the first cascade adder receives the accumulated sum and calculates with the first cascade input, and because the cascade input of the first cascade adder is set to be '0', the cascade output of the first cascade adder is still equal to the accumulated sum; the cascade output of the cascade adder in the first general computation module PE in the previous period is used as the cascade input of the cascade adder in the second general computation module PE in the current period; in the current period, the second cascade adder receives the accumulated sum and calculates with the second cascade input, and the cascade output of the second cascade adder is obtained because the cascade input of the second cascade adder is the cascade output of the cascade adder in the first general calculation module PE in the previous period; the cascade output of the cascade adder in the second general computation module PE of the current period is used as the cascade input of the cascade adder in the third general computation module PE of the next period; in the next period, the third cascade adder receives the accumulated sum and calculates with the third cascade input, and the cascade output of the third cascade adder is obtained because the cascade input of the third cascade adder is the cascade output of the cascade adder in the second general calculation module PE in the current period; the cascade output of the cascade adder in the third general computation module PE of the next period is used as the cascade input of the cascade adder in the fourth general computation module PE of the next period; in the next period, the fourth cascade adder receives the accumulated sum and calculates with the fourth cascade input, and because the cascade input of the fourth cascade adder is the cascade output of the cascade adder in the third general calculation module PE in the next period, the cascade output of the fourth cascade adder is obtained and is the sum of the accumulated sums of 4 adder trees, that is, the accumulated sum of 16 products;
step 7, because the cascade output of the cascade adder in the fourth general computing unit in the previous period is the cascade input of the cascade adder in the first general computing module PE in the current period, and the cascade output of the cascade adder in the fourth general computing unit in the current period is the cascade input of the cascade adder in the first general computing module PE in the next period, the cascade output of the cascade adder in the current period is equal to the result of the sum of all the additions from the first period to the current period, and the sum of all the additions in the 9 th period is the result of the sum of all the additions in 9 periods; the calculation process corresponds to a full addition step in the full connection calculation process;
step 6, the offset adder receives the cascade output of the fourth cascade adder in the 9 th period, calculates the cascade output with the externally input offset data, and transmits the addition result to the ReLu activation function module;
step 7, processing the addition result by the ReLu activation function module to obtain an output result of a fourth general computation module PE in the 9 th cycle as an output result of the general computation circuit; the output result of the general-purpose computation circuit corresponds to the result of the full-link computation with the number of input profiles of 16 at a time.

Claims (6)

1. A general calculation circuit of a neural network accelerator is characterized by consisting of m general calculation modules PE, wherein any ith general calculation module PE consists of an RAM and a 2nThe system comprises a multiplier, an adder tree, a cascade adder, an offset adder, a first-in first-out queue and a ReLu activation function module;
at the current cycle, 2nThe multiplier acquires the stored weight data from the RAM, receives and processes externally input calculation data to obtain 2 in the current periodnPassing the product to the adder tree;
the adder tree pair is 2 in the current cyclenAccumulating the products to obtain the accumulated sum in the current period and storing the accumulated sum in the first-in first-out queue;
the first-in first-out queue reads the accumulated sum in the current period and transmits the accumulated sum to the cascade adder;
the cascade adder receives the accumulated sum in the current period and calculates the accumulated sum with cascade inputs in different configurations to obtain the cascade output of the ith cascade adder in the current period;
the offset adder receives the cascade output of the ith cascade adder in the current period, calculates the cascade output with the offset data input externally in the current period, obtains an addition result and transmits the addition result to the ReLu activation function module;
and processing the addition result by the ReLu activation function module to obtain the output result of the ith general computation module PE in the current period and the output result of the general computation circuit in different configurations.
2. The general purpose computing circuit of a neural network accelerator, as claimed in claim 1, wherein said different configurations are performed by:
step 1, judging whether the size of a convolution kernel in a neural network is smaller than the number of multipliers 2nIf yes, executing single PE convolution configuration; otherwise, executing the convolution configuration of the cascaded PE;
step 2, judging whether the number of input characteristic graphs of the full connection layer in the neural network is less than the number of multipliers 2nIf yes, executing single PE full connection configuration, otherwise executing cascade PE full connection configuration.
3. The general purpose computing circuit of a neural network accelerator as claimed in claim 2, wherein said single PE convolution configuration is:
setting the cascade input of a cascade adder in the ith general computation module PE to be 0;
and taking the output results of the m general computation modules PE as the output results of the general computation circuit.
4. The method of claim 2, wherein said cascaded PE convolution configuration is:
taking the cascade output of the cascade adder in the ith-1 th general computation module PE in the previous period as the cascade input of the cascade adder in the ith general computation module PE;
when i is 1, setting the cascade input of a cascade adder in the ith general computation module PE to be 0;
and taking the output result of the m-th general computation module PE as the output result of the general computation circuit.
5. The method of claim 2, wherein the single-PE fully-connected configuration is:
taking the cascade output of the cascade adder in the ith general computing unit in the previous period as the cascade input of the cascade adder in the ith general computing module PE;
and taking the output results of the m general computation modules PE as the output results of the general computation circuit.
6. The method of claim 2, wherein the cascaded PE fully connected configuration is:
taking the cascade output of the cascade adder in the ith-1 th general computation module PE in the previous period as the cascade input of the cascade adder in the ith general computation module PE;
when i is 1, taking the cascade output of the cascade adder in the mth general computation module PE in the previous period as the cascade input of the cascade adder in the ith general computation module PE;
and taking the output result of the m-th general computation module PE as the output result of the general computation circuit.
CN201911055499.5A 2019-10-31 2019-10-31 General calculation circuit of neural network accelerator Active CN110807522B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911055499.5A CN110807522B (en) 2019-10-31 2019-10-31 General calculation circuit of neural network accelerator

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911055499.5A CN110807522B (en) 2019-10-31 2019-10-31 General calculation circuit of neural network accelerator

Publications (2)

Publication Number Publication Date
CN110807522A true CN110807522A (en) 2020-02-18
CN110807522B CN110807522B (en) 2022-05-06

Family

ID=69489925

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911055499.5A Active CN110807522B (en) 2019-10-31 2019-10-31 General calculation circuit of neural network accelerator

Country Status (1)

Country Link
CN (1) CN110807522B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111610963A (en) * 2020-06-24 2020-09-01 上海西井信息科技有限公司 Chip structure and multiply-add calculation engine thereof
CN112580787A (en) * 2020-12-25 2021-03-30 北京百度网讯科技有限公司 Data processing method, device and equipment of neural network accelerator and storage medium
CN112862091A (en) * 2021-01-26 2021-05-28 合肥工业大学 Resource multiplexing type neural network hardware accelerating circuit based on quick convolution
CN112965931A (en) * 2021-02-22 2021-06-15 北京微芯智通科技合伙企业(有限合伙) Digital integration processing method based on CNN cell neural network structure
CN113095495A (en) * 2021-03-29 2021-07-09 上海西井信息科技有限公司 Control method of convolutional neural network module
WO2021248540A1 (en) * 2020-06-11 2021-12-16 杭州知存智能科技有限公司 Data loading circuit and method
US11977969B2 (en) 2020-06-11 2024-05-07 Hangzhou Zhicun Intelligent Technology Co., Ltd. Data loading

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106875012A (en) * 2017-02-09 2017-06-20 武汉魅瞳科技有限公司 A kind of streamlined acceleration system of the depth convolutional neural networks based on FPGA
CN108805266A (en) * 2018-05-21 2018-11-13 南京大学 A kind of restructural CNN high concurrents convolution accelerator
CN109543140A (en) * 2018-09-20 2019-03-29 中国科学院计算技术研究所 A kind of convolutional neural networks accelerator
CN109726806A (en) * 2017-10-30 2019-05-07 上海寒武纪信息科技有限公司 Information processing method and terminal device
CN109886400A (en) * 2019-02-19 2019-06-14 合肥工业大学 The convolutional neural networks hardware accelerator system and its calculation method split based on convolution kernel

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106875012A (en) * 2017-02-09 2017-06-20 武汉魅瞳科技有限公司 A kind of streamlined acceleration system of the depth convolutional neural networks based on FPGA
CN109726806A (en) * 2017-10-30 2019-05-07 上海寒武纪信息科技有限公司 Information processing method and terminal device
CN108805266A (en) * 2018-05-21 2018-11-13 南京大学 A kind of restructural CNN high concurrents convolution accelerator
CN109543140A (en) * 2018-09-20 2019-03-29 中国科学院计算技术研究所 A kind of convolutional neural networks accelerator
CN109886400A (en) * 2019-02-19 2019-06-14 合肥工业大学 The convolutional neural networks hardware accelerator system and its calculation method split based on convolution kernel

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
余子健: "基于FPGA的卷积神经网络加速器", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
朱智洋: "基于近似计算与数据调度的CNN加速器设计与优化", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021248540A1 (en) * 2020-06-11 2021-12-16 杭州知存智能科技有限公司 Data loading circuit and method
CN113807506A (en) * 2020-06-11 2021-12-17 杭州知存智能科技有限公司 Data loading circuit and method
US11977969B2 (en) 2020-06-11 2024-05-07 Hangzhou Zhicun Intelligent Technology Co., Ltd. Data loading
CN111610963A (en) * 2020-06-24 2020-09-01 上海西井信息科技有限公司 Chip structure and multiply-add calculation engine thereof
CN111610963B (en) * 2020-06-24 2021-08-17 上海西井信息科技有限公司 Chip structure and multiply-add calculation engine thereof
CN112580787A (en) * 2020-12-25 2021-03-30 北京百度网讯科技有限公司 Data processing method, device and equipment of neural network accelerator and storage medium
CN112580787B (en) * 2020-12-25 2023-11-17 北京百度网讯科技有限公司 Data processing method, device and equipment of neural network accelerator and storage medium
CN112862091A (en) * 2021-01-26 2021-05-28 合肥工业大学 Resource multiplexing type neural network hardware accelerating circuit based on quick convolution
CN112965931A (en) * 2021-02-22 2021-06-15 北京微芯智通科技合伙企业(有限合伙) Digital integration processing method based on CNN cell neural network structure
CN113095495A (en) * 2021-03-29 2021-07-09 上海西井信息科技有限公司 Control method of convolutional neural network module
CN113095495B (en) * 2021-03-29 2023-08-25 上海西井科技股份有限公司 Control Method of Convolutional Neural Network Module

Also Published As

Publication number Publication date
CN110807522B (en) 2022-05-06

Similar Documents

Publication Publication Date Title
CN110807522B (en) General calculation circuit of neural network accelerator
CN107862374B (en) Neural network processing system and processing method based on assembly line
CN107844826B (en) Neural network processing unit and processing system comprising same
CN110458279B (en) FPGA-based binary neural network acceleration method and system
US10691996B2 (en) Hardware accelerator for compressed LSTM
CN108108809B (en) Hardware architecture for reasoning and accelerating convolutional neural network and working method thereof
CN108090565A (en) Accelerated method is trained in a kind of convolutional neural networks parallelization
CN109284824B (en) Reconfigurable technology-based device for accelerating convolution and pooling operation
CN108629406B (en) Arithmetic device for convolutional neural network
CN110163357A (en) A kind of computing device and method
CN113240101B (en) Method for realizing heterogeneous SoC (system on chip) by cooperative acceleration of software and hardware of convolutional neural network
CN113344179B (en) IP core of binary convolution neural network algorithm based on FPGA
CN111126569B (en) Convolutional neural network device supporting pruning sparse compression and calculation method
CN112734020B (en) Convolution multiplication accumulation hardware acceleration device, system and method of convolution neural network
CN110163350A (en) A kind of computing device and method
Hao A general neural network hardware architecture on FPGA
CN110716751B (en) High-parallelism computing platform, system and computing implementation method
Domingos et al. An efficient and scalable architecture for neural networks with backpropagation learning
Shu et al. High energy efficiency FPGA-based accelerator for convolutional neural networks using weight combination
CN112836793B (en) Floating point separable convolution calculation accelerating device, system and image processing method
CN115222028A (en) One-dimensional CNN-LSTM acceleration platform based on FPGA and implementation method
CN114065923A (en) Compression method, system and accelerating device of convolutional neural network
CN110807479A (en) Neural network convolution calculation acceleration method based on Kmeans algorithm
CN110765413A (en) Matrix summation structure and neural network computing platform
Alaeddine et al. A Pipelined Energy-efficient Hardware Accelaration for Deep Convolutional Neural Networks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant