CN113988280B

CN113988280B - Array computing accelerator architecture based on binarization neural network

Info

Publication number: CN113988280B
Application number: CN202111245183.XA
Authority: CN
Inventors: 胡绍刚; 李天琛; 乔冠超; 于奇; 刘洋
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2021-10-26
Filing date: 2021-10-26
Publication date: 2023-05-05
Anticipated expiration: 2041-10-26
Also published as: CN113988280A

Abstract

The invention belongs to the technical field of integrated circuits and neural networks, and particularly relates to an array computing accelerator architecture based on a binarization neural network. According to the invention, the processing unit in the computing core adopts the alternative selector to replace the multi-bit multiplier so as to accelerate the computing of the FC layer of the binary neural network, thereby greatly reducing the storage and computing area of the chip, the computing delay and power consumption; meanwhile, the configurable neural network functional module is integrated inside, so that the calculation requirements of various current neural network algorithm models are met to a great extent, and the universality of the accelerator is enhanced.

Description

Array computing accelerator architecture based on binarization neural network

Technical Field

The invention belongs to the technical field of integrated circuits and neural networks, and particularly relates to an array computing accelerator architecture based on a binarization neural network.

Background

With the optimization of integrated circuit design and the continuous improvement of integrated circuit technology level, the performance of contemporary processors and memories has been greatly improved, but the performance improvement of computers has encountered bottlenecks nowadays. As a classical architecture of a computer, the von neumann architecture limits the data interaction capability of the processor with memory, a bottleneck that is significantly manifested in the neural network model computation. The neural network model has the characteristics of large parameter quantity, multiple multiplication and addition operations and the like, and when the neural network model is calculated, the parameters are frequently called from the memory to the processor for calculation, which obviously increases the power consumption overhead of the computer.

As a currently mainstream neural network reasoning and training platform, the rapid development of CPU (Central Processing Unit ) and GPU (Graphic Processing Unit, graphics processor) has driven the research of neural network algorithms. The GPU integrates a large number of parallel ALUs (Arithmetic Logic Unit, arithmetic logic units) therein to meet the computational power demand, but the excessive computing power consumption is a difficult problem, and the bottleneck caused by the architecture also becomes a challenge and a block for the development of artificial intelligence to the mobile terminal. Therefore, some researchers have focused their attention on designing custom neural network-specific accelerator chips. In 2016, google corporation has proposed an ASIC (Application Specific Integrated Circuit ) chip for neural network computing, i.e., a first generation TPU (Tensor Processing Unit, tensor processor), to optimize its own machine learning framework. The first generation TPU only supports the reasoning of deep learning, and a pulsation array architecture is adopted in a computing core of the first generation TPU, and the architecture realizes the large throughput of the core by arranging the computing modules in an array manner, so that the first generation TPU has the advantages of recycling input data, high-parallelism pipelining, simple and regular structure, data flow and the like. Compared with the CPU and the GPU at the time, the architecture improves the calculation speed by about 15 to 30 times and improves the energy efficiency ratio by about 30 to 80 times.

In addition, researchers have also begun to start with neural network algorithms to select lightweight neural network models to reduce hardware overhead. For example, there are cases where high accuracy recognition is achieved while binarizing a neural network (i.e., converting an input image or a weight parameter to 0 or 1), and studies on SNN (Spiking Neural Network, impulse neural network) based on biological neurons have been attracting a lot of attention in recent years, and these theoretical developments have all laid the foundation for designing a high-performance dedicated neural network accelerator.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides an array computing accelerator architecture based on a binarization neural network, which aims to solve the problems of high computing power consumption, low speed, low reconfigurability and the like of the current neural network model.

In order to achieve the above purpose, the technical scheme of the invention is as follows:

an array computing accelerator architecture based on a binarized neural network is used for computing two matrixes, wherein one matrix is a 0-1 matrix; the device is characterized by comprising a matrix operation control module, a first sending module, a second sending module, a selection-accumulation array calculation module, a first receiving module, a second receiving module and a neural network function module;

the matrix operation control module is used for receiving the scale of a calculated matrix input from the outside, recording the address of a matrix block calculated by the current array according to a matrix blocking algorithm, transmitting the matrix scale and the address of the currently calculated matrix block to the first transmitting module, the second transmitting module and the neural network function module, and simultaneously transmitting an enabling signal to the first transmitting module, the second transmitting module and the neural network function module; the matrix operation control module also receives externally input neural network configuration information and sends the information to the neural network function module;

the first transmitting module receives and stores all elements of 0-1 matrix input from outside, correspondingly, the first transmitting module receives the scale and matrix block address of 0-1 matrix in the operated matrix transmitted by the matrix operation control module, and transmits the elements of matrix block of corresponding address to the selection-accumulation array calculation module under the control of the enabling signal, and after completing the operation of one matrix block, the first transmitting module transmits an identifier to the selection-accumulation array calculation module so as to output a calculation result;

the second transmitting module receives an external input real number matrix, corresponds to the real number matrix in the calculated matrix and a matrix block address, and stores elements in a matrix block in the corresponding address, and transmits matrix block elements of the corresponding address to the selection-accumulation array computing module under the control of an enabling signal, and transmits 0 to the selection-accumulation array computing module after completing matrix block operation, so that a corresponding register in the selection-accumulation array computing module is initialized to 0;

the selection-accumulation array calculation module adopts n rows and n columns of processing units to be arrayed, and the processing units in each row and each column are sequentially connected end to end; the processing unit is provided with two input ports in rows and columns and two output ports in rows and columns, the row input port is defined as a first input port, the row output port is defined as a first output port, the column input port is defined as a second input port, the column output port is defined as a second output port, the processing unit comprises a first selector, a second selector, an adder and a register, the first selector firstly judges the data of the first input port, if the data is an identifier, the accumulated sum in the register is output from the second output port, the data received by the second input port is stored in the register, and meanwhile the first output port outputs the identifier; if the data is not the identifier, judging the single-bit data from the 0-1 matrix by a second alternative selector, if the data is 1, sending the multi-bit data from the real matrix to an adder, adding the sent data and the data in a register by the adder, returning the result to the register again, wherein the register is responsible for storing the accumulated partial sum, and if the data is 0, skipping calculation;

the first receiving module is used for receiving the data output by the first output port of the array computing module, and the inside of the first receiving module is composed of n registers, wherein the ith register is connected with the first output port of the ith row processing unit in the nth column in the array. Correspondingly, each register is responsible for reading an element of a 0-1 matrix transmitted by a first input port and an output port of a processing unit in one clock period in the matrix calculation process; after completing a matrix block operation, it is responsible for reading an identifier transferred by the first input and output ports of the processing unit in one clock cycle.

The second receiving module is used for receiving the data output by the second output port of the array computing module, and the inside of the second receiving module is composed of n registers, wherein the ith register is connected with the second output port of the ith processing unit in the nth row and the ith column in the array. Correspondingly, each register is responsible for reading an element of a real matrix transmitted by a second input port and an output port of the processing unit in one clock period in the matrix calculation process; after finishing a matrix block operation, the method is responsible for reading a calculation result transmitted by a second input port and an output port of the processing unit in one clock period, and sending the result data to the neural network functional module for caching.

The neural network function module stores the corresponding calculation result obtained by the second receiving module according to the received matrix scale and the current calculation matrix block address, and after all matrix block calculation is completed, the internal sub-neural network module is called to perform data processing and output on the calculation result according to the received configuration information sent by the matrix operation control module, wherein the sub-neural network module at least comprises a batch normalization module, a pooling module and a ReLU module.

The beneficial effects of the invention are as follows: according to the invention, the processing unit in the computing core adopts the alternative selector to replace the multi-bit multiplier so as to accelerate the computing of the FC layer of the binary neural network, thereby greatly reducing the storage and computing area of the chip, the computing delay and power consumption; meanwhile, the configurable neural network functional module is integrated inside, so that the calculation requirements of various current neural network algorithm models are met to a great extent, and the universality of the accelerator is enhanced.

Drawings

FIG. 1 is a schematic diagram of a prior art calculation process using a 3×3 scale systolic array for two 3×3 matrices;

FIG. 2 is a schematic diagram of the prior art calculation of two 3X 3 matrices using a 3X 3 scale systolic array, resulting in a record of results over 7 clock cycles;

FIG. 3 is a schematic diagram of the operation of a TPU compute core in the prior art;

FIG. 4 is a schematic diagram illustrating a matrix block multiplication algorithm employed in an embodiment of the present invention;

FIG. 5 is a block diagram of an array computing accelerator architecture based on a binarized neural network according to an embodiment of the present invention;

FIG. 6 is a block diagram of a processing unit according to an embodiment of the present invention;

fig. 7 is a block diagram of a second transmitting module for receiving real matrix elements according to an embodiment of the present invention;

FIG. 8 is a block diagram of a first transmitting module for receiving a 0-1 matrix element according to an embodiment of the present invention;

fig. 9 is a block diagram of a neural network functional module according to an embodiment of the present invention;

fig. 10 is a timing diagram of a part of input and output data interfaces of a select-accumulate computing array module according to an embodiment of the present invention.

Detailed Description

The technical scheme of the invention is described in detail below with reference to the accompanying drawings:

currently, to perform matrix multiplication operations, the processing units in the array typically need to be composed of a multiplier, an adder, and a register, which are used to perform the functions of multiplication, accumulation, partial, and storage, respectively. However, implementing the multiplier occupies a large area of the chip, and performing the multiplication operation generates a certain delay and generates a large power consumption, which all limit the performance improvement of the computing array to a certain extent.

Calculations made by the FC (Full Connection) layer in the neural network are essentially matrix multiplications, for which the computational core will operate in the manner of a block matrix multiplication. The method comprises the following steps: matrix A for an i row k column _i×k Matrix B with one k rows and one j columns _k×j Is multiplied by (a) to obtain a result matrix C _i×j For example, the operations of row partitioning and column partitioning can be performed on A and B, respectively, and denoted as A _i×k ＝[A ₁ ^T ,A ₂ ^T ,…,A _n ^T ] ^T ,B _k×j ＝[B ₁ ,B ₂ ,…,B _m ]At the same time, the width of the matrix partition should be as close as possible to the width of the computing array so as to maintain the high efficiency of the computing core. Then, the computing cores respectively call different operated matrix combinations to calculate to obtain different result matrix blocks, namely, C is completed _x,y ＝A _x ×B _y And (3) a process. In the calculation process, a certain row of processing units in the calculation core respectively reads the matrix block A _x And passing down a certain column of elements, a certain column of processing units respectively reads the matrix block B _y And passing to the left, and performing accumulation operation in the processing unit according to the read matrix elements until all elements are read. In order for the calculation result to be correct, there should be a blocking of the interfaces of the processing unitsI.e. after having received completely two elements from matrices a and B, respectively, a subsequent operation is performed. After the x and y are traversed respectively, all the result matrix blocks are spliced to obtain a result matrix C _i×j . In the calculation process, if a certain row matrix block or a certain column matrix block of the result matrix needs to be obtained, the called operated matrix elements comprise all elements of one matrix and a certain row matrix block or a certain column matrix block of the other matrix. For example, for the s-th row matrix partition C of the result matrix _s In other words, the called operated on matrix element includes A _s The elements within the matrix block and all elements of the B matrix. From the hardware perspective, compared with caching all elements, a large amount of storage space is saved by only opening up temporary caches for the row blocks of the matrix A and the elements of the matrix B; meanwhile, in the binarized neural network, a certain operated matrix is a 0-1 matrix, if the B matrix is assumed to be the 0-1 matrix and is stored, the reduction of the storage bit width also reduces the cost of data cache.

To accomplish the multiply-add operation in two real matrices, multipliers are often present in the processing units in conventional processing unit arrays. The implementation of the multi-bit multiplier requires a large number of logic gates, which occupies a large chip area and generates high delay and power consumption, thereby being unfavorable for hardware implementation. Based on the principle of a binarization neural network, the accelerator converts multiplication of two real matrixes in the traditional neural network into multiplication of a 0-1 matrix and a real matrix, so that multiplication and addition operation is converted into selection-accumulation operation, and a multi-bit multiplier in a processing unit can be replaced by a selector in the process of mapping an algorithm into hardware. Compared with a multi-bit multiplier, the alternative selector has the advantages of small area, low power consumption and high speed, and the array calculation accelerator adopting the improved processing unit based on the binary neural network algorithm can be optimized in terms of performance, area and power consumption.

In the neural network model, besides the most common matrix multiplication operation, a series of algorithm processing such as pooling is needed, and special circuits are rarely integrated for the operations in the traditional neural network accelerator, and the data processing is performed by means of a general processor, so that the calculation time and the power consumption are affected to a certain extent. The invention integrates a configurable neural network function module in the accelerator architecture, calls different function blocks in the accelerator architecture according to coding instructions transmitted by the control module, and carries out different neural network processing on the result matrix. By adopting the architecture, various configurable data processing modes can be performed on the premise of improving the computing speed, and compared with the traditional general purpose processor processing mode, the architecture is improved in performance and generality.

The invention provides an array computing accelerator architecture based on a binarized neural network, the general structure of which is shown in fig. 5, comprising:

the system comprises a matrix operation control module, a selection-accumulation array calculation module, a first sending module, a second sending module, a first receiving module, a second receiving module and a neural network function module.

The matrix operation control module is used for reading the matrix scale and the configuration parameters, and adopts block matrix multiplication. And according to the operation scale, address recording is carried out on different result matrix blocks, and the operation block addresses are sent to each module, so that the calculation positions are positioned. And after matrix operation is completed, the configuration parameters are sent to the configurable neural network functional module. The matrix operation control module is composed of the following devices and performs the following tasks: three registers are adopted to store the scales of two calculated matrixes, i.e. i, j and k, sent by an upper computer, and according to a matrix partitioning algorithm, the matrix block addresses x and y calculated by the current array are recorded, the two comparators compare the addresses with the matrix scale, and if the calculation of one matrix block is completed, the addresses can be moved and converted to the addresses of the next matrix block. The matrix scales i, j, k and matrix block addresses x, y are sent to the first and second sending modules to make them send the matrix block A corresponding to the addresses _x And B is connected with _y Loading the data into the buffer area, and sending an enabling signal to the buffer area to enable the data to output the data of the buffer area; at the same time, the result C is sent to the neural network function module to make it calculate the result C according to the address _x,y Temporarily stored in corresponding buffer area, and finally receives the configuration information sent by upper computer and sends it to godAnd the data are subjected to neural network processing through a network function module.

The selecting-accumulating computing array module is used for receiving the input parameters sent by the sending module, executing matrix multiplication operation in the binarization neural network and sending the used input parameters and the result matrix elements after the computation to the receiving module.

The selecting-accumulating array calculating module is formed by adopting n rows and n columns of processing units to be arrayed, each row and each column of processing units are sequentially connected end to end, wherein each processing unit is provided with a row input port, a column output port, a column input port, a column output port and a column output port, wherein the row input port and the column output port are respectively defined as a first input port and a first output port, the column input port and the column output port are respectively defined as a second input port and a second output port, the selecting-accumulating array calculating module is approximately shown in fig. 6 and comprises two alternative selectors, an adder and a register, the first alternative selector firstly judges the data of the first input port, if the data is an identifier, accumulated sum in the register is output from the second output port, the data received by the second input port is stored in the register, and the first output port transmits the identifier; if the data is not the identifier, judging the single-bit data from the 0-1 matrix by a second alternative selector, if the data is 1, sending the multi-bit data from the real matrix to an adder, adding the sent data with the data in a register by the adder, and returning the result to the register again, wherein the register is responsible for storing the accumulated partial sum; if 0, the calculation is skipped.

The at least two input parameter storage and transmission modules, wherein the first transmission module and the second transmission module have the general structures shown in fig. 7 and 8, and are respectively used for temporarily storing the elements of the 0-1 matrix and the real matrix block which need to be operated, and transmitting the matrix elements which need to be used for calculation to the selection-accumulation calculation array module for neural network model calculation;

and the first sending module internally adopts a RAM to store data. The first transmitting module receives a 0-1 matrix A in two operated matrixes transmitted by the control module _i×k Scales i and k of (a) and will be sent from the upper computerAll 0-1 matrix elements sent are stored. According to the enabling signal and address of the control module, a matrix block A of the corresponding address is sent to the array computing module _x After completing a matrix block operation, sending an identifier to the array calculation module to enable the array calculation module to output a calculation result.

And the second sending module adopts a RAM to store data. The second transmitting module receives the real matrix B in the two operated matrices transmitted by the control module _k×j And a certain matrix block B of the real matrix to be transmitted by the upper computer _y The elements within are stored. And according to the enabling signal and the address of the control module, transmitting matrix block elements of the corresponding address to the array computing module, and transmitting 0 to the array computing module after completing matrix block operation, so that a register in a processing unit in the array computing module is initialized to 0.

The first receiving module uses n registers to continuously read the data output by the array computing module and sent by the first sending module. When the array computing module performs matrix operation, the first receiving module receives the 0-1 matrix block A _x Elements to ensure A _x The elements can flow in a pulsating manner in the array computing module to complete the computing process; when the array computing module is initialized, the first receiving module receives the identifier so as to ensure that the identifier can perform pulsating flow in the array computing module and complete the initialization process.

The second receiving module uses n registers to continuously read the data output by the array computing module and sent by the second sending module. When the array computing module performs matrix operation, the second receiving module receives the real matrix block B _y Elements to ensure B _y The elements can flow in a pulsating manner in the array computing module to complete the computing process; when the array computing module is initialized, the second receiving module receives the computing result C _x,y The elements are sent to a neural network functional module to ensure that the initialization data 0 can flow in the array computing module in a pulsating manner, and the initialization process is completed.

The general structure of the configurable neural network functional module is shown in FIG. 9And the data receiving module is used for receiving the result matrix block elements sent by the data receiving module and temporarily storing the matrix block in the RAM according to the calculation position sent by the matrix operation control module. Specifically, the matrix block addresses x and y calculated by the operation matrix scales i, j and k and the current array sent by the matrix operation control module are received, and the calculation result C sent by the second receiving module is received _x,y And temporarily storing the addresses of the current matrix blocks into the RAM. After the complete matrix operation, neural network calculation is performed according to the configuration parameters sent by the matrix operation control module, such as BN layer operation, reLU layer operation and the like, and finally the result is output.

As shown in fig. 5, the present invention provides a rough structure of an array computing accelerator architecture based on a binarized neural network, which includes the following working steps:

step 1, a matrix operation control module receives two operation matrices A _i×k And B is connected with _k×j The scale i, j, k of the scale and the neural network calculate the configuration parameters, record the block addresses x and y of two operated matrixes based on the block matrix multiplication, and broadcast the addresses to the first and second sending modules and the first and second receiving modules, so that each module calculates a matrix block C _x,y The process can be synchronously processed;

step 2, the first transmitting module receives the 0-1 matrix A _i×k All elements in the matrix are stored in RAM, and A in the corresponding matrix block is processed according to the address x sent by the matrix operation control module _x The elements of (2) are serially transmitted to a selection-accumulation array calculation module for calculation through a plurality of parallel interfaces;

step 3, the second transmitting module receives the real matrix B _k×j A matrix block B of (a) _m The elements in the matrix are stored in the RAM, and are serially transmitted to the selection-accumulation array calculation module through a plurality of parallel interfaces to calculate, if a row of matrix block operation is completed, the data in the RAM is updated according to the control instruction transmitted by the matrix operation control module, namely the next matrix block B is stored _m+1 Is an element of (2);

step 4, the selection-accumulation array calculation module receives serial data through a plurality of parallel interfaces, and the data are processed in high parallelism and high pipelining by the processing units arranged in the internal array;

step 5, the processing unit receives the data sent by the first and second sending modules, the first two-choice selector judges the data of the first input port as single-bit data instead of the identifier, and the second two-choice selector judges the data from the 0-1 matrix block A _x If 1, the second input port is input from the real matrix block B _y The multi-bit data of (2) is sent to adder, and accumulated with partial sum in register, if it is 0, the calculation is skipped. Outputting the single-bit data from the first output port, and outputting the multi-bit data from the second output port;

step 6, after the first and second sending modules complete the sending of all the elements, the first sending module sends identifiers to the array computing module, the second sending module sends initialization data 0 to the array computing module, and the computing results in the array computing module are output and initialized;

step 7, the processing unit receives the data sent by the first and second sending modules, the first selector judges the data of the first input port as an identifier, the identifier is output from the first output port, the data in the register is output from the second output port, and the initializing data 0 received by the second input port is used for initializing the register;

step 8, the first and second receiving modules receive the used matrix block A outputted from the array computing module _x 、B _y Elements, completing a matrix block C _x,y After the operation of the (a) and when the array computing module outputs the computing result, the second receiving module sends the computing result to the neural network functional module for caching;

and 9, the matrix operation control module records the matrix block calculation address, compares the matrix block calculation address with the operation matrix scale and determines the calculation process. If all matrix blocks are calculated, performing step 10, otherwise returning to step 1;

and step 10, after the calculation of all matrix blocks is completed, the neural network functional module sequentially caches the calculation result received from the second receiving module into the RAM according to the addresses x and y sent by the matrix operation control module. And finally, the matrix operation control module sends the neural network calculation configuration parameters to the neural network functional module, performs neural network processing on the result matrix according to the configuration parameters sent by the control module, and comprises BN layer operation, pooling operation, reLU layer operation, IF neuron model and the like besides no operation, and finally outputs the result after the neural network calculation.

Claims

1. An array computing accelerator architecture based on a binarized neural network is used for computing two matrixes, wherein one matrix is a 0-1 matrix; the device is characterized by comprising a matrix operation control module, a first sending module, a second sending module, a selection-accumulation array calculation module, a first receiving module, a second receiving module and a neural network function module;

the first transmitting module receives and stores all elements of 0-1 matrix input from outside, correspondingly, the first transmitting module receives the scale and matrix block address of 0-1 matrix in the operated matrix transmitted by the matrix operation control module, the elements of matrix block of corresponding address are transmitted to the selection-accumulation array calculation module under the control of the enabling signal, and the data are received by the first input port of the processing unit; after finishing a matrix block operation, sending an identifier to the selection-accumulation array calculation module to enable the identifier to output a calculation result;

the second transmitting module receives an external input real number matrix, correspondingly receives the scale and matrix block address of the real number matrix in the operated matrix transmitted by the matrix operation control module, stores elements in a matrix block in the corresponding address, transmits matrix block elements of the corresponding address to the selection-accumulation array calculation module under the control of an enabling signal, and receives data from a second input port of the processing unit; after finishing a matrix block operation, transmitting 0 to the selection-accumulation array calculation module, and initializing a corresponding register in the selection-accumulation array calculation module to 0;

the first receiving module is used for receiving data output by a first output port of the array computing module, the inside of the first receiving module is composed of n registers, the ith register is connected with the first output port of an ith row processing unit in an nth column in the array, and each register is correspondingly responsible for reading an element of a 0-1 matrix transmitted by the first input port and the first output port of the processing unit in a clock cycle in the matrix computing process; after completing a matrix block operation, it is responsible for reading an identifier transferred by the first input and output ports of the processing unit in one clock cycle;

the second receiving module is used for receiving the data output by the second output port of the array computing module, the inside of the second receiving module is composed of n registers, the ith register is connected with the second output port of the ith processing unit in the nth row and the ith column in the array, and each register is correspondingly responsible for reading an element of a real matrix transmitted by the second input port and the second output port of the processing unit in one clock period in the matrix computing process; after finishing a matrix block operation, reading a calculation result transmitted by a second input port and an output port of the processing unit in a clock period, and sending result data to the neural network functional module for caching;