CN115965067A

CN115965067A - Neural network accelerator for ReRAM

Info

Publication number: CN115965067A
Application number: CN202310049117.8A
Authority: CN
Inventors: 景乃峰; 伍骏; 董光达; 熊大鹏; 李涛
Original assignee: Suzhou Yizhu Intelligent Technology Co ltd
Current assignee: Suzhou Yizhu Intelligent Technology Co ltd
Priority date: 2023-02-01
Filing date: 2023-02-01
Publication date: 2023-04-14
Anticipated expiration: 2043-02-01
Also published as: CN115965067B

Abstract

The invention discloses a neural network accelerator based on ReRAM (random access memory), which belongs to the field of design of neural network accelerators and comprises a ReRAM in-situ calculation array, an input register, an accumulation buffer, a vector logic unit, a global buffer, a calculation control unit and a characteristic data read-write DMA (direct memory access); the input register and the accumulation buffer are connected with a ReRAM in-situ calculation array. The invention adopts a novel neural network weight mapping method for neural network reasoning calculation, adopts efficient direct memory access and flexible data layout format, improves the parallelism of data flow and calculation flow, solves the problem of data blockage in the reasoning process and improves the data throughput rate of the architecture.

Description

Neural network accelerator for ReRAM

Technical Field

The invention relates to the field of design of neural network accelerators, in particular to a neural network accelerator based on ReRAM.

Background

With the rapid increase in data size, the demand for hardware computing power and data storage capacity for computationally intensive neural network applications has also increased. In order to break through the bottleneck of memory access bandwidth caused by the separation of computing and storage in the traditional von neumann architecture, more and more researches are beginning to focus on a high-density memory-computing integrated computing architecture, so as to reduce the memory access energy consumption and bandwidth requirement additionally increased due to the frequent transportation of data between a storage component and a computing component by tightly coupling computing and storage, and further maximize the energy efficiency ratio of a hardware architecture.

A new nonvolatile memory, namely a resistive random access memory (ReRAM), stores the conductance values, and organizes multiplication accumulation calculation through ohm's law to form an in-situ calculation function which is not possessed by the traditional memory. Traditional acceleration pairs based on von Neumann architectures require simultaneous handling of Weight and Feature for computation. The neural network accelerator based on the ReRAM converts an input vector into voltage to be applied to a Word line (Word-line) of the RRAM, then, the other weight input is mapped into the conductance of each unit of the ReRAM in advance, and the accumulated current on a Bit line (Bit-line) is characterized as the number product of two vectors, so that in-situ storage processing is realized, and the data bandwidth pressure is greatly relieved to improve the performance.

At present, the mainstream storage and computation all-in-one accelerator usually adopts a hierarchical topological structure, and a circuit is divided into functional modules of different scales from top to bottom. The minimum granularity calculation module in the architecture is a ReRAM cross array and an arithmetic logic unit, basic units are combined to form an upper layer operation module, the upper layer modules can be combined with each other, and the rest is done until the top layer of the architecture. According to the scheduling and driving modes of the modules, the existing accelerator systems can be divided into two types: data flow driven and instruction flow driven.

The data flow driving type structure means that in the calculation execution process, data access and interaction of each functional module are completed under the control of a state machine, and the process has no instruction interference. In the neural network estimation, once the structure and the weight of the network are determined, the size and the execution mode of data in the whole operation are also determined. The data flow driving type storage and calculation integrated structure aims at the characteristic, before execution, the structure, the connection relation and the weight value of a network layer are mapped onto a corresponding functional array, and the network reasoning process can be realized through input and output control during execution. Typical accelerators in such structures are ISAAC, PRIME, etc.

The mainstream accelerator structure in the industry at present has problems: 1. the weight mapping algorithm mode is single and fixed, and the flexibility is poor; 2. in the existing computing architecture, the blocking situation of data flow often occurs in the network reasoning process.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides a neural network accelerator based on ReRAM.

The purpose of the invention is realized by the following technical scheme:

a neural network accelerator based on a ReRAM comprises a ReRAM in-situ calculation array, an input register, an accumulation buffer, a vector logic unit, a global buffer, a calculation control unit and a characteristic data read-write DMA; the input register and the accumulation buffer are connected with a ReRAM in-situ calculation array;

the calculation unit organization form of the ReRAM in-situ calculation array adopts a cross circuit layout, and the cross point of each word line and each bit line is a ReRAM device for mapping multi-bit data; the ReRAM in-situ calculation Array comprises a plurality of groups of calculation arrays, and each group of arrays comprises a plurality of Bank ReRAM calculation units in the column direction;

the input register is a connecting bridge between an input characteristic diagram and array calculation in a calculation process and is used for data caching, layout conversion and data distribution;

the accumulation buffer is used for buffering and accumulating partial sum data output by the ReRAM in-situ calculation array;

the vector logic unit is used for performing corresponding activation function calculation and network pooling calculation according to the requirement of the neural network layer;

the input end of the global buffer is connected with the vector logic unit, and the output end of the global buffer is connected with the input register; the buffer is used for caching intermediate data among the neural network layers;

the calculation control unit is used for managing the whole processes of on-chip data access, carrying and calculation;

the characteristic data read-write DMA is used for carrying the characteristic data.

Further, when the ReRAM in-situ computation array performs a weight mapping algorithm, bank is used as a first priority, a Bank Row direction is used as a second priority, and each weight data is regularly arranged in a cross circuit of the ReRAM in-situ computation array according to a specific order of bits from high to low or from low to high.

Furthermore, the logical bottom end of the bit line of the ReRAM in-situ calculation array is connected with an output register and an adder unit for performing accumulation calculation among the multi-bit line data.

Further, each computing unit of the ReRAM in-situ computing array is connected with each other through the NoC.

Further, when the input register performs layout conversion, the data layout is adjusted in the iReg according to the requirement of the input feature data, and the slice data on each input channel is rearranged and arranged to be consistent with the calculation position of the weight data stored in the ReRAM in-situ calculation array.

Further, the buffering mode of the global buffer is ping-pong buffering.

Further, the calculation process of the calculation control unit is as follows: firstly, the sliding of the convolution kernel on the input features is divided into three directions, namely a W direction, an IC direction and an H direction; then in the calculation process, the sliding window firstly slides along the IC direction after being executed along the W direction, and finally slides along the H direction; after the convolution sliding window is calculated each time, corresponding input data are sent to the same position of the accumulation buffer, in the process, the calculation control unit calculates to obtain an output address of the array and controls the sending process of the data; and after the convolution kernel sliding is finished and the calculation of the accumulation buffer is finished, the control unit sequentially sends the output results to the vector logic unit to execute vector operation and stores the results in the global buffer.

Further, the characteristic data read-write DMA comprises RDMA and WDMA; the RDMA is used for carrying a plurality of data from a plurality of banks to the input register in parallel and ensuring that the input register works in a full load state; the WDMA is used for transporting the calculated data from the buffer of the vector logic unit to the global buffer.

The invention has the beneficial effects that: the novel ReRAM storage and calculation integrated neural network accelerator structure provided by the invention excavates the parallelism of convolution and matrix multiplication calculation to a great extent. Different from the traditional neural network mapping algorithm, the novel weight mapping algorithm takes the Bank of a ReRAM array as a first priority and takes the Row direction of the Bank as a second priority, and regularly arranges each weight data in a cross circuit of a calculation array according to the specific sequence of bits from high to low or from low to high. Based on the algorithm, a multi-array multi-Bank ReRAM calculation unit mode is adopted in the framework, and high-parallelism matrix multiplication calculation is performed on the input feature diagram through an efficient data transfer and layout reforming module. And the output results are subjected to non-blocking activation and pooling operation after accumulation, so that the throughput efficiency of the network reasoning assembly line is improved to a great extent.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the structures shown in the drawings without creative efforts.

FIG. 1 is a block diagram of an accelerator architecture of the present invention.

Fig. 2 is a schematic diagram of a control mode of the calculation control unit.

Fig. 3 is a schematic diagram of a typical convolution implementation.

FIG. 4 is an exemplary diagram of convolution weight mapping in Resnet 50.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In this embodiment, an overall architecture of a neural network accelerator based on ReRAM is composed of the following modules:

(1) ReRAM in-situ computing arrays;

(2) An input register;

(3) An accumulation buffer facing the array output;

(4) Vector logic units oriented to activation, pooling, etc. operations;

(5) A global buffer oriented to network interlayer result storage;

(6) A calculation control unit;

(7) DMA (RDMA/WDMA) for feature data read and write.

As shown in fig. 1, an illustration of a specific module design and connection representation is given. All modules of the framework are interconnected through NoC, and flexible data streams are supported to face different weight mapping modes.

(1) The ReRAM in-situ computing array is one of the most important core modules in the design of the accelerator architecture, and is different from the computing array in the traditional ReRAM neural network accelerator, and the computing array of the accelerator architecture not only supports the traditional weight placement mode of the neural network, but also carries out corresponding optimization on a new weight mapping algorithm. In the field of ReRAM accelerators, the weight mapping mode of convolution computation is implemented based on the expansion of convolution kernels. Taking an (O, I, H, W) convolution kernel as an example, O and I are the sizes of the output channel and the input channel of the convolution kernel, and H and W are the height and the width of the convolution kernel. When the convolution kernels are expanded, the convolution kernels of different output channels are expanded into vector data of I x H x W according to a certain sequence, and expanded vectors with the shapes of O columns can be obtained to be mapped to different rows and columns of the cross circuit. Therefore, convolution calculation is converted into matrix multiplication calculation, and the calculation process can be accelerated by the in-situ calculation array.

According to the novel mapping algorithm, the Bank of the ReRAM array is used as the first priority, and the parallelism among the Bank is fully utilized to place more convolution input channels; the mapping algorithm takes the Bank Row direction as the second priority, and arranges convolution kernels of different channels at different Bank Row positions in the dimension of a convolution output channel until the data mapping requirements of all input and output channels are completed. Meanwhile, each weight data is regularly arranged in a cross circuit of the calculation array according to a specific sequence of bit from high to low or from low to high so as to ensure the correctness of the calculation flow and the data accumulation process. Taking the convolution kernels of (O, I, H, W) shape above as an example, first one convolution kernel is spread along the input lane dimension, the data slice of each (H, W) in that dimension is converted into vector data of size H x W, and the vectors of different input lanes are placed on different banks. Next, along the output path, the data of the convolution kernel is mapped to the direction of the array Bank Row according to the above expansion scheme, and the data at the corresponding position in the adjacent convolution kernel is placed on the adjacent Bank Row.

The ReRAM calculation array module in the framework is designed based on the mapping algorithm principle, and structural and scale adaptation is carried out to meet the requirements of the mapping algorithm. The single ReRAM computing unit adopts the traditional cross circuit layout in the organization form, and the cross point of each word line and each bit line is a ReRAM device for mapping multi-bit data; the logic bottom end of each bit line is connected with an output register and an adder unit for performing accumulation calculation among the multi-bit line data. In the overall layout, the in-situ computation Array comprises a plurality of sets of computation arrays, each set of arrays comprising a plurality of Bank ReRAM computation units in the column direction, typically values such as 128 Bank. The units are connected with each other through NoC. The design utilizes a large-scale parallelization Bank structure to improve the parallel computing capability of the convolution computing flow on the input dimension, and also improves the deployment capability and the universality of large-scale convolution kernels through the Array scale of double arrays.

(2) The input register is a connecting bridge between the input characteristic diagram and the array calculation in the calculation process, and plays roles of data caching, layout conversion and data distribution. Among them, the layout conversion is one of the most important functions. The mapping algorithm provided by the invention expands convolution kernels in convolution calculation according to the dimension of an input channel, and multiply-accumulate calculation is carried out on each expansion backward quantity requirement and corresponding input characteristic data. Therefore, the need to input feature data makes a data layout adjustment in iReg, which rearranges the layout of the slice data on each input channel to match the calculated position of the weight data stored in the ReRAM array.

The specific principle is as follows: the address generation module calculates the source address and the target address of the data in the convolution process according to the input data of the convolution layer, the convolution kernel size, padding, stride and other information, counts the data from the source address of the global buffer and sends the data to the target address of the input register, and the corresponding conversion of the data layout is completed.

(3) The accumulation buffer is connected with the ReRAM calculation array through the NoC module and is mainly responsible for buffering and accumulating partial sum data output by the calculation array. The Partial sum data (Partial-sum) means that the output data of the array is only a part of the actual output data, and the actual data needs to be obtained by accumulating all the Partial sum data.

In the actual neural network deployment and reasoning process, due to the characteristics of the mapping algorithm adopted by the accelerator, data on the scale of an input channel of a convolution kernel is scattered on different banks, and in addition, the data size of an input channel slice or a weight matrix of the convolution kernel exceeds the capacity of a single ReRAM computing unit, a plurality of ReRAM computing units are required to take charge of the same block of data, so the computing output of each ReRAM array is often a part and a result which need to be accumulated.

The data and the part at the same position on the output characteristic diagram are forwarded through the NoC, and are stored to the same address position of the accumulation buffer from the output ports of different ReRAM calculation arrays. The accumulation buffer structure design comprises 128 banks, each Bank has the capacity of 2KB, at most 512 32-bit floating point or fixed point data are accommodated, and the total capacity is 256KB. Ideally, one Bank is responsible for storing data of only one input channel slice of the output feature map, and if the output feature map is large, a plurality of banks are required to store data of one channel slice.

(4) The vector logic unit is responsible for executing vector operations such as activation, pooling and the like of output data in the neural network pipeline and is positioned behind the global buffer.

After the accumulation buffer completes the accumulation operation of the partial sums, the result is sent to the vector unit. And the vector unit performs corresponding activation function calculation and network pooling calculation according to the requirement of the neural network layer.

(5) The global buffer is used for buffering intermediate data among neural network layers, a ping-pong buffer design is adopted, and the capacity is 4MB. The input end of the global cache is connected to the vector unit, and the output end of the global cache is connected with the input register. The global cache design logically comprises two groups of cache regions, wherein the first group of cache regions is responsible for placing input characteristic data of a current network layer, and the second group of cache regions is responsible for placing output data of the network layer. When the input characteristic data is not used any more, the data is cleared by the global cache, and then the data is shifted to fill the data position, so that extra space is provided for network layer output.

In the process of inference calculation, the DMA unit will first obtain the input feature map from DRAM, and place it in the buffer area of the network input. In the convolution calculation process, the global cache executes ping-pong operation, shifts data and caches new network output data, and only output characteristic data are left in the buffer area after the calculation of the neural network layer is finished, and the data can be used as input data of the next layer of network for calculation.

(6) And the calculation control unit is responsible for managing the whole processes of on-chip data access, carrying and calculation.

First, the control unit determines the calculation flow of the convolutional layer. The convolution calculation flow in the framework is as follows: first, the sliding of the convolution kernel on the input features is divided into three methods, respectively in the W direction, IC direction, H direction. In the calculation process, the sliding window is firstly executed along the W direction, then slides along the IC direction, and so on. Six groups of signals are designed in the control unit and used for controlling the calculation process, and the calculation process comprises the following steps: isFirstSlice, isLastSlice, issilcdone, ispanedone, isCubeDone, isDone. The position of the Slice in the W direction is marked by two signals of the ISFirstSlice and the ISLastSlice, whether the computation of marking the Plane and the Cube by the ispaneDone and the iscObone is finished or not is judged, and whether the computation of the whole input feature graph is finished or not is marked by the iscenone. Fig. 2 shows the control mode and control domain for different signals.

After the convolution sliding window is calculated each time, the corresponding input data is sent to the same position of the accumulation cache, in the process, the control unit calculates the output address of the array and controls the sending process of the data. And after the convolution kernel sliding is finished and the calculation of the accumulation cache is finished, the control unit sequentially sends the output results to the vector unit to execute vector operation and stores the results in the global cache.

In a typical convolution execution mode, fig. 3 shows a schematic diagram of an execution model of convolution layers in a neural network. The convolution calculation is split into five steps in the present invention, as indicated in the figure. The specific execution flow is as follows:

(1) calculating data in sliding according to the three-dimensional sizes of the length, the width and the number of input channels of 3-16;

(2) switching the calculation lines, repeating the step (1) and finishing the calculation of all the lines in the convolution sliding window;

(3) switching the convolution calculation block, repeating the step (2), and inputting the calculation on all channels of the characteristic diagram;

(4) repeating the step (3), completing the calculation of all columns in the sliding window, moving the sliding window in the row dimension, and completing the calculation of data in the row direction of the input characteristic diagram;

(5) and (4) moving a sliding window on the column dimension of the input feature diagram, and repeating the step (4) to finish data calculation on all the column dimensions.

(7) And DMA (RDMA/WDMA) for reading and writing the feature data, wherein the RDMA is used for carrying the feature data to the input register, and the RDMA supports carrying a plurality of data from a plurality of banks to the input register in parallel, thereby ensuring that the input register works in a full load state.

The WDMA is responsible for carrying the computed data from the results in the buffers of the vector logic unit into the global cache.

Selecting a first convolution layer in a 9 th residual block of a classical neural network model Resnet50 to implement the case, wherein the network structure of the layer is as follows: the length and width of the convolution kernel is 3 x 3, and the size of the input channel and the output channel is 256.

Suppose the Array size of the accelerator is 4,the number of banks is 64.

Before the calculation is started, the weights are expanded into one-dimensional data by using Cube with the length, the width and the input channel of 3 × 16 as granularity, the data are mapped into one row of a cross circuit, and the data with different input channel and output channel dimensions are mapped into multiple banks and multiple arrays. As shown in fig. 4, the Cube of each color is mapped into different arrays, and only mapping examples of the first three cubes are shown, and different output channel weights are replayed in different banks. Meanwhile, the input feature map is stored at the first address of the global buffer.

In the reasoning process, an input register takes Cube as granularity and fetches data from the first address of the global cache, the data layout is adjusted through an integer circuit to match the layout of weights in the ReRAM array, and after conversion, cube-form data is expanded into a one-dimensional vector with the size of 144. And forwarding the redirection channel NoC to a corresponding ReRAM array, multiplying and accumulating the redirection channel NoC and the weight to obtain a partial sum of data at a certain point on the output characteristic degree, and then sending the partial sum to an accumulation cache. And after the data of the received Cube is calculated, the system acquires the next data Cube along the input channel of the input characteristic diagram and sends the next data Cube to the input register to repeat the calculation. It should be noted that the data Cube on the same input dimension is placed in the same position of the accumulation buffer after the multiply-accumulate operation.

When 16 Cube in input channel dimension completes calculation, the sliding window will switch to the next sliding position to repeat the calculation. Meanwhile, the data in the accumulation buffer can be sent to the activation unit to execute the ReLU operation, and the data in the buffer can be refreshed to wait for the data storage of the next round of calculation. The data passing through the active unit is sent back to the global buffer, which also flushes out the data that is no longer needed through ping-pong mode.

And when all the data are calculated, only the output characteristic diagram of the current layer exists in the global cache, and the calculation of the next network layer is started.

Compared with the prior art, the invention has the advantages that:

(1) The non-blocking data stream is supplemented with high-efficiency DMA and a data layout format, so that the Feature and the writing result can be efficiently read out.

(2) The computing structure is simple and efficient, and the accelerator structure only comprises three cache structures, namely a ping-pong cache performs data layout adjustment, an accumulation cache performs partial sum accumulation, and a global cache performs intermediate data caching.

(3) The ReRAM array in the accelerator structure has high design flexibility and can support various weight placing modes, so that the accelerator has the support capability of various convolution operators.

(4) The high-efficiency convolution calculation path and the vector calculation path have the advantages that the convolution calculation result can be output to the coprocessor in a non-blocking serial mode to be activated and the like, and the coprocessor can also independently process the activation, pooling, quantization and the like.

The invention provides a neural network accelerator structure based on a ReRAM (random Access), which is used for reasoning and calculation of a neural network by adopting a novel neural network weight mapping method and adopting high-efficiency Direct Memory Access (DMA) and flexible data layout formats, improves the parallelism of data streams and calculation streams, solves the problem of data blockage in the reasoning process and improves the data throughput rate of the architecture. The cache level in the architecture is simple in design and high in flexibility, and the implementation of a hardware circuit is facilitated while the accelerated calculation of multiple operators in a common neural network is completely supported.

In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to the related descriptions of other embodiments.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a ROM, a RAM, etc.

The above disclosure is only for the purpose of illustrating the preferred embodiments of the present invention, and it is therefore to be understood that the invention is not limited by the scope of the appended claims.

Claims

1. A neural network accelerator based on a ReRAM is characterized by comprising a ReRAM in-situ calculation array, an input register, an accumulation buffer, a vector logic unit, a global buffer, a calculation control unit and a characteristic data read-write DMA; the input register and the accumulation buffer are connected with a ReRAM in-situ calculation array;

the calculation unit organization form of the ReRAM in-situ calculation array adopts a cross circuit layout, and the cross point of each word line and each bit line is a ReRAM device for mapping multi-bit data; the ReRAM in-situ calculation Array comprises a plurality of groups of calculation arrays, and each group of calculation arrays comprises a plurality of Bank ReRAM calculation units in the row direction;

the calculation control unit is used for managing the whole processes of on-chip data access, transportation and calculation;

2. The ReRAM-based neural network accelerator according to claim 1, wherein the ReRAM in-situ computation array performs a weight mapping algorithm with Bank as a first priority and Bank Row direction as a second priority, and arranges each weight data in a cross circuit of the ReRAM in-situ computation array according to a specific order of bits from high to low or from low to high.

3. The ReRAM-based neural network accelerator as claimed in claim 1, wherein the output register and adder unit are connected to the bottom end of the bit line logic of the ReRAM in-situ computation array for performing the accumulation computation among the multi-bit line data.

4. The ReRAM-based neural network accelerator of claim 1, wherein each compute unit of the ReRAM in-situ compute array is interconnected via a NoC.

5. The ReRAM-based neural network accelerator according to claim 1, wherein the input register performs data layout adjustment in iReg according to the requirement of input feature data during layout conversion, and the slice data on each input channel is rearranged to be in accordance with the calculation position of the weight data stored in the ReRAM in-situ calculation array.

6. The ReRAM-based neural network accelerator as recited in claim 1, wherein the global buffer is buffered in a ping-pong manner.

7. The ReRAM-based neural network accelerator according to claim 1, wherein the calculation process of the calculation control unit is: firstly, the sliding of the convolution kernel on the input features is divided into three directions, namely the W direction, the IC direction and the H direction; then in the calculation process, the sliding window firstly slides along the IC direction after being executed along the W direction, and finally slides along the H direction; after the convolution sliding window is calculated each time, corresponding input data are sent to the same position of the accumulation buffer, in the process, the calculation control unit calculates to obtain an output address of the array and controls the sending process of the data; and after the convolution kernel sliding is finished and the calculation of the accumulation buffer is finished, the control unit sequentially sends the output results to the vector logic unit to execute vector operation and stores the results in the global buffer.

8. The ReRAM-based neural network accelerator of claim 1, wherein the feature data read-write DMA comprises RDMA and WDMA; the RDMA is used for carrying a plurality of data from a plurality of banks to the input register in parallel and ensuring that the input register works in a full load state; the WDMA is used for transporting the calculated data from the buffer of the vector logic unit to the global buffer.