CN115965067A - Neural network accelerator for ReRAM - Google Patents

Neural network accelerator for ReRAM Download PDF

Info

Publication number
CN115965067A
CN115965067A CN202310049117.8A CN202310049117A CN115965067A CN 115965067 A CN115965067 A CN 115965067A CN 202310049117 A CN202310049117 A CN 202310049117A CN 115965067 A CN115965067 A CN 115965067A
Authority
CN
China
Prior art keywords
calculation
data
reram
neural network
array
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202310049117.8A
Other languages
Chinese (zh)
Other versions
CN115965067B (en
Inventor
景乃峰
伍骏
董光达
熊大鹏
李涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Yizhu Intelligent Technology Co ltd
Original Assignee
Suzhou Yizhu Intelligent Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Yizhu Intelligent Technology Co ltd filed Critical Suzhou Yizhu Intelligent Technology Co ltd
Priority to CN202310049117.8A priority Critical patent/CN115965067B/en
Publication of CN115965067A publication Critical patent/CN115965067A/en
Application granted granted Critical
Publication of CN115965067B publication Critical patent/CN115965067B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Complex Calculations (AREA)

Abstract

The invention discloses a neural network accelerator based on ReRAM (random access memory), which belongs to the field of design of neural network accelerators and comprises a ReRAM in-situ calculation array, an input register, an accumulation buffer, a vector logic unit, a global buffer, a calculation control unit and a characteristic data read-write DMA (direct memory access); the input register and the accumulation buffer are connected with a ReRAM in-situ calculation array. The invention adopts a novel neural network weight mapping method for neural network reasoning calculation, adopts efficient direct memory access and flexible data layout format, improves the parallelism of data flow and calculation flow, solves the problem of data blockage in the reasoning process and improves the data throughput rate of the architecture.

Description

Neural network accelerator for ReRAM
Technical Field
The invention relates to the field of design of neural network accelerators, in particular to a neural network accelerator based on ReRAM.
Background
With the rapid increase in data size, the demand for hardware computing power and data storage capacity for computationally intensive neural network applications has also increased. In order to break through the bottleneck of memory access bandwidth caused by the separation of computing and storage in the traditional von neumann architecture, more and more researches are beginning to focus on a high-density memory-computing integrated computing architecture, so as to reduce the memory access energy consumption and bandwidth requirement additionally increased due to the frequent transportation of data between a storage component and a computing component by tightly coupling computing and storage, and further maximize the energy efficiency ratio of a hardware architecture.
A new nonvolatile memory, namely a resistive random access memory (ReRAM), stores the conductance values, and organizes multiplication accumulation calculation through ohm's law to form an in-situ calculation function which is not possessed by the traditional memory. Traditional acceleration pairs based on von Neumann architectures require simultaneous handling of Weight and Feature for computation. The neural network accelerator based on the ReRAM converts an input vector into voltage to be applied to a Word line (Word-line) of the RRAM, then, the other weight input is mapped into the conductance of each unit of the ReRAM in advance, and the accumulated current on a Bit line (Bit-line) is characterized as the number product of two vectors, so that in-situ storage processing is realized, and the data bandwidth pressure is greatly relieved to improve the performance.
At present, the mainstream storage and computation all-in-one accelerator usually adopts a hierarchical topological structure, and a circuit is divided into functional modules of different scales from top to bottom. The minimum granularity calculation module in the architecture is a ReRAM cross array and an arithmetic logic unit, basic units are combined to form an upper layer operation module, the upper layer modules can be combined with each other, and the rest is done until the top layer of the architecture. According to the scheduling and driving modes of the modules, the existing accelerator systems can be divided into two types: data flow driven and instruction flow driven.
The data flow driving type structure means that in the calculation execution process, data access and interaction of each functional module are completed under the control of a state machine, and the process has no instruction interference. In the neural network estimation, once the structure and the weight of the network are determined, the size and the execution mode of data in the whole operation are also determined. The data flow driving type storage and calculation integrated structure aims at the characteristic, before execution, the structure, the connection relation and the weight value of a network layer are mapped onto a corresponding functional array, and the network reasoning process can be realized through input and output control during execution. Typical accelerators in such structures are ISAAC, PRIME, etc.
The mainstream accelerator structure in the industry at present has problems: 1. the weight mapping algorithm mode is single and fixed, and the flexibility is poor; 2. in the existing computing architecture, the blocking situation of data flow often occurs in the network reasoning process.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provides a neural network accelerator based on ReRAM.
The purpose of the invention is realized by the following technical scheme:
a neural network accelerator based on a ReRAM comprises a ReRAM in-situ calculation array, an input register, an accumulation buffer, a vector logic unit, a global buffer, a calculation control unit and a characteristic data read-write DMA; the input register and the accumulation buffer are connected with a ReRAM in-situ calculation array;
the calculation unit organization form of the ReRAM in-situ calculation array adopts a cross circuit layout, and the cross point of each word line and each bit line is a ReRAM device for mapping multi-bit data; the ReRAM in-situ calculation Array comprises a plurality of groups of calculation arrays, and each group of arrays comprises a plurality of Bank ReRAM calculation units in the column direction;
the input register is a connecting bridge between an input characteristic diagram and array calculation in a calculation process and is used for data caching, layout conversion and data distribution;
the accumulation buffer is used for buffering and accumulating partial sum data output by the ReRAM in-situ calculation array;
the vector logic unit is used for performing corresponding activation function calculation and network pooling calculation according to the requirement of the neural network layer;
the input end of the global buffer is connected with the vector logic unit, and the output end of the global buffer is connected with the input register; the buffer is used for caching intermediate data among the neural network layers;
the calculation control unit is used for managing the whole processes of on-chip data access, carrying and calculation;
the characteristic data read-write DMA is used for carrying the characteristic data.
Further, when the ReRAM in-situ computation array performs a weight mapping algorithm, bank is used as a first priority, a Bank Row direction is used as a second priority, and each weight data is regularly arranged in a cross circuit of the ReRAM in-situ computation array according to a specific order of bits from high to low or from low to high.
Furthermore, the logical bottom end of the bit line of the ReRAM in-situ calculation array is connected with an output register and an adder unit for performing accumulation calculation among the multi-bit line data.
Further, each computing unit of the ReRAM in-situ computing array is connected with each other through the NoC.
Further, when the input register performs layout conversion, the data layout is adjusted in the iReg according to the requirement of the input feature data, and the slice data on each input channel is rearranged and arranged to be consistent with the calculation position of the weight data stored in the ReRAM in-situ calculation array.
Further, the buffering mode of the global buffer is ping-pong buffering.
Further, the calculation process of the calculation control unit is as follows: firstly, the sliding of the convolution kernel on the input features is divided into three directions, namely a W direction, an IC direction and an H direction; then in the calculation process, the sliding window firstly slides along the IC direction after being executed along the W direction, and finally slides along the H direction; after the convolution sliding window is calculated each time, corresponding input data are sent to the same position of the accumulation buffer, in the process, the calculation control unit calculates to obtain an output address of the array and controls the sending process of the data; and after the convolution kernel sliding is finished and the calculation of the accumulation buffer is finished, the control unit sequentially sends the output results to the vector logic unit to execute vector operation and stores the results in the global buffer.
Further, the characteristic data read-write DMA comprises RDMA and WDMA; the RDMA is used for carrying a plurality of data from a plurality of banks to the input register in parallel and ensuring that the input register works in a full load state; the WDMA is used for transporting the calculated data from the buffer of the vector logic unit to the global buffer.
The invention has the beneficial effects that: the novel ReRAM storage and calculation integrated neural network accelerator structure provided by the invention excavates the parallelism of convolution and matrix multiplication calculation to a great extent. Different from the traditional neural network mapping algorithm, the novel weight mapping algorithm takes the Bank of a ReRAM array as a first priority and takes the Row direction of the Bank as a second priority, and regularly arranges each weight data in a cross circuit of a calculation array according to the specific sequence of bits from high to low or from low to high. Based on the algorithm, a multi-array multi-Bank ReRAM calculation unit mode is adopted in the framework, and high-parallelism matrix multiplication calculation is performed on the input feature diagram through an efficient data transfer and layout reforming module. And the output results are subjected to non-blocking activation and pooling operation after accumulation, so that the throughput efficiency of the network reasoning assembly line is improved to a great extent.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the structures shown in the drawings without creative efforts.
FIG. 1 is a block diagram of an accelerator architecture of the present invention.
Fig. 2 is a schematic diagram of a control mode of the calculation control unit.
Fig. 3 is a schematic diagram of a typical convolution implementation.
FIG. 4 is an exemplary diagram of convolution weight mapping in Resnet 50.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In this embodiment, an overall architecture of a neural network accelerator based on ReRAM is composed of the following modules:
(1) ReRAM in-situ computing arrays;
(2) An input register;
(3) An accumulation buffer facing the array output;
(4) Vector logic units oriented to activation, pooling, etc. operations;
(5) A global buffer oriented to network interlayer result storage;
(6) A calculation control unit;
(7) DMA (RDMA/WDMA) for feature data read and write.
As shown in fig. 1, an illustration of a specific module design and connection representation is given. All modules of the framework are interconnected through NoC, and flexible data streams are supported to face different weight mapping modes.
(1) The ReRAM in-situ computing array is one of the most important core modules in the design of the accelerator architecture, and is different from the computing array in the traditional ReRAM neural network accelerator, and the computing array of the accelerator architecture not only supports the traditional weight placement mode of the neural network, but also carries out corresponding optimization on a new weight mapping algorithm. In the field of ReRAM accelerators, the weight mapping mode of convolution computation is implemented based on the expansion of convolution kernels. Taking an (O, I, H, W) convolution kernel as an example, O and I are the sizes of the output channel and the input channel of the convolution kernel, and H and W are the height and the width of the convolution kernel. When the convolution kernels are expanded, the convolution kernels of different output channels are expanded into vector data of I x H x W according to a certain sequence, and expanded vectors with the shapes of O columns can be obtained to be mapped to different rows and columns of the cross circuit. Therefore, convolution calculation is converted into matrix multiplication calculation, and the calculation process can be accelerated by the in-situ calculation array.
According to the novel mapping algorithm, the Bank of the ReRAM array is used as the first priority, and the parallelism among the Bank is fully utilized to place more convolution input channels; the mapping algorithm takes the Bank Row direction as the second priority, and arranges convolution kernels of different channels at different Bank Row positions in the dimension of a convolution output channel until the data mapping requirements of all input and output channels are completed. Meanwhile, each weight data is regularly arranged in a cross circuit of the calculation array according to a specific sequence of bit from high to low or from low to high so as to ensure the correctness of the calculation flow and the data accumulation process. Taking the convolution kernels of (O, I, H, W) shape above as an example, first one convolution kernel is spread along the input lane dimension, the data slice of each (H, W) in that dimension is converted into vector data of size H x W, and the vectors of different input lanes are placed on different banks. Next, along the output path, the data of the convolution kernel is mapped to the direction of the array Bank Row according to the above expansion scheme, and the data at the corresponding position in the adjacent convolution kernel is placed on the adjacent Bank Row.
The ReRAM calculation array module in the framework is designed based on the mapping algorithm principle, and structural and scale adaptation is carried out to meet the requirements of the mapping algorithm. The single ReRAM computing unit adopts the traditional cross circuit layout in the organization form, and the cross point of each word line and each bit line is a ReRAM device for mapping multi-bit data; the logic bottom end of each bit line is connected with an output register and an adder unit for performing accumulation calculation among the multi-bit line data. In the overall layout, the in-situ computation Array comprises a plurality of sets of computation arrays, each set of arrays comprising a plurality of Bank ReRAM computation units in the column direction, typically values such as 128 Bank. The units are connected with each other through NoC. The design utilizes a large-scale parallelization Bank structure to improve the parallel computing capability of the convolution computing flow on the input dimension, and also improves the deployment capability and the universality of large-scale convolution kernels through the Array scale of double arrays.
(2) The input register is a connecting bridge between the input characteristic diagram and the array calculation in the calculation process, and plays roles of data caching, layout conversion and data distribution. Among them, the layout conversion is one of the most important functions. The mapping algorithm provided by the invention expands convolution kernels in convolution calculation according to the dimension of an input channel, and multiply-accumulate calculation is carried out on each expansion backward quantity requirement and corresponding input characteristic data. Therefore, the need to input feature data makes a data layout adjustment in iReg, which rearranges the layout of the slice data on each input channel to match the calculated position of the weight data stored in the ReRAM array.
The specific principle is as follows: the address generation module calculates the source address and the target address of the data in the convolution process according to the input data of the convolution layer, the convolution kernel size, padding, stride and other information, counts the data from the source address of the global buffer and sends the data to the target address of the input register, and the corresponding conversion of the data layout is completed.
(3) The accumulation buffer is connected with the ReRAM calculation array through the NoC module and is mainly responsible for buffering and accumulating partial sum data output by the calculation array. The Partial sum data (Partial-sum) means that the output data of the array is only a part of the actual output data, and the actual data needs to be obtained by accumulating all the Partial sum data.
In the actual neural network deployment and reasoning process, due to the characteristics of the mapping algorithm adopted by the accelerator, data on the scale of an input channel of a convolution kernel is scattered on different banks, and in addition, the data size of an input channel slice or a weight matrix of the convolution kernel exceeds the capacity of a single ReRAM computing unit, a plurality of ReRAM computing units are required to take charge of the same block of data, so the computing output of each ReRAM array is often a part and a result which need to be accumulated.
The data and the part at the same position on the output characteristic diagram are forwarded through the NoC, and are stored to the same address position of the accumulation buffer from the output ports of different ReRAM calculation arrays. The accumulation buffer structure design comprises 128 banks, each Bank has the capacity of 2KB, at most 512 32-bit floating point or fixed point data are accommodated, and the total capacity is 256KB. Ideally, one Bank is responsible for storing data of only one input channel slice of the output feature map, and if the output feature map is large, a plurality of banks are required to store data of one channel slice.
(4) The vector logic unit is responsible for executing vector operations such as activation, pooling and the like of output data in the neural network pipeline and is positioned behind the global buffer.
After the accumulation buffer completes the accumulation operation of the partial sums, the result is sent to the vector unit. And the vector unit performs corresponding activation function calculation and network pooling calculation according to the requirement of the neural network layer.
(5) The global buffer is used for buffering intermediate data among neural network layers, a ping-pong buffer design is adopted, and the capacity is 4MB. The input end of the global cache is connected to the vector unit, and the output end of the global cache is connected with the input register. The global cache design logically comprises two groups of cache regions, wherein the first group of cache regions is responsible for placing input characteristic data of a current network layer, and the second group of cache regions is responsible for placing output data of the network layer. When the input characteristic data is not used any more, the data is cleared by the global cache, and then the data is shifted to fill the data position, so that extra space is provided for network layer output.
In the process of inference calculation, the DMA unit will first obtain the input feature map from DRAM, and place it in the buffer area of the network input. In the convolution calculation process, the global cache executes ping-pong operation, shifts data and caches new network output data, and only output characteristic data are left in the buffer area after the calculation of the neural network layer is finished, and the data can be used as input data of the next layer of network for calculation.
(6) And the calculation control unit is responsible for managing the whole processes of on-chip data access, carrying and calculation.
First, the control unit determines the calculation flow of the convolutional layer. The convolution calculation flow in the framework is as follows: first, the sliding of the convolution kernel on the input features is divided into three methods, respectively in the W direction, IC direction, H direction. In the calculation process, the sliding window is firstly executed along the W direction, then slides along the IC direction, and so on. Six groups of signals are designed in the control unit and used for controlling the calculation process, and the calculation process comprises the following steps: isFirstSlice, isLastSlice, issilcdone, ispanedone, isCubeDone, isDone. The position of the Slice in the W direction is marked by two signals of the ISFirstSlice and the ISLastSlice, whether the computation of marking the Plane and the Cube by the ispaneDone and the iscObone is finished or not is judged, and whether the computation of the whole input feature graph is finished or not is marked by the iscenone. Fig. 2 shows the control mode and control domain for different signals.
After the convolution sliding window is calculated each time, the corresponding input data is sent to the same position of the accumulation cache, in the process, the control unit calculates the output address of the array and controls the sending process of the data. And after the convolution kernel sliding is finished and the calculation of the accumulation cache is finished, the control unit sequentially sends the output results to the vector unit to execute vector operation and stores the results in the global cache.
In a typical convolution execution mode, fig. 3 shows a schematic diagram of an execution model of convolution layers in a neural network. The convolution calculation is split into five steps in the present invention, as indicated in the figure. The specific execution flow is as follows:
(1) calculating data in sliding according to the three-dimensional sizes of the length, the width and the number of input channels of 3-16;
(2) switching the calculation lines, repeating the step (1) and finishing the calculation of all the lines in the convolution sliding window;
(3) switching the convolution calculation block, repeating the step (2), and inputting the calculation on all channels of the characteristic diagram;
(4) repeating the step (3), completing the calculation of all columns in the sliding window, moving the sliding window in the row dimension, and completing the calculation of data in the row direction of the input characteristic diagram;
(5) and (4) moving a sliding window on the column dimension of the input feature diagram, and repeating the step (4) to finish data calculation on all the column dimensions.
(7) And DMA (RDMA/WDMA) for reading and writing the feature data, wherein the RDMA is used for carrying the feature data to the input register, and the RDMA supports carrying a plurality of data from a plurality of banks to the input register in parallel, thereby ensuring that the input register works in a full load state.
The WDMA is responsible for carrying the computed data from the results in the buffers of the vector logic unit into the global cache.
Selecting a first convolution layer in a 9 th residual block of a classical neural network model Resnet50 to implement the case, wherein the network structure of the layer is as follows: the length and width of the convolution kernel is 3 x 3, and the size of the input channel and the output channel is 256.
Suppose the Array size of the accelerator is 4,the number of banks is 64.
Before the calculation is started, the weights are expanded into one-dimensional data by using Cube with the length, the width and the input channel of 3 × 16 as granularity, the data are mapped into one row of a cross circuit, and the data with different input channel and output channel dimensions are mapped into multiple banks and multiple arrays. As shown in fig. 4, the Cube of each color is mapped into different arrays, and only mapping examples of the first three cubes are shown, and different output channel weights are replayed in different banks. Meanwhile, the input feature map is stored at the first address of the global buffer.
In the reasoning process, an input register takes Cube as granularity and fetches data from the first address of the global cache, the data layout is adjusted through an integer circuit to match the layout of weights in the ReRAM array, and after conversion, cube-form data is expanded into a one-dimensional vector with the size of 144. And forwarding the redirection channel NoC to a corresponding ReRAM array, multiplying and accumulating the redirection channel NoC and the weight to obtain a partial sum of data at a certain point on the output characteristic degree, and then sending the partial sum to an accumulation cache. And after the data of the received Cube is calculated, the system acquires the next data Cube along the input channel of the input characteristic diagram and sends the next data Cube to the input register to repeat the calculation. It should be noted that the data Cube on the same input dimension is placed in the same position of the accumulation buffer after the multiply-accumulate operation.
When 16 Cube in input channel dimension completes calculation, the sliding window will switch to the next sliding position to repeat the calculation. Meanwhile, the data in the accumulation buffer can be sent to the activation unit to execute the ReLU operation, and the data in the buffer can be refreshed to wait for the data storage of the next round of calculation. The data passing through the active unit is sent back to the global buffer, which also flushes out the data that is no longer needed through ping-pong mode.
And when all the data are calculated, only the output characteristic diagram of the current layer exists in the global cache, and the calculation of the next network layer is started.
Compared with the prior art, the invention has the advantages that:
(1) The non-blocking data stream is supplemented with high-efficiency DMA and a data layout format, so that the Feature and the writing result can be efficiently read out.
(2) The computing structure is simple and efficient, and the accelerator structure only comprises three cache structures, namely a ping-pong cache performs data layout adjustment, an accumulation cache performs partial sum accumulation, and a global cache performs intermediate data caching.
(3) The ReRAM array in the accelerator structure has high design flexibility and can support various weight placing modes, so that the accelerator has the support capability of various convolution operators.
(4) The high-efficiency convolution calculation path and the vector calculation path have the advantages that the convolution calculation result can be output to the coprocessor in a non-blocking serial mode to be activated and the like, and the coprocessor can also independently process the activation, pooling, quantization and the like.
The invention provides a neural network accelerator structure based on a ReRAM (random Access), which is used for reasoning and calculation of a neural network by adopting a novel neural network weight mapping method and adopting high-efficiency Direct Memory Access (DMA) and flexible data layout formats, improves the parallelism of data streams and calculation streams, solves the problem of data blockage in the reasoning process and improves the data throughput rate of the architecture. The cache level in the architecture is simple in design and high in flexibility, and the implementation of a hardware circuit is facilitated while the accelerated calculation of multiple operators in a common neural network is completely supported.
In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to the related descriptions of other embodiments.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a ROM, a RAM, etc.
The above disclosure is only for the purpose of illustrating the preferred embodiments of the present invention, and it is therefore to be understood that the invention is not limited by the scope of the appended claims.

Claims (8)

1. A neural network accelerator based on a ReRAM is characterized by comprising a ReRAM in-situ calculation array, an input register, an accumulation buffer, a vector logic unit, a global buffer, a calculation control unit and a characteristic data read-write DMA; the input register and the accumulation buffer are connected with a ReRAM in-situ calculation array;
the calculation unit organization form of the ReRAM in-situ calculation array adopts a cross circuit layout, and the cross point of each word line and each bit line is a ReRAM device for mapping multi-bit data; the ReRAM in-situ calculation Array comprises a plurality of groups of calculation arrays, and each group of calculation arrays comprises a plurality of Bank ReRAM calculation units in the row direction;
the input register is a connecting bridge between an input characteristic diagram and array calculation in a calculation process and is used for data caching, layout conversion and data distribution;
the accumulation buffer is used for buffering and accumulating partial sum data output by the ReRAM in-situ calculation array;
the vector logic unit is used for performing corresponding activation function calculation and network pooling calculation according to the requirement of the neural network layer;
the input end of the global buffer is connected with the vector logic unit, and the output end of the global buffer is connected with the input register; the buffer is used for caching intermediate data among the neural network layers;
the calculation control unit is used for managing the whole processes of on-chip data access, transportation and calculation;
the characteristic data read-write DMA is used for carrying the characteristic data.
2. The ReRAM-based neural network accelerator according to claim 1, wherein the ReRAM in-situ computation array performs a weight mapping algorithm with Bank as a first priority and Bank Row direction as a second priority, and arranges each weight data in a cross circuit of the ReRAM in-situ computation array according to a specific order of bits from high to low or from low to high.
3. The ReRAM-based neural network accelerator as claimed in claim 1, wherein the output register and adder unit are connected to the bottom end of the bit line logic of the ReRAM in-situ computation array for performing the accumulation computation among the multi-bit line data.
4. The ReRAM-based neural network accelerator of claim 1, wherein each compute unit of the ReRAM in-situ compute array is interconnected via a NoC.
5. The ReRAM-based neural network accelerator according to claim 1, wherein the input register performs data layout adjustment in iReg according to the requirement of input feature data during layout conversion, and the slice data on each input channel is rearranged to be in accordance with the calculation position of the weight data stored in the ReRAM in-situ calculation array.
6. The ReRAM-based neural network accelerator as recited in claim 1, wherein the global buffer is buffered in a ping-pong manner.
7. The ReRAM-based neural network accelerator according to claim 1, wherein the calculation process of the calculation control unit is: firstly, the sliding of the convolution kernel on the input features is divided into three directions, namely the W direction, the IC direction and the H direction; then in the calculation process, the sliding window firstly slides along the IC direction after being executed along the W direction, and finally slides along the H direction; after the convolution sliding window is calculated each time, corresponding input data are sent to the same position of the accumulation buffer, in the process, the calculation control unit calculates to obtain an output address of the array and controls the sending process of the data; and after the convolution kernel sliding is finished and the calculation of the accumulation buffer is finished, the control unit sequentially sends the output results to the vector logic unit to execute vector operation and stores the results in the global buffer.
8. The ReRAM-based neural network accelerator of claim 1, wherein the feature data read-write DMA comprises RDMA and WDMA; the RDMA is used for carrying a plurality of data from a plurality of banks to the input register in parallel and ensuring that the input register works in a full load state; the WDMA is used for transporting the calculated data from the buffer of the vector logic unit to the global buffer.
CN202310049117.8A 2023-02-01 2023-02-01 Neural network accelerator for ReRAM Active CN115965067B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310049117.8A CN115965067B (en) 2023-02-01 2023-02-01 Neural network accelerator for ReRAM

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310049117.8A CN115965067B (en) 2023-02-01 2023-02-01 Neural network accelerator for ReRAM

Publications (2)

Publication Number Publication Date
CN115965067A true CN115965067A (en) 2023-04-14
CN115965067B CN115965067B (en) 2023-08-25

Family

ID=87359911

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310049117.8A Active CN115965067B (en) 2023-02-01 2023-02-01 Neural network accelerator for ReRAM

Country Status (1)

Country Link
CN (1) CN115965067B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120303871A1 (en) * 2011-05-26 2012-11-29 Naoya Tokiwa Semiconductor memory device and method of controlling the same
CN107195321A (en) * 2017-05-15 2017-09-22 华中科技大学 A kind of cross bar structure resistive formula memory performance optimization method and system
WO2020258528A1 (en) * 2019-06-25 2020-12-30 东南大学 Configurable universal convolutional neural network accelerator
WO2021004366A1 (en) * 2019-07-08 2021-01-14 浙江大学 Neural network accelerator based on structured pruning and low-bit quantization, and method
WO2021098821A1 (en) * 2019-11-20 2021-05-27 华为技术有限公司 Method for data processing in neural network system, and neural network system
CN113126898A (en) * 2020-01-15 2021-07-16 三星电子株式会社 Memory device, operating method thereof, and operating method of memory controller
US20220019408A1 (en) * 2020-07-17 2022-01-20 Samsung Electronics Co., Ltd. Method and apparatus with neural network processing

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120303871A1 (en) * 2011-05-26 2012-11-29 Naoya Tokiwa Semiconductor memory device and method of controlling the same
CN107195321A (en) * 2017-05-15 2017-09-22 华中科技大学 A kind of cross bar structure resistive formula memory performance optimization method and system
WO2020258528A1 (en) * 2019-06-25 2020-12-30 东南大学 Configurable universal convolutional neural network accelerator
WO2021004366A1 (en) * 2019-07-08 2021-01-14 浙江大学 Neural network accelerator based on structured pruning and low-bit quantization, and method
WO2021098821A1 (en) * 2019-11-20 2021-05-27 华为技术有限公司 Method for data processing in neural network system, and neural network system
CN113126898A (en) * 2020-01-15 2021-07-16 三星电子株式会社 Memory device, operating method thereof, and operating method of memory controller
KR20210092078A (en) * 2020-01-15 2021-07-23 삼성전자주식회사 Memory Device performing parallel calculation process, Operating Method thereof and Operation Method of Memory controller controlling memory device
US20220019408A1 (en) * 2020-07-17 2022-01-20 Samsung Electronics Co., Ltd. Method and apparatus with neural network processing

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
JUSTIN M CORRELL: "AN 8-bit 20.7 tops/w multi-level cell reram based compute engine", 《2022 IEEE SYMPOSIUM ON VLSI TECHNOLOGY AND CIRCUITS》 *
NAIFENG JING: ""fast fpga-based emulation for reram-enabled deep neural network accelerator"", 《2021 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS SYSTEMS》 *
张航: ""基于新型非易失存储技术的体系结构设计与优化"", 《中国博士学位论文全文数据库信息科技辑》 *

Also Published As

Publication number Publication date
CN115965067B (en) 2023-08-25

Similar Documents

Publication Publication Date Title
Ji et al. ReCom: An efficient resistive accelerator for compressed deep neural networks
CN111897579B (en) Image data processing method, device, computer equipment and storage medium
CN107301455B (en) Hybrid cube storage system for convolutional neural network and accelerated computing method
CN110222818B (en) Multi-bank row-column interleaving read-write method for convolutional neural network data storage
CN112149816B (en) Heterogeneous memory-computation fusion system and method supporting deep neural network reasoning acceleration
CN110738308B (en) Neural network accelerator
CN109948774A (en) Neural network accelerator and its implementation based on network layer binding operation
CN110705702A (en) Dynamic extensible convolutional neural network accelerator
CN111783933A (en) Hardware circuit design and method for data loading device combining main memory and accelerating deep convolution neural network calculation
CN109446478B (en) Complex covariance matrix calculation system based on iteration and reconfigurable mode
CN111459552B (en) Method and device for parallelization calculation in memory
CN117234720A (en) Dynamically configurable memory computing fusion data caching structure, processor and electronic equipment
Iliev et al. Low latency CMOS hardware acceleration for fully connected layers in deep neural networks
CN115965067B (en) Neural network accelerator for ReRAM
CN108920097B (en) Three-dimensional data processing method based on interleaving storage
US20230025068A1 (en) Hybrid machine learning architecture with neural processing unit and compute-in-memory processing elements
CN112862079B (en) Design method of running water type convolution computing architecture and residual error network acceleration system
CN112328536B (en) Inter-core structure of multi-core processor array and multi-core processor
CN114912596A (en) Sparse convolution neural network-oriented multi-chip system and method thereof
CN114072778A (en) Memory processing unit architecture
CN109583577B (en) Arithmetic device and method
CN115496190A (en) Efficient reconfigurable hardware accelerator for convolutional neural network training
CN113392959A (en) Method for reconstructing architecture in computing system and computing system
CN115719088B (en) Intermediate cache scheduling circuit device supporting in-memory CNN
CN111709872B (en) Spin memory computing architecture of graph triangle counting algorithm

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant