CN113723044A

CN113723044A - Data sparsity-based extra row activation and storage integrated accelerator design

Info

Publication number: CN113723044A
Application number: CN202111061410.3A
Authority: CN
Inventors: 景乃锋; 郭梦裕; 张子涵; 蒋剑飞; 王琴
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2021-09-10
Filing date: 2021-09-10
Publication date: 2021-11-30
Anticipated expiration: 2041-09-10
Also published as: CN113723044B

Abstract

The invention discloses an excessive row activation and storage integrated accelerator design based on data sparsity, which relates to the field of neural network accelerator design of a storage and storage integrated framework, and comprises three parts, namely, a prediction mechanism based on row activation data is constructed, the limitation and the calculation parallelism of a peripheral circuit device are modeled, and the matching problem of the peripheral circuit and the calculation parallelism is solved; a row activation excess subscription mechanism is constructed, the calculation parallelism and the resource use are adaptively adjusted, and the problems of low utilization rate of a calculation array and the peripheral circuit and resource redundancy under sparse data are solved; aiming at the characteristic of data sparsity of the neural network, a control flow and a data flow are re-planned, and the problem of complex circuit design introduced by using the data sparsity is solved. The invention models the relation between the limitation of peripheral circuit devices and the calculation parallelism by predicting the scale of output data, and adaptively adjusts the calculation parallelism and the resource use according to the prediction so as to utilize the peripheral circuit resources to the maximum extent.

Description

Data sparsity-based extra row activation and storage integrated accelerator design

Technical Field

The invention relates to the field of design of a neural network accelerator with a storage and calculation integrated framework, in particular to design of an excess row activation and storage integrated accelerator based on data sparsity.

Background

In recent years, with the rapid development of convolutional neural network applications, the demand for dedicated accelerators has become higher and higher. The accelerator based on the traditional architecture has a great challenge to the acceleration performance improvement due to the significant cost of data movement in the calculation process, and the neural network accelerator based on the resistive random access memory (ReRAM) has become a novel paradigm for solving the problem of the memory wall. He programs the weight data into the crossbar, thereby reducing the amount of data movement, and through successive resistance transformations of the ReRAM cells, the multiply-accumulate operation can be computed in the analog domain on the crossbar with massive parallelism. Compared with the traditional CMOS architecture calculation, the method greatly reduces the requirements of intermediate data unloading and data handling, and improves the energy efficiency by more than 100 times.

However, ReRAM-based accelerators introduce a significant burden when introducing analog domain computations. Expensive digital-to-Analog conversion units incur huge area and energy overhead, and in recent designs, peripheral areas of digital-to-Analog converters (ADCs) and the like account for 95.5% and power consumption accounts for 55.9%. As computational demands increase, area and energy limitations of digital-to-analog conversion units have also inhibited the development of ReRAM-based accelerators.

In order to reduce the area power consumption of the ADC, the conventional optimization method is implemented by a low-precision interface. The method converts original complex data into low-precision data, reduces the data scale and the data range, and accordingly relieves the limitation of computing resources and the pressure of peripheral circuits. They are only optimized for a specific network, lack versatility, and for more extensive complex models there is a costly and intolerable loss of accuracy. Optimization can also be achieved by tailoring unnecessary computational requirements. The method analyzes the influence factors of each calculation request relative to the result according to the weight data, and cuts the calculation requests with small influence factors, so that the calculation resource requirement is reduced, but the complicated area control logic and the layout and wiring can bring extra expenses.

On the other hand, input data characteristics are underestimated and disregarded compared to the sparseness of the weight data of the neural network. For example, there may be many zeros in the input signature because the most common ReLU activations will clamp all negative activations to zeros. There are also many zero bits in the non-zero value. And when time sliced as inputs to a ReRAM crossbar switch, they may introduce a large number of zero bits or wasted computations, thereby reducing performance and power consumption.

But conversely, the accumulated result on the bit lines may yield a smaller value than the pre-designed ADC range, given the presence of these zeros. In other words, the same output value can still be obtained by increasing the number of active rows without changing the resolution of the ADC, resulting in higher computational parallelism. The present invention is referred to herein as activating a rowavailability oversubscription (RAOS) for a row. Unlike weight sparsity, input data sparsity and small values are difficult to detect at runtime. Thus, the main challenge of RAOS is how to learn dynamic data to find a sufficient oversubscription rate.

Disclosure of Invention

In view of the above-mentioned drawbacks of the prior art, the technical problem to be solved by the present invention is how to utilize the resources of the peripheral circuit to the maximum extent.

In order to achieve the aim, the invention provides an excess row activation and storage integrated accelerator design based on data sparsity, which is characterized by comprising three parts, namely, a prediction mechanism based on row activation data is constructed, the limitation and the calculation parallelism of a peripheral circuit device are modeled, and the matching problem of the peripheral circuit and the calculation parallelism is solved; a row activation excess subscription mechanism is constructed, the calculation parallelism and the resource use are adaptively adjusted, and the problems of low utilization rate of a calculation array and the peripheral circuit and resource redundancy under sparse data are solved; aiming at the characteristic of data sparsity of the neural network, a control flow and a data flow are re-planned, and the problem of complex circuit design introduced by using the data sparsity is solved.

Further, the building a row activation data based prediction mechanism and the building a row activation oversubscription mechanism includes: the output data size is analyzed according to the input data, and the calculation parallelism is adjusted to utilize the design of the peripheral circuit to the maximum extent.

Further, the input data calculation and prediction reduces area power consumption by adopting a ReRAM calculation column method.

Furthermore, the prediction result is analyzed by a voltage comparator, so that the cost caused by digital-to-analog conversion is reduced.

Further, the planning of the control flow and the data flow includes: and analyzing the data characteristics of the neural network, and providing a dichotomy method and a sliding method to improve the sparse mining capability of the equipment.

Further, the dichotomy prediction mode is generated according to a dichotomy scheme, the calculation data is predicted at one time, and line activation is performed by using a smaller oversubscription rate for data which does not pass prediction.

Furthermore, the predicted style of each time by the sliding method is the maximum excess subscription rate, and the data which does not pass prediction and the subsequent calculation data are combined for a new prediction iteration until the calculation requirement is completed.

Further, the building row activation oversubscription mechanism includes: and according to the prediction of the row activation data, an oversubscription mechanism design and a circuit implementation are provided.

Further, the predicted result of the row activation data is decoded by a leading zero circuit to select an oversubscription scheme, and subsequent circuit control is carried out through two masks.

Further, the subsequent circuit control includes a compute core mask controlling input data feed to a ReRAM compute core, and a predict core mask controlling input data feed to a ReRAM predict unit.

In order to solve the problem of imbalance of peripheral circuit design and computing resource requirements, the invention designs a row activation over-subscription mechanism by combining the data characteristic and the structural characteristic of a neural network. The method models the relation between the limitation of peripheral circuit devices and the calculation parallelism by predicting the scale of output data, and adaptively adjusts the calculation parallelism and the resource use according to the prediction so as to utilize the peripheral circuit resources to the maximum extent.

The conception, the specific structure and the technical effects of the present invention will be further described with reference to the accompanying drawings to fully understand the objects, the features and the effects of the present invention.

Drawings

FIG. 1 is a ReRAM accelerator architecture design in accordance with a preferred embodiment of the present invention;

FIG. 2 is a comparison of 4XRAOS rates for different designs according to a preferred embodiment of the present invention;

FIG. 3 is a comparison of ISAAC, SRE and energy consumption of the present invention under normalization for a preferred embodiment of the present invention;

FIG. 4 is a graph of the performance improvement using different RAOS rates for a preferred embodiment of the present invention;

FIG. 5 is a computational core design of a preferred embodiment of the present invention;

FIG. 6 is a row activation oversubscription circuit implementation of a preferred embodiment of the present invention;

FIG. 7 is a ReRAM accelerator pipeline design in accordance with a preferred embodiment of the present invention;

FIG. 8 is a block diagram of binary prediction and sliding prediction in accordance with a preferred embodiment of the present invention.

Detailed Description

The technical contents of the preferred embodiments of the present invention will be more clearly and easily understood by referring to the drawings attached to the specification. The present invention may be embodied in many different forms of embodiments and the scope of the invention is not limited to the embodiments set forth herein.

In the drawings, structurally identical elements are represented by like reference numerals, and structurally or functionally similar elements are represented by like reference numerals throughout the several views. The size and thickness of each component shown in the drawings are arbitrarily illustrated, and the present invention is not limited to the size and thickness of each component. The thickness of the components may be exaggerated where appropriate in the figures to improve clarity.

The neural network accelerator design based on the ReRAM is realized on a simulation domain, the development of the neural network accelerator is influenced by huge expenses caused by introducing digital-to-analog conversion, and certain benefits are lost despite the realization of some low-precision methods. On the other hand, data sparsity in neural networks is severely underestimated, resulting in low utilization of computational resources and substantial waste. In order to solve the problem of imbalance of peripheral circuit design and computing resource requirements, the invention designs a row activation over-subscription mechanism by combining the data characteristic and the structural characteristic of a neural network. The method models the relation between the limitation of peripheral circuit devices and the calculation parallelism by predicting the scale of output data, and adaptively adjusts the calculation parallelism and the resource use according to the prediction so as to utilize the peripheral circuit resources to the maximum extent.

First we evaluate the output data size to determine the oversubscription rate. The idea is as follows,

i.e. all output data can be predicted by the maximum weight of each row and the corresponding sum of input data. When the prediction result is smaller than the pre-designed ADC range, the RAOS is applied to select larger calculation parallelism to complete the calculation requirement without influencing the accuracy. Otherwise, we use normal row activation to guarantee the normal operation of the ADC.

We use two methods to get sufficient oversubscription rate. The dichotomy prediction mode is generated according to a dichotomy scheme, the calculation data is predicted at one time, and the line activation is performed by using a smaller oversubscription rate for the data which does not pass the prediction. And predicting the maximum oversubscription rate of the patterns predicted each time by the sliding method, and combining the data which are not predicted and the subsequent calculation data to perform a new round of prediction iteration until the calculation requirement is completed.

The RAOS is mainly built on a crossbar to increase its computational parallelism, and its hierarchical structure is as shown in fig. 1. Fig. 5 shows a detailed implementation of a computing core, which is implemented as shown in fig. 6 by adding three blocks, namely a ReRAM prediction column, a prediction decoder and a row activation controller, compared with a traditional architecture RAOS architecture, and its specific details are as follows:

1. predicted column

Two prediction columns are equipped with 4X sliding prediction capability. Since the prediction column only checks whether the output data is out of the pre-designed ADC range, the analysis is performed using a voltage comparator after converting the accumulated current into a voltage. The two prediction columns give a prediction vector for the pass/fail signal per column.

2. Predictive decoder

The prediction decoder converts the prediction vector from the prediction column to the selected RAOS rate. Consisting of a leading detector that selects the position of the first "pass" signal from the vector.

3. Row activation controller

The row activation controller simultaneously presents a predicted row activation mask and calculates a row activation mask for use in predicting column and crossbar array row activations. It consists of 8 bit masks, each controlling 8 independent sets of activation controls. The controller has a built-in completion register for holding the number of line groups for which the calculation has been completed, with an initial value of 0. The corresponding predicted line activation mask and calculated line activation mask can be inferred from the number of completed line groups and the prediction result, and hold and complete signals are given to the input and output buffers for data updating.

Based on these three modules, the operation principle of the RAOS crossbar switch is as follows. First, in the prediction phase, the prediction column takes input data from the input buffer with a prediction row activation mask, and detects the prediction result through a leading zero circuit, outputting the selected RAOS rate to the row activation controller. Then in the compute phase, the crossbar array computes multiply-accumulate computations at the selected RAOS rate using the input data processed by the compute row activation mask. Both masks are updated at the selected RAOS rate in each calculation cycle. The prediction and computation stages are pipelined. After the computation is complete, the slice data is combined in the shift accumulation unit as before. The input buffering and output buffering may be organized as ping-pong buffers to overlap data loading with prediction and computation, enabling pipelining, as shown in fig. 7.

Based on the ReRAM prediction column, we use two methods to obtain sufficient oversubscription rate. The basic idea is shown in fig. 8. Assuming normal row activation with 1X being 1 row using a 2-bit input slice and a 1-bit weight slice, a predesigned ADC can detect an output range of [0,4 ]. When we apply a 4X RAOS rate, we should be equipped with three prediction columns to find the appropriate one. The cells in the first prediction column are programmed with the maximum weight of 4 rows each. In the second column, the first two rows are programmed with the largest weight, the last two rows are programmed to "0" accordingly, and the third column reverses the pattern in the second column. Three prediction columns work simultaneously. That is, if the first column detects that the output data is out of range, we should check the second and third columns for the 2X oversubscription rate. If either of these succeeds, we can still oversubscribe by 2X. Otherwise, we fall back to 1 line at a time to complete 4 lines in the MVM calculation one by one. Since the prediction column works in a binary partition, we refer to it herein as a binary prediction.

In contrast to the binary prediction, the sliding prediction always performs the next prediction at the highest permitted RAOS rate when the prediction fails. For input0 in the example of fig. 8(b), even if the first 4X rate prediction fails and subsequent calculations use 2X rates for calculations, the next sliding prediction is still performed at 4X rate.

Fig. 6 shows the row activation oversubscription implementation process after the prediction is completed, assuming that the input data passes through the prediction column and the output data exceeds the range at the 4X oversubscription rate and the output data does not exceed the range at the 2X oversubscription rate, the preamble detection circuit in the prediction decoder recognizes the 2X oversubscription rate and transmits the result downward. The row activation controller simultaneously presents a predicted row activation mask and calculates a row activation mask to activate rows for the predicted column and cross array based on the selected oversubscription rate. Assume that 1 row group has completed the computation, i.e., the completion register is recorded as 1, and the predictive decoder decides the 2X oversubscription rate for the next computation. The row activation mask is computed to be 4' b0011< <1 ═ 8' b00000110, where 4' b0011 is the standard mask at 2X oversubscription rate. The prediction line activation mask should be 4' b1111< (1+2) ═ 8' b01111000, where 4' b1111 is the standard mask at 4X oversubscription rate. At the same time, the completion register updates itself by adding 2, since 2 new line groups have just completed the computation. Note that when the completion register reaches 8, the current round of computation is complete and the new input buffer should be swapped in. In addition, the two masks need to be overflowed when the input and output buffer indexes cannot be exceeded.

The patent mainly provides an excess row activation and storage integrated accelerator design based on data sparsity, and common neural network models are respectively deployed on the row activation neural network accelerator based on the ReRAM, the ISAAC and the Sparereramengine (SRE), so that performance and area power consumption data of different models under different accelerator designs are obtained and analyzed, and the technical effect of the patent is embodied. The experiment used several popular DNN models, namely ResNet50, inclusion v3, MobileNetV2, ShuffleNetV2 and SqueezeNet, with the data set from ImageNet 2012. The experiment used the original weights and input data without increasing sparsity.

Experiments performance was assessed by running real data cycle by cycle using a NeuroSim simulator. And adds a row activation oversubscription module to each crossbar cell inside to enhance the crossbar architecture. In the experiment, we unified the array size of 128x128 and applied the 2-bit input and 1-bit weights for fair performance evaluation. We use ADCs consistent with SREs with a resolution of 6 bits and thus support up to 16 rows active at a time without oversubscription.

Fig. 4 reports the performance of using different RAOS rates and prediction schemes. For a 1X structure without RAOS, it activates 16 rows at a time to guarantee ADC range. It can be seen that higher RAOS rates can provide higher performance due to increased computational parallelism. In lightweight networks, such as MobileNetV2, ShuffleNetV2, and SqueezeNet, the performance gains are relatively low because they have less sparsity. When the RAOS rate is increased, the performance improvement tends to be stable, the 2X rate is increased by nearly 1.97 times, and the 8X rate is increased by only 5.1 times. This is because, in the case of higher oversubscription, the accumulated result becomes large and cannot be compressed into one calculation cycle. The results also show that the sliding prediction brings higher performance advantages compared to the dichotomy prediction. This is because the latter always tries to compress more data at a time, while the former must use multiple computation cycles to complete a failed prediction.

The RAOS of the present invention also outperforms ISAAC and SRE in performance. Fig. 2 shows a comparison of the present invention design using 4X rate with ISAAC and SRE. RAOS with sliding prediction can improve performance by about 3.1 to 3.8 times compared to ISAAC designs calculated based on dense data. The RAOS of the present invention may also compress smaller accumulated result values within one calculation cycle than SREs that can only extrude zero values, so the present invention may further improve performance by about 23% to 31% using sliding prediction.

Using ResNet50 as an example, an energy expenditure assessment was performed using a 128x128 cross array size and compared to ISAAC and SRE designs, the results of which are shown in FIG. 3. Although the present invention adds additional energy to the prediction, the present invention still achieves a much lower total energy. The main energy reduction comes from the ADC and DAC parts. Although a small ADC is used in ISAAC design, the present invention relieves the ADC from stress and can significantly reduce ADC execution time, and thus the energy on the ADC and crossbar can be reduced proportionally compared to ISAAC. Because the RAOS of the present invention considers both zero and small values, the present invention can better utilize the ADC, thereby further reducing energy consumption, as compared to SRE designs that can only extrude zero values.

The foregoing detailed description of the preferred embodiments of the invention has been presented. It should be understood that numerous modifications and variations could be devised by those skilled in the art in light of the present teachings without departing from the inventive concepts. Therefore, the technical solutions available to those skilled in the art through logic analysis, reasoning and limited experiments based on the prior art according to the concept of the present invention should be within the scope of protection defined by the claims.

Claims

1. The design of the excess row activation and storage integrated accelerator based on data sparsity is characterized by comprising three parts, namely, a prediction mechanism based on row activation data is constructed, the limitation and the calculation parallelism of peripheral circuit devices are modeled, and the matching problem of the peripheral circuit and the calculation parallelism is solved; a row activation excess subscription mechanism is constructed, the calculation parallelism and the resource use are adaptively adjusted, and the problems of low utilization rate of a calculation array and the peripheral circuit and resource redundancy under sparse data are solved; aiming at the characteristic of data sparsity of the neural network, a control flow and a data flow are re-planned, and the problem of complex circuit design introduced by using the data sparsity is solved.

2. The data sparsity-based excess line activation computation unified accelerator design of claim 1, wherein said building a line activation data-based prediction mechanism and said building a line activation excess subscription mechanism comprises: the output data size is analyzed according to the input data, and the calculation parallelism is adjusted to utilize the design of the peripheral circuit to the maximum extent.

3. The data sparsity-based excess row activation storage unified accelerator design of claim 2, wherein the input data computation predicts reduced area power consumption using ReRAM computation column approach.

4. The data sparsity-based excess row activation storage unified accelerator design of claim 3, wherein the prediction results are analyzed with a voltage comparator to reduce the overhead of digital-to-analog conversion.

5. The data sparsity-based excess row activation computation-integrated accelerator design of claim 4, wherein planning control and data flows comprises: and analyzing the data characteristics of the neural network, and providing a dichotomy method and a sliding method to improve the sparse mining capability of the equipment.

6. The data sparsity-based excess line activation computation-integrated accelerator design of claim 5, wherein the pattern of bisection per prediction is generated in a bisection scheme, predicting computation data at one time, and line activation using a smaller excess subscription rate for data that fails prediction.

7. The data sparsity-based excess line activation computation-integrated accelerator design of claim 6, wherein the sliding method predicts the maximum excess subscription rate every time, and merges the data that fails to pass prediction with the subsequent computation data for a new round of prediction iteration until the computation requirement is completed.

8. The data sparsity-based excess line activation computation-integrated accelerator design of claim 7, wherein the mechanism to construct line activation excess subscriptions comprises: and according to the prediction of the row activation data, an oversubscription mechanism design and a circuit implementation are provided.

9. The data sparsity-based excess line activation computational monolith accelerator design of claim 8, wherein the line activation data prediction results are decoded by leading zero circuitry to select an excess subscription scheme and subsequent circuitry control is performed by two masks.

10. The data sparsity-based excess line activation storage unified accelerator design of claim 9, wherein the subsequent circuit control comprises a compute kernel mask controlling input data feed of a ReRAM compute kernel, a predict kernel mask controlling input data feed of a ReRAM predict unit.