CN113723044B

CN113723044B - Excess row activation and calculation integrated accelerator design method based on data sparsity

Info

Publication number: CN113723044B
Application number: CN202111061410.3A
Authority: CN
Inventors: 景乃锋; 郭梦裕; 张子涵; 蒋剑飞; 王琴
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2021-09-10
Filing date: 2021-09-10
Publication date: 2024-04-05
Anticipated expiration: 2041-09-10
Also published as: CN113723044A

Abstract

The invention discloses an excess line activation and calculation integrated accelerator design method based on data sparsity, which relates to the field of neural network accelerator design of an integrated calculation architecture, and comprises three parts, namely, constructing a prediction mechanism based on line activation data, modeling limit and calculation parallelism of peripheral circuit devices, and solving the problem of matching between the peripheral circuit and the calculation parallelism; constructing a line activation oversubscription mechanism, adaptively adjusting the computation parallelism and the resource use, and solving the problems of low utilization rate of a computation array and the peripheral circuit and resource redundancy under sparse data; aiming at the characteristic of data sparsity of the neural network, the control flow and the data flow are re-planned, and the problem of complex circuit design caused by utilizing the data sparsity is solved. According to the invention, the relation between the limitation of the peripheral circuit device and the calculation parallelism is modeled by predicting the output data scale, and the calculation parallelism and the resource use are adaptively adjusted according to the prediction, so that the peripheral circuit resource is utilized to the greatest extent.

Description

Excess row activation and calculation integrated accelerator design method based on data sparsity

Technical Field

The invention relates to the field of neural network accelerator design of a memory-calculation integrated architecture, in particular to an excess row activation memory-calculation integrated accelerator design based on data sparsity.

Background

In recent years, with the rapid development of convolutional neural network applications, the demand for dedicated accelerators is increasing. Accelerators based on traditional architecture have presented significant challenges for acceleration performance improvement due to the significant cost of data movement during their computation, while neural network accelerators based on resistive random access memory (ResistiveRandomAccessMemory, reRAM) have become a new paradigm for solving memory wall problems. He programs the weight data into the crossbar, thus reducing the amount of data movement, and by successive resistive transformations of the ReRAM cell, the multiply-accumulate operation can be calculated in the analog domain on the crossbar with massive parallelism. Compared with the traditional CMOS architecture, the method greatly reduces the requirements of intermediate data transfer and data handling, and improves the energy efficiency by more than 100 times.

However, reRAM-based accelerators introduce a significant burden while introducing analog domain computations. The expensive digital-to-Analog conversion unit brings huge area and energy cost, and in the latest design, the peripheral area such as a digital-to-Analog converter (Analog-to-DigitalConverter, ADC) occupies 95.5% and the power consumption occupies 55.9%. As the computational demands increase, the area and energy limitations of digital-to-analog conversion units have also suppressed the development of ReRAM-based accelerators.

In order to reduce the area power consumption of the ADC, the traditional optimization method is realized through a low-precision interface. The method converts original complex data into low-precision data, reduces the data scale and the data range, and thus relieves the limitation of computing resources and the pressure of peripheral circuits. They are only optimally implemented for a particular network, lacking versatility, and for a more extensive complex model, they suffer from high and intolerable loss of accuracy. Optimization may also be achieved by tailoring unnecessary computational requirements. According to the method, the influence factors of all calculation requests relative to the result are analyzed according to the weight data, and the calculation requests with small influence factors are cut, so that the calculation resource requirement is reduced, but the complicated regional control logic and layout and wiring can bring about additional expense.

On the other hand, the input data characteristics are underestimated and disregarded compared to the sparse weight data of the neural network. For example, there may be many zeros in the input profile because the most common ReLU activations would clamp all negative activations to zero. There are also many zero bits in the non-zero values. While they are time-sliced as inputs to the ReRAM crossbar, they may introduce a significant amount of zero bits or dead space computations, thereby reducing performance and power consumption.

But conversely, given the existence of these zero bits, the cumulative result on the bit line may yield a smaller value than the pre-designed ADC range. In other words, the same output value can still be obtained by increasing the number of active rows, with the ADC resolution unchanged, resulting in higher computational parallelism. The present invention is referred to herein as line activation oversubscription (RAOS). Unlike weight sparsity, input data sparsity and small values are difficult to detect at run-time. Thus, the main challenge of RAOS is how to learn dynamic data to find a sufficient oversubscription rate.

Disclosure of Invention

In view of the above-mentioned drawbacks of the prior art, the technical problem to be solved by the present invention is how to utilize the peripheral circuit resources to the maximum extent.

In order to achieve the above purpose, the invention provides an excess line activation and calculation integrated accelerator design based on data sparsity, which is characterized by comprising three parts, namely, constructing a prediction mechanism based on line activation data, modeling the limitation and calculation parallelism of peripheral circuit devices, and solving the matching problem of the peripheral circuit and the calculation parallelism; constructing a line activation oversubscription mechanism, adaptively adjusting the computation parallelism and the resource use, and solving the problems of low utilization rate of a computation array and the peripheral circuit and resource redundancy under sparse data; aiming at the characteristic of data sparsity of the neural network, the control flow and the data flow are re-planned, and the problem of complex circuit design caused by utilizing the data sparsity is solved.

Further, the constructing a prediction mechanism based on the line activation data and the constructing a line activation oversubscription mechanism includes: and analyzing the output data scale according to the input data, and adjusting the calculation parallelism to maximally utilize the design of the peripheral circuit.

Further, the input data calculation prediction adopts a method of calculating columns by using a ReRAM to reduce area power consumption.

Further, the prediction result is analyzed by a voltage comparator, so that the cost caused by digital-to-analog conversion is reduced.

Further, the planning control flow and data flow includes: and analyzing the data characteristics of the neural network, and providing a dichotomy and a sliding method to improve the sparse mining capability of the equipment.

Further, the pattern of each prediction of the dichotomy is generated according to a dichotomy scheme, the calculated data is predicted at one time, and the data which does not pass the prediction is activated by using a smaller oversubscription rate.

Further, the mode of each prediction by the sliding method is the maximum oversubscription rate, and the data which do not pass through the prediction are combined with the follow-up calculation data to carry out a new prediction iteration until the calculation requirement is completed.

Further, the build line activation oversubscription mechanism includes: and according to the prediction of the row activation data, an oversubscription mechanism design and a circuit implementation are provided.

Further, the prediction result of the row activation data is decoded and selected to be oversubscribed by the leading zero circuit, and the subsequent circuit control is performed by the two masks.

Further, the subsequent circuit control includes a compute core mask controlling an input data feed of the ReRAM compute core, a predict core mask controlling an input data feed of the ReRAM predict unit.

In order to solve the unbalanced problem of peripheral circuit design and computing resource requirements, the invention combines the data characteristics and the structural characteristics of the neural network to design a line activation oversubscription mechanism. The peripheral circuit device limitation and calculation parallelism relation is modeled by predicting the output data scale, and the calculation parallelism and the resource use are adaptively adjusted according to the prediction so as to utilize the peripheral circuit resource to the greatest extent.

The conception, specific structure, and technical effects of the present invention will be further described with reference to the accompanying drawings to fully understand the objects, features, and effects of the present invention.

Drawings

FIG. 1 is a ReRAM accelerator architecture design of a preferred embodiment of the invention;

FIG. 2 is a comparison of the performance of 4XRAOS rates for different designs of a preferred embodiment of the invention;

FIG. 3 is a normalized ISAAC, SRE and energy consumption comparison of the present invention for a preferred embodiment of the present invention;

FIG. 4 is a performance improvement using different RAOS rates in accordance with a preferred embodiment of the present invention;

FIG. 5 is a computational core design of a preferred embodiment of the present invention;

FIG. 6 is a row-activated oversubscription circuit implementation of a preferred embodiment of the present invention;

FIG. 7 is a ReRAM accelerator pipeline design in accordance with a preferred embodiment of the present invention;

FIG. 8 is a binary prediction and a sliding prediction of a preferred embodiment of the present invention.

Detailed Description

The following description of the preferred embodiments of the present invention refers to the accompanying drawings, which make the technical contents thereof more clear and easy to understand. The present invention may be embodied in many different forms of embodiments and the scope of the present invention is not limited to only the embodiments described herein.

In the drawings, like structural elements are referred to by like reference numerals and components having similar structure or function are referred to by like reference numerals. The dimensions and thickness of each component shown in the drawings are arbitrarily shown, and the present invention is not limited to the dimensions and thickness of each component. The thickness of the components is exaggerated in some places in the drawings for clarity of illustration.

The neural network accelerator design based on ReRAM is implemented in the analog domain, and the huge cost caused by the introduction of digital-to-analog conversion affects the development of the neural network accelerator design, and certain benefits are lost despite the implementation of some low-precision methods. On the other hand, data sparsity in neural networks is severely underestimated, resulting in low computing resource utilization and significant waste. In order to solve the unbalanced problem of peripheral circuit design and computing resource requirements, the invention combines the data characteristics and the structural characteristics of the neural network to design a line activation oversubscription mechanism. The peripheral circuit device limitation and calculation parallelism relation is modeled by predicting the output data scale, and the calculation parallelism and the resource use are adaptively adjusted according to the prediction so as to utilize the peripheral circuit resource to the greatest extent.

First we evaluate the output data size to determine oversubscription rate. The idea is as follows,

i.e. all output data can be predicted by the maximum weight of each row and the corresponding input data sums. When the prediction result is smaller than the pre-designed ADC range, we will apply RAOS to select larger calculation parallelism to complete the calculation requirement without affecting the accuracy. Otherwise, we use normal row activation to ensure normal operation of the ADC.

We use two methods to get a sufficient oversubscription rate. The dichotomy predicts the calculated data once every time the predicted pattern is generated according to the dichotomy scheme, and activates the line by using a smaller oversubscription rate for the data which does not pass the prediction. And combining the data which do not pass through the prediction with the follow-up calculation data to carry out a new prediction iteration until the calculation requirement is completed.

The RAOS is mainly built on the crossbar to increase its computational parallelism, and its hierarchical structure is as in fig. 1. FIG. 5 illustrates a detailed computational core implementation, which is shown in FIG. 6, with three modules, namely, a ReRAM predictive column, a predictive decoder, and a row activation controller, added to a traditional architecture RAOS architecture, in more detail as follows:

1. prediction column

Both prediction columns are equipped with 4X sliding prediction capability. Since the prediction column only checks whether the output data exceeds the pre-designed ADC range, the voltage comparator is used for analysis after converting the accumulated current into voltage. Two prediction columns give a prediction vector of the pass/fail signal for each column.

2. Predictive decoder

The predictive decoder converts the predictive vector from the predictive column to the selected RAOS rate. Is constituted by a preamble detector which selects the position of the first "pass" signal from the vector.

3. Row activation controller

The row activation controller simultaneously gives the predictive row activation mask and the calculate row activation mask for predictive column and cross array row activation. It consists of 8-bit masks, controlling 8 sets of independent activation controls, respectively. The controller has a built-in completion register for holding the number of row groups for which the calculation has been completed, with an initial value of 0. The corresponding predicted line activation mask and calculated line activation mask may be inferred from the number of completed line groups and the prediction results and hold and complete signals may be provided for the input-output buffers for data updating.

Based on these three modules, the operating principle of the RAOS crossbar is as follows. First, in the prediction phase, the predicted column acquires input data from the input buffer with a predicted line activation mask, and detects the prediction result by a leading zero circuit, and outputs the selected RAOS rate to the line activation controller. Then, in a computation phase, the crossbar array computes multiply-accumulate computations at a selected RAOS rate using the input data processed by the compute row activation mask. Both masks are updated at the selected RAOS rate for each computation cycle. The prediction and calculation phases are pipelined. After the calculation is completed, the slice data is combined in a shift accumulation unit as before. The input and output buffers may be organized as ping-pong buffers to overlap data loading with prediction and computation, enabling pipelining as shown in FIG. 7.

Based on the ReRAM prediction column, we use two methods to obtain a sufficient oversubscription rate. The basic idea is shown in fig. 8. Assuming a normal row activation of 1X for 1 row using a 2-bit input slice and a 1-bit weight slice, a pre-designed ADC can detect the output range of [0,4 ]. When we apply a 4X RAOS rate, we should be equipped with three prediction columns to find the appropriate one. The cells in the first predicted column are programmed with a maximum weight of 4 rows, respectively. In the second column, the first two rows are programmed with the largest weight, the latter two rows are programmed to "0" accordingly, and the third column inverts the pattern in the second column. The three prediction columns work simultaneously. That is, if the first column detects that the output data exceeds the span, we should check the second and third columns for 2X oversubscription rate. If either one of these is successful, we can still go through 2X oversubscription. Otherwise, we fall back to 1 line at a time to complete the 4 lines in the MVM calculation one by one. Since the prediction column works in dichotomy, we refer to it herein as dichotomy prediction.

In contrast to dichotomy prediction, the sliding prediction always makes the next prediction at the highest rate of RAOS allowed when the prediction fails. For input0 in the example of fig. 8 (b), even if the first 4X rate prediction fails, the subsequent calculation uses 2X rate for calculation, but the next sliding prediction is still performed at 4X rate.

Fig. 6 shows a predicted line activation oversubscription implementation flow assuming that input data passes through the prediction column and results in output data exceeding the range at a 4X oversubscription rate and output data not exceeding the range at a 2X oversubscription rate, and a preamble detection circuit in the prediction decoder identifies the 2X oversubscription rate and passes the result down. The row activation controller simultaneously presents a predictive row activation mask and a calculated row activation mask to activate the rows for the predictive column and cross array according to the selected oversubscription rate. Assuming that 1 row group has completed the computation, i.e., the completion register records as 1, and the predictive decoder decides the 2X oversubscription rate for the next computation. The line activation mask is calculated to be 4' b0011< < 1=8 ' b00000110, where 4' b0011 is the standard mask at 2X oversubscription rate. The predicted line activation mask should be 4' b1111< < (1+2) =8 ' b 0111000, where 4' b1111 is the standard mask at 4X oversubscription rate. At the same time, the completion register will update itself plus 2, since the 2 new row group has just completed the computation. Note that when the completion register reaches 8, the round of computation is complete and new input buffers should be swapped in. In addition, two masks need to overflow when the input/output buffer index cannot be exceeded.

The patent mainly provides an excess line activation and calculation integrated accelerator design based on data sparsity, and the performance and area power consumption data of different models under different accelerator designs are obtained and analyzed by respectively deploying common neural network models on the ReRAM-based line activation neural network accelerator, the ISAAC and the SparseReRAMEngine (SRE), so that the technical effect of the patent is reflected. The experiments used several popular DNN models, namely ResNet50, conceptionv 3, mobilenet v2, shufflenet v2 and SqueezeNet, with data sets from ImageNet2012. The experiment uses the original weights and the input data without increasing sparsity.

Experiments evaluate performance by running real data cycle by cycle using a NeuroSim simulator. And adding a row-activated oversubscription module to each crossbar unit inside to enhance the crossbar structure. In experiments we unified the array size 128x128 and applied 2 bit inputs and 1 bit weights for fair performance assessment. We use an ADC consistent with SRE with a resolution of 6 bits, thus supporting up to 16 rows activated at a time without oversubscription.

Fig. 4 reports the performance using different RAOS rates and prediction schemes. For a 1X architecture without RAOS, it activates 16 rows at a time to guarantee ADC range. It can be seen that higher RAOS rates may provide higher performance due to increased computational parallelism. In lightweight networks, such as mobilenet v2, shufflenet v2, and SqueezeNet, the performance gains are lower because they have less sparsity. When the RAOS rate is increased, the performance increase tends to be smooth, with the 2X rate increased by approximately 1.97 times, and the 8X rate increased by only 5.1 times. This is because, in the case of higher oversubscription, the accumulation result thereof becomes large, which results in an inability to compress into one calculation cycle. The results also show that sliding prediction gives higher performance advantages than dichotomy prediction. This is because the latter always tries to compress more data at a time, while the former has to use multiple calculation cycles to complete a failure prediction.

The RAOS of the present invention also outperforms ISAAC and SRE. Fig. 2 shows that the design of the present invention using 4X rate is compared to ISAAC and SRE. The RAOS with sliding prediction can improve performance by about 3.1 to 3.8 times compared to ISAAC designs based on dense data calculations. The RAOS of the present invention can also compress smaller accumulated result values in one calculation cycle than SREs that can only extrude zero values, so the present invention can use sliding prediction to further improve performance by about 23% to 31%.

Taking ResNet50 as an example, energy consumption was evaluated using a 128x128 cross array size and compared to the design of ISAAC and SRE, the results of which are shown in FIG. 3. Although the present invention adds additional energy to the prediction, the present invention still achieves a much lower total energy. Its main energy reduction comes from the ADC and DAC parts. Although a small ADC is used in the ISAAC design, the present invention relieves the ADC stress and can greatly reduce the ADC execution time, thus the energy on the ADC and the crossbar can be reduced proportionally to ISAAC. Whereas the RAOS of the present invention considers both zero and small values, the present invention can better utilize the ADC than SRE designs that can only extrude zero values, thereby further reducing energy consumption.

The foregoing describes in detail preferred embodiments of the present invention. It should be understood that numerous modifications and variations can be made in accordance with the concepts of the invention without requiring creative effort by one of ordinary skill in the art. Therefore, all technical solutions which can be obtained by logic analysis, reasoning or limited experiments based on the prior art by the person skilled in the art according to the inventive concept shall be within the scope of protection defined by the claims.

Claims

1. The design method of the excess line activation and calculation integrated accelerator is characterized by comprising two parts, namely a prediction mechanism based on line activation data and a planning mechanism based on control flow and data flow of neural network data; the prediction and oversubscription mechanism based on the row activation data refers to analyzing the output data scale according to the input data, and adjusting the calculation parallelism to maximally utilize the peripheral circuit design; the planning of the control flow and the data flow based on the neural network data comprises the steps of analyzing the characteristics of the neural network data, and improving the sparse mining capacity of the equipment by a dichotomy method and a sliding method;

the prediction and oversubscription mechanism based on the line activation data refers to that the output data scale is evaluated to determine the oversubscription rate, and the idea is as follows,

i.e. all output data can be predicted by the maximum weight of each row and the corresponding input data sum; when the prediction result is smaller than the pre-designed ADC range, the excessive subscription RAOS is applied to select larger calculation parallelism to finish calculation requirements without affecting accuracy, otherwise, normal line activation is used to ensure normal operation of the ADC;

the pattern of each prediction of the dichotomy is generated according to a dichotomy scheme, the calculated data is predicted at one time, and the data which does not pass the prediction is activated by using a smaller oversubscription rate; the mode of each prediction by the sliding method is the maximum oversubscription rate, and the data which do not pass through the prediction are combined with the subsequent calculation data to carry out a new round of prediction iteration until the calculation requirement is completed;

the RAOS is built on a crossbar to increase its computational parallelism, and compared with the traditional architecture, the RAOS architecture adds three modules, namely a ReRAM prediction column, a prediction decoder and a row activation controller, the specific details of which are as follows:

the prediction columns are provided with 4X sliding prediction capability, and since the prediction columns only check whether output data exceeds a pre-designed ADC range, after converting accumulated current into voltage, analysis is performed using a voltage comparator, and the two prediction columns give a prediction vector of a "pass/fail" signal for each column;

said predictive decoder converting the predicted vector from the predicted column to the selected RAOS rate, consisting of a preamble detector selecting the position of the first "pass" signal from the vector;

the row activation controller simultaneously gives a predicted row activation mask and a calculated row activation mask for predicting column and cross array row activation, which consists of 8-bit masks, respectively controlling 8 groups of independent activation controls; the controller is provided with a built-in completion register which is used for storing the number of the row groups which are completed, the initial value is 0, the corresponding predicted row activation mask and the calculated row activation mask can be inferred through the number of the completed row groups and the prediction result, and a holding and completion signal is given for the input and output buffering so as to update the data;

based on the three modules, the working principle of the RAOS crossbar switch comprises the following steps:

firstly, in a prediction stage, a prediction column acquires input data from an input buffer by using a prediction line activation mask, and detects a prediction result through a leading zero circuit, and outputs a selected RAOS rate to a line activation controller; then in a computation phase, the crossbar array computes multiply-accumulate computations at a selected RAOS rate using the input data processed by the compute row activation mask; both masks are updated at the selected RAOS rate for each computation cycle; the prediction and calculation stages are pipelined, slice data are combined in a shift accumulation unit as before after calculation is completed, and input buffering and output buffering can be organized into ping-pong buffers to overlap data loading with prediction and calculation, so that pipelining is realized.

2. The excess row-activated memory integrated accelerator design method of claim 1, wherein the prediction and excess subscription mechanism based on row-activation data is constructed to model peripheral circuit device limitations and compute parallelism.

3. The excess row active all-in-one accelerator design method of claim 2, wherein the input data calculation prediction reduces area power consumption by a ReRAM calculation column method.

4. The excess row-activated and all-in-one accelerator design method of claim 3, wherein the prediction results are analyzed by a voltage comparator to reduce the cost of digital-to-analog conversion.

5. The method for designing an excess line activation and calculation integrated accelerator according to claim 4, wherein the control flow and the data flow are planned based on the neural network data according to the characteristic of the data sparsity of the neural network, so as to solve the problem of complex circuit design caused by utilizing the data sparsity.

6. The method for designing an excess line active memory integrated accelerator according to claim 5, wherein the pattern predicted by the dichotomy is generated according to a dichotomy scheme each time, the calculated data is predicted at one time, and the excess subscription rate is smaller for the data which does not pass the prediction for line activation.

7. The method of claim 6, wherein each predicted pattern of the sliding method is a maximum oversubscription rate, and combining the data that has not passed the prediction with the subsequent calculation data to perform a new round of prediction iteration until the calculation requirement is completed.

8. The excess row activation and storage integrated accelerator design method of claim 7, wherein the row activation data based prediction and excess subscription mechanism comprises: and according to the prediction of the row activation data, an oversubscription mechanism design and a circuit implementation are provided.

9. The excess row-active all-in-one accelerator design method of claim 8, wherein the prediction result of the row-active data selects an excess subscription scheme by leading-zero circuit decoding and subsequent circuit control by two masks.

10. The excess row active memory unified accelerator design method of claim 9, wherein the subsequent circuit control comprises a compute core mask controlling an input data feed of a ReRAM compute core, a prediction core mask controlling an input data feed of a ReRAM prediction unit.