CN113723044A - Data sparsity-based extra row activation and storage integrated accelerator design - Google Patents

Data sparsity-based extra row activation and storage integrated accelerator design Download PDF

Info

Publication number
CN113723044A
CN113723044A CN202111061410.3A CN202111061410A CN113723044A CN 113723044 A CN113723044 A CN 113723044A CN 202111061410 A CN202111061410 A CN 202111061410A CN 113723044 A CN113723044 A CN 113723044A
Authority
CN
China
Prior art keywords
data
prediction
excess
sparsity
design
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111061410.3A
Other languages
Chinese (zh)
Other versions
CN113723044B (en
Inventor
景乃锋
郭梦裕
张子涵
蒋剑飞
王琴
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Jiaotong University
Original Assignee
Shanghai Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Jiaotong University filed Critical Shanghai Jiaotong University
Priority to CN202111061410.3A priority Critical patent/CN113723044B/en
Publication of CN113723044A publication Critical patent/CN113723044A/en
Application granted granted Critical
Publication of CN113723044B publication Critical patent/CN113723044B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11CSTATIC STORES
    • G11C7/00Arrangements for writing information into, or reading information out from, a digital store
    • G11C7/10Input/output [I/O] data interface arrangements, e.g. I/O data control circuits, I/O data buffers
    • G11C7/1006Data managing, e.g. manipulating data before writing or reading out, data bus switches or control circuits therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F30/00Computer-aided design [CAD]
    • G06F30/30Circuit design
    • G06F30/39Circuit design at the physical level
    • G06F30/392Floor-planning or layout, e.g. partitioning or placement
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11CSTATIC STORES
    • G11C11/00Digital stores characterised by the use of particular electric or magnetic storage elements; Storage elements therefor
    • G11C11/54Digital stores characterised by the use of particular electric or magnetic storage elements; Storage elements therefor using elements simulating biological cells, e.g. neuron
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11CSTATIC STORES
    • G11C8/00Arrangements for selecting an address in a digital store
    • G11C8/08Word line control circuits, e.g. drivers, boosters, pull-up circuits, pull-down circuits, precharging circuits, for word lines
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11CSTATIC STORES
    • G11C13/00Digital stores characterised by the use of storage elements not covered by groups G11C11/00, G11C23/00, or G11C25/00
    • G11C13/0002Digital stores characterised by the use of storage elements not covered by groups G11C11/00, G11C23/00, or G11C25/00 using resistive RAM [RRAM] elements
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11CSTATIC STORES
    • G11C8/00Arrangements for selecting an address in a digital store
    • G11C8/10Decoders
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Neurology (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Hardware Design (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Architecture (AREA)
  • Geometry (AREA)
  • Microelectronics & Electronic Packaging (AREA)
  • Complex Calculations (AREA)

Abstract

The invention discloses an excessive row activation and storage integrated accelerator design based on data sparsity, which relates to the field of neural network accelerator design of a storage and storage integrated framework, and comprises three parts, namely, a prediction mechanism based on row activation data is constructed, the limitation and the calculation parallelism of a peripheral circuit device are modeled, and the matching problem of the peripheral circuit and the calculation parallelism is solved; a row activation excess subscription mechanism is constructed, the calculation parallelism and the resource use are adaptively adjusted, and the problems of low utilization rate of a calculation array and the peripheral circuit and resource redundancy under sparse data are solved; aiming at the characteristic of data sparsity of the neural network, a control flow and a data flow are re-planned, and the problem of complex circuit design introduced by using the data sparsity is solved. The invention models the relation between the limitation of peripheral circuit devices and the calculation parallelism by predicting the scale of output data, and adaptively adjusts the calculation parallelism and the resource use according to the prediction so as to utilize the peripheral circuit resources to the maximum extent.

Description

Data sparsity-based extra row activation and storage integrated accelerator design
Technical Field
The invention relates to the field of design of a neural network accelerator with a storage and calculation integrated framework, in particular to design of an excess row activation and storage integrated accelerator based on data sparsity.
Background
In recent years, with the rapid development of convolutional neural network applications, the demand for dedicated accelerators has become higher and higher. The accelerator based on the traditional architecture has a great challenge to the acceleration performance improvement due to the significant cost of data movement in the calculation process, and the neural network accelerator based on the resistive random access memory (ReRAM) has become a novel paradigm for solving the problem of the memory wall. He programs the weight data into the crossbar, thereby reducing the amount of data movement, and through successive resistance transformations of the ReRAM cells, the multiply-accumulate operation can be computed in the analog domain on the crossbar with massive parallelism. Compared with the traditional CMOS architecture calculation, the method greatly reduces the requirements of intermediate data unloading and data handling, and improves the energy efficiency by more than 100 times.
However, ReRAM-based accelerators introduce a significant burden when introducing analog domain computations. Expensive digital-to-Analog conversion units incur huge area and energy overhead, and in recent designs, peripheral areas of digital-to-Analog converters (ADCs) and the like account for 95.5% and power consumption accounts for 55.9%. As computational demands increase, area and energy limitations of digital-to-analog conversion units have also inhibited the development of ReRAM-based accelerators.
In order to reduce the area power consumption of the ADC, the conventional optimization method is implemented by a low-precision interface. The method converts original complex data into low-precision data, reduces the data scale and the data range, and accordingly relieves the limitation of computing resources and the pressure of peripheral circuits. They are only optimized for a specific network, lack versatility, and for more extensive complex models there is a costly and intolerable loss of accuracy. Optimization can also be achieved by tailoring unnecessary computational requirements. The method analyzes the influence factors of each calculation request relative to the result according to the weight data, and cuts the calculation requests with small influence factors, so that the calculation resource requirement is reduced, but the complicated area control logic and the layout and wiring can bring extra expenses.
On the other hand, input data characteristics are underestimated and disregarded compared to the sparseness of the weight data of the neural network. For example, there may be many zeros in the input signature because the most common ReLU activations will clamp all negative activations to zeros. There are also many zero bits in the non-zero value. And when time sliced as inputs to a ReRAM crossbar switch, they may introduce a large number of zero bits or wasted computations, thereby reducing performance and power consumption.
But conversely, the accumulated result on the bit lines may yield a smaller value than the pre-designed ADC range, given the presence of these zeros. In other words, the same output value can still be obtained by increasing the number of active rows without changing the resolution of the ADC, resulting in higher computational parallelism. The present invention is referred to herein as activating a rowavailability oversubscription (RAOS) for a row. Unlike weight sparsity, input data sparsity and small values are difficult to detect at runtime. Thus, the main challenge of RAOS is how to learn dynamic data to find a sufficient oversubscription rate.
Disclosure of Invention
In view of the above-mentioned drawbacks of the prior art, the technical problem to be solved by the present invention is how to utilize the resources of the peripheral circuit to the maximum extent.
In order to achieve the aim, the invention provides an excess row activation and storage integrated accelerator design based on data sparsity, which is characterized by comprising three parts, namely, a prediction mechanism based on row activation data is constructed, the limitation and the calculation parallelism of a peripheral circuit device are modeled, and the matching problem of the peripheral circuit and the calculation parallelism is solved; a row activation excess subscription mechanism is constructed, the calculation parallelism and the resource use are adaptively adjusted, and the problems of low utilization rate of a calculation array and the peripheral circuit and resource redundancy under sparse data are solved; aiming at the characteristic of data sparsity of the neural network, a control flow and a data flow are re-planned, and the problem of complex circuit design introduced by using the data sparsity is solved.
Further, the building a row activation data based prediction mechanism and the building a row activation oversubscription mechanism includes: the output data size is analyzed according to the input data, and the calculation parallelism is adjusted to utilize the design of the peripheral circuit to the maximum extent.
Further, the input data calculation and prediction reduces area power consumption by adopting a ReRAM calculation column method.
Furthermore, the prediction result is analyzed by a voltage comparator, so that the cost caused by digital-to-analog conversion is reduced.
Further, the planning of the control flow and the data flow includes: and analyzing the data characteristics of the neural network, and providing a dichotomy method and a sliding method to improve the sparse mining capability of the equipment.
Further, the dichotomy prediction mode is generated according to a dichotomy scheme, the calculation data is predicted at one time, and line activation is performed by using a smaller oversubscription rate for data which does not pass prediction.
Furthermore, the predicted style of each time by the sliding method is the maximum excess subscription rate, and the data which does not pass prediction and the subsequent calculation data are combined for a new prediction iteration until the calculation requirement is completed.
Further, the building row activation oversubscription mechanism includes: and according to the prediction of the row activation data, an oversubscription mechanism design and a circuit implementation are provided.
Further, the predicted result of the row activation data is decoded by a leading zero circuit to select an oversubscription scheme, and subsequent circuit control is carried out through two masks.
Further, the subsequent circuit control includes a compute core mask controlling input data feed to a ReRAM compute core, and a predict core mask controlling input data feed to a ReRAM predict unit.
In order to solve the problem of imbalance of peripheral circuit design and computing resource requirements, the invention designs a row activation over-subscription mechanism by combining the data characteristic and the structural characteristic of a neural network. The method models the relation between the limitation of peripheral circuit devices and the calculation parallelism by predicting the scale of output data, and adaptively adjusts the calculation parallelism and the resource use according to the prediction so as to utilize the peripheral circuit resources to the maximum extent.
The conception, the specific structure and the technical effects of the present invention will be further described with reference to the accompanying drawings to fully understand the objects, the features and the effects of the present invention.
Drawings
FIG. 1 is a ReRAM accelerator architecture design in accordance with a preferred embodiment of the present invention;
FIG. 2 is a comparison of 4XRAOS rates for different designs according to a preferred embodiment of the present invention;
FIG. 3 is a comparison of ISAAC, SRE and energy consumption of the present invention under normalization for a preferred embodiment of the present invention;
FIG. 4 is a graph of the performance improvement using different RAOS rates for a preferred embodiment of the present invention;
FIG. 5 is a computational core design of a preferred embodiment of the present invention;
FIG. 6 is a row activation oversubscription circuit implementation of a preferred embodiment of the present invention;
FIG. 7 is a ReRAM accelerator pipeline design in accordance with a preferred embodiment of the present invention;
FIG. 8 is a block diagram of binary prediction and sliding prediction in accordance with a preferred embodiment of the present invention.
Detailed Description
The technical contents of the preferred embodiments of the present invention will be more clearly and easily understood by referring to the drawings attached to the specification. The present invention may be embodied in many different forms of embodiments and the scope of the invention is not limited to the embodiments set forth herein.
In the drawings, structurally identical elements are represented by like reference numerals, and structurally or functionally similar elements are represented by like reference numerals throughout the several views. The size and thickness of each component shown in the drawings are arbitrarily illustrated, and the present invention is not limited to the size and thickness of each component. The thickness of the components may be exaggerated where appropriate in the figures to improve clarity.
The neural network accelerator design based on the ReRAM is realized on a simulation domain, the development of the neural network accelerator is influenced by huge expenses caused by introducing digital-to-analog conversion, and certain benefits are lost despite the realization of some low-precision methods. On the other hand, data sparsity in neural networks is severely underestimated, resulting in low utilization of computational resources and substantial waste. In order to solve the problem of imbalance of peripheral circuit design and computing resource requirements, the invention designs a row activation over-subscription mechanism by combining the data characteristic and the structural characteristic of a neural network. The method models the relation between the limitation of peripheral circuit devices and the calculation parallelism by predicting the scale of output data, and adaptively adjusts the calculation parallelism and the resource use according to the prediction so as to utilize the peripheral circuit resources to the maximum extent.
First we evaluate the output data size to determine the oversubscription rate. The idea is as follows,
Figure BDA0003256739420000031
i.e. all output data can be predicted by the maximum weight of each row and the corresponding sum of input data. When the prediction result is smaller than the pre-designed ADC range, the RAOS is applied to select larger calculation parallelism to complete the calculation requirement without influencing the accuracy. Otherwise, we use normal row activation to guarantee the normal operation of the ADC.
We use two methods to get sufficient oversubscription rate. The dichotomy prediction mode is generated according to a dichotomy scheme, the calculation data is predicted at one time, and the line activation is performed by using a smaller oversubscription rate for the data which does not pass the prediction. And predicting the maximum oversubscription rate of the patterns predicted each time by the sliding method, and combining the data which are not predicted and the subsequent calculation data to perform a new round of prediction iteration until the calculation requirement is completed.
The RAOS is mainly built on a crossbar to increase its computational parallelism, and its hierarchical structure is as shown in fig. 1. Fig. 5 shows a detailed implementation of a computing core, which is implemented as shown in fig. 6 by adding three blocks, namely a ReRAM prediction column, a prediction decoder and a row activation controller, compared with a traditional architecture RAOS architecture, and its specific details are as follows:
1. predicted column
Two prediction columns are equipped with 4X sliding prediction capability. Since the prediction column only checks whether the output data is out of the pre-designed ADC range, the analysis is performed using a voltage comparator after converting the accumulated current into a voltage. The two prediction columns give a prediction vector for the pass/fail signal per column.
2. Predictive decoder
The prediction decoder converts the prediction vector from the prediction column to the selected RAOS rate. Consisting of a leading detector that selects the position of the first "pass" signal from the vector.
3. Row activation controller
The row activation controller simultaneously presents a predicted row activation mask and calculates a row activation mask for use in predicting column and crossbar array row activations. It consists of 8 bit masks, each controlling 8 independent sets of activation controls. The controller has a built-in completion register for holding the number of line groups for which the calculation has been completed, with an initial value of 0. The corresponding predicted line activation mask and calculated line activation mask can be inferred from the number of completed line groups and the prediction result, and hold and complete signals are given to the input and output buffers for data updating.
Based on these three modules, the operation principle of the RAOS crossbar switch is as follows. First, in the prediction phase, the prediction column takes input data from the input buffer with a prediction row activation mask, and detects the prediction result through a leading zero circuit, outputting the selected RAOS rate to the row activation controller. Then in the compute phase, the crossbar array computes multiply-accumulate computations at the selected RAOS rate using the input data processed by the compute row activation mask. Both masks are updated at the selected RAOS rate in each calculation cycle. The prediction and computation stages are pipelined. After the computation is complete, the slice data is combined in the shift accumulation unit as before. The input buffering and output buffering may be organized as ping-pong buffers to overlap data loading with prediction and computation, enabling pipelining, as shown in fig. 7.
Based on the ReRAM prediction column, we use two methods to obtain sufficient oversubscription rate. The basic idea is shown in fig. 8. Assuming normal row activation with 1X being 1 row using a 2-bit input slice and a 1-bit weight slice, a predesigned ADC can detect an output range of [0,4 ]. When we apply a 4X RAOS rate, we should be equipped with three prediction columns to find the appropriate one. The cells in the first prediction column are programmed with the maximum weight of 4 rows each. In the second column, the first two rows are programmed with the largest weight, the last two rows are programmed to "0" accordingly, and the third column reverses the pattern in the second column. Three prediction columns work simultaneously. That is, if the first column detects that the output data is out of range, we should check the second and third columns for the 2X oversubscription rate. If either of these succeeds, we can still oversubscribe by 2X. Otherwise, we fall back to 1 line at a time to complete 4 lines in the MVM calculation one by one. Since the prediction column works in a binary partition, we refer to it herein as a binary prediction.
In contrast to the binary prediction, the sliding prediction always performs the next prediction at the highest permitted RAOS rate when the prediction fails. For input0 in the example of fig. 8(b), even if the first 4X rate prediction fails and subsequent calculations use 2X rates for calculations, the next sliding prediction is still performed at 4X rate.
Fig. 6 shows the row activation oversubscription implementation process after the prediction is completed, assuming that the input data passes through the prediction column and the output data exceeds the range at the 4X oversubscription rate and the output data does not exceed the range at the 2X oversubscription rate, the preamble detection circuit in the prediction decoder recognizes the 2X oversubscription rate and transmits the result downward. The row activation controller simultaneously presents a predicted row activation mask and calculates a row activation mask to activate rows for the predicted column and cross array based on the selected oversubscription rate. Assume that 1 row group has completed the computation, i.e., the completion register is recorded as 1, and the predictive decoder decides the 2X oversubscription rate for the next computation. The row activation mask is computed to be 4' b0011< <1 ═ 8' b00000110, where 4' b0011 is the standard mask at 2X oversubscription rate. The prediction line activation mask should be 4' b1111< (1+2) ═ 8' b01111000, where 4' b1111 is the standard mask at 4X oversubscription rate. At the same time, the completion register updates itself by adding 2, since 2 new line groups have just completed the computation. Note that when the completion register reaches 8, the current round of computation is complete and the new input buffer should be swapped in. In addition, the two masks need to be overflowed when the input and output buffer indexes cannot be exceeded.
The patent mainly provides an excess row activation and storage integrated accelerator design based on data sparsity, and common neural network models are respectively deployed on the row activation neural network accelerator based on the ReRAM, the ISAAC and the Sparereramengine (SRE), so that performance and area power consumption data of different models under different accelerator designs are obtained and analyzed, and the technical effect of the patent is embodied. The experiment used several popular DNN models, namely ResNet50, inclusion v3, MobileNetV2, ShuffleNetV2 and SqueezeNet, with the data set from ImageNet 2012. The experiment used the original weights and input data without increasing sparsity.
Experiments performance was assessed by running real data cycle by cycle using a NeuroSim simulator. And adds a row activation oversubscription module to each crossbar cell inside to enhance the crossbar architecture. In the experiment, we unified the array size of 128x128 and applied the 2-bit input and 1-bit weights for fair performance evaluation. We use ADCs consistent with SREs with a resolution of 6 bits and thus support up to 16 rows active at a time without oversubscription.
Fig. 4 reports the performance of using different RAOS rates and prediction schemes. For a 1X structure without RAOS, it activates 16 rows at a time to guarantee ADC range. It can be seen that higher RAOS rates can provide higher performance due to increased computational parallelism. In lightweight networks, such as MobileNetV2, ShuffleNetV2, and SqueezeNet, the performance gains are relatively low because they have less sparsity. When the RAOS rate is increased, the performance improvement tends to be stable, the 2X rate is increased by nearly 1.97 times, and the 8X rate is increased by only 5.1 times. This is because, in the case of higher oversubscription, the accumulated result becomes large and cannot be compressed into one calculation cycle. The results also show that the sliding prediction brings higher performance advantages compared to the dichotomy prediction. This is because the latter always tries to compress more data at a time, while the former must use multiple computation cycles to complete a failed prediction.
The RAOS of the present invention also outperforms ISAAC and SRE in performance. Fig. 2 shows a comparison of the present invention design using 4X rate with ISAAC and SRE. RAOS with sliding prediction can improve performance by about 3.1 to 3.8 times compared to ISAAC designs calculated based on dense data. The RAOS of the present invention may also compress smaller accumulated result values within one calculation cycle than SREs that can only extrude zero values, so the present invention may further improve performance by about 23% to 31% using sliding prediction.
Using ResNet50 as an example, an energy expenditure assessment was performed using a 128x128 cross array size and compared to ISAAC and SRE designs, the results of which are shown in FIG. 3. Although the present invention adds additional energy to the prediction, the present invention still achieves a much lower total energy. The main energy reduction comes from the ADC and DAC parts. Although a small ADC is used in ISAAC design, the present invention relieves the ADC from stress and can significantly reduce ADC execution time, and thus the energy on the ADC and crossbar can be reduced proportionally compared to ISAAC. Because the RAOS of the present invention considers both zero and small values, the present invention can better utilize the ADC, thereby further reducing energy consumption, as compared to SRE designs that can only extrude zero values.
The foregoing detailed description of the preferred embodiments of the invention has been presented. It should be understood that numerous modifications and variations could be devised by those skilled in the art in light of the present teachings without departing from the inventive concepts. Therefore, the technical solutions available to those skilled in the art through logic analysis, reasoning and limited experiments based on the prior art according to the concept of the present invention should be within the scope of protection defined by the claims.

Claims (10)

1. The design of the excess row activation and storage integrated accelerator based on data sparsity is characterized by comprising three parts, namely, a prediction mechanism based on row activation data is constructed, the limitation and the calculation parallelism of peripheral circuit devices are modeled, and the matching problem of the peripheral circuit and the calculation parallelism is solved; a row activation excess subscription mechanism is constructed, the calculation parallelism and the resource use are adaptively adjusted, and the problems of low utilization rate of a calculation array and the peripheral circuit and resource redundancy under sparse data are solved; aiming at the characteristic of data sparsity of the neural network, a control flow and a data flow are re-planned, and the problem of complex circuit design introduced by using the data sparsity is solved.
2. The data sparsity-based excess line activation computation unified accelerator design of claim 1, wherein said building a line activation data-based prediction mechanism and said building a line activation excess subscription mechanism comprises: the output data size is analyzed according to the input data, and the calculation parallelism is adjusted to utilize the design of the peripheral circuit to the maximum extent.
3. The data sparsity-based excess row activation storage unified accelerator design of claim 2, wherein the input data computation predicts reduced area power consumption using ReRAM computation column approach.
4. The data sparsity-based excess row activation storage unified accelerator design of claim 3, wherein the prediction results are analyzed with a voltage comparator to reduce the overhead of digital-to-analog conversion.
5. The data sparsity-based excess row activation computation-integrated accelerator design of claim 4, wherein planning control and data flows comprises: and analyzing the data characteristics of the neural network, and providing a dichotomy method and a sliding method to improve the sparse mining capability of the equipment.
6. The data sparsity-based excess line activation computation-integrated accelerator design of claim 5, wherein the pattern of bisection per prediction is generated in a bisection scheme, predicting computation data at one time, and line activation using a smaller excess subscription rate for data that fails prediction.
7. The data sparsity-based excess line activation computation-integrated accelerator design of claim 6, wherein the sliding method predicts the maximum excess subscription rate every time, and merges the data that fails to pass prediction with the subsequent computation data for a new round of prediction iteration until the computation requirement is completed.
8. The data sparsity-based excess line activation computation-integrated accelerator design of claim 7, wherein the mechanism to construct line activation excess subscriptions comprises: and according to the prediction of the row activation data, an oversubscription mechanism design and a circuit implementation are provided.
9. The data sparsity-based excess line activation computational monolith accelerator design of claim 8, wherein the line activation data prediction results are decoded by leading zero circuitry to select an excess subscription scheme and subsequent circuitry control is performed by two masks.
10. The data sparsity-based excess line activation storage unified accelerator design of claim 9, wherein the subsequent circuit control comprises a compute kernel mask controlling input data feed of a ReRAM compute kernel, a predict kernel mask controlling input data feed of a ReRAM predict unit.
CN202111061410.3A 2021-09-10 2021-09-10 Excess row activation and calculation integrated accelerator design method based on data sparsity Active CN113723044B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111061410.3A CN113723044B (en) 2021-09-10 2021-09-10 Excess row activation and calculation integrated accelerator design method based on data sparsity

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111061410.3A CN113723044B (en) 2021-09-10 2021-09-10 Excess row activation and calculation integrated accelerator design method based on data sparsity

Publications (2)

Publication Number Publication Date
CN113723044A true CN113723044A (en) 2021-11-30
CN113723044B CN113723044B (en) 2024-04-05

Family

ID=78683183

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111061410.3A Active CN113723044B (en) 2021-09-10 2021-09-10 Excess row activation and calculation integrated accelerator design method based on data sparsity

Country Status (1)

Country Link
CN (1) CN113723044B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190370645A1 (en) * 2018-06-05 2019-12-05 Nvidia Corp. Deep neural network accelerator with fine-grained parallelism discovery
CN110647983A (en) * 2019-09-30 2020-01-03 南京大学 Self-supervision learning acceleration system and method based on storage and calculation integrated device array
CN111026700A (en) * 2019-11-21 2020-04-17 清华大学 Memory computing architecture for realizing acceleration and acceleration method thereof
CN111062472A (en) * 2019-12-11 2020-04-24 浙江大学 Sparse neural network accelerator based on structured pruning and acceleration method thereof
US20200285949A1 (en) * 2017-04-04 2020-09-10 Hailo Technologies Ltd. Structured Activation Based Sparsity In An Artificial Neural Network
CN211554991U (en) * 2020-04-28 2020-09-22 南京宁麒智能计算芯片研究院有限公司 Convolutional neural network reasoning accelerator

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200285949A1 (en) * 2017-04-04 2020-09-10 Hailo Technologies Ltd. Structured Activation Based Sparsity In An Artificial Neural Network
US20190370645A1 (en) * 2018-06-05 2019-12-05 Nvidia Corp. Deep neural network accelerator with fine-grained parallelism discovery
CN110647983A (en) * 2019-09-30 2020-01-03 南京大学 Self-supervision learning acceleration system and method based on storage and calculation integrated device array
CN111026700A (en) * 2019-11-21 2020-04-17 清华大学 Memory computing architecture for realizing acceleration and acceleration method thereof
CN111062472A (en) * 2019-12-11 2020-04-24 浙江大学 Sparse neural network accelerator based on structured pruning and acceleration method thereof
CN211554991U (en) * 2020-04-28 2020-09-22 南京宁麒智能计算芯片研究院有限公司 Convolutional neural network reasoning accelerator

Also Published As

Publication number Publication date
CN113723044B (en) 2024-04-05

Similar Documents

Publication Publication Date Title
Lee et al. UNPU: A 50.6 TOPS/W unified deep neural network accelerator with 1b-to-16b fully-variable weight bit-precision
Zhu et al. A configurable multi-precision CNN computing framework based on single bit RRAM
US11194549B2 (en) Matrix multiplication system, apparatus and method
US10241971B2 (en) Hierarchical computations on sparse matrix rows via a memristor array
CN109543816B (en) Convolutional neural network calculation method and system based on weight kneading
Ueyoshi et al. Diana: An end-to-end energy-efficient digital and analog hybrid neural network soc
CN113220630B (en) Reconfigurable array optimization method and automatic optimization method for hardware accelerator
Chen et al. A high-throughput and energy-efficient RRAM-based convolutional neural network using data encoding and dynamic quantization
Zheng et al. MobiLatice: a depth-wise DCNN accelerator with hybrid digital/analog nonvolatile processing-in-memory block
Shu et al. High energy efficiency FPGA-based accelerator for convolutional neural networks using weight combination
KR102541461B1 (en) Low power high performance deep-neural-network learning accelerator and acceleration method
CN112529171B (en) In-memory computing accelerator and optimization method thereof
Shabani et al. Hirac: A hierarchical accelerator with sorting-based packing for spgemms in dnn applications
Wang et al. SPCIM: Sparsity-Balanced Practical CIM Accelerator With Optimized Spatial-Temporal Multi-Macro Utilization
Yang et al. GQNA: Generic quantized DNN accelerator with weight-repetition-aware activation aggregating
CN113723044B (en) Excess row activation and calculation integrated accelerator design method based on data sparsity
Rhe et al. VWC-SDK: Convolutional weight mapping using shifted and duplicated kernel with variable windows and channels
Guo et al. Boosting reram-based DNN by row activation oversubscription
Yuan et al. VLSI architectures for the restricted Boltzmann machine
Liu et al. Enabling efficient ReRAM-based neural network computing via crossbar structure adaptive optimization
CN115222028A (en) One-dimensional CNN-LSTM acceleration platform based on FPGA and implementation method
CN113705794A (en) Neural network accelerator design method based on dynamic activation bit sparsity
Shin et al. Low complexity gradient computation techniques to accelerate deep neural network training
US11954580B2 (en) Spatial tiling of compute arrays with shared control
Shao et al. An FPGA-based reconfigurable accelerator for low-bit DNN training

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant