FI130137B

FI130137B - A method for increase of energy efficiency through leveraging fault tolerant algorithms into undervolted digital systems

Info

Publication number: FI130137B
Application number: FI20215475A
Authority: FI
Inventors: Mehdi Safarpour; Olli Silvén
Original assignee: Univ Of Oulu
Priority date: 2021-04-22
Filing date: 2021-04-22
Publication date: 2023-03-09
Also published as: WO2022223881A1; FI20215475A1

Abstract

Esillä oleva keksintö koskee lähestymistapaa, jonka avulla ehdotetaan saavutettavaksi energiansäästöjä pienennetyn jännitteen käytöstä. Ratkaisu havaitsee ajoitusvirheet integroimalla algoritmipohjainen vikasietoisuus (ABFT) digitaaliseen arkkitehtuuriin. Mainittua lähestymistapaa on tutkittu matalalla jännitteellä toimivan systolisen matriisikiihdyttimen avulla, joka havaitsee virheitä reaaliajassa välttäen energiaa vaativia edestakaisia kulkuja muistille. Ratkaisun analyysi on toteutettu käyttämällä analogisdigitaalista yhteissimulaatiota transienttikäyttäytymisen erottamiseksi eri jännitteillä ja kellotaajuuksilla. Suoritettiin HSPICE-simulaatioita, joissa käytettiin 90 nm CMOS-transistorimalleja, ja kokeita, joissa alennettiin Xilinx Zynq FPGA -laitteen toimintajännitettä. Jännitteen pienentäminen FPGA-kokeissa nimellisarvosta 22 prosentilla rekisteröitiin ilman yhtään virhettä, mikä merkitsi 1,8-kertaista lisäystä energiatehokkuudessa. HSPICE-simulaatiot osoittivat mahdollisuuden 10-kertaiseen lisäykseen energiatehokkuudessa lähestymällä lähes kynnysarvon aluetta.

Description

A METHOD FOR INCREASE OF ENERGY EFFICIENCY THROUGH

LEVERAGING FAULT TOLERANT ALGORITHMS INTO UNDERVOLTED

DIGITAL SYSTEMS

Technical field

The present invention is related to digital circuits and systems, systolic arrays and error detection is such circuits and systems.

Background

With the ultra-densification of wireless infrastructure through 5G technologies, with aim already at 6G solutions, much of the computing is expected to take place in the "edge", at short latency from the sensor and actuator nodes. This development is fueled by the increasing reliance on machine learning applications. With disappear- ing data communications bandwidth and latency constraints "intelligence" can be implemented and coordinated in the edge computing resources. Neural inference and massive mu-MIMO (multi-user Multiple-Input-Multiple-Output) radios are among the technological enablers that both are computationally demanding and a challenge for energy-efficient digital design.

Power consumption of digital circuits has a quadratic relation to the supply voltage (Eq. 1) . Therefore, a straightforward approach to gain in efficiency is to reduce the voltage.

P = afCV2 + loakV Eq. 1 5

N The first proposal to operate digital logic near and below the threshold voltage of the

S transistors was in 1972. In theory, these operating regions offer potential to improve

N the energy efficiency by 10x to 20x, while in experiments improvements up to 8x = 30 have been demonstrated by adopting Near-Threshold voltage regimes. Although so the nominal operating voltages have been reducing, recent research [1] has shown s that the vendor specifications for off-the-shelf components such as Field-Program-

N mable-Gate-Arrays (FPGA) and Graphical-Processing-Units (GPU) are pessimistic

NN to accommodate for process variations, such as inconsistencies in transistor geom- etry, oxide thickness and doping. It has been demonstrated that the supply voltage can be scaled down by around 12%, 20%, and even 30% in case of CPUs, GPUs,

and FPGAs, respectively. In most cases the voltage margin to where errors start appearing is large; up to 60% energy savings has been reported with FPGAs without observing one single fault and without performance loss.

Unfortunately, the impact of variations cannot be modeled deterministically in re- duced voltage settings. This uncertainty discourages the manufacturers from adopt- ing aggressive voltage reduction schemes to utilize near-threshold and sub-thresh- old regions. In reduced voltage settings the effects of process variations are exac- erbated, even up to 100x differences between Fast-Fast (FF) and Slow-Slow (SS) process corners. A challenge is that lowering the supply voltage without setting the clock frequency to the optimum, results in either loss of performance or loss of reli- ability. Too slow clocking results in the loss of energy efficiency and performance, while too fast clocking results in timing errors, hence loss of reliability. Reliable low- voltage implementations often require significant investments into development time and may reduce the fabrication yield [1,2].

In the present invention, we propose an algorithm-dependent technique for the de- tection of errors from aggressive voltage reductions. Similar approaches, e.g., Er- ror-Correction-Code (ECC) algorithms are practical in detection and correction of memory errors. However they are not applicable for error detection in data and con- trol paths. Traditional fault-tolerance approaches such as Triple-Module-Redun- dancy (TMR) or Double-Module-Redundancy (DMR) and their different flavors, while provide error-resiliency, increase the gate-count and gate-activity substan- tially. This translates into fault-resiliency at the cost of energy-efficiency, while our approach is to achieve energy-efficiency through leveraging fault-tolerance. The proposed Algorithm Based Fault Tolerance (ABFT) approach is demonstrated = through transistor and system level co-simulation and FPGA implementation of a

N systolic matrix multiplier. The solution enables utilizing reduced voltage operating x regions at minor development overheads. The targets are low-power high-perfor-

N 30 mance applications that allow for error correction by recalculation after voltage/fre-

I quency adjustment. a = Prior Approaches

O

O 35 A simple technigue to optimize the operation voltage is to embed the design with a delay chain, e.g., a seguence of inverters that mimics the timing behavior of the longest delay path in the circuit. While this enables taking into account global varia- tions, supply voltage drops, and temperature fluctuations, local variations and cross-

coupled noise cannot be effectively modeled. In practice, the longest delay path is stimulated less frequently than the shorter ones, rendering delay chain based volt- age tuning either too optimistic or too pessimistic. This translates to loss of reliability or reduced efficiency gains, respectively.

Another state-of-the-art solution that is commercially used is to arm the critical paths with Timing Error Detection (TED) circuits [2], shown in Fig. 1. The TED method was originally proposed to mitigate susceptibility to ambient and internal variations to increase manufacturing yield, while in reduced voltage cases TED circuits aid in setting an improved operation point to maximise energy efficiency. Based on detec- tions by TED circuits the voltage and clock frequency can be adjusted in an adaptive manner. Nonetheless, TED schemes are not readily applicable with off-the-shelf platforms and require significant development investment when adopted to custom

ASIC designs. Being clock synchronized digital circuits, TEDs also add non-trivial power consumption overheads.

Due to increased impacts of process and ambient variations at lower voltages, de- terministic operation may be possible only with performance or energy losses, that is, by lowering the clock frequency, or using error correction. Solutions such as shortening the critical paths or TED circuits are either costly in silicon real-estate or overly conservative.

Borrowing concepts from the supercomputing community, we propose a simple, yet an effective, algorithm level solution. It enables low-voltage integrated circuit design, manifesting sub-linear overhead from error detection in terms of power and circuit complexity.

N

N Earlier mentioned references are listed in the following.

S

N 30 [1] G. Papadimitriou, A. Chatzidimitriou, D. Gizopoulos, V. J. Reddi, J. Leng, B. Sa-

I lami, O. S. Unsal, and A. C. Kestelman, “Exceeding conservative limits: A consoli- = dated analysis on modern hardware margins,” IEEE Transactions on Device and = Materials Reliability, 2020.

O

O 35 [2] Mudge, Trevor Nigel, Todd Michael Austin, David Theodore Blaauw, and Kriszt- ian Flautner. "Systematic and random error detection and recovery within pro- cessing stages of an integrated circuit." U.S. Patent 7,162,661, issued January 9, 2007.

[3] K.-H. Huang and J. A. Abraham, “Algorithm-based fault tolerance for matrix op- erations,” IEEE transactions on computers, vol. 100, no. 6, pp. 518-528, 1984.

[4] H.-T. Kung, “Why systolic architectures?” Computer, vol. 15, no. 1, pp. 37-46, 1982.

Furthermore, concerning patent publications, the following prior art are briefly dis- cussed.

CN 108733628 (“Dai”) discloses a reinforcement method of a parallel matrix multi- plication algorithm. The method is used for lowering ABFT reinforcement overhead of the matrix algorithm. The method comprises the following steps of (1) encoding input and output of matrix multiplication, checking a calculation result according to an encoding value and storing all possible error lists; (2) preprocessing the error lists, and eliminating some misjudgment errors and avoiding unnecessary correc- tion, wherein the method of eliminating errors is a relative error law, error detection is performed before error correction, then left errors are corrected; if one or more errors are corrected, the error list is updated, and most errors are corrected after iteration for many times; and (3) adopting a re-calculation policy for left errors that cannot be corrected by the algorithm. The reinforcement method improves both sys- tem reliability and execution efficiency. The application area of this disclosure is

GPUs (Graphics Processing Unit).

US 3,604,619 (“Abbiati”) from the late 1960s discloses a biquinary calculating ma- chine i.e. a machine for calculating with decimal numbers in which each decimal order to two operands is stored in accordance with a biquinary code and it also = concerns a store for use in such machine. This disclosure does not apply the above

N presented ABFT from the 1980s.

S

N 30 CN 101369241 (”Huo”) is about a "cluster fault-tolerance system, apparatus and

I method”, but it does not discuss arrays which would be systolic. Huo mentions the = ABFT only in its background as one known option, but the actual description seems = to disclose a totally different concept on fault tolerant processes.

LO

S 35 US 2012/0221884 (”Carter”) discusses an error management method, taking both

HW and some SW into account. In other words, this disclosure provides error man- agement across hardware and software layers to enable hardware and software to deliver reliable operation in the face of errors and hardware variation due to aging,

manufacturing tolerances, etc. In one embodiment, an error management module is provided that gathers information from the hardware and software layers, and de- tects and diagnoses errors. A hardware or software recovery technique may be se- lected to provide efficient operation, and, in some embodiments, the hardware de- 5 vice may be reconfigured to prevent future errors and to permit the hardware device to operate despite a permanent error.

One problem in the prior art is that concerning TED circuits, redesigning or re-fabri- cation of the circuit is required. If the TED circuits are used as additional circuit ele- ments, they would act as an extra source of power consumption. Thus, design times will get longer, and manufacturing costs will also increase.

Generally, high voltages in accelerator devices lead directly to high energies and thus, high costs.

A cost- and energy-efficient, yet properly working accelerator device is thus desired to be produced.

Summary

The present invention introduces an integrated circuit for reliable low-voltage oper- ation of a matrix accelerator processing system (10), which matrix accelerator pro- cessing system (10) comprises a matrix accelerator (11), wherein the matrix accel- erator processing system (10) is configured to operate systolic arrays. The inte- grated circuit is characterized in that - the matrix accelerator (11) is configured to output data to Algorithm-Based = Fault Tolerance (ABFT) based error detection module (12), which is applied to com-

N pute checksums and detect errors online within the systolic array, x - the ABFT-based error detection module (12) is configured to forward possible

N 30 detected errors to a dynamic voltage and freguency controlling module (13), which

I is configured to reduce an operational voltage of the matrix accelerator (11) in case = of no detected errors, and respectively, to increase an operational voltage of the = matrix accelerator (11) in case of at least one detected error, = - the matrix accelerator processing system (10) is configured to reperform the

S 35 ABFT based error detection for the matrix accelerator (11) output data with an ad- justed operational voltage, and - the matrix accelerator processing system (10) is configured to find a lowest operational voltage where the number of the detected errors is zero.

In an embodiment of the integrated circuit, the matrix accelerator processing system (10) is configured to keep the clock (14) frequency unchanged throughout the ad- justment process by the dynamic voltage and frequency controlling module (13).

In an embodiment of the integrated circuit, the matrix accelerator processing system (10) is implemented in form of an ASIC or an FPGA.

In an embodiment of the integrated circuit, the matrix accelerator processing system (10) is configured to either lower or increase the operational voltage by the dynamic voltage and frequency controlling module (13) by a predetermined voltage step.

In an embodiment of the integrated circuit, a voltage controller (15) is configured to perform the voltage adjustments for the matrix accelerator (11).

In an embodiment of the integrated circuit, the input data and output data of the matrix accelerator (11) are augmented for Algorithm-Based Fault Tolerance error detection.

In an embodiment of the integrated circuit, the matrix accelerator processing system (10) does not require roundtrips for the data from the matrix accelerator (11) to a memory and back to the matrix accelerator (11).

According to a second aspect of the present invention, it introduces a method for reliable low-voltage operation of a matrix accelerator processing system (10), which matrix accelerator processing system (10) comprises a matrix accelerator (11), wherein the method comprises the step of operating systolic arrays in the matrix = accelerator processing system (10). The method is characterized in that it further

N comprises the steps of: x - outputting data from the matrix accelerator (11) to Algorithm-Based Fault Tol-

N 30 erance (ABFT) based error detection module (12), which is applied to compute

I checksums and detect errors online within the systolic array, = - forwarding possible detected errors from the ABFT-based error detection = module (12) to a dynamic voltage and freguency controlling module (13), which is = configured to reduce an operational voltage of the matrix accelerator (11) in case of

S 35 no detected errors, and respectively, to increase an operational voltage of the matrix accelerator (11) in case of at least one detected error,

- reperforming the ABFT based error detection for the matrix accelerator (11) output data with an adjusted operational voltage by the matrix accelerator pro- cessing system (10), and - finding, by the matrix accelerator processing system (10), a lowest opera- tional voltage where the number of the detected errors is zero.

In an embodiment of the method, the matrix accelerator processing system (10) is configured to keep the clock (14) frequency unchanged throughout the adjustment process by the dynamic voltage and frequency controlling module (13).

In an embodiment of the method, the matrix accelerator processing system (10) is implemented in form of an ASIC or an FPGA.

In an embodiment of the method, the matrix accelerator processing system (10) is configured to either lower or increase the operational voltage by the dynamic voltage and frequency controlling module (13) by a predetermined voltage step.

In an embodiment of the method, a voltage controller (15) is configured to perform the voltage adjustments for the matrix accelerator (11).

In an embodiment of the method, the input data and output data of the matrix accel- erator (11) are augmented for Algorithm-Based Fault Tolerance error detection.

In an embodiment of the method, the matrix accelerator processing system (10) does not require roundtrips for the data from the matrix accelerator (11) to a memory and back to the matrix accelerator (11).

N

N According to a third aspect of the present invention, it introduces a computer pro- x gram product for reliable low-voltage operation of a matrix accelerator processing

N 30 system (10), which matrix accelerator processing system (10) comprises a matrix

I accelerator (11), wherein the matrix accelerator processing system (10) is config- = ured to operate systolic arrays. The computer program product is characterized in = that the computer program product comprises program code which is executable by = a processor, wherein the computer program product is configured to execute the

S 35 — steps of: - outputting data from the matrix accelerator (11) to Algorithm-Based Fault Tol- erance (ABFT) based error detection module (12), which is applied to compute checksums and detect errors online within the systolic array,

- forwarding possible detected errors from the ABFT-based error detection module (12) to a dynamic voltage and frequency controlling module (13), which is configured to reduce an operational voltage of the matrix accelerator (11) in case of no detected errors, and respectively, to increase an operational voltage of the matrix accelerator (11) in case of at least one detected error, - reperforming the ABFT based error detection for the matrix accelerator (11) output data with an adjusted operational voltage by the matrix accelerator pro- cessing system (10), and - finding, by the matrix accelerator processing system (10), a lowest opera- tional voltage where the number of the detected errors is zero.

The embodiments concerning the method are applicable to the computer program product as well.

Brief description of the drawings

FIG. 1 illustrates a simplified concept of Timing-Error-Detection circuit in the prior art. Shadow register is clocked by a delayed version of original clock.

FIG. 2 illustrates a checksum row and column which help to detect the errors.

FIG. 3 illustrates a checksum row and column which help to detect the errors and correct up to one error per row/column.

FIG. 4 illustrates a co-simulation paradigm of reduced voltage processing platform.

FIG. 5 illustrates a processing element of the systolic array.

FIG. 6 illustrates an energy efficient point which can be achieved using error detec- tions.

FIG. 7 illustrates in the top: Error rates of PE output bits. In the bottom: Total error = and silent error rates for 32 point row-column and 32-by-32 matrix multiplication. & The horizontal axis represents clock period variations (ot, nanoseconds). x FIG. 8 illustrates that ABFT detects errors in different operation points and temper-

N 30 atures.

I FIG. 9 discloses a main block diagram according to an embodiment of the present - invention. 12 5

S

Detailed description

The present invention introduces a solution which achieves substantial energy ben- efits from reduced voltage operation in a matrix accelerator. The matrix accelerator may be a matrix multiplier designed as a systolic array.

A main embodiment of the present invention is discussed at first. In this embodi- ment, it is proposed to adjust the voltage and frequency according to errors detected at the data output of the computing logic. For detections an ABFT scheme [3] is integrated into the design. The objective is to enable pushing the supply voltage down, while maintaining reliability with minimal overheads.

The restriction of the approach is that ABFT methods are algorithm specific. In our study they are for matrix operations that are a worthwhile target, representing the most energy hungry computations in many applications. For instance, more than 98% of computations in a neural network inference episode, and majority of the computations of a 5G radio involve matrix operations.

One of the most energy efficient and high performance architectures for matrix op- erations is systolic array introduced by Kung in 1980s [4]. Systolic designs have recently regained attention, being utilized in Google Tensor-Processor-Units (TPU) for acceleration of neural network computations.

Our contribution in the embodiments of the present invention is demonstrating an

ABFT solution for error detection in matrix multiplication targeted systolic array and verifying its performance. The objective is to support sustained operation at reduced = voltage compensating, e.g., for temperature dependency of the threshold voltage. & 3 Algorithm Based Fault Tolerance is discussed first.

V 30

I The idea of ABFT was originally conceived by Huang and Abraham in [3]. They = initially described a low-overhead technigue to detect and correct computational er- = rors striking multiplication operations. Subseguently, ABFT was extended to other = linear algebra operations, including transposition, QR decomposition, FFT, and 2-D

S 35 convolutions [3].

The fundamental idea of ABFT for matrix operations is to augment input matrices with checksum property. This enables detecting computational errors by inspecting the final result, while correction can be done in limited cases provided that hardware support has been included in the design. In our proposed approach according to an embodiment of the invention, correction is done by recalculating after clock fre- quency or voltage change.

Assuming A is an N x N matrix, then a row checksum matrix A" is defined as a N x (N + 1) matrix, as below:

A, = [4 Ae] Eq. 2 where e is column vector en =[1,1,...,1], hence the n!" element of the column vector.

Similarly a full checksum, Af matrices are defined as

AB — ABe Eg. 3

Ce = A, XB, = f € 0T leap eABe respectively. The checksums can be used to detect errors instead by multiplying matrices Anxn and Byxy:

C=AXB Eq. 4

Thus, we multiply the column checksum matrix A* and row checksum matrix B® to obtain full checksum matrix C/. This outcome is depicted in Fig. 2. Provided that there are no errors in the checksum calculations, the location of a single row or column error in the result matrix C can be detected.

N In the current study the interests are in controlling the operating voltage and fre- & quency, and understanding the energy costs of the checksum logic. We assume

S 25 that the occasionally added latencies from recalculations, and the silent error rate fit

N the application constraints. = = Next, systolic array for matrix multiplication with ABFT is discussed. = = 30 We limit our treatment to 2-D structured systolic array [4] shown in Fig. 3 for matrix

S multiplication, eguipped with ABFT logic, in this embodiment of the invention. The array consists of a grid of identical Processing Elements (PE) that each perform multiply-accumulate (MAC) operation on data received from the adjacent top and left PEs and pass the result or the input data to the next neighbouring PEs on the right and below. The architecture exhibits a high degree of parallelism and reduces memory bandwidth requirements. It should be noticed that the size of the array doesn't need to match the one of the matrices to be multiplied, as the calculations can be broken down to sub-matrices.

In the current design according to an embodiment of the invention, ABFT is merged with a N x N systolic array structure by adding a column of N PEs for checksums and another column consisting of N digital integrators and comparators for error detection. We recognize that by not including column checksum logic, the silent er- ror rate increases, however, it stays low as we show later. In a similar manner, we could check just a fraction of the bits in the checksums.

Except for the added checksum and error detection columns, the system design is the same as proposed in [4] in this embodiment of the invention. An advantage of the scheme is that errors are detected on-the-fly as the result matrix is clocked-out from the array. No memory round-trips are needed. The added overhead from col- umns for checksum and error checking is only O(1/N), targeting transient-errors in reduced voltage settings rather than tackling permanent faults.

Next, co-simulation and experimentation are discussed.

Simulation models for the design according to the above embodiment were created for MATLAB and HSPICE to investigate the functionality and implementation char- acteristics. The analog-digital co-simulation setting is depicted in Fig. 4. As no standard cell libraries exists that are characterized for reduced voltage operations, the main focus was on HSPICE analyses of transient behavior under different oper- = ation voltages, while the MATLAB model was used for error detection.

O

N x As a partial confirmation of the scheme in the real-world, reduced voltage experi-

N 30 ments were carried out on an FPGA. = - In the HSPICE model of the processing elements 8-bit Wallace-tree multiplier is fol- = lowed by 24-bit ripple-carry adder. The structure is shown in Fig. 5. While the = HSPICE simulations are time consuming, detailed results were desired rather than

S 35 probabilistic models to estimate the energy impacts. The digital behavior was mod- eled using MATLAB, closing the loop from analog simulations.

HSPICE was used to model variations in W/L , oxide thickness, temperature etc.

Signal transitions were obtained from system level simulations with random data inputs to compile sets of test vectors.

Overhead Analysis is discussed next. The number of arithmetic operations of N by

N matrix multiplication is (2N*3-N”*2). Extending the input matrices with checksums according to Fig. 2, the operation count increases to (2N/3 + 3N*2). Detection of errors through only the row checksum vector requires additional N*2 summation operations plus N comparisons. The outcome is (2N/3 + 4N/2 + 3N) operations.

With large enough matrices the added overhead is small, 8.0% for N=32 and 2.5% for N=100. Compared to similar methods such as Result Checking (RC), ABFT has overhead rate of O(1/N) compared to the O(N). Moreover, ABFT can be adapted to operate on-the-fly.

The results according to the above disclosed embodiment are discussed next. Fig- ures 6, 7 and 8 are referred in that regard.

The co-simulation model was used to investigate the potential of ABFT in operating a systolic array at reduced voltages. The scheme in our simulations was to keep the clock frequency and reduce the operating voltage until errors start appearing. Then, the clock frequency was reduced by predetermined steps until the errors disappear.

Matrix size 32-by-32 was selected due to its application relevance. The dots in the upper part of Fig. 6 show the power dissipation of the ABFT logic augmented systolic = array when the operating voltage is gradually reduced by 0.01V steps, and the fre-

N quency is adjusted after error detection. On an average the highest power consump-

S tion per all types of PEs is around 744W at V,;; = 0.9V and fo = 400MHz. The

N 30 recalculations can be observed as momentary power dissipation peaks taking place

I after frequency drop. = The lower part of Fig. 6 shows the correct throughput rates ("goodput") for the 32- = by-32 matrix multiplication systolic array with ABFT logic. These results are for long

S 35 runs at each voltage after freguency adjustment. The impacts of process, voltage and temperature variations were simulated by adding timing variations, following the approach presented in and with a fixed relative variance of clock period.

Voltage reduction from 1V to 0.7V saves half of the energy for a matrix multiplication without compromising the throughput. When the near threshold region is ap- proached at 0.5V, the energy use is reduced by further 70%, while the goodput drops to half. Aiming at higher energy efficiency sacrifices the goodput: at the near- threshold region in vicinity of 0.4V the goodput is only 9% of the one at 0.7V, instead, the energy per 32-by-32 matrix multiplication is only 8%.

The silent error rate or the share of errors that can find their way into the output without being detected is a relevant parameter for applications. It can be impacted by the design of error detection logic that in our case is fairly sensitive.

The worst case takes place when only 1 bit in the output of PEs is affected, and the erroneous results "neutralize" each other. In this arguably rare scenario, the proba- bility of a silent error in the final matrix multiplication result is 50%. This is the upper- bound with having only one erroneous row. However, when multiplications are car- ried out one after another by the PEs, such errors are most likely to become de- tected, triggering an operating point change. At application level, such as in tele- communications a silent error might result in a re-transmission due to packet errors.

Fig. 7 shows the error rate for different variations and the error rate in the final output of a row of PEs. To estimate the silent error rate we first modeled the bit error rate of a PE using a fixed voltage (Va, = 0.7V) and clock period (fk = 280MHz) under different random variations in HSPICE. The variations were introduced as clock jitter and were considered to represent sources for timing-errors collectively. The result- ing error probability density was then used within PEs of the systolic array to flip their output bits.

N

N In the worst case for ABFT error detection, only a single row is struck by errors, x whereas in a more realistic case the PEs in all rows undergo similar variations. This

N 30 means lowering chances of a silent error passing through to the output. = - Experimentation using an FPGA design was carried out to demonstrate aggressive = elimination of guard band voltages without incurring erroneous results. Using a 32- = by-32 matrix multiplier synthesized on Xilinx Zyng-7000 SoC ZC702 (XC7Z020)

S 35 evaluation board, different operational temperature/voltage points were explored as depicted in Fig. 8.

The programmable logic, BRAM and auxiliary circuits voltage-rails were adjusted by sending Power Management Bus (PMBUS) commands to the voltage regulator (UCD9248). According to the timing analysis tool of the vendor, the maximum clock- ing for the programmable logic design was 120 MHz, while we overclocked to 250

MHz to force the FPGA to produce errors to be detected in experimentation. To investigate the sensitivity of ABFT a few thousands trials per operation point were carried out. Based on experiments in low-voltage setting and at higher temperatures the error rate was reduced which is explained by Inverse Temperature Dependence.

The ABFT approach can utilize this opportunity to increase the clock frequency and to improve energy efficiency.

Now discussing the main invented concept in a form of a main block diagram ac- cording to an embodiment of the present invention, we refer to Fig. 9.

A matrix accelerator processing system 10 is illustrated, and it comprises a matrix accelerator 11. The matrix accelerator processing system 10 is configured to oper- ate systolic arrays (see the Input and Output). The matrix accelerator 11 is config- ured to output data to Algorithm-Based Fault Tolerance (ABFT) based error detec- tion module 12, which is applied to compute checksums and detect errors online within the systolic array. Thereafter, the ABFT-based error detection module 12 is configured to forward possible detected errors (i.e. error feedback) to a dynamic voltage and frequency controlling module 13, which is configured to reduce an op- erational voltage of the matrix accelerator 11 in case of no detected errors, and respectively, to increase an operational voltage of the matrix accelerator 11 in case of at least one detected error. Thereafter, the matrix accelerator processing system 10 is configured to reperform the ABFT based error detection for the matrix accel- = erator 11 output data with an adjusted operational voltage. In this way, the matrix

N accelerator processing system 10 is configured to find a lowest operational voltage x where the number of the detected errors is zero.

N 30

I As shown in Fig. 9, the dynamic voltage and frequency controlling module 13 is = capable to control the clock 14 frequency, and the voltage Vaa (i.e. the operational = voltage of the matrix accelerator 11) via the voltage controller 15. In the present = invention, we can select whether we adjust the frequency or the voltage, or even

S 35 both the freguency and the voltage based on the error feedback.

In an embodiment of the present invention, the matrix accelerator processing sys- tem 10 is configured to keep the clock 14 frequency unchanged throughout the ad- justment process by the dynamic voltage and frequency controlling module 13.

In brief for the above presented embodiments of the present invention and their advantages, the proposed ABFT approach enables removing the extra voltage guard bands determined by the circuit manufacturers, without performance compro- mises. Furthermore, efficient near-threshold operation points can be reached when the clock rate is adjusted as well. Another interesting case of use is for approximate computing applications where the solution can be used to assure the error probabil- ities are confined within certain bounds.

Summarizing the above, the energy savings potential indicated by transistor level

HSPICE simulations was experimentally confirmed with an FPGA. Integration of the proposed ABFT scheme into a systolic array according to the embodiment of the invention allows on-the-fly error detection, and can be employed to find low-power operation points. With its low silent error rate, the scheme fits a wide field of appli- cations from wireless communications to artificial intelligence. These are notable advantages for the invented concept.

Furthermore, the invented concept of applying the ABFT in systolic arrays for low voltage, has been implemented in a real device (an FPGA) for demonstration pur- poses, instead of being simulated. As in the above, the focus is in the energy sav- ings achievable through voltage reductions. Matrix multiplication has been selected as an example case due to being the core of key operations for wireless communi- cations and deep neural networks. The matrix multiplier design used on the FPGA = is a run-of-the-mill scheme generated using a vendor provided HLS tool.

O

N x Further summarizing the invented concept verified both by simulations and by an

N 30 FPGA implementation, and when comparing it to the earlier cited prior art, we note

I the following. a = To our best knowledge, this seems to be the first contribution in which a low over- = head algorithmic error detection technique is employed to realize a low-voltage pro-

S 35 cessing solution. The disclosed ABFT based voltage control scheme is independent of architectural optimizations, and can therefore be used with any matrix multiplier design. Implementations of other algorithms can be foreseen to benefit from match- ing ABFT schemes.

Designing chips to operate at optimum near-threshold points is notoriously difficult, due to exacerbated process parameter and temperature dependencies that need to be modeled. The demonstrated ABFT based error feedback mechanism, however, is an alternative adaptive approach that fits applications where occasional errors can be tolerated.

While its utility was now demonstrated with an FPGA, it is possible to envision ap- plications with ASIC designs.

Furthermore, the objective of the present invention was to present and demonstrate safe voltage reduction of matrix multiplication accelerator using ABFT. The results demonstrate that the energy efficiency of FPGA based matrix multiplication can be improved substantially by cutting the voltage margins. Moreover, the utility of an

ABFT based low overhead feedback mechanism was shown in controlling the volt- ages based on errors originating from the internal logic, Block RAM, and auxiliary circuit.

The invented concept comprises an integrated circuit, a method and a computer program product for reliable low-voltage operation of a matrix accelerator pro- cessing system.

The present invention is not restricted merely in the embodiments discussed above but the present invention may vary within the scope of the claims.

N

O

N

<+ <Q

N

I

Ao a

LO

N

<

LO

N

O

N

Claims

1. An integrated circuit for reliable low-voltage operation of a matrix accelerator processing system (10), which matrix accelerator processing system (10) comprises a matrix accelerator (11), wherein - the matrix accelerator processing system (10) is configured to operate sys- tolic arrays, characterized in that - the matrix accelerator (11) is configured to output data to Algorithm-Based Fault Tolerance (ABFT) based error detection module (12), which is applied to com- pute checksums and detect errors online within the systolic array, - the ABFT-based error detection module (12) is configured to forward possible detected errors to a dynamic voltage and frequency controlling module (13), which is configured to reduce an operational voltage of the matrix accelerator (11) in case of no detected errors, and respectively, to increase an operational voltage of the matrix accelerator (11) in case of at least one detected error, - the matrix accelerator processing system (10) is configured to reperform the ABFT based error detection for the matrix accelerator (11) output data with an ad- justed operational voltage, and - the matrix accelerator processing system (10) is configured to find a lowest operational voltage where the number of the detected errors is zero.

2. The integrated circuit according to claim 1, characterized in that the matrix accelerator processing system (10) is configured to keep the clock (14) frequency S unchanged throughout the adjustment process by the dynamic voltage and fre- N quency controlling module (13). LO O - 25

3. The integrated circuit according to claim 1, characterized in that the matrix accelerator processing system (10) is implemented in form of an ASIC or an FPGA. a a

W

4. The integrated circuit according to claim 1, characterized in that the matrix s accelerator processing system (10) is configured to either lower or increase the op- N erational voltage by the dynamic voltage and freguency controlling module (13) by O N 30 a predetermined voltage step.

5. The integrated circuit according to any of claims 1-4, characterized in that a voltage controller (15) is configured to perform the voltage adjustments for the matrix accelerator (11).

6. The integrated circuit according to any of claims 1-5, characterized in that the input matrices of the matrix accelerator (11) are augmented with checksum prop- erty in the Algorithm-Based Fault Tolerance based error detection.

7. A method for reliable low-voltage operation of a matrix accelerator processing system (10), which matrix accelerator processing system (10) comprises a matrix accelerator (11), wherein the method comprises the step of: - operating systolic arrays in the matrix accelerator processing system (10), characterized in that the method further comprises the steps of: - outputting data from the matrix accelerator (11) to Algorithm-Based Fault Tol- erance (ABFT) based error detection module (12), which is applied to compute checksums and detect errors online within the systolic array, - forwarding possible detected errors from the ABFT-based error detection module (12) to a dynamic voltage and frequency controlling module (13), which is configured to reduce an operational voltage of the matrix accelerator (11) in case of no detected errors, and respectively, to increase an operational voltage of the matrix accelerator (11) in case of at least one detected error, - reperforming the ABFT based error detection for the matrix accelerator (11) output data with an adjusted operational voltage by the matrix accelerator pro- cessing system (10), and N O - finding, by the matrix accelerator processing system (10), a lowest opera- ro tional voltage where the number of the detected errors is zero. - 25 8. The method according to claim 7, characterized in that the matrix accelerator E processing system (10) is configured to keep the clock (14) freguency unchanged 0 throughout the adjustment process by the dynamic voltage and frequency control- = ling module (13). QA

S

9. The method according to claim 7, characterized in that the matrix accelerator processing system (10) is implemented in form of an ASIC or an FPGA.

10. The method according to claim 7, characterized in that the matrix accelerator processing system (10) is configured to either lower or increase the operational volt- age by the dynamic voltage and frequency controlling module (13) by a predeter- mined voltage step.

11. The method according to any of claims 7-10, characterized in that a voltage controller (15) is configured to perform the voltage adjustments for the matrix accel- erator (11).

12. The method according to any of claims 7-11, characterized in that the input matrices of the matrix accelerator (11) are augmented with checksum property in the Algorithm-Based Fault Tolerance based error detection.

13. A computer program product for reliable low-voltage operation of a matrix accelerator processing system (10), which matrix accelerator processing system (10) comprises a matrix accelerator (11), wherein - the matrix accelerator processing system (10) is configured to operate sys- tolic arrays, characterized in that the computer program product comprises program code which is executable by a processor, wherein the computer program product is configured to execute the steps of: - outputting data from the matrix accelerator (11) to Algorithm-Based Fault Tol- erance (ABFT) based error detection module (12), which is applied to compute checksums and detect errors online within the systolic array, N - forwarding possible detected errors from the ABFT-based error detection O module (12) to a dynamic voltage and freguency controlling module (13), which is ie configured to reduce an operational voltage of the matrix accelerator (11) in case of - 25 no detected errors, and respectively, to increase an operational voltage of the matrix accelerator (11) in case of at least one detected error, a a W - reperforming the ABFT based error detection for the matrix accelerator (11) s output data with an adjusted operational voltage by the matrix accelerator pro- N cessing system (10), and N - finding, by the matrix accelerator processing system (10), a lowest opera- tional voltage where the number of the detected errors is zero.