CN108733628B - Parallel matrix multiplication algorithm reinforcing method - Google Patents
Parallel matrix multiplication algorithm reinforcing method Download PDFInfo
- Publication number
- CN108733628B CN108733628B CN201810502409.1A CN201810502409A CN108733628B CN 108733628 B CN108733628 B CN 108733628B CN 201810502409 A CN201810502409 A CN 201810502409A CN 108733628 B CN108733628 B CN 108733628B
- Authority
- CN
- China
- Prior art keywords
- errors
- error
- corrected
- matrix multiplication
- algorithm
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/16—Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T1/00—General purpose image data processing
- G06T1/20—Processor architectures; Processor configuration, e.g. pipelining
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Theoretical Computer Science (AREA)
- Computational Mathematics (AREA)
- Mathematical Analysis (AREA)
- Mathematical Optimization (AREA)
- Pure & Applied Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Computing Systems (AREA)
- Algebra (AREA)
- Databases & Information Systems (AREA)
- Software Systems (AREA)
- General Engineering & Computer Science (AREA)
- Detection And Prevention Of Errors In Transmission (AREA)
- Detection And Correction Of Errors (AREA)
Abstract
The invention discloses a reinforcing method of a parallel matrix multiplication algorithm, which is used for reducing ABFT reinforcing expense of matrix multiplication and comprises the following steps: (1) firstly, encoding the input and output of the matrix multiplication, checking a calculation result according to an encoding value and storing all possible error lists; (2) and preprocessing the error list, eliminating some misjudgment errors and avoiding unnecessary correction, wherein the method for eliminating errors adopts a relative error method, adds an error detection before correcting the errors, and then corrects the rest errors. If one or more errors are corrected, the error list is updated, and a large portion of the errors may be corrected over multiple iterations. (3) And adopting a recalculation strategy for the remaining errors which cannot be corrected by the algorithm. The reinforcement method can improve the system reliability and the execution efficiency.
Description
Technical Field
The invention relates to a parallel matrix multiplication algorithm reinforcing technology, which can be applied to various technical fields of matrix multiplication algorithms, such as image processing, data statistics and the like.
Background
At present, the parallel computing architecture of a Graphic Processing Unit (GPU) greatly improves the speed of large-scale operation of a computer, and shows great potential in high-performance computing application. GPUs are used in a variety of areas, such as image processing, data statistics, and other high performance computing applications, and are also becoming increasingly popular in modern industries. In recent years, GPU manufacturers such as NVIDIA have been developing GPU computing platforms for automotive driving applications.
The energetic particles may cause bit flipping of the memory element or cause transient voltage pulses in other logic circuits, such as computational cells. With the continuous reduction of the size of the CMOS fabrication process, logic circuits are more sensitive to soft errors caused by high energy particles. Numerous experimental results indicate that GPUs have a higher error rate than other integrated circuit devices under high energy particle strikes. It should be noted that the reliability requirements are application dependent. The reliability of the GPU is critical in some applications, such as spacecraft, satellite or autopilot applications, where soft errors can have extremely serious consequences. In personal entertainment applications such as audio or video, a certain number of soft errors can be tolerated.
Error Correction Code (ECC) mechanisms are one of the most common consolidation techniques in memory, and can also be applied in GPUs to reduce soft error rates. However, this approach incurs high costs in terms of time, space, and power consumption, and only certain families of high-end GPUs are equipped with ECC. Other common consolidation methods, such as redundancy and checkpointing techniques, mainly use re-computation after an error is detected. One of the redundancy-based reinforcement techniques is TMR (triple modular redundancy), which has been experimentally proven to improve the reliability of the system. However, although TMR can effectively solve the problem of soft errors, it results in a resource consumption of three times, and in some applications, the resources are limited.
Therefore, an algorithm-based reinforcement technology for matrix multiplication is provided, and the method can improve the reliability of a system and improve the execution efficiency.
Disclosure of Invention
The invention aims to design a matrix multiplication and reinforcement method based on an ABFT algorithm, which can consume less resources to realize algorithm reinforcement, and the execution efficiency of the technology is higher than that of the prior art proved by error injection simulation results. The main content of the invention is to correct errors in the case of a random distribution of a plurality of errors. In the case of a random distribution of multiple errors, conventional code verification algorithms may detect more error locations than the actual errors, which may result in unnecessary time consumption in error correction. However, in most cases only a few errors are real errors. To solve this problem, the present invention provides a new reinforcement technique based on ABFT algorithm to further reduce the overhead.
The technical scheme of the invention is as follows:
a strengthening method of a parallel matrix multiplication algorithm is used for reducing ABFT overhead of matrix multiplication and FFT, and comprises the following steps:
(1) and firstly, encoding the input and output multiplied by the matrix, checking a calculation result according to an encoding value and storing all possible error lists.
(2) And preprocessing the error list to eliminate some errors of misjudgment and avoid unnecessary correction, wherein the method for eliminating errors adopts a relative error method, and an error detection is added before the errors are corrected. The remaining errors are then corrected. If one or more errors are corrected, the error information is updated, and a large portion of the errors may be corrected over multiple iterations.
(3) And adopting a recalculation strategy for the remaining errors which cannot be corrected by the algorithm.
The invention has the following beneficial effects:
the reinforcement method can improve the system reliability and the execution efficiency.
Drawings
FIG. 1 is a schematic diagram of a specific encoding process;
FIG. 2 is a schematic diagram of an exemplary error distribution map;
FIG. 3 is an example of randomly distributed errors in embodiment 1, in which black dots are real errors and gray dots are regarded as potential errors;
FIG. 4 is an example of a procedure for correcting a randomly distributed error in embodiment 1;
FIG. 5 is a time consuming comparison of the present invention algorithm to the existing EXABFT algorithm.
Detailed Description
The invention is further described below with reference to the accompanying drawings. The following examples are only for illustrating the technical solutions of the present invention more clearly, and the protection scope of the present invention is not limited thereby.
(1) And firstly, encoding the input and output multiplied by the matrix, checking a calculation result according to an encoding value and storing all possible error positions. The specific encoding process is shown in fig. 1;
calculating column sum of each column of the A matrix and adding the column sum to the matrix as Ac, wherein Br in the B matrix is the sum of each row, and Mc and Mr are obtained by multiplication, wherein Mc and Mr are the row sum and the column sum of the M matrix, since in a large number of experiments Mc and Mr have never been affected by the radiation effect, we can assume that Mc and Mr are the correct row and column sums, and Mc 'and Mr' are used for solving the column sum and row sum of the M matrix after the calculation is finished, so that through the comparison of Mc and Mc ', Mr and Mr', an error position can be positioned, wherein, the integers # Err _ row and # Err _ col respectively store the row number and the column number containing errors, the arrays fault _ rows and fault _ cols contain the indexes of the error rows and the error columns, and the intersection of the error rows and the error columns is the position of the calculation error;
(2) and preprocessing the error list, eliminating some misjudgment errors and avoiding unnecessary correction. The method for eliminating errors adopts a relative error method, wherein a typical error distribution is shown in fig. 2, wherein a gray area is a misjudgment area. An error detection is added before correcting the error to determine if equation (1) holds. If yes, the intersection point is the only error on the ith row and the jth column, otherwise, the intersection point is a misjudgment area.
Mc[j]-Mc’[j]=Mr[i]-Mr’[i] (1)
These errors are corrected using equations (2) or (3) for the remaining real errors. If one or more errors are corrected, the error information fault _ rows and fault _ cols are updated, and encoding and error detection are performed again, and most errors can be corrected after a plurality of iterations.
Mcorrect[i,j]=Merror[i,j]-(Mr′[i]-Mr[i]) (2)
Mcorrect[i,j]=Merror[i,j]-(Mc′[j]-Mc[j]) (3)
(3) The remaining errors that cannot be corrected by equation (2) or (3), recalculate these elements.
Example 1:
as shown in fig. 3, the errors are randomly distributed. The code verification process will detect that the 3-row and 4-column codes are biased, i.e., there are 12 possible error points, and in fact only 4 errors. Using equation (1) will identify the locations (c) and (c) as detected as a single error, which will be corrected preferentially. Once successfully corrected using equation (2) or equation (3), the Fault _ rows and Fault _ cols will be updated, with the remaining 2 errors in the same row, and thus can be corrected using very few operations. The exemplary correction process is shown in fig. 4 below. In this example, the proposed scheme avoids unnecessary correction of "false" errors and reduces the total potential error in very few iterations.
In the error injection simulation, the efficiency of the method is superior to that of the existing method. We performed error-injecting simulation tests on different sizes of matrix multiplication. Fig. 5 compares the running time consumption of the algorithm and the existing EXABFT algorithm when 10 errors are injected, and it can be seen that the algorithm execution efficiency in the present invention is better than that of the existing EXABFT algorithm.
The above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, several modifications and variations can be made without departing from the technical principle of the present invention, and these modifications and variations should also be regarded as the protection scope of the present invention.
Claims (1)
1. A strengthening method of a parallel matrix multiplication algorithm is used for reducing ABFT overhead of matrix multiplication, and is characterized by comprising the following steps:
(1) firstly, encoding the input and output of the matrix multiplication, checking a calculation result according to an encoding value and storing all possible error lists;
(2) preprocessing the error list to eliminate some errors of misjudgment and avoid unnecessary correction, wherein the method for eliminating errors adopts a relative error method, an error detection is added before the errors are corrected, and the error list is preprocessed according to whether a judgment equation (1) is satisfied or not:
Mc[j]-Mc’[j]=Mr[i]-Mr’[i] (1);
if yes, the intersection point is the only error on the ith row and the jth column, otherwise, the intersection point is a misjudgment area;
the remaining errors are then corrected using equation (2) or (3):
Mcorrect[i,j]=Merror[i,j]-(M′r[i]-Mr[i]) (2);
Mcorrect[i,j]=Merror[i,j]-(M′c[j]-Mc[j]) (3);
wherein, Mc and Mr are correct row and column sums, and Mc 'and Mr' are obtained by calculating the column sum and the row sum of the M matrix after the calculation is finished; if one or more errors are corrected, updating error information, and correcting most errors after multiple iterations;
(3) and adopting a recalculation strategy for the remaining errors which cannot be corrected by the algorithm.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810502409.1A CN108733628B (en) | 2018-05-23 | 2018-05-23 | Parallel matrix multiplication algorithm reinforcing method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810502409.1A CN108733628B (en) | 2018-05-23 | 2018-05-23 | Parallel matrix multiplication algorithm reinforcing method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108733628A CN108733628A (en) | 2018-11-02 |
CN108733628B true CN108733628B (en) | 2020-01-03 |
Family
ID=63934982
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810502409.1A Active CN108733628B (en) | 2018-05-23 | 2018-05-23 | Parallel matrix multiplication algorithm reinforcing method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108733628B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
FI130137B (en) | 2021-04-22 | 2023-03-09 | Univ Of Oulu | A method for increase of energy efficiency through leveraging fault tolerant algorithms into undervolted digital systems |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104133738A (en) * | 2014-07-11 | 2014-11-05 | 中国人民解放军信息工程大学 | SEU-resistant method for satellite-borne MIMO detector based on SEC-DED |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9362953B2 (en) * | 2013-08-02 | 2016-06-07 | Infineon Technologies Ag | Efficient error correction of multi-bit errors |
-
2018
- 2018-05-23 CN CN201810502409.1A patent/CN108733628B/en active Active
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104133738A (en) * | 2014-07-11 | 2014-11-05 | 中国人民解放军信息工程大学 | SEU-resistant method for satellite-borne MIMO detector based on SEC-DED |
Non-Patent Citations (4)
Title |
---|
A-ABFT: Autonomous Algorithm-Based Fault Tolerance for Matrix Multiplications on Graphics Processing Units;Claus Braun 等;《2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks》;20140922;第443-454页 * |
Algorithm-Based Fault Tolerance for Matnx Operations;KUANG-HUA HUANG 等;《IEEE TRANSACTIONS ON COMPUTERS》;19840630;第c-33卷(第6期);第518-528页 * |
用软件实现的故障注入工具评估错误检测机制;王建莹 等;《小型微型计算机系统》;20000531;第21卷(第5期);第497-499页 * |
矩阵的容错计算和纠查错性能;隋厚堂 等;《中国空间科学学会空间探测专业委员会第十次学术会议论文集》;19971031;第128-130页 * |
Also Published As
Publication number | Publication date |
---|---|
CN108733628A (en) | 2018-11-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109542668B (en) | NAND FLASH memory-based verification method, terminal equipment and storage medium | |
CN111338840B (en) | Space data protection method, storage medium, computer program, system and terminal | |
US8010875B2 (en) | Error correcting code with chip kill capability and power saving enhancement | |
US9092349B2 (en) | Storage of codeword portions | |
KR20180086816A (en) | Memory device and electronic device performing adaptive error correction with pre-checking error rate and method of operating the memory device | |
US9189327B2 (en) | Error-correcting code distribution for memory systems | |
CN104409103A (en) | Novel two-dimensional coding reinforcing method and circuit arrangement for aerospace memory | |
US10291258B2 (en) | Error correcting code for correcting single symbol errors and detecting double bit errors | |
CN108733628B (en) | Parallel matrix multiplication algorithm reinforcing method | |
US9043683B2 (en) | Error protection for integrated circuits | |
US9041428B2 (en) | Placement of storage cells on an integrated circuit | |
US10108486B2 (en) | Error protection | |
Schöll et al. | Low-overhead fault-tolerance for the preconditioned conjugate gradient solver | |
Silva et al. | Extended matrix region selection code: An ECC for adjacent multiple cell upset in memory arrays | |
US9959166B2 (en) | Error correction for non-volatile memory | |
US8661320B2 (en) | Independent orthogonal error correction and detection | |
Venkataraman et al. | Multi-directional error correction schemes for SRAM-based FPGAs | |
US7904786B2 (en) | Assisted memory system | |
Jang et al. | MATE: Memory-and retraining-free error correction for convolutional neural network weights | |
CN107168817B (en) | Data restoration method and device applied to storage array and storage equipment | |
US20140201599A1 (en) | Error protection for integrated circuits in an insensitive direction | |
US8499224B2 (en) | Redundant code generation method and device, data restoration method and device, and raid storage device | |
Kustov et al. | Efficiency Estimation of Single Error Correction, Double Error Detection and Double-Adjacent-Error Correction Codes | |
Singh et al. | Ram error detection and correction using HVD implementation | |
Hui et al. | Optimized software-based hardening strategies for matrix multiplication and fast fourier transform |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
EE01 | Entry into force of recordation of patent licensing contract | ||
EE01 | Entry into force of recordation of patent licensing contract |
Application publication date: 20181102 Assignee: Changzhou Xinsheng Semiconductor Technology Co.,Ltd. Assignor: CHANGZHOU CAMPUS OF HOHAI University Contract record no.: X2023980034321 Denomination of invention: A Reinforcement Method for Parallel Matrix Multiplication Algorithm Granted publication date: 20200103 License type: Common License Record date: 20230404 |