CN108733628B - Parallel matrix multiplication algorithm reinforcing method - Google Patents

Parallel matrix multiplication algorithm reinforcing method Download PDF

Info

Publication number
CN108733628B
CN108733628B CN201810502409.1A CN201810502409A CN108733628B CN 108733628 B CN108733628 B CN 108733628B CN 201810502409 A CN201810502409 A CN 201810502409A CN 108733628 B CN108733628 B CN 108733628B
Authority
CN
China
Prior art keywords
errors
error
corrected
matrix multiplication
algorithm
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810502409.1A
Other languages
Chinese (zh)
Other versions
CN108733628A (en
Inventor
王海滨
王杨圣
戴茜茜
惠志坚
叶静
孙洪文
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Changzhou Campus of Hohai University
Original Assignee
Changzhou Campus of Hohai University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Changzhou Campus of Hohai University filed Critical Changzhou Campus of Hohai University
Priority to CN201810502409.1A priority Critical patent/CN108733628B/en
Publication of CN108733628A publication Critical patent/CN108733628A/en
Application granted granted Critical
Publication of CN108733628B publication Critical patent/CN108733628B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T1/00General purpose image data processing
    • G06T1/20Processor architectures; Processor configuration, e.g. pipelining

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Computing Systems (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Detection And Prevention Of Errors In Transmission (AREA)
  • Detection And Correction Of Errors (AREA)

Abstract

The invention discloses a reinforcing method of a parallel matrix multiplication algorithm, which is used for reducing ABFT reinforcing expense of matrix multiplication and comprises the following steps: (1) firstly, encoding the input and output of the matrix multiplication, checking a calculation result according to an encoding value and storing all possible error lists; (2) and preprocessing the error list, eliminating some misjudgment errors and avoiding unnecessary correction, wherein the method for eliminating errors adopts a relative error method, adds an error detection before correcting the errors, and then corrects the rest errors. If one or more errors are corrected, the error list is updated, and a large portion of the errors may be corrected over multiple iterations. (3) And adopting a recalculation strategy for the remaining errors which cannot be corrected by the algorithm. The reinforcement method can improve the system reliability and the execution efficiency.

Description

Parallel matrix multiplication algorithm reinforcing method
Technical Field
The invention relates to a parallel matrix multiplication algorithm reinforcing technology, which can be applied to various technical fields of matrix multiplication algorithms, such as image processing, data statistics and the like.
Background
At present, the parallel computing architecture of a Graphic Processing Unit (GPU) greatly improves the speed of large-scale operation of a computer, and shows great potential in high-performance computing application. GPUs are used in a variety of areas, such as image processing, data statistics, and other high performance computing applications, and are also becoming increasingly popular in modern industries. In recent years, GPU manufacturers such as NVIDIA have been developing GPU computing platforms for automotive driving applications.
The energetic particles may cause bit flipping of the memory element or cause transient voltage pulses in other logic circuits, such as computational cells. With the continuous reduction of the size of the CMOS fabrication process, logic circuits are more sensitive to soft errors caused by high energy particles. Numerous experimental results indicate that GPUs have a higher error rate than other integrated circuit devices under high energy particle strikes. It should be noted that the reliability requirements are application dependent. The reliability of the GPU is critical in some applications, such as spacecraft, satellite or autopilot applications, where soft errors can have extremely serious consequences. In personal entertainment applications such as audio or video, a certain number of soft errors can be tolerated.
Error Correction Code (ECC) mechanisms are one of the most common consolidation techniques in memory, and can also be applied in GPUs to reduce soft error rates. However, this approach incurs high costs in terms of time, space, and power consumption, and only certain families of high-end GPUs are equipped with ECC. Other common consolidation methods, such as redundancy and checkpointing techniques, mainly use re-computation after an error is detected. One of the redundancy-based reinforcement techniques is TMR (triple modular redundancy), which has been experimentally proven to improve the reliability of the system. However, although TMR can effectively solve the problem of soft errors, it results in a resource consumption of three times, and in some applications, the resources are limited.
Therefore, an algorithm-based reinforcement technology for matrix multiplication is provided, and the method can improve the reliability of a system and improve the execution efficiency.
Disclosure of Invention
The invention aims to design a matrix multiplication and reinforcement method based on an ABFT algorithm, which can consume less resources to realize algorithm reinforcement, and the execution efficiency of the technology is higher than that of the prior art proved by error injection simulation results. The main content of the invention is to correct errors in the case of a random distribution of a plurality of errors. In the case of a random distribution of multiple errors, conventional code verification algorithms may detect more error locations than the actual errors, which may result in unnecessary time consumption in error correction. However, in most cases only a few errors are real errors. To solve this problem, the present invention provides a new reinforcement technique based on ABFT algorithm to further reduce the overhead.
The technical scheme of the invention is as follows:
a strengthening method of a parallel matrix multiplication algorithm is used for reducing ABFT overhead of matrix multiplication and FFT, and comprises the following steps:
(1) and firstly, encoding the input and output multiplied by the matrix, checking a calculation result according to an encoding value and storing all possible error lists.
(2) And preprocessing the error list to eliminate some errors of misjudgment and avoid unnecessary correction, wherein the method for eliminating errors adopts a relative error method, and an error detection is added before the errors are corrected. The remaining errors are then corrected. If one or more errors are corrected, the error information is updated, and a large portion of the errors may be corrected over multiple iterations.
(3) And adopting a recalculation strategy for the remaining errors which cannot be corrected by the algorithm.
The invention has the following beneficial effects:
the reinforcement method can improve the system reliability and the execution efficiency.
Drawings
FIG. 1 is a schematic diagram of a specific encoding process;
FIG. 2 is a schematic diagram of an exemplary error distribution map;
FIG. 3 is an example of randomly distributed errors in embodiment 1, in which black dots are real errors and gray dots are regarded as potential errors;
FIG. 4 is an example of a procedure for correcting a randomly distributed error in embodiment 1;
FIG. 5 is a time consuming comparison of the present invention algorithm to the existing EXABFT algorithm.
Detailed Description
The invention is further described below with reference to the accompanying drawings. The following examples are only for illustrating the technical solutions of the present invention more clearly, and the protection scope of the present invention is not limited thereby.
(1) And firstly, encoding the input and output multiplied by the matrix, checking a calculation result according to an encoding value and storing all possible error positions. The specific encoding process is shown in fig. 1;
calculating column sum of each column of the A matrix and adding the column sum to the matrix as Ac, wherein Br in the B matrix is the sum of each row, and Mc and Mr are obtained by multiplication, wherein Mc and Mr are the row sum and the column sum of the M matrix, since in a large number of experiments Mc and Mr have never been affected by the radiation effect, we can assume that Mc and Mr are the correct row and column sums, and Mc 'and Mr' are used for solving the column sum and row sum of the M matrix after the calculation is finished, so that through the comparison of Mc and Mc ', Mr and Mr', an error position can be positioned, wherein, the integers # Err _ row and # Err _ col respectively store the row number and the column number containing errors, the arrays fault _ rows and fault _ cols contain the indexes of the error rows and the error columns, and the intersection of the error rows and the error columns is the position of the calculation error;
(2) and preprocessing the error list, eliminating some misjudgment errors and avoiding unnecessary correction. The method for eliminating errors adopts a relative error method, wherein a typical error distribution is shown in fig. 2, wherein a gray area is a misjudgment area. An error detection is added before correcting the error to determine if equation (1) holds. If yes, the intersection point is the only error on the ith row and the jth column, otherwise, the intersection point is a misjudgment area.
Mc[j]-Mc’[j]=Mr[i]-Mr’[i] (1)
These errors are corrected using equations (2) or (3) for the remaining real errors. If one or more errors are corrected, the error information fault _ rows and fault _ cols are updated, and encoding and error detection are performed again, and most errors can be corrected after a plurality of iterations.
Mcorrect[i,j]=Merror[i,j]-(Mr′[i]-Mr[i]) (2)
Mcorrect[i,j]=Merror[i,j]-(Mc′[j]-Mc[j]) (3)
(3) The remaining errors that cannot be corrected by equation (2) or (3), recalculate these elements.
Example 1:
as shown in fig. 3, the errors are randomly distributed. The code verification process will detect that the 3-row and 4-column codes are biased, i.e., there are 12 possible error points, and in fact only 4 errors. Using equation (1) will identify the locations (c) and (c) as detected as a single error, which will be corrected preferentially. Once successfully corrected using equation (2) or equation (3), the Fault _ rows and Fault _ cols will be updated, with the remaining 2 errors in the same row, and thus can be corrected using very few operations. The exemplary correction process is shown in fig. 4 below. In this example, the proposed scheme avoids unnecessary correction of "false" errors and reduces the total potential error in very few iterations.
In the error injection simulation, the efficiency of the method is superior to that of the existing method. We performed error-injecting simulation tests on different sizes of matrix multiplication. Fig. 5 compares the running time consumption of the algorithm and the existing EXABFT algorithm when 10 errors are injected, and it can be seen that the algorithm execution efficiency in the present invention is better than that of the existing EXABFT algorithm.
The above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, several modifications and variations can be made without departing from the technical principle of the present invention, and these modifications and variations should also be regarded as the protection scope of the present invention.

Claims (1)

1. A strengthening method of a parallel matrix multiplication algorithm is used for reducing ABFT overhead of matrix multiplication, and is characterized by comprising the following steps:
(1) firstly, encoding the input and output of the matrix multiplication, checking a calculation result according to an encoding value and storing all possible error lists;
(2) preprocessing the error list to eliminate some errors of misjudgment and avoid unnecessary correction, wherein the method for eliminating errors adopts a relative error method, an error detection is added before the errors are corrected, and the error list is preprocessed according to whether a judgment equation (1) is satisfied or not:
Mc[j]-Mc’[j]=Mr[i]-Mr’[i] (1);
if yes, the intersection point is the only error on the ith row and the jth column, otherwise, the intersection point is a misjudgment area;
the remaining errors are then corrected using equation (2) or (3):
Mcorrect[i,j]=Merror[i,j]-(M′r[i]-Mr[i]) (2);
Mcorrect[i,j]=Merror[i,j]-(M′c[j]-Mc[j]) (3);
wherein, Mc and Mr are correct row and column sums, and Mc 'and Mr' are obtained by calculating the column sum and the row sum of the M matrix after the calculation is finished; if one or more errors are corrected, updating error information, and correcting most errors after multiple iterations;
(3) and adopting a recalculation strategy for the remaining errors which cannot be corrected by the algorithm.
CN201810502409.1A 2018-05-23 2018-05-23 Parallel matrix multiplication algorithm reinforcing method Active CN108733628B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810502409.1A CN108733628B (en) 2018-05-23 2018-05-23 Parallel matrix multiplication algorithm reinforcing method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810502409.1A CN108733628B (en) 2018-05-23 2018-05-23 Parallel matrix multiplication algorithm reinforcing method

Publications (2)

Publication Number Publication Date
CN108733628A CN108733628A (en) 2018-11-02
CN108733628B true CN108733628B (en) 2020-01-03

Family

ID=63934982

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810502409.1A Active CN108733628B (en) 2018-05-23 2018-05-23 Parallel matrix multiplication algorithm reinforcing method

Country Status (1)

Country Link
CN (1) CN108733628B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
FI130137B (en) 2021-04-22 2023-03-09 Univ Of Oulu A method for increase of energy efficiency through leveraging fault tolerant algorithms into undervolted digital systems

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104133738A (en) * 2014-07-11 2014-11-05 中国人民解放军信息工程大学 SEU-resistant method for satellite-borne MIMO detector based on SEC-DED

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9362953B2 (en) * 2013-08-02 2016-06-07 Infineon Technologies Ag Efficient error correction of multi-bit errors

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104133738A (en) * 2014-07-11 2014-11-05 中国人民解放军信息工程大学 SEU-resistant method for satellite-borne MIMO detector based on SEC-DED

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
A-ABFT: Autonomous Algorithm-Based Fault Tolerance for Matrix Multiplications on Graphics Processing Units;Claus Braun 等;《2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks》;20140922;第443-454页 *
Algorithm-Based Fault Tolerance for Matnx Operations;KUANG-HUA HUANG 等;《IEEE TRANSACTIONS ON COMPUTERS》;19840630;第c-33卷(第6期);第518-528页 *
用软件实现的故障注入工具评估错误检测机制;王建莹 等;《小型微型计算机系统》;20000531;第21卷(第5期);第497-499页 *
矩阵的容错计算和纠查错性能;隋厚堂 等;《中国空间科学学会空间探测专业委员会第十次学术会议论文集》;19971031;第128-130页 *

Also Published As

Publication number Publication date
CN108733628A (en) 2018-11-02

Similar Documents

Publication Publication Date Title
CN109542668B (en) NAND FLASH memory-based verification method, terminal equipment and storage medium
CN111338840B (en) Space data protection method, storage medium, computer program, system and terminal
US8010875B2 (en) Error correcting code with chip kill capability and power saving enhancement
US9092349B2 (en) Storage of codeword portions
KR20180086816A (en) Memory device and electronic device performing adaptive error correction with pre-checking error rate and method of operating the memory device
US9189327B2 (en) Error-correcting code distribution for memory systems
CN104409103A (en) Novel two-dimensional coding reinforcing method and circuit arrangement for aerospace memory
US10291258B2 (en) Error correcting code for correcting single symbol errors and detecting double bit errors
CN108733628B (en) Parallel matrix multiplication algorithm reinforcing method
US9043683B2 (en) Error protection for integrated circuits
US9041428B2 (en) Placement of storage cells on an integrated circuit
US10108486B2 (en) Error protection
Schöll et al. Low-overhead fault-tolerance for the preconditioned conjugate gradient solver
Silva et al. Extended matrix region selection code: An ECC for adjacent multiple cell upset in memory arrays
US9959166B2 (en) Error correction for non-volatile memory
US8661320B2 (en) Independent orthogonal error correction and detection
Venkataraman et al. Multi-directional error correction schemes for SRAM-based FPGAs
US7904786B2 (en) Assisted memory system
Jang et al. MATE: Memory-and retraining-free error correction for convolutional neural network weights
CN107168817B (en) Data restoration method and device applied to storage array and storage equipment
US20140201599A1 (en) Error protection for integrated circuits in an insensitive direction
US8499224B2 (en) Redundant code generation method and device, data restoration method and device, and raid storage device
Kustov et al. Efficiency Estimation of Single Error Correction, Double Error Detection and Double-Adjacent-Error Correction Codes
Singh et al. Ram error detection and correction using HVD implementation
Hui et al. Optimized software-based hardening strategies for matrix multiplication and fast fourier transform

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
EE01 Entry into force of recordation of patent licensing contract
EE01 Entry into force of recordation of patent licensing contract

Application publication date: 20181102

Assignee: Changzhou Xinsheng Semiconductor Technology Co.,Ltd.

Assignor: CHANGZHOU CAMPUS OF HOHAI University

Contract record no.: X2023980034321

Denomination of invention: A Reinforcement Method for Parallel Matrix Multiplication Algorithm

Granted publication date: 20200103

License type: Common License

Record date: 20230404