CN108733628A - A kind of reinforcement means of parallel matrix multiplication algorithm - Google Patents
A kind of reinforcement means of parallel matrix multiplication algorithm Download PDFInfo
- Publication number
- CN108733628A CN108733628A CN201810502409.1A CN201810502409A CN108733628A CN 108733628 A CN108733628 A CN 108733628A CN 201810502409 A CN201810502409 A CN 201810502409A CN 108733628 A CN108733628 A CN 108733628A
- Authority
- CN
- China
- Prior art keywords
- mistake
- error
- matrix multiplication
- corrected
- reinforcement means
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/16—Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T1/00—General purpose image data processing
- G06T1/20—Processor architectures; Processor configuration, e.g. pipelining
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Theoretical Computer Science (AREA)
- Computational Mathematics (AREA)
- Mathematical Analysis (AREA)
- Mathematical Optimization (AREA)
- Pure & Applied Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Computing Systems (AREA)
- Algebra (AREA)
- Databases & Information Systems (AREA)
- Software Systems (AREA)
- General Engineering & Computer Science (AREA)
- Detection And Correction Of Errors (AREA)
- Detection And Prevention Of Errors In Transmission (AREA)
Abstract
The invention discloses a kind of reinforcement means of parallel matrix multiplication algorithm, reinforce expense for reducing the ABFT of matrix multiplication, include the following steps:(1), first the input and output of Matrix Multiplication are encoded, result of calculation is verified according to encoded radio and preserves all possible error listing;(2), error listing is pre-processed, exclude mistakes of some erroneous judgements, avoid unnecessary correction, wherein the method for debug uses relative error method, and an error detection is added before correction mistake, is then corrected to remaining mistake.If having corrected one or more mistakes, error listing is updated, the most mistake of recoverable after successive ignition.(3), it is remaining can not use algorithm correct mistake, using the strategy recalculated.The reinforcement means of the present invention can improve execution efficiency while lifting system reliability.
Description
Technical field
The present invention relates to a kind of reinforcement technique of parallel matrix multiplication algorithm, can be applied to various be applied to Matrix Multiple Algorithms
Technical field such as image procossing, data statistics.
Background technology
Currently, the parallel computation framework of graphics processing unit (GPU) greatly improves the speed of computer extensive computation,
Huge potentiality are shown in high-performance calculation application.GPU is applied to every field, as image procossing, data statistics and
Other high-performance calculations application etc., it is also becoming increasingly popular in modern industry.In recent years, the GPU such as NVIDIA manufacturers one
The directly GPU computing platforms in exploitation for car steering application.
High energy particle may cause the bit flipping of memory component, or cause in other logic circuits such as computing unit
Transient voltage pulses.With the continuous reduction of CMOS preparation process sizes, logic circuit to soft error caused by high energy particle more
Add sensitivity.It is numerous the experimental results showed that, under high energy particle strike, GPU is than other integrated circuit device with higher mistake
Accidentally rate.It should be noted that reliability requirement is depending on application.The reliability of GPU is most important in certain applications
, for example in the applications such as spacecraft, artificial satellite or automatic Pilot, soft error may result in extremely serious consequence.And
In the personal entertainment application of such as audio or video, a certain number of soft errors can then be tolerated.
Error correcting code (ECC) mechanism is one of most common reinforcement technique in memory, be can also be applied to soft to reduce in GPU
Error rate.However, the expensive of time, space and power consumption can be caused using this scheme, and only particular series
High-end GPU is just equipped with ECC.Some other common reinforcement means is mainly detecting mistake such as redundancy and checkpoint technology
Afterwards using the method recalculated.One of reinforcement technique based on redundancy is TMR (triplication redundancy method), can be proved in an experiment
The technology can improve the reliability of system.But while TMR can efficiently solve the problem of soft error, it can lead to three times
Resource consumption, and in certain application programs, resource is limited.
Therefore, we have proposed a kind of reinforcement technique based on algorithm of matrix multiplication, this method can be in lifting system
Execution efficiency is improved while reliability.
Invention content
It is an object of the invention to design the Matrix Multiplication reinforcement means based on ABFT algorithms, it is real that less resource can be consumed
Existing algorithm is reinforced, and notes wrong simulation results show this technology than existing technology execution efficiency higher.The invention mainly includes
Error correction is carried out to the case where multiple wrong random distributions.In the case of multiple wrong random distributions, traditional code verification
Algorithm can detect error positions more more than actual error, and unnecessary take can be caused in error correction.However, more
Only a small number of mistakes is true mistake in number situation.In order to solve this problem, the present invention provides it is a kind of it is new based on
The reinforcement technique of ABFT algorithms, to further decrease expense.
Technical scheme is as follows:
A kind of reinforcement means of parallel matrix multiplication algorithm, for reducing the ABFT expenses of matrix multiplication and FFT, including it is as follows
Step:
(1), the input and output of Matrix Multiplication are encoded first, according to encoded radio verify result of calculation and preserve it is all can
The error listing of energy.
(2) error listing is pre-processed, excludes the mistake of some erroneous judgements, unnecessary correction is avoided, wherein excluding
The method of mistake uses relative error method, and an error detection is added before correction mistake.Then remaining mistake is carried out
Correction.If having corrected one or more mistakes, error message is updated, the most mistake of recoverable after successive ignition
Accidentally.
(3), the remaining mistake that can not be corrected with algorithm, using the strategy recalculated.
Beneficial effects of the present invention are as follows:
The reinforcement means of the present invention can improve execution efficiency while lifting system reliability.
Description of the drawings
Fig. 1 is the schematic diagram of specific cataloged procedure;
Fig. 2 is that typical Fault Distribution diagram is intended to;
Fig. 3 is the example of random distribution mistake in embodiment 1, and wherein stain is true mistake, and Grey Point is considered latent
In error;
Fig. 4 is the makeover process example of random distribution mistake in embodiment 1;
Fig. 5 is the time-consuming comparison figure of inventive algorithm and existing EXABFT algorithms.
Specific implementation mode
The invention will be further described below in conjunction with the accompanying drawings.Following embodiment is only used for clearly illustrating the present invention
Technical solution, and not intended to limit the protection scope of the present invention.
(1), the input and output of Matrix Multiplication are encoded first, according to encoded radio verify result of calculation and preserve it is all can
The errors present of energy.Specific cataloged procedure is as shown in Figure 1;
Each column count of A matrixes is arranged and matrix is added to as Ac, likewise, the Br in B matrixes is per a line
With, obtained Mc and Mr by multiplication, Mc and Mr be Metzler matrix row and and row and, due to the Mc in many experiments and Mr
Never influenced by radiation effect, one can consider that Mc and Mr be correct row and and row and, and Mc ' and Mr ' then be have been calculated
At it is rear to Metzler matrix ask row and and row and, therefore pass through Mc and Mc ', the comparison of Mr and Mr ', we can position the position to make mistake
It sets, wherein integer #Err_row (line number of error row), #Err_col (columns of mistake row) are stored respectively comprising mistake
Line number and columns, array Faulty_rows (index list of error row) and Faulty_cols (index lists of mistake row) packet
The crosspoint of index containing these wrong row and columns, error row and mistake row is the position for calculating mistake;
(2), error listing is pre-processed, excludes the mistake of some erroneous judgements, avoids unnecessary correction.Wherein exclude
The method of mistake uses relative error method, wherein typical Fault Distribution is as shown in Fig. 2, its grey area is erroneous judgement region.
An error detection is added before correction mistake, judges whether equation (1) is true.If so, then illustrate that this crosspoint is
Otherwise unique mistake in ith row and jth column is erroneous judgement region.
Mc [j]-Mc ' [j]=Mr [i]-Mr ' [i] (1)
These mistakes are corrected using equation (2) or (3) to remaining real mistake.If having corrected one or more mistakes
Accidentally, then error message Faulty_rows (index list of error row) and the Faulty_cols (index columns of mistake row are updated
Table), and coding and error detection are re-started, the most mistake of recoverable after successive ignition.
Mcorrect[i, j]=Merror[i, j]-(Mr′[i]-Mr[i]) (2)
Mcorrect[i, j]=Merror[i, j]-(Mc′[j]-Mc[j]) (3)
(3), the remaining mistake that equation (2) or (3) can not be used to correct, recalculates these elements.
Embodiment 1:
As shown in figure 3, mistake is random distribution.Coding checkout process will will detect that 3 rows and 4 row codings occur
Deviation has 12 possible erroneous points, and actually there was only 4 mistakes.It will identify 3. and 4. position is examined using formula (1)
It is single error to survey, this will preferentially be corrected.Once successfully corrected using equation (2) or equation (3), Faulty_rows and
Faulty_cols will be updated, and remaining 2 mistakes in the same row, therefore can be corrected using considerably less operation.This shows
The correction course of example is shown in following Fig. 4.In this example, the scheme proposed avoids the unnecessary of "false" mistake
Correction, and reduce total potential error in considerably less iteration.
In the wrong emulation of note, the efficiency of this method is an advantage over existing method.We to different size of Matrix Multiplication into
The wrong emulation testing of note of having gone.Fig. 5 is compared when injecting 10 mistakes, and the operation of this algorithm and existing EXABFT algorithms takes ratio
Compared with, it can be seen that the algorithm performs efficiency in the present invention is better than having EXABFT algorithms.
The above is only a preferred embodiment of the present invention, it is noted that for the ordinary skill people of the art
For member, without departing from the technical principles of the invention, several improvement and deformations can also be made, these improvement and deformations
Also it should be regarded as protection scope of the present invention.
Claims (1)
1. a kind of reinforcement means of parallel matrix multiplication algorithm, for reducing the ABFT expenses of matrix multiplication, which is characterized in that including
Following steps:
(1), first the input and output of Matrix Multiplication are encoded, result of calculation is verified according to encoded radio and is preserved all possible
Error listing;
(2)Error listing is pre-processed, the mistake of some erroneous judgements is excluded, avoids unnecessary correction, wherein debug
Method use relative error method, correction mistake before be added an error detection;Then remaining mistake is corrected;
If having corrected one or more mistakes, error message is updated, by the most mistake of successive ignition post-equalization;
(3), it is remaining can not use algorithm correct mistake, using the strategy recalculated.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810502409.1A CN108733628B (en) | 2018-05-23 | 2018-05-23 | Parallel matrix multiplication algorithm reinforcing method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810502409.1A CN108733628B (en) | 2018-05-23 | 2018-05-23 | Parallel matrix multiplication algorithm reinforcing method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108733628A true CN108733628A (en) | 2018-11-02 |
CN108733628B CN108733628B (en) | 2020-01-03 |
Family
ID=63934982
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810502409.1A Active CN108733628B (en) | 2018-05-23 | 2018-05-23 | Parallel matrix multiplication algorithm reinforcing method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108733628B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2022223881A1 (en) | 2021-04-22 | 2022-10-27 | University Of Oulu | A method for increase of energy efficiency through leveraging fault tolerant algorithms into undervolted digital systems |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104133738A (en) * | 2014-07-11 | 2014-11-05 | 中国人民解放军信息工程大学 | SEU-resistant method for satellite-borne MIMO detector based on SEC-DED |
CN104348588A (en) * | 2013-08-02 | 2015-02-11 | 英飞凌科技股份有限公司 | Efficient Error Correction of Multi-Bit Errors |
-
2018
- 2018-05-23 CN CN201810502409.1A patent/CN108733628B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104348588A (en) * | 2013-08-02 | 2015-02-11 | 英飞凌科技股份有限公司 | Efficient Error Correction of Multi-Bit Errors |
CN104133738A (en) * | 2014-07-11 | 2014-11-05 | 中国人民解放军信息工程大学 | SEU-resistant method for satellite-borne MIMO detector based on SEC-DED |
Non-Patent Citations (4)
Title |
---|
CLAUS BRAUN 等: "A-ABFT: Autonomous Algorithm-Based Fault Tolerance for Matrix Multiplications on Graphics Processing Units", 《2014 44TH ANNUAL IEEE/IFIP INTERNATIONAL CONFERENCE ON DEPENDABLE SYSTEMS AND NETWORKS》 * |
KUANG-HUA HUANG 等: "Algorithm-Based Fault Tolerance for Matnx Operations", 《IEEE TRANSACTIONS ON COMPUTERS》 * |
王建莹 等: "用软件实现的故障注入工具评估错误检测机制", 《小型微型计算机系统》 * |
隋厚堂 等: "矩阵的容错计算和纠查错性能", 《中国空间科学学会空间探测专业委员会第十次学术会议论文集》 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2022223881A1 (en) | 2021-04-22 | 2022-10-27 | University Of Oulu | A method for increase of energy efficiency through leveraging fault tolerant algorithms into undervolted digital systems |
Also Published As
Publication number | Publication date |
---|---|
CN108733628B (en) | 2020-01-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US9800271B2 (en) | Error correction and decoding | |
US9092349B2 (en) | Storage of codeword portions | |
CN107436821B (en) | Apparatus and method for generating error codes for blocks comprising a plurality of data bits and address bits | |
CN106708655B (en) | Memory reinforcing method and circuit based on two-dimensional error correcting code | |
CN111338840B (en) | Space data protection method, storage medium, computer program, system and terminal | |
CN103890732B (en) | Numeric error is corrected | |
CN105320579B (en) | Towards the selfreparing dual redundant streamline and fault-tolerance approach of SPARC V8 processors | |
US9189327B2 (en) | Error-correcting code distribution for memory systems | |
Rahman et al. | Soft error tolerance using horizontal-vertical-double-bit diagonal parity method | |
US10108486B2 (en) | Error protection | |
Schöll et al. | Low-overhead fault-tolerance for the preconditioned conjugate gradient solver | |
CN108733628A (en) | A kind of reinforcement means of parallel matrix multiplication algorithm | |
US20090249174A1 (en) | Fault Tolerant Self-Correcting Non-Glitching Low Power Circuit for Static and Dynamic Data Storage | |
CN105320575A (en) | Self-checking and recovering device and method for dual-modular redundancy assembly lines | |
CN104378120B (en) | A kind of Hsiao coding checkout matrix generating methods detected for continuous N BU | |
CN109753369A (en) | The data encoding and method of calibration of sequence array in a kind of register and memory | |
Pereira-Santos et al. | Exploring redundancy granularities to repair real-time FPGA-based systems | |
CN205193785U (en) | Self -check and recovery device of duplication redundancy assembly line | |
CN107168817B (en) | Data restoration method and device applied to storage array and storage equipment | |
Loh et al. | Fault tolerance through invariant checking for iterative solvers | |
TWI503833B (en) | A method of detecting and correcting errors with bch engines for flash storage system | |
Kustov et al. | Efficiency Estimation of Single Error Correction, Double Error Detection and Double-Adjacent-Error Correction Codes | |
Singh et al. | Ram error detection and correction using HVD implementation | |
TWI501083B (en) | A method of detecting and correcting errors with bch and ldpc engines for flash storage system | |
CN109271282B (en) | Single-particle multi-dislocation autonomous repair triple-redundancy assembly line and design method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
EE01 | Entry into force of recordation of patent licensing contract | ||
EE01 | Entry into force of recordation of patent licensing contract |
Application publication date: 20181102 Assignee: Changzhou Xinsheng Semiconductor Technology Co.,Ltd. Assignor: CHANGZHOU CAMPUS OF HOHAI University Contract record no.: X2023980034321 Denomination of invention: A Reinforcement Method for Parallel Matrix Multiplication Algorithm Granted publication date: 20200103 License type: Common License Record date: 20230404 |