CN108733628A - A kind of reinforcement means of parallel matrix multiplication algorithm - Google Patents

A kind of reinforcement means of parallel matrix multiplication algorithm Download PDF

Info

Publication number
CN108733628A
CN108733628A CN201810502409.1A CN201810502409A CN108733628A CN 108733628 A CN108733628 A CN 108733628A CN 201810502409 A CN201810502409 A CN 201810502409A CN 108733628 A CN108733628 A CN 108733628A
Authority
CN
China
Prior art keywords
mistake
error
matrix multiplication
corrected
reinforcement means
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810502409.1A
Other languages
Chinese (zh)
Other versions
CN108733628B (en
Inventor
王海滨
王杨圣
戴茜茜
惠志坚
叶静
孙洪文
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Changzhou Campus of Hohai University
Original Assignee
Changzhou Campus of Hohai University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Changzhou Campus of Hohai University filed Critical Changzhou Campus of Hohai University
Priority to CN201810502409.1A priority Critical patent/CN108733628B/en
Publication of CN108733628A publication Critical patent/CN108733628A/en
Application granted granted Critical
Publication of CN108733628B publication Critical patent/CN108733628B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T1/00General purpose image data processing
    • G06T1/20Processor architectures; Processor configuration, e.g. pipelining

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Computing Systems (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Detection And Prevention Of Errors In Transmission (AREA)
  • Detection And Correction Of Errors (AREA)

Abstract

The invention discloses a kind of reinforcement means of parallel matrix multiplication algorithm, reinforce expense for reducing the ABFT of matrix multiplication, include the following steps:(1), first the input and output of Matrix Multiplication are encoded, result of calculation is verified according to encoded radio and preserves all possible error listing;(2), error listing is pre-processed, exclude mistakes of some erroneous judgements, avoid unnecessary correction, wherein the method for debug uses relative error method, and an error detection is added before correction mistake, is then corrected to remaining mistake.If having corrected one or more mistakes, error listing is updated, the most mistake of recoverable after successive ignition.(3), it is remaining can not use algorithm correct mistake, using the strategy recalculated.The reinforcement means of the present invention can improve execution efficiency while lifting system reliability.

Description

A kind of reinforcement means of parallel matrix multiplication algorithm
Technical field
The present invention relates to a kind of reinforcement technique of parallel matrix multiplication algorithm, can be applied to various be applied to Matrix Multiple Algorithms Technical field such as image procossing, data statistics.
Background technology
Currently, the parallel computation framework of graphics processing unit (GPU) greatly improves the speed of computer extensive computation, Huge potentiality are shown in high-performance calculation application.GPU is applied to every field, as image procossing, data statistics and Other high-performance calculations application etc., it is also becoming increasingly popular in modern industry.In recent years, the GPU such as NVIDIA manufacturers one The directly GPU computing platforms in exploitation for car steering application.
High energy particle may cause the bit flipping of memory component, or cause in other logic circuits such as computing unit Transient voltage pulses.With the continuous reduction of CMOS preparation process sizes, logic circuit to soft error caused by high energy particle more Add sensitivity.It is numerous the experimental results showed that, under high energy particle strike, GPU is than other integrated circuit device with higher mistake Accidentally rate.It should be noted that reliability requirement is depending on application.The reliability of GPU is most important in certain applications , for example in the applications such as spacecraft, artificial satellite or automatic Pilot, soft error may result in extremely serious consequence.And In the personal entertainment application of such as audio or video, a certain number of soft errors can then be tolerated.
Error correcting code (ECC) mechanism is one of most common reinforcement technique in memory, be can also be applied to soft to reduce in GPU Error rate.However, the expensive of time, space and power consumption can be caused using this scheme, and only particular series High-end GPU is just equipped with ECC.Some other common reinforcement means is mainly detecting mistake such as redundancy and checkpoint technology Afterwards using the method recalculated.One of reinforcement technique based on redundancy is TMR (triplication redundancy method), can be proved in an experiment The technology can improve the reliability of system.But while TMR can efficiently solve the problem of soft error, it can lead to three times Resource consumption, and in certain application programs, resource is limited.
Therefore, we have proposed a kind of reinforcement technique based on algorithm of matrix multiplication, this method can be in lifting system Execution efficiency is improved while reliability.
Invention content
It is an object of the invention to design the Matrix Multiplication reinforcement means based on ABFT algorithms, it is real that less resource can be consumed Existing algorithm is reinforced, and notes wrong simulation results show this technology than existing technology execution efficiency higher.The invention mainly includes Error correction is carried out to the case where multiple wrong random distributions.In the case of multiple wrong random distributions, traditional code verification Algorithm can detect error positions more more than actual error, and unnecessary take can be caused in error correction.However, more Only a small number of mistakes is true mistake in number situation.In order to solve this problem, the present invention provides it is a kind of it is new based on The reinforcement technique of ABFT algorithms, to further decrease expense.
Technical scheme is as follows:
A kind of reinforcement means of parallel matrix multiplication algorithm, for reducing the ABFT expenses of matrix multiplication and FFT, including it is as follows Step:
(1), the input and output of Matrix Multiplication are encoded first, according to encoded radio verify result of calculation and preserve it is all can The error listing of energy.
(2) error listing is pre-processed, excludes the mistake of some erroneous judgements, unnecessary correction is avoided, wherein excluding The method of mistake uses relative error method, and an error detection is added before correction mistake.Then remaining mistake is carried out Correction.If having corrected one or more mistakes, error message is updated, the most mistake of recoverable after successive ignition Accidentally.
(3), the remaining mistake that can not be corrected with algorithm, using the strategy recalculated.
Beneficial effects of the present invention are as follows:
The reinforcement means of the present invention can improve execution efficiency while lifting system reliability.
Description of the drawings
Fig. 1 is the schematic diagram of specific cataloged procedure;
Fig. 2 is that typical Fault Distribution diagram is intended to;
Fig. 3 is the example of random distribution mistake in embodiment 1, and wherein stain is true mistake, and Grey Point is considered latent In error;
Fig. 4 is the makeover process example of random distribution mistake in embodiment 1;
Fig. 5 is the time-consuming comparison figure of inventive algorithm and existing EXABFT algorithms.
Specific implementation mode
The invention will be further described below in conjunction with the accompanying drawings.Following embodiment is only used for clearly illustrating the present invention Technical solution, and not intended to limit the protection scope of the present invention.
(1), the input and output of Matrix Multiplication are encoded first, according to encoded radio verify result of calculation and preserve it is all can The errors present of energy.Specific cataloged procedure is as shown in Figure 1;
Each column count of A matrixes is arranged and matrix is added to as Ac, likewise, the Br in B matrixes is per a line With, obtained Mc and Mr by multiplication, Mc and Mr be Metzler matrix row and and row and, due to the Mc in many experiments and Mr Never influenced by radiation effect, one can consider that Mc and Mr be correct row and and row and, and Mc ' and Mr ' then be have been calculated At it is rear to Metzler matrix ask row and and row and, therefore pass through Mc and Mc ', the comparison of Mr and Mr ', we can position the position to make mistake It sets, wherein integer #Err_row (line number of error row), #Err_col (columns of mistake row) are stored respectively comprising mistake Line number and columns, array Faulty_rows (index list of error row) and Faulty_cols (index lists of mistake row) packet The crosspoint of index containing these wrong row and columns, error row and mistake row is the position for calculating mistake;
(2), error listing is pre-processed, excludes the mistake of some erroneous judgements, avoids unnecessary correction.Wherein exclude The method of mistake uses relative error method, wherein typical Fault Distribution is as shown in Fig. 2, its grey area is erroneous judgement region. An error detection is added before correction mistake, judges whether equation (1) is true.If so, then illustrate that this crosspoint is Otherwise unique mistake in ith row and jth column is erroneous judgement region.
Mc [j]-Mc ' [j]=Mr [i]-Mr ' [i] (1)
These mistakes are corrected using equation (2) or (3) to remaining real mistake.If having corrected one or more mistakes Accidentally, then error message Faulty_rows (index list of error row) and the Faulty_cols (index columns of mistake row are updated Table), and coding and error detection are re-started, the most mistake of recoverable after successive ignition.
Mcorrect[i, j]=Merror[i, j]-(Mr′[i]-Mr[i]) (2)
Mcorrect[i, j]=Merror[i, j]-(Mc′[j]-Mc[j]) (3)
(3), the remaining mistake that equation (2) or (3) can not be used to correct, recalculates these elements.
Embodiment 1:
As shown in figure 3, mistake is random distribution.Coding checkout process will will detect that 3 rows and 4 row codings occur Deviation has 12 possible erroneous points, and actually there was only 4 mistakes.It will identify 3. and 4. position is examined using formula (1) It is single error to survey, this will preferentially be corrected.Once successfully corrected using equation (2) or equation (3), Faulty_rows and Faulty_cols will be updated, and remaining 2 mistakes in the same row, therefore can be corrected using considerably less operation.This shows The correction course of example is shown in following Fig. 4.In this example, the scheme proposed avoids the unnecessary of "false" mistake Correction, and reduce total potential error in considerably less iteration.
In the wrong emulation of note, the efficiency of this method is an advantage over existing method.We to different size of Matrix Multiplication into The wrong emulation testing of note of having gone.Fig. 5 is compared when injecting 10 mistakes, and the operation of this algorithm and existing EXABFT algorithms takes ratio Compared with, it can be seen that the algorithm performs efficiency in the present invention is better than having EXABFT algorithms.
The above is only a preferred embodiment of the present invention, it is noted that for the ordinary skill people of the art For member, without departing from the technical principles of the invention, several improvement and deformations can also be made, these improvement and deformations Also it should be regarded as protection scope of the present invention.

Claims (1)

1. a kind of reinforcement means of parallel matrix multiplication algorithm, for reducing the ABFT expenses of matrix multiplication, which is characterized in that including Following steps:
(1), first the input and output of Matrix Multiplication are encoded, result of calculation is verified according to encoded radio and is preserved all possible Error listing;
(2)Error listing is pre-processed, the mistake of some erroneous judgements is excluded, avoids unnecessary correction, wherein debug Method use relative error method, correction mistake before be added an error detection;Then remaining mistake is corrected; If having corrected one or more mistakes, error message is updated, by the most mistake of successive ignition post-equalization;
(3), it is remaining can not use algorithm correct mistake, using the strategy recalculated.
CN201810502409.1A 2018-05-23 2018-05-23 Parallel matrix multiplication algorithm reinforcing method Active CN108733628B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810502409.1A CN108733628B (en) 2018-05-23 2018-05-23 Parallel matrix multiplication algorithm reinforcing method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810502409.1A CN108733628B (en) 2018-05-23 2018-05-23 Parallel matrix multiplication algorithm reinforcing method

Publications (2)

Publication Number Publication Date
CN108733628A true CN108733628A (en) 2018-11-02
CN108733628B CN108733628B (en) 2020-01-03

Family

ID=63934982

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810502409.1A Active CN108733628B (en) 2018-05-23 2018-05-23 Parallel matrix multiplication algorithm reinforcing method

Country Status (1)

Country Link
CN (1) CN108733628B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022223881A1 (en) 2021-04-22 2022-10-27 University Of Oulu A method for increase of energy efficiency through leveraging fault tolerant algorithms into undervolted digital systems

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104133738A (en) * 2014-07-11 2014-11-05 中国人民解放军信息工程大学 SEU-resistant method for satellite-borne MIMO detector based on SEC-DED
CN104348588A (en) * 2013-08-02 2015-02-11 英飞凌科技股份有限公司 Efficient Error Correction of Multi-Bit Errors

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104348588A (en) * 2013-08-02 2015-02-11 英飞凌科技股份有限公司 Efficient Error Correction of Multi-Bit Errors
CN104133738A (en) * 2014-07-11 2014-11-05 中国人民解放军信息工程大学 SEU-resistant method for satellite-borne MIMO detector based on SEC-DED

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
CLAUS BRAUN 等: "A-ABFT: Autonomous Algorithm-Based Fault Tolerance for Matrix Multiplications on Graphics Processing Units", 《2014 44TH ANNUAL IEEE/IFIP INTERNATIONAL CONFERENCE ON DEPENDABLE SYSTEMS AND NETWORKS》 *
KUANG-HUA HUANG 等: "Algorithm-Based Fault Tolerance for Matnx Operations", 《IEEE TRANSACTIONS ON COMPUTERS》 *
王建莹 等: "用软件实现的故障注入工具评估错误检测机制", 《小型微型计算机系统》 *
隋厚堂 等: "矩阵的容错计算和纠查错性能", 《中国空间科学学会空间探测专业委员会第十次学术会议论文集》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022223881A1 (en) 2021-04-22 2022-10-27 University Of Oulu A method for increase of energy efficiency through leveraging fault tolerant algorithms into undervolted digital systems

Also Published As

Publication number Publication date
CN108733628B (en) 2020-01-03

Similar Documents

Publication Publication Date Title
US9800271B2 (en) Error correction and decoding
CN107436821B (en) Apparatus and method for generating error codes for blocks comprising a plurality of data bits and address bits
CN111338840B (en) Space data protection method, storage medium, computer program, system and terminal
EP2953027A1 (en) Microcontroller and electronic control device using the same
CN106708655B (en) Memory reinforcing method and circuit based on two-dimensional error correcting code
CN103140841A (en) Methods and apparatus to protect segments of memory
CN103890732B (en) Numeric error is corrected
US20140089760A1 (en) Storage of codeword portions
CN104409103A (en) Novel two-dimensional coding reinforcing method and circuit arrangement for aerospace memory
Rahman et al. Soft error tolerance using horizontal-vertical-double-bit diagonal parity method
US9189327B2 (en) Error-correcting code distribution for memory systems
CN105320575B (en) A kind of self checking of duplication redundancy streamline and recovery device and method
US10108486B2 (en) Error protection
CN108733628A (en) A kind of reinforcement means of parallel matrix multiplication algorithm
Schöll et al. Low-overhead fault-tolerance for the preconditioned conjugate gradient solver
CN104378120B (en) A kind of Hsiao coding checkout matrix generating methods detected for continuous N BU
WO2023124006A1 (en) Spiking neuron reinforcing circuit and reinforcing method
US10877842B2 (en) Detecting silent data corruption for mass storage devices
CN107168817B (en) Data restoration method and device applied to storage array and storage equipment
Loh et al. Fault tolerance through invariant checking for iterative solvers
TWI503833B (en) A method of detecting and correcting errors with bch engines for flash storage system
CN109753369A (en) The data encoding and method of calibration of sequence array in a kind of register and memory
Kustov et al. Efficiency Estimation of Single Error Correction, Double Error Detection and Double-Adjacent-Error Correction Codes
Singh et al. Ram error detection and correction using HVD implementation
TWI501083B (en) A method of detecting and correcting errors with bch and ldpc engines for flash storage system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
EE01 Entry into force of recordation of patent licensing contract
EE01 Entry into force of recordation of patent licensing contract

Application publication date: 20181102

Assignee: Changzhou Xinsheng Semiconductor Technology Co.,Ltd.

Assignor: CHANGZHOU CAMPUS OF HOHAI University

Contract record no.: X2023980034321

Denomination of invention: A Reinforcement Method for Parallel Matrix Multiplication Algorithm

Granted publication date: 20200103

License type: Common License

Record date: 20230404