CN108733628A

CN108733628A - A kind of reinforcement means of parallel matrix multiplication algorithm

Info

Publication number: CN108733628A
Application number: CN201810502409.1A
Authority: CN
Inventors: 王海滨; 王杨圣; 戴茜茜; 惠志坚; 叶静; 孙洪文
Original assignee: Changzhou Campus of Hohai University
Current assignee: Changzhou Campus of Hohai University
Priority date: 2018-05-23
Filing date: 2018-05-23
Publication date: 2018-11-02
Anticipated expiration: 2038-05-23
Also published as: CN108733628B

Abstract

The invention discloses a kind of reinforcement means of parallel matrix multiplication algorithm, reinforce expense for reducing the ABFT of matrix multiplication, include the following steps：（1）, first the input and output of Matrix Multiplication are encoded, result of calculation is verified according to encoded radio and preserves all possible error listing；（2）, error listing is pre-processed, exclude mistakes of some erroneous judgements, avoid unnecessary correction, wherein the method for debug uses relative error method, and an error detection is added before correction mistake, is then corrected to remaining mistake.If having corrected one or more mistakes, error listing is updated, the most mistake of recoverable after successive ignition.（3）, it is remaining can not use algorithm correct mistake, using the strategy recalculated.The reinforcement means of the present invention can improve execution efficiency while lifting system reliability.

Description

A kind of reinforcement means of parallel matrix multiplication algorithm

Technical field

The present invention relates to a kind of reinforcement technique of parallel matrix multiplication algorithm, can be applied to various be applied to Matrix Multiple Algorithms Technical field such as image procossing, data statistics.

Background technology

Currently, the parallel computation framework of graphics processing unit (GPU) greatly improves the speed of computer extensive computation, Huge potentiality are shown in high-performance calculation application.GPU is applied to every field, as image procossing, data statistics and Other high-performance calculations application etc., it is also becoming increasingly popular in modern industry.In recent years, the GPU such as NVIDIA manufacturers one The directly GPU computing platforms in exploitation for car steering application.

High energy particle may cause the bit flipping of memory component, or cause in other logic circuits such as computing unit Transient voltage pulses.With the continuous reduction of CMOS preparation process sizes, logic circuit to soft error caused by high energy particle more Add sensitivity.It is numerous the experimental results showed that, under high energy particle strike, GPU is than other integrated circuit device with higher mistake Accidentally rate.It should be noted that reliability requirement is depending on application.The reliability of GPU is most important in certain applications , for example in the applications such as spacecraft, artificial satellite or automatic Pilot, soft error may result in extremely serious consequence.And In the personal entertainment application of such as audio or video, a certain number of soft errors can then be tolerated.

Error correcting code (ECC) mechanism is one of most common reinforcement technique in memory, be can also be applied to soft to reduce in GPU Error rate.However, the expensive of time, space and power consumption can be caused using this scheme, and only particular series High-end GPU is just equipped with ECC.Some other common reinforcement means is mainly detecting mistake such as redundancy and checkpoint technology Afterwards using the method recalculated.One of reinforcement technique based on redundancy is TMR (triplication redundancy method), can be proved in an experiment The technology can improve the reliability of system.But while TMR can efficiently solve the problem of soft error, it can lead to three times Resource consumption, and in certain application programs, resource is limited.

Therefore, we have proposed a kind of reinforcement technique based on algorithm of matrix multiplication, this method can be in lifting system Execution efficiency is improved while reliability.

Invention content

It is an object of the invention to design the Matrix Multiplication reinforcement means based on ABFT algorithms, it is real that less resource can be consumed Existing algorithm is reinforced, and notes wrong simulation results show this technology than existing technology execution efficiency higher.The invention mainly includes Error correction is carried out to the case where multiple wrong random distributions.In the case of multiple wrong random distributions, traditional code verification Algorithm can detect error positions more more than actual error, and unnecessary take can be caused in error correction.However, more Only a small number of mistakes is true mistake in number situation.In order to solve this problem, the present invention provides it is a kind of it is new based on The reinforcement technique of ABFT algorithms, to further decrease expense.

Technical scheme is as follows：

A kind of reinforcement means of parallel matrix multiplication algorithm, for reducing the ABFT expenses of matrix multiplication and FFT, including it is as follows Step：

(1), the input and output of Matrix Multiplication are encoded first, according to encoded radio verify result of calculation and preserve it is all can The error listing of energy.

(2) error listing is pre-processed, excludes the mistake of some erroneous judgements, unnecessary correction is avoided, wherein excluding The method of mistake uses relative error method, and an error detection is added before correction mistake.Then remaining mistake is carried out Correction.If having corrected one or more mistakes, error message is updated, the most mistake of recoverable after successive ignition Accidentally.

(3), the remaining mistake that can not be corrected with algorithm, using the strategy recalculated.

Beneficial effects of the present invention are as follows：

The reinforcement means of the present invention can improve execution efficiency while lifting system reliability.

Description of the drawings

Fig. 1 is the schematic diagram of specific cataloged procedure；

Fig. 2 is that typical Fault Distribution diagram is intended to；

Fig. 3 is the example of random distribution mistake in embodiment 1, and wherein stain is true mistake, and Grey Point is considered latent In error；

Fig. 4 is the makeover process example of random distribution mistake in embodiment 1；

Fig. 5 is the time-consuming comparison figure of inventive algorithm and existing EXABFT algorithms.

Specific implementation mode

The invention will be further described below in conjunction with the accompanying drawings.Following embodiment is only used for clearly illustrating the present invention Technical solution, and not intended to limit the protection scope of the present invention.

(1), the input and output of Matrix Multiplication are encoded first, according to encoded radio verify result of calculation and preserve it is all can The errors present of energy.Specific cataloged procedure is as shown in Figure 1；

Each column count of A matrixes is arranged and matrix is added to as Ac, likewise, the Br in B matrixes is per a line With, obtained Mc and Mr by multiplication, Mc and Mr be Metzler matrix row and and row and, due to the Mc in many experiments and Mr Never influenced by radiation effect, one can consider that Mc and Mr be correct row and and row and, and Mc ' and Mr ' then be have been calculated At it is rear to Metzler matrix ask row and and row and, therefore pass through Mc and Mc ', the comparison of Mr and Mr ', we can position the position to make mistake It sets, wherein integer #Err_row (line number of error row), #Err_col (columns of mistake row) are stored respectively comprising mistake Line number and columns, array Faulty_rows (index list of error row) and Faulty_cols (index lists of mistake row) packet The crosspoint of index containing these wrong row and columns, error row and mistake row is the position for calculating mistake；

(2), error listing is pre-processed, excludes the mistake of some erroneous judgements, avoids unnecessary correction.Wherein exclude The method of mistake uses relative error method, wherein typical Fault Distribution is as shown in Fig. 2, its grey area is erroneous judgement region. An error detection is added before correction mistake, judges whether equation (1) is true.If so, then illustrate that this crosspoint is Otherwise unique mistake in ith row and jth column is erroneous judgement region.

Mc [j]-Mc ' [j]=Mr [i]-Mr ' [i] (1)

These mistakes are corrected using equation (2) or (3) to remaining real mistake.If having corrected one or more mistakes Accidentally, then error message Faulty_rows (index list of error row) and the Faulty_cols (index columns of mistake row are updated Table), and coding and error detection are re-started, the most mistake of recoverable after successive ignition.

M_correct[i, j]=M_error[i, j]-(M_r′[i]-M_r[i]) (2)

M_correct[i, j]=M_error[i, j]-(M_c′[j]-M_c[j]) (3)

(3), the remaining mistake that equation (2) or (3) can not be used to correct, recalculates these elements.

Embodiment 1：

As shown in figure 3, mistake is random distribution.Coding checkout process will will detect that 3 rows and 4 row codings occur Deviation has 12 possible erroneous points, and actually there was only 4 mistakes.It will identify 3. and 4. position is examined using formula (1) It is single error to survey, this will preferentially be corrected.Once successfully corrected using equation (2) or equation (3), Faulty_rows and Faulty_cols will be updated, and remaining 2 mistakes in the same row, therefore can be corrected using considerably less operation.This shows The correction course of example is shown in following Fig. 4.In this example, the scheme proposed avoids the unnecessary of "false" mistake Correction, and reduce total potential error in considerably less iteration.

In the wrong emulation of note, the efficiency of this method is an advantage over existing method.We to different size of Matrix Multiplication into The wrong emulation testing of note of having gone.Fig. 5 is compared when injecting 10 mistakes, and the operation of this algorithm and existing EXABFT algorithms takes ratio Compared with, it can be seen that the algorithm performs efficiency in the present invention is better than having EXABFT algorithms.

The above is only a preferred embodiment of the present invention, it is noted that for the ordinary skill people of the art For member, without departing from the technical principles of the invention, several improvement and deformations can also be made, these improvement and deformations Also it should be regarded as protection scope of the present invention.

Claims

1. a kind of reinforcement means of parallel matrix multiplication algorithm, for reducing the ABFT expenses of matrix multiplication, which is characterized in that including Following steps：

（1）, first the input and output of Matrix Multiplication are encoded, result of calculation is verified according to encoded radio and is preserved all possible Error listing；

（2）Error listing is pre-processed, the mistake of some erroneous judgements is excluded, avoids unnecessary correction, wherein debug Method use relative error method, correction mistake before be added an error detection；Then remaining mistake is corrected； If having corrected one or more mistakes, error message is updated, by the most mistake of successive ignition post-equalization；

（3）, it is remaining can not use algorithm correct mistake, using the strategy recalculated.