CN108733628B

CN108733628B - Parallel matrix multiplication algorithm reinforcing method

Info

Publication number: CN108733628B
Application number: CN201810502409.1A
Authority: CN
Inventors: 王海滨; 王杨圣; 戴茜茜; 惠志坚; 叶静; 孙洪文
Original assignee: Changzhou Campus of Hohai University
Current assignee: Changzhou Campus of Hohai University
Priority date: 2018-05-23
Filing date: 2018-05-23
Publication date: 2020-01-03
Anticipated expiration: 2038-05-23
Also published as: CN108733628A

Abstract

The invention discloses a reinforcing method of a parallel matrix multiplication algorithm, which is used for reducing ABFT reinforcing expense of matrix multiplication and comprises the following steps: (1) firstly, encoding the input and output of the matrix multiplication, checking a calculation result according to an encoding value and storing all possible error lists; (2) and preprocessing the error list, eliminating some misjudgment errors and avoiding unnecessary correction, wherein the method for eliminating errors adopts a relative error method, adds an error detection before correcting the errors, and then corrects the rest errors. If one or more errors are corrected, the error list is updated, and a large portion of the errors may be corrected over multiple iterations. (3) And adopting a recalculation strategy for the remaining errors which cannot be corrected by the algorithm. The reinforcement method can improve the system reliability and the execution efficiency.

Description

Parallel matrix multiplication algorithm reinforcing method

Technical Field

The invention relates to a parallel matrix multiplication algorithm reinforcing technology, which can be applied to various technical fields of matrix multiplication algorithms, such as image processing, data statistics and the like.

Background

At present, the parallel computing architecture of a Graphic Processing Unit (GPU) greatly improves the speed of large-scale operation of a computer, and shows great potential in high-performance computing application. GPUs are used in a variety of areas, such as image processing, data statistics, and other high performance computing applications, and are also becoming increasingly popular in modern industries. In recent years, GPU manufacturers such as NVIDIA have been developing GPU computing platforms for automotive driving applications.

The energetic particles may cause bit flipping of the memory element or cause transient voltage pulses in other logic circuits, such as computational cells. With the continuous reduction of the size of the CMOS fabrication process, logic circuits are more sensitive to soft errors caused by high energy particles. Numerous experimental results indicate that GPUs have a higher error rate than other integrated circuit devices under high energy particle strikes. It should be noted that the reliability requirements are application dependent. The reliability of the GPU is critical in some applications, such as spacecraft, satellite or autopilot applications, where soft errors can have extremely serious consequences. In personal entertainment applications such as audio or video, a certain number of soft errors can be tolerated.

Error Correction Code (ECC) mechanisms are one of the most common consolidation techniques in memory, and can also be applied in GPUs to reduce soft error rates. However, this approach incurs high costs in terms of time, space, and power consumption, and only certain families of high-end GPUs are equipped with ECC. Other common consolidation methods, such as redundancy and checkpointing techniques, mainly use re-computation after an error is detected. One of the redundancy-based reinforcement techniques is TMR (triple modular redundancy), which has been experimentally proven to improve the reliability of the system. However, although TMR can effectively solve the problem of soft errors, it results in a resource consumption of three times, and in some applications, the resources are limited.

Therefore, an algorithm-based reinforcement technology for matrix multiplication is provided, and the method can improve the reliability of a system and improve the execution efficiency.

Disclosure of Invention

The invention aims to design a matrix multiplication and reinforcement method based on an ABFT algorithm, which can consume less resources to realize algorithm reinforcement, and the execution efficiency of the technology is higher than that of the prior art proved by error injection simulation results. The main content of the invention is to correct errors in the case of a random distribution of a plurality of errors. In the case of a random distribution of multiple errors, conventional code verification algorithms may detect more error locations than the actual errors, which may result in unnecessary time consumption in error correction. However, in most cases only a few errors are real errors. To solve this problem, the present invention provides a new reinforcement technique based on ABFT algorithm to further reduce the overhead.

The technical scheme of the invention is as follows:

a strengthening method of a parallel matrix multiplication algorithm is used for reducing ABFT overhead of matrix multiplication and FFT, and comprises the following steps:

(1) and firstly, encoding the input and output multiplied by the matrix, checking a calculation result according to an encoding value and storing all possible error lists.

(2) And preprocessing the error list to eliminate some errors of misjudgment and avoid unnecessary correction, wherein the method for eliminating errors adopts a relative error method, and an error detection is added before the errors are corrected. The remaining errors are then corrected. If one or more errors are corrected, the error information is updated, and a large portion of the errors may be corrected over multiple iterations.

(3) And adopting a recalculation strategy for the remaining errors which cannot be corrected by the algorithm.

The invention has the following beneficial effects:

the reinforcement method can improve the system reliability and the execution efficiency.

Drawings

FIG. 1 is a schematic diagram of a specific encoding process;

FIG. 2 is a schematic diagram of an exemplary error distribution map;

FIG. 3 is an example of randomly distributed errors in embodiment 1, in which black dots are real errors and gray dots are regarded as potential errors;

FIG. 4 is an example of a procedure for correcting a randomly distributed error in embodiment 1;

FIG. 5 is a time consuming comparison of the present invention algorithm to the existing EXABFT algorithm.

Detailed Description

The invention is further described below with reference to the accompanying drawings. The following examples are only for illustrating the technical solutions of the present invention more clearly, and the protection scope of the present invention is not limited thereby.

(1) And firstly, encoding the input and output multiplied by the matrix, checking a calculation result according to an encoding value and storing all possible error positions. The specific encoding process is shown in fig. 1;

calculating column sum of each column of the A matrix and adding the column sum to the matrix as Ac, wherein Br in the B matrix is the sum of each row, and Mc and Mr are obtained by multiplication, wherein Mc and Mr are the row sum and the column sum of the M matrix, since in a large number of experiments Mc and Mr have never been affected by the radiation effect, we can assume that Mc and Mr are the correct row and column sums, and Mc 'and Mr' are used for solving the column sum and row sum of the M matrix after the calculation is finished, so that through the comparison of Mc and Mc ', Mr and Mr', an error position can be positioned, wherein, the integers # Err _ row and # Err _ col respectively store the row number and the column number containing errors, the arrays fault _ rows and fault _ cols contain the indexes of the error rows and the error columns, and the intersection of the error rows and the error columns is the position of the calculation error;

(2) and preprocessing the error list, eliminating some misjudgment errors and avoiding unnecessary correction. The method for eliminating errors adopts a relative error method, wherein a typical error distribution is shown in fig. 2, wherein a gray area is a misjudgment area. An error detection is added before correcting the error to determine if equation (1) holds. If yes, the intersection point is the only error on the ith row and the jth column, otherwise, the intersection point is a misjudgment area.

Mc[j]-Mc’[j]＝Mr[i]-Mr’[i] (1)

These errors are corrected using equations (2) or (3) for the remaining real errors. If one or more errors are corrected, the error information fault _ rows and fault _ cols are updated, and encoding and error detection are performed again, and most errors can be corrected after a plurality of iterations.

M_correct[i，j]＝M_error[i，j]-(M_r′[i]-M_r[i]) (2)

M_correct[i，j]＝M_error[i，j]-(M_c′[j]-M_c[j]) (3)

(3) The remaining errors that cannot be corrected by equation (2) or (3), recalculate these elements.

Example 1:

as shown in fig. 3, the errors are randomly distributed. The code verification process will detect that the 3-row and 4-column codes are biased, i.e., there are 12 possible error points, and in fact only 4 errors. Using equation (1) will identify the locations (c) and (c) as detected as a single error, which will be corrected preferentially. Once successfully corrected using equation (2) or equation (3), the Fault _ rows and Fault _ cols will be updated, with the remaining 2 errors in the same row, and thus can be corrected using very few operations. The exemplary correction process is shown in fig. 4 below. In this example, the proposed scheme avoids unnecessary correction of "false" errors and reduces the total potential error in very few iterations.

In the error injection simulation, the efficiency of the method is superior to that of the existing method. We performed error-injecting simulation tests on different sizes of matrix multiplication. Fig. 5 compares the running time consumption of the algorithm and the existing EXABFT algorithm when 10 errors are injected, and it can be seen that the algorithm execution efficiency in the present invention is better than that of the existing EXABFT algorithm.

The above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, several modifications and variations can be made without departing from the technical principle of the present invention, and these modifications and variations should also be regarded as the protection scope of the present invention.

Claims

1. A strengthening method of a parallel matrix multiplication algorithm is used for reducing ABFT overhead of matrix multiplication, and is characterized by comprising the following steps:

(1) firstly, encoding the input and output of the matrix multiplication, checking a calculation result according to an encoding value and storing all possible error lists;

(2) preprocessing the error list to eliminate some errors of misjudgment and avoid unnecessary correction, wherein the method for eliminating errors adopts a relative error method, an error detection is added before the errors are corrected, and the error list is preprocessed according to whether a judgment equation (1) is satisfied or not:

Mc[j]-Mc’[j]＝Mr[i]-Mr’[i] (1)；

if yes, the intersection point is the only error on the ith row and the jth column, otherwise, the intersection point is a misjudgment area;

the remaining errors are then corrected using equation (2) or (3):

M_correct[i，j]＝M_error[i，j]-(M′_r[i]-M_r[i]) (2)；

M_correct[i，j]＝M_error[i，j]-(M′_c[j]-M_c[j]) (3)；

wherein, Mc and Mr are correct row and column sums, and Mc 'and Mr' are obtained by calculating the column sum and the row sum of the M matrix after the calculation is finished; if one or more errors are corrected, updating error information, and correcting most errors after multiple iterations;