CN109815104A

CN109815104A - GPGPU program approximate analysis system and method based on soft error perception

Info

Publication number: CN109815104A
Application number: CN201910107441.4A
Authority: CN
Inventors: 魏晓辉; 岳恒山; 谭婧炜佳; 孙冰怡; 徐海啸
Original assignee: Jilin University
Current assignee: Jilin University
Priority date: 2019-02-02
Filing date: 2019-02-02
Publication date: 2019-05-28
Anticipated expiration: 2039-02-02
Also published as: CN109815104B

Abstract

The present invention provides the GPGPU program approximate analysis system and method perceived based on soft error, carries out fail-safe analysis to GPGPU program to realize.Above system has carried out multiple soft error simulation, and the mistake output result after generation soft error is sorted out.In classification process, it whether is more than user quality demand according to the error (difference) between mistake output result and standard output result, by the mistake output of difference occurrence type (SDC), it is further divided into difference acceptable type and the unacceptable type of difference.Which reflects the approximate characteristics of a certain range of error of program tolerable, therefore the classification that the embodiment of the present invention is carried out is approximate classification.And sort out carried out fail-safe analysis based on approximate, as " reliability approximate analysis ".Reliability approximate analysis helps to find out real grave error, and design protection strategy can reduce protection unnecessary and expense on this basis.

Description

GPGPU program approximate analysis system and method based on soft error perception

Technical field

The present invention relates to computer fields, in particular to based on soft error perception GPGPU program approximate analysis system and Method.

Background technique

GPGPU (general image processor) is the component that image is handled in computer, and original is called GPU (image processor), Later because this component can also handle the high-performance of the non-graphic program such as numerical simulation, temperature simulation, deep learning (HPC) program, therefore it is known as " general image processor ".

The framework of GPGPU is made of multiple SM (streaming Multiproeessor, Steaming Multiprocessors).GPU can Run thousands of a threads parallel by SM, what these threads executed is same program code.

GPGPU is integrated with thousands of calculating core on the chip of very little, and the integrated level of chip is higher, and circuit is just It is easier to be influenced by alpha particle in universe and high-energy neutron.

When computer is influenced by alpha particle and high-energy neutron, the 0 of storage and 1 can be jumped, such as 001 becomes 000, and this jump is known as " bit reversal ", and the program error as caused by " bit reversal " can be described as " soft error " (because circuit is not damaged, only data are damaged, so being known as " soft error ").

Traditional GPU is may only to will cause image portion pixels point quilt after soft error occurs for handling image data It destroys (as shown in Figure 1a), does not interfere with the demand of user.But when GPGPU handles HPC program, HPC program has centainly Reliability requirement, therefore design protection strategy is needed to eliminate " soft error " influence to program.

Before design protection strategy, then need to carry out fail-safe analysis to GPGPU program, to find out in program to result Far-reaching part, it is subsequent just can targetedly design protection strategy.

Summary of the invention

In view of this, the present invention provides the GPGPU program approximate analysis system and method perceived based on soft error, to realize Fail-safe analysis is carried out to GPGPU program.

In order to achieve the above-mentioned object of the invention, the present invention the following technical schemes are provided:

A kind of GPGPU program approximate analysis system based on soft error perception, comprising:

Receiving unit, for receiving program code, quality metric formula, soft error mode and user quality demand；Institute Stating user quality demand includes the tolerable max value of error of user；

Analogue unit is simulated for executing multiple soft error to said program code according to the soft error mode；

Analytical unit, corresponding to determining output result that each secondary soft error is simulated according to the quality metric formula Mistake severity type, and, reliability approximate analysis is carried out according to the wrong severity type；

Wherein, mistake corresponding to the output result for determining each secondary soft error simulation according to the quality metric formula Severity type includes:

The error between target error output result and standard output result is calculated according to the quality metric formula；The mesh Mislabel the output result that accidentally output result characterizes any secondary soft error simulation；

If the error is zero, determine that the wrong severity type of the target error output result is indifference class Type；

If the error is greater than zero and is not less than the max value of error, the mistake of the target error output result is determined Severity type is difference acceptable type；

If the error is greater than the max value of error, the wrong severity class of the target error output result is determined Type is the unacceptable type of difference；Wherein, difference acceptable type and the unacceptable type of difference belong to difference occurrence type；

If said program code terminates operation, determine that the wrong severity type of the target error output result is to collapse Routed type.

Optionally, soft error simulation each time includes: to carry out soft error to said program code according to the soft error mode It accidentally injects, and runs the program code of injection soft error；The soft error mode includes: type of error and errors present；Wherein, The errors present is used to indicate in the data bit of the function of mistake generation, the instruction type that mistake occurs and mistake generation extremely Few one kind；The type of error is used to indicate the digit that bit reversal occurs.

Optionally, the analytical unit is also used to: being collected each secondary soft error and is simulated relevant error message, the mistake letter Breath includes the kernel function title that soft error occurs, the instruction type of soft error occurs and the data bit of bit reversal occurs；In root The aspect of reliability approximate analysis is carried out according to the wrong severity type, the analytical unit is specifically used for: according to each time The error message of soft error simulation corresponding wrong severity type and collection, carries out various dimensions mistake approximation point Analysis；The various dimensions mistake approximate analysis includes: program staging error approximate analysis, kernel function staging error approximate analysis, instruction class Type staging error approximate analysis, and, one of data bit staging error approximate analysis or multiple combinations.

Optionally, in terms of carrying out program staging error approximate analysis, the analytical unit is specifically used for: analysis program generation Susceptibility of the code to soft error；In terms of carrying out kernel function staging error approximate analysis, the analytical unit is specifically used for: analysis Program code to the susceptibility of specific kernel function, and, analyze same program code in the susceptibility of different kernel functions It is at least one；In terms of carrying out instruction type staging error approximate analysis, the analytical unit is specifically used for: analysis same program Code to the susceptibility of different instruction types, and, analyze same instruction type to the susceptibility of distinct program code, with And analysis at least one of the susceptibility of same instruction type in different kernel functions in same program code；Carry out The aspect of data bit staging error approximate analysis, the analytical unit are specifically used for: mistake occurs for analysis same program code The susceptibility of different data bit, analysis same program code compare to the susceptibility of different type of errors and analyze different journeys Mistake occurs in different at least one of the susceptibilitys of data bit for sequence code.

Optionally, in terms of analysis said program code is to the susceptibility of specific instruction type, the analytical unit It is specifically used for: counts and simulated in the soft error that the instruction to specific instruction type described in program code carries out error injection In, the quantity N of corresponding indifference type_Masked, the quantity N of difference acceptable type_acceptable, the unacceptable type of difference Quantity N_unacceptable, and, collapse the quantity N of type_Detected, wherein N expression quantity, subscript Masked, acceptable, Unacceptable and Detected indicates specific wrong severity type；According to formula N_acceptable/(N_acceptable+ N_unacceptable) calculate proportion of the difference acceptable type in difference occurrence type；According to formula (N_acceptable+ N_Masked)/(N_Masked+N_acceptable+N_unacceptable+N_Detected), calculate the approximate ratio on said program code without influence.

Optionally, if the standard output result is single numerical value, the error mass formula includes:

Wherein, G_iThe standard output is indicated as a result, C_iIndicate mistake output as a result, rel-diff_iIndicate the error of mistake output result and standard output result.

Optionally, if the standard output result is matrix, the error mass formula includes:

Wherein, G indicates the standard output as a result, C indicates mistake output knot Fruit, rel-l2-norm indicate the error of mistake output result and standard output result.

A kind of GPGPU program Near covering based on soft error perception, comprising:

Receive program code, quality metric formula, soft error mode and quality requirement；The quality requirement includes that can hold The max value of error born；

Multiple soft error is executed to said program code according to the soft error mode to simulate；

The corresponding wrong severity type of each secondary soft error simulation is determined according to the quality metric formula；

Reliability approximate analysis is carried out according to the wrong severity type；

Wherein, described that the corresponding wrong severity class of each secondary soft error simulation is determined according to the quality metric formula Type includes:

The error between target error output result and standard output result is calculated according to the quality metric formula；The mesh It mislabels the accidentally any secondary soft error of output result characterization and simulates corresponding output result；

If the error is zero, determine that the wrong severity type of target soft error simulation is indifference type；It is described Any secondary soft error simulation of target soft error simulation characterization；

If the error is greater than zero and is less than the max value of error, determine that the mistake of the target soft error simulation is serious Degree type is difference acceptable type；

If the error is greater than the max value of error, the wrong severity type of the target soft error simulation is determined For the unacceptable type of difference；Wherein, difference acceptable type and the unacceptable type of difference belong to difference occurrence type；

If said program code terminates operation, determine that the wrong severity type of the target soft error simulation is collapse Type.

Optionally, further includes: collect each secondary soft error and simulate relevant error message, the error message includes that generation is soft The kernel function title of mistake, the instruction type that soft error occurs and the data bit that bit reversal occurs；It is described according to the mistake It includes: the wrong severity type corresponding according to each secondary soft error simulation that severity type, which carries out reliability approximate analysis, And the error message collected, carry out the approximate analysis of various dimensions mistake；The various dimensions mistake approximate analysis includes: program Staging error approximate analysis, kernel function staging error approximate analysis, instruction type staging error approximate analysis, and, data bit staging error One of approximate analysis or multiple combinations.

Optionally, described program staging error approximate analysis includes: to analyze program code to the susceptibility of soft error；

The kernel function staging error approximate analysis includes: susceptibility of the analysis program code to specific kernel function, and, point Same program code is analysed to different at least one of the susceptibilitys of kernel function；

Described instruction type staging error approximate analysis includes: to analyze same program code to the sensitivity of different instruction types Degree, and, same instruction type is analyzed to the susceptibility of distinct program code, and, it analyzes same in same program code Susceptibility at least one of of the instruction type in different kernel functions；

The data bit staging error approximate analysis includes: that mistake occurs in different data bit for analysis same program code Susceptibility, analysis same program code to the susceptibilitys of different type of errors, and, compare and analyze different program codes pair Mistake occurs in different at least one of the susceptibilitys of data bit.

It should be noted that, although it is high to the accuracy requirement of output result, but still tolerable is certain in many cases, The error of range, for example, when carrying out physical analogy estimation room temperature with high-performance program, if estimated value and true Value has gap, but the little still acceptable of gap.That is, high-performance program also has " approximation " characteristic.It is so-called " approximate special Property " refer to that program can tolerate a certain range of error, receive the characteristic of less accurate result.

Based on above-mentioned " approximate characteristic ", technical solution provided by the embodiment of the present invention has carried out multiple soft error simulation, By the mistake output after generation soft error as a result, being sorted out.It is defeated according to mistake output result and standard in classification process Whether the error between result (difference) exports the mistake of difference occurrence type (SDC) more than QoS (user quality demand) out, It is further divided into acceptable (SDCs-acceptable) type of difference and difference is unacceptable (SDCs-unacceptable) Type.Which reflects the approximate characteristic of a certain range of error of program tolerable, therefore the classification that the embodiment of the present invention is carried out It is approximate classification.And sort out carried out fail-safe analysis based on approximate, as " reliability approximate analysis ".Reliability approximation point Analysis helps to find out real grave error, and design protection strategy can reduce protection unnecessary and expense on this basis.

Detailed description of the invention

Fig. 1 a is the schematic diagram that soft error causes image portion pixels point to be destroyed；

Fig. 1 b is GPGPU architecture schematic diagram provided in an embodiment of the present invention；

Fig. 2 is the exemplary block diagram of approximate analysis system provided in an embodiment of the present invention；

Fig. 3 is the exemplary flow of Near covering provided in an embodiment of the present invention.

Specific embodiment

The embodiment of the invention discloses the GPGPU program approximate analysis system and methods perceived based on soft error, to realize Fail-safe analysis is carried out to GPGPU program.

First introduce GPGPU:

As the data volume of every profession and trade application in recent years is increased sharply, the far super hardware platform calculating of internet data amount growth rate, Storage resource growth rate.It is all difficult to meet based on by the computing capability and energy efficiency that available data processing technique provides Calculation demand.It was found that internet data amount will increase 50 times, and computer storage and computing capability are only in following 10 years 10 times can only be promoted.

In this context, GPGPU comes into being.GPGPU is a kind of novel computing platform, because height can be supported Concurrent threading operation, GPGPU are increasingly used for high-performance calculation.It is used only to count different from traditional GPU platform Image data is calculated, GPGPU is used in the fields such as numerical simulation, data mining, artificial intelligence by more and more.

GPGPU uses SIMT (single instrction multithreading) programming model, i.e., all threads in the same program kernel use Different data execute identical program code.

It includes multiple SM that the framework of GPU, which can be found in Fig. 1 b:GPU, and GPU can run parallel multiple threads by SM.

Specifically, what above-mentioned multiple threads executed is that program instruction in identical code section (namely uses different operations Number executes identical program instruction).It include many program instructions in code segment, for example, first program instruction is two number phases Add, Article 2 program instruction is two numbers multiplication, etc..In synchronization, different SM can go to different programs and refer to It enables, some SM may go to the 100th bar of program instruction, and some SM may go to the 200th bar of program instruction.

Above-mentioned multiple threads are divided into multiple CTA (thread block), and each CTA is made of hundreds of threads.Thread is with CTA It is distributed in each SM for unit, that is, every SM can distribute at least one thread block.

CTA is further divided into multiple thread clusters (warp), and the size of thread cluster is fixed, including 32 threads or 16 A thread.In SM assembly line, a ready warp is taken to execute operation from the CTA distributed every time.In same warp Thread with the execution of SIMD (single-instruction multiple-data) mode, that is, thread in same warp executes same journey in synchronization Sequence instruction.

For example, the thread in the t0 moment, a thread cluster can execute the program instruction that a+b is assigned to c, and institute is not With the value of a and b of different threads selection is different, and further, the value for being assigned to c is also different.

Aforementioned to be referred to, GPGPU is when executing program code, it may occur that soft error, especially some HPC programs have Higher reliability requirement needs to carry out fail-safe analysis to it.

In existing analysis method for reliability, thousands of up to a hundred mistakes can be carried out to instruction certain types of in program code It accidentally injects, and runs the program code of injection mistake.

At the same time, record portion is not injected into the program output of mistake, i.e., " pollution-free output ".And then to implantation mistake The output of program code be monitored, for wrong severity, output is divided into three classes:

Masked: program output and legitimate reading do not have difference；

SDC: procedure result and true value are different；

Detected: program directly terminates, because wrong order of severity is forced to terminate.

It should be noted that generally carrying out the text of comparison program output using diff order in existing analysis method for reliability The text file of this document and legitimate reading, diff can be in a manner of line by line, at the similarities and differences that compare two text files.Diff can be The change of every a line is printed in order line.

(program directly terminates then without output), program output and true knot when program output are can determine based on diff order When fruit indifference, diff order output is sky, it may be determined that is Masked, otherwise diff order output is not sky, it may be determined that is SDC。

Then by statistical analysis method, analyze mistake occur different types of instruction influence as a result, further according to Analysis result finds out the part critically important for program, so as to subsequent targetedly design protection strategy.

In above-mentioned existing analysis method for reliability (alternatively referred to as reliability flexibility analysis method), by all SDC and Detected mistake all thinks unacceptable.

Citing is to cease raining or snowing, if to addition instruction injection mistake 1000 times, wherein being for 200 times Masked, 500 times are SDCs, 300 times are Detected, then it is (500 that addition instruction, which causes the probability of serious error, according to existing analysis method for reliability + 300)/1000=0.8.

Then, inventor is implementing to find when the invention: although high-performance program wants the accuracy of output result Ask high, but the still a certain range of error of tolerable in many cases, for example, estimating carrying out physical analogy with high-performance program When counting room temperature, if estimated value and true value have gap, but the little still acceptable of gap.That is, high property Energy program also has " approximation " characteristic.So-called " approximate characteristic " refers to that program can tolerate a certain range of error, receives not The characteristic of too accurate result.Precedent is continued to use, in 500 SDCs, inventor's discovery wherein has 400 output results not shadow Ring normal use.

Therefore, existing analysis method for reliability excessively has estimated the severity that mistake occurs, and causes analysis result not Accurately.Cause it is subsequent carried out many protections unnecessary, cause expense unnecessary.

In view of this, being based on the present invention provides the GPGPU program approximate analysis system and method perceived based on soft error Above-mentioned " approximate characteristic " referred to carry out fail-safe analysis to GPGPU program.

Core of the invention thought is: carry out multiple soft error simulation, by occur soft error after mistake output result into Whether row is sorted out, in classification process, according to the error (difference) between mistake output result and standard output result more than QoS The mistake of difference occurrence type (SDC) is exported, is further divided into difference and is subjected to (SDCs- by (user quality demand) Acceptable) type and unacceptable (SDCs-unacceptable) type of difference.Reliability point is carried out based on above-mentioned classification Analysis.

Above-mentioned approximate analysis system also is understood as approximation analytical framework.

Fig. 2 shows the exemplary structure of approximate analysis system, can include: receiving unit 1, analogue unit 2 and analysis Unit 3.

The subsequent function that above-mentioned each unit will be introduced in conjunction with embodiment of the method herein.

Fig. 3 shows a kind of exemplary flow of the analysis method executed by above-mentioned approximate analysis system, at least may include Following steps:

S0: approximate analysis system/approximate analysis frame receives parameter.

Above-mentioned parameter can generally be submitted from the user that need to carry out approximate analysis to frame.

Specifically, above-mentioned parameter can include:

1. Application: including program code and normal data；

Wherein, program code is the program code for needing to analyze, and normal data includes program standard input data and mark Quasi- output data.

Standard output data are the true numbers that are exported based on standard input data when soft error does not occur for program code According to.

It is understood that in the simulation of multiple soft error, the input data of program code processing is answered in order to compare Identical, its output result is just comparable with standard output data in this way.

So also to submit standard input data other than submitting standard output data to frame.

2. Quality Metric: quality metric formula；

Quality metric formula is used to quantify the difference between mistake output result and standard output data.It is defeated according to program code Out be single numerical value or matrix, quality metric formula has difference.

In one example, if the output of program code is single numerical value, error mass formula includes:

Wherein, G_iIndicate standard output as a result, C_iIndicate mistake output as a result, rel-diff_iIndicate mistake output result with The error of standard output result, | * | expression takes absolute value.

In another example, if the output result of program code is matrix, error mass formula includes:

Wherein, G indicates standard output as a result, C indicates mistake output as a result, rel-l2-norm indicates mistake output result Error or opposite L2 error amount with standard output result.

The operation that formula two is realized are as follows: G obtains a matrix after subtracting each other with two matrixes of C, calculates the L2 of this matrix Normal form (L2 normal form is a numerical value), then divided by the L2 normal form of G.| | * | | it indicates to calculate L2 normal form.

The calculation method of L2 normal form is that radical sign is opened in all elements summation in matrix, that is, for any matrix A, if It includes N number of element, a_iI-th of element in representing matrix A, then its L2 normal form can be expressed as follows:

3. Error Mode: soft error mode；

Above-mentioned soft error mode includes: type of error and errors present.

4. QoS threshold: user quality demand.

Wherein, above-mentioned user quality demand includes the tolerable max value of error of user.Certainly, max value of error can also be with It is empirical value, is provided without user.

In one example, step S0 can be executed by receiving unit 1 above-mentioned.

S1: multiple soft error is executed to program code according to soft error mode and is simulated.

In one example, step S1 can be executed by analogue unit 2 above-mentioned.

Specifically, soft error is simulated each time can include: soft error injection is carried out to program code according to soft error mode, And run the program code of injection soft error.

It should be noted that the aforementioned GPGPU that is referred to can run multiple threads parallel, each thread all runs identical journey Sequence code then simulates in each soft error, can select a thread at random, and the program code executed to it carries out soft error injection.

Aforementioned to be referred to, soft error mode includes: type of error and errors present.

Wherein, type of error is used to indicate the digit that bit reversal occurs.For example, the single bit upset, (N such as N-bit overturning Greater than 1).

And above-mentioned errors present may be used to indicate what the function of mistake generation, the instruction type that mistake occurs and mistake occurred At least one of data bit.

With 32 data instances, the data bit that mistake occurs can indicate that mistake occurs more refine and refer in high-order, low level Show that mistake occurs in xth position.

Further, instruction type includes addition, multiplication, division, displacement, storage, value etc..

It should be noted that by taking single bit upset as an example, it is assumed that errors present only indicates addition instruction, in program code In may have thousands of addition instructions, then when each soft error is simulated, digital ratio can be carried out to addition instruction at random The error injection of spy's overturning.

More specifically, it is assumed that addition instruction is that a+b is assigned to c, single bit upset can be carried out to the value of c, to realize mistake Injection；Instruction for value can carry out single bit upset to the value that it is taken, to realize error injection.

The error injection of other instruction types is similar, and therefore not to repeat here.

Still by taking single bit upset as an example, it is assumed that errors present only indicates the function of mistake generation, then in each soft error When simulation, can at random it is corresponding to the function one instruction carry out single bit upset error injection.

Still by taking single bit upset as an example, it is assumed that errors present indicate mistake generation function and instruction type (such as plus Method), then when each soft error is simulated, corresponding to a function addition instruction it can carry out the mistake of single bit upset at random Injection.

Still by taking single bit upset as an example, it is assumed that errors present only indicates mistake and occurs in a high position, then in each soft error When simulation, the error injection of high-order single bit upset can be carried out to an instruction in program code at random.

And so on, it repeats no more.

S2: wrong severity class corresponding to the output result for determining each secondary soft error simulation according to quality metric formula Type.

Specifically, step S2 can be executed by analytical unit 3 above-mentioned.

Step S2, which is accomplished that, sorts out the mistake output result after generation soft error.

In one example, the output result (can be described as target error output result) simulated for any secondary soft error, Can specifically it be sorted out as follows:

Step a: the error between target error output result and standard output result is calculated according to quality metric formula.

The introduction of quality metric formula refers to foregoing description, and therefore not to repeat here.

It should be noted that aforementioned be referred to, in existing analysis method for reliability, exported using diff order comparison error It is whether variant with legitimate reading.But it is variant which diff order can only export, and does not quantify to difference.

And in embodiments of the present invention, it is measured using difference of the quality metric formula to mistake output and legitimate reading Change, has obtained quantized value (namely error).

It is subsequent more careful classification to be carried out to the mistake output of SDC class based on quantized value.

Step b:

If error is zero, determine that the wrong severity type of target error output result is that indifference type (is expressed as Masked)；

If error is greater than zero and is not less than max value of error, the wrong severity type of target error output result is determined For difference acceptable type (being expressed as acceptable or SDCs-acceptable)；

If error is greater than max value of error, determine that the wrong severity type of target error output result is that difference can not Receive type (being expressed as unacceptable or SDCs-unacceptable)；Wherein, difference acceptable type and difference can not Receive type and belongs to difference occurrence type (being expressed as SDC)；

If program code terminates operation, determine the wrong severity type of target error output result for collapse type (being expressed as Detected).

In above-mentioned classification process, according to mistake output result and standard output result between error (difference) whether be more than QoS (user quality demand) exports the mistake of difference occurrence type (SDC), is further divided into difference and is subjected to (SDCs- Acceptable) type and unacceptable (SDCs-unacceptable) type of difference.Which reflects the certain models of program tolerable The approximate characteristic for the error enclosed, therefore the classification that the embodiment of the present invention is carried out is approximate classification.

S3: it collects each secondary soft error and simulates relevant error message.

Specifically, step S3 can be executed by analytical unit 3 above-mentioned.

Error message includes that (soft error occurs for kname (the kernel function title that soft error occurs), instruction-type Instruction type) and the data bit-location data bit of bit reversal (occur).

According to these information, analytical framework is subsequent can to carry out program level, kernel function grade, instruction type grade, data bit grade Error analysis.

S4: reliability approximate analysis is carried out according to the wrong severity type.

Specifically, step S4 can be executed by analytical unit 3 above-mentioned.

Specifically, analytical unit 3 can be according to wrong severity type and collection corresponding to each secondary soft error simulation Error message, carry out the approximate analysis of various dimensions mistake.

Further, above-mentioned various dimensions mistake approximate analysis can include: program staging error approximate analysis, kernel function staging error Approximate analysis, instruction type staging error approximate analysis, and, one of data bit staging error approximate analysis or multiple combinations.

Various wrong approximate analyses are introduced separately below.

One, program staging error approximate analysis

Program staging error approximate analysis is laid particular emphasis on: susceptibility of the analysis program code to soft error.

Specifically, program code can be analyzed in the following way to the susceptibility of soft error:

Step a1: during the soft error for counting all is simulated, the quantity N of corresponding indifference type_Masked, difference is subjected to class The quantity N of type_acceptable, the quantity N of the unacceptable type of difference_unacceptable, and, collapse the quantity N of type_Detected；

Step b1: according to formula S DCs-Acceptance-Ratio=N_acceptable/(N_acceptable+N_unacceptable)* 100%, calculate proportion of the difference acceptable type in difference occurrence type.

Wherein, " SDCs-Acceptance-Ratio " expression " generation of the difference acceptable type in difference occurrence type Ratio ".

Step c1: according to formula Aprroximate-Acceptance-Proportion=(N_acceptable+N_Masked)/ (N_Masked+N_acceptable+N_unacceptable+N_Detected) * 100%, calculate the approximate ratio on program code without influence.

Wherein, " Aprroximate-Acceptance-Proportion " expression " approximate ratio of no influence ".

For example, it is assumed that the error injection that 1000 times have been carried out to program code, obtain mistake output in, 200 times It is Masked, 300 times are Detected, and 500 times are SDCs, are divided by QoS threshold, this 500 SDCs can be divided into 400 times SDCs-acceptable and 100 time SDCs-unacceptable.

Namely N_Masked=200, N_Detected=300, N_acceptable=400, N_unacceptable=100.

Then SDCs-Acceptance-Ratio=400/500*100%=80%.

Aprroximate-Acceptance-Proportion=(400+200)/1000*100%=60%.

Two, kernel function staging error approximate analysis

Kernel function staging error approximate analysis lays particular emphasis on analysis same program code to the susceptibility of different kernel functions, certainly Program code can also be analyzed to the susceptibility of specific kernel function.

For example, same program code can be analyzed to the susceptibility of kernel function A and kernel function B.

Sensitivity of the program code to the susceptibility of kernel function A and same program code to kernel function B can first be calculated separately Degree, then be compared.

In kernel function staging error approximate analysis, compares and lay particular emphasis on analysis same program code to the quick of different kernel functions Sensitivity.

Specifically, program code can be analyzed in the following way to the susceptibility of specific kernel function:

Step a2: various types of instructions carry out statistics in specific kernel function (such as kernel function A) in program code In the soft error simulation of error injection, the quantity N of corresponding indifference type_Masked, the quantity of difference acceptable type N_acceptable, the quantity N of the unacceptable type of difference_unacceptable, and, collapse the quantity N of type_Detected；

Step b2: according to formula S DCs-Acceptance-Ratio=N_acceptable/(N_acceptable+N_unacceptable)* 100%, calculate proportion of the difference acceptable type in difference occurrence type.

Step c2: according to formula Aprroximate-Acceptance-Proportion=(N_acceptable+N_Masked)/ (N_Masked+N_acceptable+N_unacceptable+N_Detected) * 100%, calculate approximation of the specific kernel function on program code without influence Ratio.

It is assumed that having carried out 1000 error injections to the kernel function A in program code, in obtained mistake output, 200 times are Masked, and 300 times are Detected, and 500 times are SDCs, are divided by QoS threshold, this 500 SDCs can It is divided into 400 times SDCs-acceptable and 100 time SDCs-unacceptable.

Namely N_Masked=200, N_Detected=300, N_acceptable=400, N_unacceptable=100.

Then SDCs-Acceptance-Ratio=400/500*100%=80%.

Aprroximate-Acceptance-Proportion=(400+200)/1000*100%=60%.

It should be noted that above-mentioned is all the susceptibility for soft error, that is, error injection all may be used in any instruction type With.

Three, instruction type staging error approximate analysis

Instruction type staging error approximate analysis is laid particular emphasis on to be analyzed as follows:

1) susceptibility of the analysis same program code to different instruction types；

For example, same program code can be analyzed to the susceptibility of addition instruction and multiplying order.

Program code can first be calculated separately to the susceptibility and same program code of addition instruction to the quick of multiplying order Sensitivity, then be compared.

Specifically, can sensitivity of the calculation procedure code to specific instruction type (such as addition instruction) in the following way Degree:

Step a3: statistics instruction of specific instruction type in program code carries out the soft error simulation of error injection In, the quantity N of corresponding indifference type_Masked, the quantity N of difference acceptable type_acceptable, the unacceptable type of difference Quantity N_unacceptable, and, collapse the quantity N of type_Detected；

Step b3: according to formula S DCs-Acceptance-Ratio=N_acceptable/(N_acceptable+N_unacceptable)* 100%, calculate proportion of the difference acceptable type in difference occurrence type.

Step c3: according to formula Aprroximate-Acceptance-Proportion=(N_acceptable+N_Masked)/ (N_Masked+N_acceptable+N_unacceptable+N_Detected) * 100%, calculate the approximate ratio on program code without influence.

It illustrates by addition of specific instruction type:

It is assumed that having carried out 1000 error injections to the addition instruction in program code, in obtained mistake output, 200 times are Masked, and 300 times are Detected, and 500 times are SDCs, are divided by QoS threshold, this 500 SDCs can It is divided into 400 times SDCs-acceptable and 100 time SDCs-unacceptable.

Namely N_Masked=200, N_Detected=300, N_acceptable=400, N_unacceptable=100.

Then SDCs-Acceptance-Ratio=400/500*100%=80%.

Aprroximate-Acceptance-Proportion=(400+200)/1000*100%=60%.

2) it analyzes same instruction type and different program codes pair is analyzed in other words to the susceptibility of distinct program code The susceptibility of same instruction type.

Specifically, each program code can be calculated separately to the susceptibility of specific instruction type, then it is compared.

3) the analysis susceptibility of same instruction type in different kernel functions in same program code；

For instructing with additive, can calculate separately in same program code, different kernel functions to the susceptibility of addition instruction, It is compared again.

Four, data bit staging error approximate analysis

Data bit staging error approximate analysis is laid particular emphasis on to be analyzed as follows:

1) analysis same program code mistake occurs the susceptibility in different data bit；

For example, same program code can be analyzed to high-order and low level susceptibility.

Susceptibility of the program code to high-order susceptibility and same program code to low level can be first calculated separately, then It is compared.

2) susceptibility of the analysis same program code to different type of errors；

Here how many generation bit reversal type of error refers to.For example, type of error may include single bit upset, it is more Bit reversal etc..

For analyzing the susceptibility that same program code overturns single bit upset and dibit, journey can be first calculated separately The susceptibility of single bit upset and the program code to the susceptibility of more bit reversals, then is compared in sequence code.

Specifically, program code can be analyzed in the following way to the susceptibility of type of error by taking single bit upset as an example:

Step a4: respectively instruction carries out in the soft error simulation of single bit upset error injection statistics in program code, The quantity N of corresponding indifference type_Masked, the quantity N of difference acceptable type_acceptable, the quantity of the unacceptable type of difference N_unacceptable, and, collapse the quantity N of type_Detected；

Step b4: according to formula S DCs-Acceptance-Ratio=N_acceptable/(N_acceptable+N_unacceptable)* 100%, calculate proportion of the difference acceptable type in difference occurrence type.

Step c4: according to formula Aprroximate-Acceptance-Proportion=(N_acceptable+N_Masked)/ (N_Masked+N_acceptable+N_unacceptable+N_Detected) * 100%, calculate approximation of the single bit upset on program code without influence Ratio.

It is assumed that being injected in the single-bit error for having carried out 1000 times to the instruction in program code, obtained mistake output In, 200 times are Masked, and 300 times are Detected, and 500 times are SDCs, are divided by QoS threshold, this is 500 times SDCs can be divided into 400 times SDCs-acceptable and 100 time SDCs-unacceptable.

Namely N_Masked=200, N_Detected=300, N_acceptable=400, N_unacceptable=100.

Then SDCs-Acceptance-Ratio=400/500*100%=80%.

Aprroximate-Acceptance-Proportion=(400+200)/1000*100%=60%.

3) it compares and analyzes different program codes to the wrong susceptibility occurred in different data bit.

It is assumed that program code a and program code b, can elder generation calculation procedure code a to high-order susceptibility and program code A is to the susceptibility of low level, and program code b is to high-order susceptibility and program code b to the susceptibility of low level.

Susceptibility is compared, it is possible to find the data bit that mistake occurs is higher, more be easy to cause serious harm.

As it can be seen that being based on " approximate characteristic ", the embodiment of the present invention has carried out multiple soft error simulation, after soft error occurs Mistake exports as a result, being sorted out.In classification process, according to the error between mistake output result and standard output result Whether (difference) exports the mistake of difference occurrence type (SDC) more than QoS (user quality demand), is further divided into difference Different acceptable (SDCs-acceptable) type and unacceptable (SDCs-unacceptable) type of difference.Which reflects journeys The approximate characteristic of a certain range of error of sequence tolerable, therefore the classification that the embodiment of the present invention is carried out is approximate classification.And base Sort out carried out fail-safe analysis in approximate, as " reliability approximate analysis ".Reliability approximate analysis helps to find out true Positive grave error, design protection strategy can reduce protection unnecessary and expense on this basis.

Each embodiment in this specification is described in a progressive manner, the highlights of each of the examples are with other The difference of embodiment, the same or similar parts in each embodiment may refer to each other.

The above is only a preferred embodiment of the present invention, it is noted that for the ordinary skill people of the art For member, various improvements and modifications may be made without departing from the principle of the present invention, these improvements and modifications are also answered It is considered as protection scope of the present invention.

Claims

1. a kind of GPGPU program approximate analysis system based on soft error perception characterized by comprising

Receiving unit, for receiving program code, quality metric formula, soft error mode and user quality demand；The use Family quality requirement includes the tolerable max value of error of user；

Analytical unit, for according to the quality metric formula determine output result that each secondary soft error is simulated corresponding to mistake Severity type, and, reliability approximate analysis is carried out according to the wrong severity type；

Wherein, mistake corresponding to the output result for determining each secondary soft error simulation according to the quality metric formula is serious Degree type includes:

The error between target error output result and standard output result is calculated according to the quality metric formula；The target is wrong Accidentally output result characterizes the output result of any secondary soft error simulation；

If the error is zero, determine that the wrong severity type of the target error output result is indifference type；

If the error is greater than zero and is not less than the max value of error, determine that the mistake of the target error output result is serious Degree type is difference acceptable type；

If the error is greater than the max value of error, determine that the wrong severity type of the target error output result is The unacceptable type of difference；Wherein, difference acceptable type and the unacceptable type of difference belong to difference occurrence type；

If said program code terminates operation, determine the wrong severity type of the target error output result for collapse class Type.

2. the system as claimed in claim 1, which is characterized in that soft error simulation each time includes: according to the soft error mould Formula carries out soft error injection to said program code, and runs the program code of injection soft error；

The soft error mode includes: type of error and errors present；Wherein, the errors present is used to indicate mistake generation At least one of the data bit that the instruction type and mistake that function, mistake occur occur；The type of error is used to indicate hair The digit of raw bit reversal.

3. system as claimed in claim 2, which is characterized in that

The analytical unit is also used to: being collected each secondary soft error and is simulated relevant error message, the error message includes occurring The kernel function title of soft error, the instruction type that soft error occurs and the data bit that bit reversal occurs；

In terms of carrying out reliability approximate analysis according to the wrong severity type, the analytical unit is specifically used for:

According to the error message of each secondary soft error simulation corresponding wrong severity type and collection, multidimensional is carried out Spend wrong approximate analysis；

The various dimensions mistake approximate analysis includes: program staging error approximate analysis, kernel function staging error approximate analysis, instruction class Type staging error approximate analysis, and, one of data bit staging error approximate analysis or multiple combinations.

4. system as claimed in claim 3, which is characterized in that

In terms of carrying out program staging error approximate analysis, the analytical unit is specifically used for:

Program code is analyzed to the susceptibility of soft error；

In terms of carrying out kernel function staging error approximate analysis, the analytical unit is specifically used for:

Program code is analyzed to the susceptibility of specific kernel function, and, same program code is analyzed to the quick of different kernel functions At least one of sensitivity；

In terms of carrying out instruction type staging error approximate analysis, the analytical unit is specifically used for:

Same program code is analyzed to the susceptibility of different instruction types, and, same instruction type is analyzed to distinct program The susceptibility of code, and, analysis in same program code same instruction type in the susceptibility in different kernel functions It is at least one；

In terms of carrying out data bit staging error approximate analysis, the analytical unit is specifically used for:

Same program code is analyzed to mistake generation in the susceptibility of different data bit, analyzes same program code to different mistakes The accidentally susceptibility of type, and, it compares analysis different program code and mistake is occurred in the susceptibility of different data bit At least one.

5. system as claimed in claim 4, which is characterized in that

In terms of analysis said program code is to the susceptibility of specific instruction type, the analytical unit is specifically used for:

It counts in the soft error simulation that the instruction to specific instruction type described in program code carries out error injection, it is corresponding The quantity N of indifference type_Masked, the quantity N of difference acceptable type_acceptable, the quantity of the unacceptable type of difference N_unacceptable, and, collapse the quantity N of type_Detected, wherein N expression quantity, subscript Masked, acceptable, Unacceptable and Detected indicates specific wrong severity type；

According to formula N_acceptable/(N_acceptable+N_unacceptable) difference acceptable type is calculated in difference occurrence type Proportion；

According to formula (N_acceptable+N_Masked)/(N_Masked+N_acceptable+N_unacceptable+N_Detected), it calculates to described program Approximate ratio of the code without influence.

6. system as described in any one in claim 1-5, which is characterized in that if the standard output result is single numerical value, The error mass formula includes:

Wherein, G_iThe standard output is indicated as a result, C_iIndicate mistake output as a result, rel-diff_iIndicate mistake output result with The error of standard output result.

7. system as described in any one in claim 1-5, which is characterized in that described if the standard output result is matrix Error mass formula includes:

Wherein, G indicates the standard output as a result, C indicates mistake output as a result, rel-l2-norm indicates mistake output result With the error of standard output result.

8. a kind of GPGPU program Near covering based on soft error perception characterized by comprising

Receive program code, quality metric formula, soft error mode and quality requirement；The quality requirement includes tolerable Max value of error；

Wherein, described that the corresponding wrong severity type packet of each secondary soft error simulation is determined according to the quality metric formula It includes:

The error between target error output result and standard output result is calculated according to the quality metric formula；The target is wrong Accidentally any secondary soft error of output result characterization simulates corresponding output result；

If the error is zero, determine that the wrong severity type of target soft error simulation is indifference type；The target Any secondary soft error simulation of soft error simulation characterization；

If the error is greater than zero and is less than the max value of error, the wrong severity of the target soft error simulation is determined Type is difference acceptable type；

If the error is greater than the max value of error, determine that the wrong severity type of the target soft error simulation is poor Different unacceptable type；Wherein, difference acceptable type and the unacceptable type of difference belong to difference occurrence type；

If said program code terminates operation, determine the wrong severity type of the target soft error simulation for collapse class Type.

9. method according to claim 8, which is characterized in that further include:

Collect each secondary soft error and simulate relevant error message, the error message include the kernel function title that soft error occurs, The instruction type of soft error occurs and the data bit of bit reversal occurs；

It is described to include: according to the wrong severity type progress reliability approximate analysis

10. method as claimed in claim 9, which is characterized in that

Described program staging error approximate analysis includes: to analyze program code to the susceptibility of soft error；

The kernel function staging error approximate analysis includes: susceptibility of the analysis program code to specific kernel function, and, analysis is same One program code is to different at least one of the susceptibilitys of kernel function；

Described instruction type staging error approximate analysis includes: susceptibility of the analysis same program code to different instruction types, And same instruction type is analyzed to the susceptibility of distinct program code, and, analyze the same instruction in same program code Susceptibility at least one of of the type in different kernel functions；

The data bit staging error approximate analysis includes: that mistake occurs in the quick of different data bit for analysis same program code Sensitivity analyzes same program code to the susceptibility of different type of errors, and, it compares and analyzes different program codes to mistake Occur in different at least one of the susceptibilitys of data bit.