CN113610154A - GPGPU program SDC error detection method and device - Google Patents

GPGPU program SDC error detection method and device Download PDF

Info

Publication number
CN113610154A
CN113610154A CN202110903201.2A CN202110903201A CN113610154A CN 113610154 A CN113610154 A CN 113610154A CN 202110903201 A CN202110903201 A CN 202110903201A CN 113610154 A CN113610154 A CN 113610154A
Authority
CN
China
Prior art keywords
sdc
instruction
program
detected
path
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110903201.2A
Other languages
Chinese (zh)
Other versions
CN113610154B (en
Inventor
魏晓辉
姜楠
谭婧炜佳
李翔
王晓楠
岳恒山
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jilin University
Original Assignee
Jilin University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jilin University filed Critical Jilin University
Priority to CN202110903201.2A priority Critical patent/CN113610154B/en
Publication of CN113610154A publication Critical patent/CN113610154A/en
Application granted granted Critical
Publication of CN113610154B publication Critical patent/CN113610154B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T1/00General purpose image data processing
    • G06T1/20Processor architectures; Processor configuration, e.g. pipelining

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The application discloses a method and a device for detecting SDC errors of a GPGPU program, comprising the following steps: acquiring a program to be detected, and determining SDC fragile instructions with high SDC tendency in the instructions to be detected of the program to be detected; the program to be detected is a GPGPU program, and the SDC tendency and the probability of SDC error of the instruction to be detected are in positive correlation; constructing an instruction path corresponding to each basic block according to the dependency relationship among SDC fragile instructions in each basic block of the program to be detected; the instruction paths are of a first class including a plurality of SDC fragile instructions having a dependency relationship and/or of a second class including a single SDC fragile instruction having no dependency relationship with any of the SDC fragile instructions; and copying the instruction path to obtain a copy path, and detecting the SDC error in the program to be detected based on the instruction path and the copy path. According to the method and the device, a large number of SDC errors are detected by copying a small number of instructions under the condition of ensuring the reliability of the program, and the error detection efficiency is improved.

Description

GPGPU program SDC error detection method and device
Technical Field
The invention relates to the technical field of computers, in particular to a method and a device for detecting SDC errors of a GPGPU program.
Background
With the continuous development of high integrated circuit technology, a circuit at a nanometer level is easily attacked by high-energy particles in the universe, so that bits are overturned, and errors caused by data damage and no circuit are called soft errors of a program. Soft errors propagate inside a thread as data used by the thread and instructions executed during program execution, ultimately affecting the result. Soft errors can have three effects on program execution results: masking (MASKED) errors, masking soft errors, and finally not influencing the condition of an application result; detecting an unrecoverable error (DUE), an application crashing or hanging up, or exiting in an abnormal state; and a Silent Data Corruption (SDC) error, wherein no abnormal information appears in the application execution process, but the final program output is different from the correct output. Particularly, with the application data volume of various industries being increased rapidly in recent years, the data volume increase speed of the internet is far away from the calculation and storage resource increase speed of a hard part platform, and the calculation capacity and the energy efficiency provided based on the existing data processing technology are difficult to meet the application calculation requirements. In this context, general purpose image processors (GPGPUs) have come into existence, and GPGPUs are a new type of computing platform that is increasingly being used for high performance computing due to supporting highly concurrent threading. Unlike the traditional GPU platform which is only used for calculating image data, the GPGPU is more and more used in the fields of numerical simulation, data mining, artificial intelligence and the like. Compared with the conventional method that after the GPU used for processing the image data is in error, only partial pixel points of the image can be damaged, and the requirements of a user cannot be influenced, but the GPGPU has certain reliability requirements on a program when processing a high-performance program, and the SDC error needs to be detected to eliminate the influence on the program.
At present, SDC errors generated in the program execution process are mainly detected through full instruction copying, but full instruction copying needs to copy one copy instruction for each instruction, extra comparison instructions and register overhead are added, the extra registers can reduce the program parallelism, and the added instructions can greatly increase the program execution time. In addition, the full instruction copy can detect all transient errors in the program, but some transient errors cannot cause SDC errors, and the detection efficiency is low. Therefore, how to improve the efficiency of detecting SDC errors is a technical problem that needs to be solved by those skilled in the art.
Disclosure of Invention
In view of the above, the present invention provides a method and an apparatus for detecting SDC errors in a GPGPU program, which can detect a large number of SDC errors by copying a small number of instructions while ensuring program reliability, thereby improving error detection efficiency. The specific scheme is as follows:
a first aspect of the present application provides a method for detecting SDC errors in a GPGPU program, including:
acquiring a program to be detected, and determining SDC fragile instructions with high SDC tendency in the instructions to be detected of the program to be detected; the program to be detected is a GPGPU program, and the SDC tendency and the probability of SDC error of the instruction to be detected are in positive correlation;
constructing an instruction path corresponding to each basic block according to the dependency relationship among the SDC fragile instructions in each basic block of the program to be detected; wherein the instruction paths are of a first class comprising a plurality of the SDC fragile instructions having dependencies and/or of a second class comprising a single SDC fragile instruction having no dependencies with any of the SDC fragile instructions;
and copying the instruction path to obtain a corresponding copy path, and detecting the SDC error in the program to be detected based on the instruction path and the copy path.
Optionally, the acquiring the program to be detected, and determining an SDC fragile instruction with a high SDC tendency in the instructions to be detected of the program to be detected includes:
acquiring a program to be detected, and determining characteristic information of a command to be detected of the program to be detected; the characteristic information represents the SDC tendency of the instruction to be detected;
inputting the instruction to be detected and the characteristic information thereof into a trained SDC vulnerability prediction model so that the SDC vulnerability prediction model can output SDC vulnerability instructions with high SDC tendency in the instruction to be detected; the SDC vulnerability prediction model is obtained by training a blank model constructed based on a machine learning algorithm by using a training set, wherein the training set comprises a sample instruction and a corresponding sample label, and the sample label is determined and obtained based on the characteristic information of the sample instruction and an error injection result obtained after the error injection operation is carried out on the sample instruction.
Optionally, before the instruction to be detected and the feature information thereof are input into the trained SDC vulnerability prediction model, the method further includes:
acquiring a sample instruction, and determining characteristic information of the sample instruction;
performing error injection operation on the sample instruction for preset times, judging whether the number of error injection results with SDC errors is larger than a preset threshold value, if so, judging whether the characteristic information meets SDC tendency conditions, and if so, labeling the sample instruction by using a sample label representing that the sample instruction is an SDC fragile instruction to obtain the training set;
and training a blank model constructed based on a deep learning algorithm by using the training set to obtain a trained SDC vulnerability prediction model.
Optionally, the feature information includes instruction attribute information, error propagation information, and shared memory information;
the instruction attribute information is a feature vector representing instruction types and instruction functions, the error propagation information is a feature vector of the total instruction number, the shielding error instruction number and the program crash instruction number in the error propagation process, and the shared memory information is a feature vector of shared loading information and shared storage information.
Optionally, the copying the instruction path to obtain a corresponding copy path includes:
and storing the value of each SDC fragile instruction in the instruction path in the original register into a new register to obtain a corresponding copy path.
Optionally, the detecting the SDC error in the program to be detected based on the instruction path and the copy path includes:
comparing the instruction operation result in the copy path with the instruction operation result in the instruction path by inserting a comparison instruction at the end of the copy path, and if the instruction operation result in the copy path is not consistent with the instruction operation result in the instruction path, judging that the SDC fragile instruction in the instruction path has SDC error.
Optionally, after the detecting the SDC error in the program to be detected based on the instruction path and the copy path, the method further includes:
and if the SDC error is detected, sending alarm information so as to repair the program to be detected with the SDC error according to the alarm information.
A second aspect of the present application provides a GPGPU program SDC error detection apparatus, including:
the system comprises an acquisition module, a detection module and a control module, wherein the acquisition module is used for acquiring a program to be detected and determining SDC fragile instructions with high SDC tendency in the instructions to be detected of the program to be detected; the program to be detected is a GPGPU program, and the SDC tendency and the probability of SDC error of the instruction to be detected are in positive correlation;
the construction module is used for constructing an instruction path corresponding to each basic block according to the dependency relationship among the SDC fragile instructions in each basic block of the program to be detected; wherein the instruction paths are of a first class comprising a plurality of the SDC fragile instructions having dependencies and/or of a second class comprising a single SDC fragile instruction having no dependencies with any of the SDC fragile instructions;
and the copying module is used for copying the instruction path to obtain a corresponding copy path and detecting the SDC error in the program to be detected based on the instruction path and the copy path.
In the method, a program to be detected is obtained first, and SDC fragile instructions with high SDC tendency in the instructions to be detected of the program to be detected are determined; the program to be detected is a GPGPU program, and the SDC tendency and the probability of SDC error of the instruction to be detected are in positive correlation; then constructing an instruction path corresponding to each basic block according to the dependency relationship among the SDC fragile instructions in each basic block of the program to be detected; wherein the instruction paths are of a first class comprising a plurality of the SDC fragile instructions having dependencies and/or of a second class comprising a single SDC fragile instruction having no dependencies with any of the SDC fragile instructions; and finally, copying the instruction path to obtain a corresponding copy path, and detecting the SDC error in the program to be detected based on the instruction path and the copy path. Therefore, the SDC fragile instructions with high SDC tendency are screened from the program to be detected, then the instruction paths and the corresponding copy paths are constructed for the screened SDC fragile instructions according to the dependency relationship among the instructions by taking the basic block as a unit, a large number of SDC errors are detected by copying a small number of instructions under the condition that the reliability of the program is ensured, and the error detection efficiency is improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
Fig. 1 is a flowchart of a method for detecting SDC errors in a GPGPU program according to the present application;
FIG. 2 is a diagram illustrating a soft error result when the GPGPU processes image data according to the present disclosure;
FIG. 3 is a diagram illustrating soft error results when the GPGPU processes high-performance data according to the present disclosure;
FIG. 4 is a block diagram illustrating a specific instruction path construction provided herein;
FIG. 5 is a diagram illustrating instruction path replication in accordance with an embodiment of the present disclosure;
fig. 6 is a flowchart of a method for constructing an SDC vulnerability prediction model according to the present application;
FIG. 7 is a graphical illustration of the relationship between soft error type and SDC propensity provided herein;
FIG. 8 is a schematic diagram illustrating an error masked instruction during error propagation according to the present application;
FIG. 9 is a diagram illustrating feature vector information of a program instruction provided herein;
fig. 10 is a schematic structural diagram of a device for detecting SDC errors in a GPGPU program according to the present application.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
At present, SDC errors generated in the program execution process are mainly detected through full instruction copying, but full instruction copying needs to copy one copy instruction for each instruction, extra comparison instructions and register overhead are added, the extra registers can reduce the program parallelism, and the added instructions can greatly increase the program execution time. In addition, the full instruction copy can detect all transient errors in the program, but some transient errors cannot cause SDC errors, and the detection efficiency is low. Aiming at the defects, the SDC fragile instructions with high SDC tendency are screened from the program to be detected, then the instruction paths and the corresponding copy paths are constructed for the screened SDC fragile instructions according to the dependency relationship among the instructions by taking a basic block as a unit, and a large number of SDC errors are detected by copying a small number of instructions under the condition of ensuring the reliability of the program, so that the error detection efficiency is improved.
Fig. 1 is a flowchart of a method for detecting SDC errors in a GPGPU program according to an embodiment of the present disclosure. Referring to fig. 1, the method for detecting SDC errors in a GPGPU includes:
s11: acquiring a program to be detected, and determining SDC fragile instructions with high SDC tendency in the instructions to be detected of the program to be detected; the program to be detected is a GPGPU program, and the SDC tendency and the probability that the instruction to be detected has SDC errors are in positive correlation.
In the embodiment, a program to be detected is obtained firstly, and then SDC fragile instructions with high SDC tendency in the instructions to be detected of the program to be detected are determined; the program to be detected is a GPGPU program, and the SDC tendency and the probability that the instruction to be detected has SDC errors are in positive correlation. The SDC tendency of an instruction is the probability that an error occurs in the instruction, which ultimately results in the program generating the SDC error, i.e., the higher the SDC tendency of the instruction, the higher the probability that the SDC error exists during execution. When the SDC tendency of an instruction is above a certain threshold, the instruction is considered to be an SDC vulnerability instruction.
As described above, as the integration degree of chips is higher and higher, the circuit is more susceptible to alpha particles and high-energy neutrons, and a GPGPU integrates thousands of computation cores on a very small chip, and bit flipping is more likely to cause soft errors in programs. It is understood that the computer stores data according to binary system, when the computer is affected by alpha particles and high-energy neutrons in the universe, the stored '0' and '1' have jump, for example, the '001' is changed into '000', the jump is called bit flipping, and the jump that flips only one bit at a time is called single bit flipping. Fig. 2 and fig. 3 show the influence of soft errors on the processing of image data by the GPGPU and the influence of the GPGPU on the processing of high-performance data, respectively, and the SDC error detection scheme of the present embodiment can widely improve the reliable execution of GPGPU applications.
In this embodiment, the SDC vulnerability of the instruction is mainly predicted by a machine learning method, so as to obtain an instruction with a high SDC tendency. Specifically, firstly, a program to be detected is obtained, and characteristic information of a command to be detected of the program to be detected is determined; and the characteristic information represents the SDC tendency of the instruction to be detected. And then inputting the instructions to be detected and the characteristic information thereof into a trained SDC vulnerability prediction model so that the SDC vulnerability prediction model can output SDC vulnerability instructions with high SDC tendency in the instructions to be detected. In this embodiment, the SDC vulnerability prediction model is a model obtained by training a blank model constructed based on a machine learning algorithm using a training set, where the training set includes a sample instruction and a corresponding sample label, and the sample label is determined based on feature information of the sample instruction and an error injection result obtained by performing an error injection operation on the sample instruction.
S12: constructing an instruction path corresponding to each basic block according to the dependency relationship among the SDC fragile instructions in each basic block of the program to be detected; wherein the instruction paths are of a first class comprising a plurality of the SDC fragile instructions having dependencies and/or of a second class comprising a single SDC fragile instruction having no dependencies with any of the SDC fragile instructions.
In this embodiment, after the SDC fragile instructions with high SDC tendency are screened out, instruction paths corresponding to the basic blocks are constructed according to the dependency relationship between the SDC fragile instructions in the basic blocks of the program to be detected. The type of the instruction path is uncertain because the dependencies between the instructions in each of the basic blocks are uncertain, but generally comprises two broad classes, a first class of paths comprising a plurality of the SDC fragile instructions having dependencies and/or a second class of paths comprising a single SDC fragile instruction having no dependencies with any of the SDC fragile instructions. To prevent large scale propagation of errors, each basic block (basic block) in the program is treated as a basic copy unit. Fig. 4 shows at least one instruction path in the basic block, where each instruction path is constructed by forming a path by instructions having a dependency relationship according to the dependency relationship between the instructions, and for those instructions having no dependency relationship with other instructions, the path alone exists as a path for the first type of path, and the path is the second type of path.
S13: and copying the instruction path to obtain a corresponding copy path, and detecting the SDC error in the program to be detected based on the instruction path and the copy path.
In this embodiment, the instruction path is copied to obtain a corresponding copy path, and the SDC error in the program to be detected is detected based on the instruction path and the copy path. Specifically, the value of the original register in which each SDC fragile instruction in the instruction path is located is first stored into a new register to obtain a corresponding copy path. And then comparing the instruction execution result in the copy path with the instruction execution result in the instruction path by inserting a comparison instruction at the end of the copy path, and if the instruction execution result is not consistent with the instruction execution result in the instruction path, judging that the SDC fragile instruction in the instruction path has an SDC error.
The traditional method for generating the duplicate instruction has two types: (1) generating a copy instruction before an original instruction, adding a comparison instruction after the original instruction, and if an error is found, immediately notifying an upper-layer application; (2) for each basic block in the program, a duplicate instruction is added after each original instruction in the basic block, a comparison instruction is added at the end of the basic block, and if an error is found, the upper-layer application is notified. The difference between the two methods is that the method (1) only uses a few extra registers to store the copied data, which saves the overhead of the registers, but the overhead of the comparison and notification instruction is too large, and the method (2) saves the overhead of the comparison and notification instruction, but additionally uses a large number of registers for copying the whole basic block. In this embodiment, for each path, each instruction in the path is copied to form a copy path, and the copying process is as shown in fig. 5. The value in the original register of the original instruction is copied and stored in a new register, the original instruction and the target instruction are respectively calculated, and the result is stored in different registers. This is done for each SDC fragile instruction on the instruction path, eventually forming a copy path. And inserting a comparison instruction at the end of the path, and comparing the execution results of the original instruction path and the copy path. If the result is different, the soft error occurs on the path. Further, if the SDC error is detected, alarm information is sent out, and the program to be detected with the SDC error is repaired according to the alarm information. If one or more paths in a basic block have difference in comparison result, it indicates that a soft error occurs in the basic block, and a warning needs to be sent to the system, and the system takes corresponding remedial measures such as a rollback procedure.
Therefore, the program to be detected is obtained first, and the SDC fragile instruction with high SDC tendency in the instructions to be detected of the program to be detected is determined; the program to be detected is a GPGPU program, and the SDC tendency and the probability of SDC error of the instruction to be detected are in positive correlation; then constructing an instruction path corresponding to each basic block according to the dependency relationship among the SDC fragile instructions in each basic block of the program to be detected; wherein the instruction paths are of a first class comprising a plurality of the SDC fragile instructions having dependencies and/or of a second class comprising a single SDC fragile instruction having no dependencies with any of the SDC fragile instructions; and finally, copying the instruction path to obtain a corresponding copy path, and detecting the SDC error in the program to be detected based on the instruction path and the copy path. According to the method and the device, the SDC fragile instructions with high SDC tendency are screened out from the program to be detected, then the instruction paths and the corresponding copy paths are established for the screened SDC fragile instructions according to the dependency relationship among the instructions by taking the basic blocks as units, a large number of SDC errors are detected by copying a small number of instructions under the condition that the reliability of the program is guaranteed, and the error detection efficiency is improved.
Fig. 6 is a flowchart of a method for constructing an SDC vulnerability prediction model according to an embodiment of the present application. Referring to fig. 6, the method for constructing the SDC vulnerability prediction model includes:
s21: a sample instruction is obtained, and characteristic information of the sample instruction is determined.
In this embodiment, a sample instruction is obtained, and feature information of the sample instruction is determined. The same SDC vulnerability prediction model can be used for predicting programs of the same type, and part of instructions in the programs are used as the sample instructions, and then characteristic information of the sample instructions is determined. It should be noted that the SDC tendency of an instruction is found experimentally to be related to the intrinsic properties of the instruction and factors encountered during error propagation. The SDC tendencies of different types and different functional instructions are clearly different, as shown in fig. 7, the SDC tendencies of the calculation instructions are higher, the SDC tendencies of the address calculation instructions are lower, the SDC tendencies of the control loops are lower, and the SDC tendencies of the control branches are higher. In addition, some instructions that can mask errors and instructions that are prone to program crashes may make the instructions less prone to SDC during error propagation, as shown in FIG. 8, where an error propagates into the left shift instruction (second row), eight bits of 32-bit data are left shifted, and eight bits on the right are filled with zeros, resulting in a 25% (8/32) error being masked. Errors propagating into shared memory may increase the SDC propensity of instructions. Propagation of errors into the shared memory may enlarge the error propagation area and thus also lead to an increased tendency of the SDC instructions. Based on this, the feature information in the embodiment of the present application includes instruction attribute information, error propagation information, and shared memory information, where the instruction attribute information is a feature vector representing an instruction type and an instruction function, the error propagation information is a feature vector representing the total instruction number, the mask error instruction number, and the program crash instruction number in an error propagation process, and the shared memory information is a feature vector sharing load information and shared storage information. The specific definition of the feature vector is shown in fig. 9.
S22: and performing error injection operation on the sample instruction for preset times, judging whether the number of error injection results with SDC errors is larger than a preset threshold value, if so, judging whether the characteristic information meets SDC tendency conditions, and if so, labeling the sample instruction by using a sample label representing that the sample instruction is an SDC fragile instruction to obtain the training set.
In this embodiment, the sample instruction is subjected to error injection for a preset number of times, and whether the number of times of error injection results with SDC errors is greater than a preset threshold is determined. The process is a selective error injection process, wherein error injection is an experimental means, and the actual occurrence of soft errors in hardware is simulated by a software simulation method. And selecting partial instructions in the program, respectively injecting errors into the instructions, and preliminarily judging the SDC vulnerability of the instructions by combining a preset threshold value given by a user according to an error injection result. And if the percentage of the SDC in the error injection result of one instruction exceeds a preset threshold value given by a user, further judging whether the characteristic information meets the SDC tendency condition, namely whether the characteristic information is high SDC tendency, if so, judging that the instruction is an SDC fragile instruction, and otherwise, judging that the instruction is a non-SDC fragile instruction. And labeling the sample instruction by using a sample label representing that the sample instruction is the SDC fragile instruction to obtain the training set. It should be noted that, in the above error injection process, during each execution of the program, one instruction of one thread is randomly selected, and a bit flipping error is injected into a random position of the register, so that after multiple positions of one instruction are injected, the overall error distribution of the instruction can be obtained.
S23: and training a blank model constructed based on a deep learning algorithm by using the training set to obtain a trained SDC vulnerability prediction model.
In this embodiment, the blank model constructed based on the deep learning algorithm is trained by using the training set to obtain a trained SDC vulnerability prediction model. And training a classifier by using the training set to predict the SDC tendency of the program instruction, wherein the SDC fragile instruction set predicted by the machine learning classifier is an instruction set worthy of protection. Only SDC (security class command) fragile instructions in the program are protected, so that most SDC errors in the program can be detected, and the instruction copying cost can be reduced.
Therefore, the embodiment of the application firstly obtains the sample instruction and determines the characteristic information of the sample instruction. And then carrying out error injection operation on the sample instruction for a preset number of times, judging whether the number of times of error injection results with SDC errors is larger than a preset threshold value, if so, judging whether the characteristic information meets SDC tendency conditions, and if so, marking the sample instruction by using a sample label representing that the sample instruction is an SDC fragile instruction to obtain the training set. And finally, training a blank model constructed based on a deep learning algorithm by using the training set to obtain a trained SDC vulnerability prediction model. The SDC vulnerability of the instructions in the program to be detected is predicted by a machine learning method, only a small amount of error injection is needed, and time consumption of a large amount of error injection is saved.
Referring to fig. 10, an embodiment of the present application further discloses a device for detecting SDC errors in a GPGPU program, which includes:
the program detection device comprises an acquisition module 11, a processing module and a processing module, wherein the acquisition module is used for acquiring a program to be detected and determining an SDC fragile instruction with high SDC tendency in the instructions to be detected of the program to be detected; the program to be detected is a GPGPU program, and the SDC tendency and the probability of SDC error of the instruction to be detected are in positive correlation;
a building module 12, configured to build, according to a dependency relationship between the SDC fragile instructions in each basic block of the program to be detected, an instruction path corresponding to each basic block; wherein the instruction paths are of a first class comprising a plurality of the SDC fragile instructions having dependencies and/or of a second class comprising a single SDC fragile instruction having no dependencies with any of the SDC fragile instructions;
and the copying module 13 is configured to copy the instruction path to obtain a corresponding copy path, and detect an SDC error in the program to be detected based on the instruction path and the copy path.
Therefore, the program to be detected is obtained first, and the SDC fragile instruction with high SDC tendency in the instructions to be detected of the program to be detected is determined; the program to be detected is a GPGPU program, and the SDC tendency and the probability of SDC error of the instruction to be detected are in positive correlation; then constructing an instruction path corresponding to each basic block according to the dependency relationship among the SDC fragile instructions in each basic block of the program to be detected; wherein the instruction paths are of a first class comprising a plurality of the SDC fragile instructions having dependencies and/or of a second class comprising a single SDC fragile instruction having no dependencies with any of the SDC fragile instructions; and finally, copying the instruction path to obtain a corresponding copy path, and detecting the SDC error in the program to be detected based on the instruction path and the copy path. According to the method and the device, the SDC fragile instructions with high SDC tendency are screened out from the program to be detected, then the instruction paths and the corresponding copy paths are established for the screened SDC fragile instructions according to the dependency relationship among the instructions by taking the basic blocks as units, a large number of SDC errors are detected by copying a small number of instructions under the condition that the reliability of the program is guaranteed, and the error detection efficiency is improved.
In some specific embodiments, the obtaining module 11 specifically includes:
the device comprises a characteristic determining unit, a judging unit and a judging unit, wherein the characteristic determining unit is used for acquiring a program to be detected and determining characteristic information of a command to be detected of the program to be detected; the characteristic information represents the SDC tendency of the instruction to be detected;
the prediction unit is used for inputting the instruction to be detected and the characteristic information thereof into a trained SDC vulnerability prediction model so that the SDC vulnerability prediction model can output SDC vulnerability instructions with high SDC tendency in the instruction to be detected; the SDC vulnerability prediction model is obtained by training a blank model constructed based on a machine learning algorithm by using a training set, wherein the training set comprises a sample instruction and a corresponding sample label, and the sample label is determined and obtained based on the characteristic information of the sample instruction and an error injection result obtained after the error injection operation is carried out on the sample instruction.
In some specific embodiments, the replication module 13 specifically includes:
the storage unit is used for storing the value of each SDC fragile instruction in the instruction path in the original register into a new register so as to obtain a corresponding copy path;
and the comparison unit is used for comparing the instruction operation result in the copy path with the instruction operation result in the instruction path in a mode of inserting a comparison instruction at the tail end of the copy path, and if the instruction operation result is not consistent with the instruction operation result in the instruction path, judging that the SDC fragile instruction in the instruction path has SDC errors.
In some specific embodiments, the SDC error detection apparatus further includes:
the system comprises a sample determining module, a data processing module and a data processing module, wherein the sample determining module is used for acquiring a sample instruction and determining characteristic information of the sample instruction;
the labeling module is used for performing error injection operation on the sample instruction for preset times, judging whether the number of error injection results with SDC errors is larger than a preset threshold value or not, if so, judging whether the characteristic information meets SDC tendency conditions or not, and if so, labeling the sample instruction by using a sample label representing that the sample instruction is an SDC fragile instruction to obtain the training set;
the training module is used for training a blank model constructed based on a deep learning algorithm by using the training set to obtain a trained SDC vulnerability prediction model;
and the warning module is used for sending warning information if the SDC error is detected so as to repair the program to be detected with the SDC error according to the warning information.
The embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same or similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.
Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The method and the device for detecting SDC errors in the GPGPU program provided by the present invention are described in detail above, and a specific example is applied in the text to explain the principle and the implementation of the present invention, and the description of the above embodiment is only used to help understanding the method and the core idea of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims (8)

1. A method for detecting SDC errors of a GPGPU program is characterized by comprising the following steps:
acquiring a program to be detected, and determining SDC fragile instructions with high SDC tendency in the instructions to be detected of the program to be detected; the program to be detected is a GPGPU program, and the SDC tendency and the probability of SDC error of the instruction to be detected are in positive correlation;
constructing an instruction path corresponding to each basic block according to the dependency relationship among the SDC fragile instructions in each basic block of the program to be detected; wherein the instruction paths are of a first class comprising a plurality of the SDC fragile instructions having dependencies and/or of a second class comprising a single SDC fragile instruction having no dependencies with any of the SDC fragile instructions;
and copying the instruction path to obtain a corresponding copy path, and detecting the SDC error in the program to be detected based on the instruction path and the copy path.
2. The GPGPU program SDC error detection method of claim 1, wherein the acquiring of the program to be detected and the determining of SDC fragile instructions with high SDC tendency in the instructions to be detected of the program to be detected comprise:
acquiring a program to be detected, and determining characteristic information of a command to be detected of the program to be detected; the characteristic information represents the SDC tendency of the instruction to be detected;
inputting the instruction to be detected and the characteristic information thereof into a trained SDC vulnerability prediction model so that the SDC vulnerability prediction model can output SDC vulnerability instructions with high SDC tendency in the instruction to be detected; the SDC vulnerability prediction model is obtained by training a blank model constructed based on a machine learning algorithm by using a training set, wherein the training set comprises a sample instruction and a corresponding sample label, and the sample label is determined and obtained based on the characteristic information of the sample instruction and an error injection result obtained after the error injection operation is carried out on the sample instruction.
3. The method for detecting SDC errors in a GPGPU program according to claim 2, wherein before inputting the instruction to be detected and the feature information thereof into the trained SDC vulnerability prediction model, the method further comprises:
acquiring a sample instruction, and determining characteristic information of the sample instruction;
performing error injection operation on the sample instruction for preset times, judging whether the number of error injection results with SDC errors is larger than a preset threshold value, if so, judging whether the characteristic information meets SDC tendency conditions, and if so, labeling the sample instruction by using a sample label representing that the sample instruction is an SDC fragile instruction to obtain the training set;
and training a blank model constructed based on a deep learning algorithm by using the training set to obtain a trained SDC vulnerability prediction model.
4. The GPGPU program SDC error detection method of claim 2 or 3, wherein the characteristic information comprises instruction attribute information, error propagation information and shared memory information;
the instruction attribute information is a feature vector representing instruction types and instruction functions, the error propagation information is a feature vector of the total instruction number, the shielding error instruction number and the program crash instruction number in the error propagation process, and the shared memory information is a feature vector of shared loading information and shared storage information.
5. The method for detecting faults in a GPGPU program SDC according to claim 1, wherein the copying the instruction path to obtain a corresponding copy path comprises:
and storing the value of each SDC fragile instruction in the instruction path in the original register into a new register to obtain a corresponding copy path.
6. The GPGPU program SDC error detection method of claim 1, wherein the detecting SDC errors in the program to be detected based on the instruction path and the copy path comprises:
comparing the instruction operation result in the copy path with the instruction operation result in the instruction path by inserting a comparison instruction at the end of the copy path, and if the instruction operation result in the copy path is not consistent with the instruction operation result in the instruction path, judging that the SDC fragile instruction in the instruction path has SDC error.
7. A GPGPU program SDC error detection method according to claim 1, 2, 3, 5 or 6, characterized in that after the detection of SDC errors in the program to be detected based on the instruction path and the copy path, the method further comprises:
and if the SDC error is detected, sending alarm information so as to repair the program to be detected with the SDC error according to the alarm information.
8. A GPGPU program SDC error detection device is characterized by comprising:
the system comprises an acquisition module, a detection module and a control module, wherein the acquisition module is used for acquiring a program to be detected and determining SDC fragile instructions with high SDC tendency in the instructions to be detected of the program to be detected; the program to be detected is a GPGPU program, and the SDC tendency and the probability of SDC error of the instruction to be detected are in positive correlation;
the construction module is used for constructing an instruction path corresponding to each basic block according to the dependency relationship among the SDC fragile instructions in each basic block of the program to be detected; wherein the instruction paths are of a first class comprising a plurality of the SDC fragile instructions having dependencies and/or of a second class comprising a single SDC fragile instruction having no dependencies with any of the SDC fragile instructions;
and the copying module is used for copying the instruction path to obtain a corresponding copy path and detecting the SDC error in the program to be detected based on the instruction path and the copy path.
CN202110903201.2A 2021-08-06 2021-08-06 GPGPU program SDC error detection method and device Active CN113610154B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110903201.2A CN113610154B (en) 2021-08-06 2021-08-06 GPGPU program SDC error detection method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110903201.2A CN113610154B (en) 2021-08-06 2021-08-06 GPGPU program SDC error detection method and device

Publications (2)

Publication Number Publication Date
CN113610154A true CN113610154A (en) 2021-11-05
CN113610154B CN113610154B (en) 2023-12-29

Family

ID=78339773

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110903201.2A Active CN113610154B (en) 2021-08-06 2021-08-06 GPGPU program SDC error detection method and device

Country Status (1)

Country Link
CN (1) CN113610154B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108334903A (en) * 2018-02-06 2018-07-27 南京航空航天大学 A kind of instruction SDC fragility prediction techniques based on support vector regression
CN109063775A (en) * 2018-08-03 2018-12-21 南京航空航天大学 Instruction SDC fragility prediction technique based on shot and long term memory network
US20190196912A1 (en) * 2017-12-21 2019-06-27 Arizona Board Of Regents On Behalf Of Arizona State University Lightweight checkpoint technique for resilience against soft errors
CN111159011A (en) * 2019-12-09 2020-05-15 南京航空航天大学 Instruction vulnerability prediction method and system based on deep random forest
CN112765609A (en) * 2020-12-31 2021-05-07 南京航空航天大学 Multi-bit SDC fragile instruction identification method based on single-class support vector machine

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190196912A1 (en) * 2017-12-21 2019-06-27 Arizona Board Of Regents On Behalf Of Arizona State University Lightweight checkpoint technique for resilience against soft errors
CN108334903A (en) * 2018-02-06 2018-07-27 南京航空航天大学 A kind of instruction SDC fragility prediction techniques based on support vector regression
CN109063775A (en) * 2018-08-03 2018-12-21 南京航空航天大学 Instruction SDC fragility prediction technique based on shot and long term memory network
CN111159011A (en) * 2019-12-09 2020-05-15 南京航空航天大学 Instruction vulnerability prediction method and system based on deep random forest
CN112765609A (en) * 2020-12-31 2021-05-07 南京航空航天大学 Multi-bit SDC fragile instruction identification method based on single-class support vector machine

Also Published As

Publication number Publication date
CN113610154B (en) 2023-12-29

Similar Documents

Publication Publication Date Title
Li et al. Understanding error propagation in deep learning neural network (DNN) accelerators and applications
JP4795433B2 (en) Reduction of uncorrectable error rate in a lockstep dual module redundant system.
US8533681B2 (en) Atomicity violation detection using access interleaving invariants
Rela et al. Experimental evaluation of the fail-silent behaviour in programs with consistency checks
US10296312B2 (en) Methods, apparatuses, and systems for zero silent data corruption (ZDC) compiler technique
CN114283863B (en) Row hammer detection and avoidance
Condia et al. Combining architectural simulation and software fault injection for a fast and accurate CNNs reliability evaluation on GPUs
US10853493B2 (en) Enhanced vector-based identification of circuit trojans
Liu et al. Identifying SDC-causing Instructions based on Random forests algorithm
Zhang et al. Quantifying the impact of memory errors in deep learning
Thomas et al. Sirius: Neural network based probabilistic assertions for detecting silent data corruption in parallel programs
Yim Characterization of impact of transient faults and detection of data corruption errors in large-scale n-body programs using graphics processing units
US8924835B2 (en) Content addressable memory continuous error detection with interleave parity
CN113610154B (en) GPGPU program SDC error detection method and device
Kadam et al. Data-centric reliability management in gpus
CN112765609B (en) Multi-bit SDC fragile instruction identification method based on single-class support vector machine
Chen et al. Static probabilistic timing analysis with a permanent fault detection mechanism
Sugihara et al. A simulation-based soft error estimation methodology for computer systems
Yan et al. Multi-bit data flow error detection method based on SDC vulnerability analysis
Weigel et al. Kernel vulnerability factor and efficient hardening for histogram of oriented gradients
Lunardi et al. Experimental and analytical analysis of sorting algorithms error criticality for HPC and large servers applications
Mokhtarpour et al. PB-IFMC: A selective soft error protection method based on instruction fault masking capability
Coleman et al. A comparison and analysis of soft-fault error models using FGMRES
CN111752718B (en) Low-overhead deadlock prediction method and device and electronic equipment
Zhou et al. Designing scrubbing strategy for memories suffering MCUs through the selection of optimal interleaving distance

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant