CN113610154B - GPGPU program SDC error detection method and device - Google Patents

GPGPU program SDC error detection method and device Download PDF

Info

Publication number
CN113610154B
CN113610154B CN202110903201.2A CN202110903201A CN113610154B CN 113610154 B CN113610154 B CN 113610154B CN 202110903201 A CN202110903201 A CN 202110903201A CN 113610154 B CN113610154 B CN 113610154B
Authority
CN
China
Prior art keywords
instruction
sdc
path
program
detected
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110903201.2A
Other languages
Chinese (zh)
Other versions
CN113610154A (en
Inventor
魏晓辉
姜楠
谭婧炜佳
李翔
王晓楠
岳恒山
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jilin University
Original Assignee
Jilin University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jilin University filed Critical Jilin University
Priority to CN202110903201.2A priority Critical patent/CN113610154B/en
Publication of CN113610154A publication Critical patent/CN113610154A/en
Application granted granted Critical
Publication of CN113610154B publication Critical patent/CN113610154B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T1/00General purpose image data processing
    • G06T1/20Processor architectures; Processor configuration, e.g. pipelining

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The application discloses a GPGPU program SDC error detection method and device, comprising the following steps: acquiring a program to be detected, and determining an SDC fragile instruction with high SDC tendency in the instructions to be detected of the program to be detected; the program to be detected is a GPGPU program, and the SDC tendency and the probability of the SDC error of the instruction to be detected are in positive correlation; constructing instruction paths corresponding to all basic blocks according to the dependency relationship among SDC fragile instructions in all basic blocks of a program to be detected; the instruction path is a first type path comprising a plurality of SDC fragile instructions with dependency and/or a second type path comprising a single SDC fragile instruction without dependency with any of the SDC fragile instructions; and copying the instruction path to obtain a copy path, and detecting the SDC error in the program to be detected based on the instruction path and the copy path. According to the method and the device, a large number of SDC errors are detected through a small number of instruction copies under the condition that the reliability of the program is guaranteed, and the error detection efficiency is improved.

Description

GPGPU program SDC error detection method and device
Technical Field
The invention relates to the technical field of computers, in particular to a GPGPU program SDC error detection method and device.
Background
With the continued development of high integrated circuit technology, nanoscale circuits are easily attacked by energetic particles in the universe, resulting in bit flipping, and errors that the circuit does not cause due to data corruption are known as soft errors of the program. In the process of program execution, soft errors can propagate inside the thread along with data used by the thread and executed instructions, and finally influence the result. Soft errors can have three effects on program execution results: masking (maskend) errors, soft errors being MASKED, eventually without affecting the situation of the application result; an unrecoverable error (DUE) is detected, the application crashes or hangs, or exits in an abnormal state; silent Data Corruption (SDC) errors, no exception information occurs during application execution, but the final program output differs from the correct output. Particularly, with the rapid increase of application data volume in various industries in recent years, the internet data volume is far faster than the superhard part platform, and the storage resource is faster, so that the computing capacity and the energy efficiency provided by the existing data processing technology are difficult to meet the application computing requirement. In this context, general purpose image processors (GPGPUs), which are new computing platforms, are increasingly being used for high performance computing due to the support of high concurrent thread operations. Unlike conventional GPU platforms that are used only to compute image data, GPGPU is increasingly used in the fields of numerical simulation, data mining, artificial intelligence, and the like. Compared with the traditional GPU for processing image data, only part of pixels of the image are destroyed after errors occur, and the requirements of users are not influenced, but the GPGPU processing high-performance programs has certain reliability requirements on the programs, and SDC errors need to be detected to eliminate the influence on the programs.
At present, SDC errors occurring in the program execution process are mainly detected through full instruction replication, but full instruction replication requires replicating a duplicate instruction for each instruction, extra comparison instructions and register overhead are added, the extra registers can cause the program parallelism to be reduced, and the added instructions can also greatly increase the program execution time. In addition, the full instruction copy can detect all transient errors in the program, but some transient errors do not cause SDC errors, and the detection efficiency is low. Therefore, how to improve the SDC error detection efficiency of the program is a technical problem to be solved by those skilled in the art.
Disclosure of Invention
Accordingly, the present invention is directed to a method and apparatus for detecting SDC errors in a GPGPU program, which can detect a large number of SDC errors by a small number of instruction copies while ensuring the reliability of the program, thereby improving the error detection efficiency. The specific scheme is as follows:
the first aspect of the present application provides a method for detecting SDC errors of a GPGPU program, including:
acquiring a program to be detected, and determining an SDC fragile instruction with high SDC tendency in the instructions to be detected of the program to be detected; the program to be detected is a GPGPU program, and the SDC tendency and the probability of the SDC error of the instruction to be detected are in positive correlation;
constructing instruction paths corresponding to the basic blocks according to the dependency relationship among the SDC fragile instructions in the basic blocks of the program to be detected; wherein the instruction path is a first type path comprising a plurality of SDC fragile instructions with dependency and/or a second type path comprising a single SDC fragile instruction without dependency with any of the SDC fragile instructions;
and copying the instruction path to obtain a corresponding copy path, and detecting the SDC error in the program to be detected based on the instruction path and the copy path.
Optionally, the acquiring the program to be detected, determining an SDC fragile instruction with high SDC tendency in the instructions to be detected of the program to be detected, includes:
acquiring a program to be detected, and determining characteristic information of an instruction to be detected of the program to be detected; wherein the characteristic information characterizes the SDC tendency of the instruction to be detected;
inputting the to-be-detected instruction and the characteristic information thereof into a trained SDC vulnerability prediction model so that the SDC vulnerability prediction model outputs an SDC vulnerability instruction with high SDC tendency in the to-be-detected instruction; the SDC vulnerability prediction model is a model obtained by training a blank model constructed based on a machine learning algorithm by using a training set, wherein the training set comprises sample instructions and corresponding sample labels, and the sample labels are obtained by determining error injection results after error injection operation is performed on the sample instructions based on characteristic information of the sample instructions.
Optionally, before the inputting the to-be-detected instruction and the feature information thereof into the trained SDC vulnerability prediction model, the method further includes:
acquiring a sample instruction and determining characteristic information of the sample instruction;
performing error injection operation for the sample instruction for preset times, judging whether the times of error injection results with SDC errors are larger than a preset threshold value, if so, judging whether the characteristic information meets SDC tendency conditions, and if so, marking the sample instruction by using a sample label representing that the sample instruction is an SDC fragile instruction so as to obtain the training set;
and training the blank model constructed based on the deep learning algorithm by utilizing the training set to obtain a trained SDC vulnerability prediction model.
Optionally, the feature information includes instruction attribute information, error propagation information and shared memory information;
the instruction attribute information is a feature vector for representing the instruction type and the instruction function, the error propagation information is a feature vector for all instruction numbers, mask error instruction numbers and program crash instruction numbers in the error propagation process, and the shared memory information is a feature vector for sharing loading information and shared storage information.
Optionally, the copying the instruction path to obtain a corresponding copy path includes:
and storing the value in the original register of each SDC fragile instruction in the instruction path into a new register to obtain a corresponding copy path.
Optionally, the detecting the SDC error in the program to be detected based on the instruction path and the copy path includes:
and comparing the instruction running result in the duplicate path with the instruction running result in the instruction path by inserting a comparison instruction at the end of the duplicate path, and if the instruction running result is inconsistent with the instruction running result in the instruction path, judging that the SDC fragile instruction in the instruction path has an SDC error.
Optionally, after detecting the SDC error in the program to be detected based on the instruction path and the duplicate path, the method further includes:
and if the SDC error is detected, sending out alarm information to repair the program to be detected with the SDC error according to the alarm information.
A second aspect of the present application provides a GPGPU program SDC error detection apparatus, including:
the acquisition module is used for acquiring a program to be detected and determining an SDC fragile instruction with high SDC tendency in the instructions to be detected of the program to be detected; the program to be detected is a GPGPU program, and the SDC tendency and the probability of the SDC error of the instruction to be detected are in positive correlation;
the construction module is used for constructing an instruction path corresponding to each basic block according to the dependency relationship between the SDC fragile instructions in each basic block of the program to be detected; wherein the instruction path is a first type path comprising a plurality of SDC fragile instructions with dependency and/or a second type path comprising a single SDC fragile instruction without dependency with any of the SDC fragile instructions;
and the copying module is used for copying the instruction path to obtain a corresponding copy path and detecting the SDC error in the program to be detected based on the instruction path and the copy path.
In the application, a program to be detected is firstly obtained, and an SDC fragile instruction with high SDC tendency in the instructions to be detected of the program to be detected is determined; the program to be detected is a GPGPU program, and the SDC tendency and the probability of the SDC error of the instruction to be detected are in positive correlation; then constructing instruction paths corresponding to the basic blocks according to the dependency relationship among the SDC fragile instructions in the basic blocks of the program to be detected; wherein the instruction path is a first type path comprising a plurality of SDC fragile instructions with dependency and/or a second type path comprising a single SDC fragile instruction without dependency with any of the SDC fragile instructions; and finally, copying the instruction path to obtain a corresponding copy path, and detecting the SDC error in the program to be detected based on the instruction path and the copy path. Therefore, the method and the device screen the SDC fragile instructions with high SDC tendency from the program to be detected, then construct instruction paths and corresponding copy paths for the screened SDC fragile instructions according to the dependency relationship among the instructions by taking the basic blocks as units, and detect a large number of SDC errors by copying a small amount of instructions under the condition of ensuring the reliability of the program, so that the error detection efficiency is improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present invention, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flowchart of a method for detecting SDC errors of a GPGPU program provided by the present application;
FIG. 2 is a graph of soft error results of processing image data by a GPGPU according to the present application;
FIG. 3 is a graph of soft error results for a GPGPU processing high performance data;
FIG. 4 is a schematic diagram of a specific instruction path architecture provided herein;
FIG. 5 is a schematic diagram illustrating one embodiment of the instruction path replication provided herein;
FIG. 6 is a flowchart of a method for constructing an SDC vulnerability prediction model provided by the present application;
FIG. 7 is a schematic diagram showing the relationship between soft error type and SDC tendency provided in the present application;
FIG. 8 is a schematic diagram of an error masked instruction in the error propagation process provided in the present application;
FIG. 9 is a schematic diagram of feature vector information of a program instruction provided in the present application;
fig. 10 is a schematic structural diagram of an SDC error detection device for a GPGPU program provided in the present application.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
At present, SDC errors occurring in the program execution process are mainly detected through full instruction replication, but full instruction replication requires replicating a duplicate instruction for each instruction, extra comparison instructions and register overhead are added, the extra registers can cause the program parallelism to be reduced, and the added instructions can also greatly increase the program execution time. In addition, the full instruction copy can detect all transient errors in the program, but some transient errors do not cause SDC errors, and the detection efficiency is low. Aiming at the defects, the application provides a GPGPU program SDC error detection scheme, firstly, SDC fragile instructions with high SDC tendency are screened from a program to be detected, then, instruction paths and corresponding duplicate paths are constructed for the screened SDC fragile instructions according to the dependency relationship among the instructions by taking basic blocks as units, and a large number of SDC errors are detected by copying a small number of instructions under the condition of ensuring the reliability of the program, so that the error detection efficiency is improved.
Fig. 1 is a flowchart of a method for detecting SDC errors in a GPGPU program according to an embodiment of the present application. Referring to fig. 1, the method for detecting SDC errors of the GPGPU program includes:
s11: acquiring a program to be detected, and determining an SDC fragile instruction with high SDC tendency in the instructions to be detected of the program to be detected; the program to be detected is a GPGPU program, and the SDC tendency and the probability of the SDC error of the instruction to be detected are in positive correlation.
In this embodiment, a program to be detected is first obtained, and then an SDC fragile instruction having a high SDC tendency in the instructions to be detected of the program to be detected is determined; the program to be detected is a GPGPU program, and the SDC tendency and the probability of the SDC error of the instruction to be detected are in positive correlation. The SDC tendency of an instruction is the probability that when an error occurs in the instruction, it eventually results in a program generating an SDC error, i.e., the higher the SDC tendency of the instruction, the higher the probability that an SDC error exists during execution. When the instruction SDC tendency is above a certain threshold, the instruction is considered an SDC vulnerability instruction.
As described above, as the chip integration level increases, the circuit is more susceptible to alpha particles and high-energy neutrons, and GPGPU integrates thousands of computing cores on a very small chip, and bit flipping is more likely to occur to cause soft errors in the program. It will be appreciated that the computer stores data in binary form, and that when the computer is affected by alpha particles and high-energy neutrons in the universe, the stored "0" and "1" transitions, e.g., "001" to "000", are referred to as bit flipping, and that such transitions are referred to as single bit flipping, where only one bit is flipped at a time. Fig. 2 and fig. 3 show the influence of soft errors on GPGPU processing image data and the influence of GPGPU processing high-performance data, respectively, and the SDC error detection scheme of this embodiment can widely promote the reliable execution of GPGPU applications.
In this embodiment, the SDC vulnerability of the instruction is predicted mainly by a machine learning method, so as to obtain an instruction with a higher SDC vulnerability. Specifically, firstly, a program to be detected is obtained, and characteristic information of an instruction to be detected of the program to be detected is determined; wherein the characteristic information characterizes an SDC tendency of the instruction to be detected. And then inputting the to-be-detected instruction and the characteristic information thereof into a trained SDC vulnerability prediction model so that the SDC vulnerability prediction model outputs an SDC vulnerability instruction with high SDC tendency in the to-be-detected instruction. In this embodiment, the SDC vulnerability prediction model is a model obtained by training a blank model constructed based on a machine learning algorithm with a training set, where the training set includes a sample instruction and a corresponding sample label, and the sample label is determined based on feature information of the sample instruction and an error injection result obtained by performing error injection operation on the sample instruction.
S12: constructing instruction paths corresponding to the basic blocks according to the dependency relationship among the SDC fragile instructions in the basic blocks of the program to be detected; the instruction path is a first type path comprising a plurality of SDC fragile instructions with dependency relationships and/or a second type path comprising a single SDC fragile instruction without dependency relationships with any one of the SDC fragile instructions.
In this embodiment, after the SDC fragile instructions with higher SDC tendency are screened out, an instruction path corresponding to each basic block is constructed according to the dependency relationship between the SDC fragile instructions in each basic block of the program to be detected. Because of the uncertainty in dependencies between instructions in each of the basic blocks, the instruction paths are of an uncertain type, but generally comprise two broad categories, a first category of paths comprising a plurality of the SDC vulnerable instructions with dependencies and/or a second category of paths comprising a single one of the SDC vulnerable instructions that has no dependencies with either of the SDC vulnerable instructions. To prevent large-scale propagation of errors, each basic block (basic block) in the program is taken as a basic copy unit. FIG. 4 illustrates constructing at least one instruction path in a basic block, each instruction path being constructed such that dependent instructions form a path according to their dependency relationships, for the first type of path, and for those instructions that have no dependency relationships with other instructions, the instructions exist alone as a path, for the second type of path.
S13: and copying the instruction path to obtain a corresponding copy path, and detecting the SDC error in the program to be detected based on the instruction path and the copy path.
In this embodiment, the instruction path is copied to obtain a corresponding copy path, and the SDC error in the program to be detected is detected based on the instruction path and the copy path. Specifically, the value in the original register where each of the SDC fragile instructions in the instruction path is located is first stored into a new register to obtain a corresponding duplicate path. And then comparing the instruction running result in the duplicate path with the instruction running result in the instruction path by inserting a comparison instruction at the end of the duplicate path, and if the instruction running result is inconsistent with the instruction running result in the instruction path, judging that the SDC fragile instruction in the instruction path has an SDC error.
There are two conventional methods of generating duplicate instructions: (1) Generating a copy instruction before an original instruction, adding a comparison instruction after the original instruction, and immediately notifying an upper-layer application if an error is found; (2) For each basic block in the program, adding a copy instruction after each original instruction in the basic block, and finally adding a comparison instruction basically fast, and if an error is found, notifying an upper layer application. The difference between the two methods is that method (1) uses only a few extra registers to store the copied data, saving the overhead of registers, but the overhead of compare and notify instructions is too large, and method (2) saves the overhead of compare and notify instructions, but a large number of registers are used additionally for the copying of the whole basic block. In this embodiment, for each path, each instruction in the path is replicated to form a duplicate path, and the replication process is shown in fig. 5. The value in the original register of the original instruction is copied and stored in a new register, the original instruction and the target instruction are calculated respectively, and the result is stored in different registers. This is done for each SDC vulnerable instruction on the instruction path, eventually forming a duplicate path. And inserting a comparison instruction at the path end, and comparing the execution results of the original instruction path and the copy path. If the results differ, this indicates that a soft error has occurred in this path. Further, if the SDC error is detected, alarm information is sent out, and the to-be-detected program with the SDC error is repaired according to the alarm information. If there are differences in the comparison result of one or more paths in a basic block, which indicates that a soft error occurs in the basic block, a warning needs to be sent to the system, and corresponding remedial measures such as rollback procedures and the like are taken by the system.
As can be seen, in the embodiment of the present application, a program to be detected is first obtained, and an SDC fragile instruction having a high SDC tendency in the instructions to be detected of the program to be detected is determined; the program to be detected is a GPGPU program, and the SDC tendency and the probability of the SDC error of the instruction to be detected are in positive correlation; then constructing instruction paths corresponding to the basic blocks according to the dependency relationship among the SDC fragile instructions in the basic blocks of the program to be detected; wherein the instruction path is a first type path comprising a plurality of SDC fragile instructions with dependency and/or a second type path comprising a single SDC fragile instruction without dependency with any of the SDC fragile instructions; and finally, copying the instruction path to obtain a corresponding copy path, and detecting the SDC error in the program to be detected based on the instruction path and the copy path. According to the method and the device, the SDC fragile instructions with high SDC tendency are screened from the program to be detected, then the instruction paths and the corresponding copy paths are constructed for the screened SDC fragile instructions according to the dependency relationship among the instructions by taking the basic blocks as units, and a large number of SDC errors are detected through a small number of instruction copies under the condition that the reliability of the program is ensured, so that the error detection efficiency is improved.
Fig. 6 is a flowchart of a method for constructing an SDC vulnerability prediction model according to an embodiment of the present application. Referring to fig. 6, the method for constructing the SDC vulnerability prediction model includes:
s21: and acquiring a sample instruction and determining characteristic information of the sample instruction.
In this embodiment, a sample instruction is acquired, and feature information of the sample instruction is determined. The same SDC vulnerability prediction model can be used for predicting the same type of program, at the moment, part of instructions in the program are used as the sample instructions, and then characteristic information of the sample instructions is determined. It should be noted that the SDC tendency of an instruction is found experimentally to be related to the inherent properties of the instruction and factors encountered during error propagation. The SDC tendencies of different types of different functional instructions are distinct, as shown in fig. 7, with higher SDC tendencies for compute instructions, lower SDC tendencies for address compute instructions, lower SDC tendencies for control loops, and higher SDC tendencies for control branches. In addition, during error propagation, some instructions that can mask errors and those that easily cause program crashes may have a reduced tendency for SDC of the instructions, as shown in FIG. 8, where the error propagates into the left shift instruction (second row), with eight bits of data being shifted left and eight bits on the right being padded with zeros, resulting in 25% (8/32) of the errors being masked. While errors propagated into shared memory may cause the SDC tendency of instructions to increase. Error propagation into shared memory expands the error propagation area and thus also leads to an increase in the tendency of the instruction SDC. Based on this, the feature information in the embodiment of the present application includes instruction attribute information, error propagation information and shared memory information, where the instruction attribute information is a feature vector characterizing an instruction type and an instruction function, the error propagation information is a feature vector of all instruction numbers, mask error instruction numbers and program crash instruction numbers in the error propagation process, and the shared memory information is a feature vector of shared loading information and shared storage information. The specific definition of the feature vector is shown in fig. 9.
S22: and performing error injection operation for the sample instruction for preset times, judging whether the times of error injection results with SDC errors are larger than a preset threshold value, if so, judging whether the characteristic information meets the SDC tendency condition, and if so, marking the sample instruction by using a sample label representing that the sample instruction is an SDC fragile instruction so as to obtain the training set.
In this embodiment, the error injection operation is performed on the sample instruction for a preset number of times, and it is determined whether the number of times of error injection results with SDC errors is greater than a preset threshold. The process is a selective error injection process, the error injection is an experimental means, and the soft error is simulated to actually occur in hardware through a software simulation method. And selecting part of instructions in the program, performing error injection on the instructions respectively, and primarily judging the SDC vulnerability of the instructions according to the error injection result and a preset threshold value given by a user. If the percentage of SDC in the error injection result of one instruction exceeds a preset threshold value given by a user, further judging whether the characteristic information meets the SDC tendency condition, namely whether the characteristic information is high SDC tendency, if so, judging that the instruction is an SDC fragile instruction, and otherwise, judging that the instruction is a non-SDC fragile instruction. And labeling the sample instruction by using a sample label which characterizes the sample instruction as an SDC fragile instruction so as to obtain the training set. It should be noted that, for the above error injection process, during each execution of the program, an instruction of a thread is randomly selected, and a bit flip error is injected into a random location of a register thereof, so that after the error is injected into a plurality of locations of the instruction, the overall error distribution of the instruction can be obtained.
S23: and training the blank model constructed based on the deep learning algorithm by utilizing the training set to obtain a trained SDC vulnerability prediction model.
In this embodiment, the training set is used to train the blank model constructed based on the deep learning algorithm, so as to obtain the trained SDC vulnerability prediction model. And training a classifier to predict the SDC tendency of the program instructions by using the training set, wherein the SDC fragile instruction set predicted by the machine learning classifier is an instruction set worthy of protection. Only the SDC vulnerability instructions in the program are protected, so that most of SDC errors in the program can be detected, and the cost of instruction copying can be reduced.
It can be seen that, in the embodiment of the present application, a sample instruction is first obtained, and feature information of the sample instruction is determined. And performing error injection operation on the sample instruction for preset times, judging whether the times of error injection results with SDC errors are larger than a preset threshold value, if so, judging whether the characteristic information meets the SDC tendency condition, and if so, marking the sample instruction by using a sample label representing that the sample instruction is an SDC fragile instruction so as to obtain the training set. And finally training the blank model constructed based on the deep learning algorithm by utilizing the training set to obtain a trained SDC vulnerability prediction model. The SDC vulnerability of instructions in a program to be detected is predicted by a machine learning method, only a small amount of error injection is needed, and the time consumption of a large amount of error injection is saved.
Referring to fig. 10, the embodiment of the application further correspondingly discloses a device for detecting SDC errors of a GPGPU program, including:
an obtaining module 11, configured to obtain a program to be detected, and determine an SDC fragile instruction having a high SDC tendency in the instructions to be detected of the program to be detected; the program to be detected is a GPGPU program, and the SDC tendency and the probability of the SDC error of the instruction to be detected are in positive correlation;
a construction module 12, configured to construct an instruction path corresponding to each basic block according to a dependency relationship between the SDC fragile instructions in each basic block of the program to be detected; wherein the instruction path is a first type path comprising a plurality of SDC fragile instructions with dependency and/or a second type path comprising a single SDC fragile instruction without dependency with any of the SDC fragile instructions;
and the copying module 13 is used for copying the instruction path to obtain a corresponding copy path, and detecting the SDC error in the program to be detected based on the instruction path and the copy path.
As can be seen, in the embodiment of the present application, a program to be detected is first obtained, and an SDC fragile instruction having a high SDC tendency in the instructions to be detected of the program to be detected is determined; the program to be detected is a GPGPU program, and the SDC tendency and the probability of the SDC error of the instruction to be detected are in positive correlation; then constructing instruction paths corresponding to the basic blocks according to the dependency relationship among the SDC fragile instructions in the basic blocks of the program to be detected; wherein the instruction path is a first type path comprising a plurality of SDC fragile instructions with dependency and/or a second type path comprising a single SDC fragile instruction without dependency with any of the SDC fragile instructions; and finally, copying the instruction path to obtain a corresponding copy path, and detecting the SDC error in the program to be detected based on the instruction path and the copy path. According to the method and the device, the SDC fragile instructions with high SDC tendency are screened from the program to be detected, then the instruction paths and the corresponding copy paths are constructed for the screened SDC fragile instructions according to the dependency relationship among the instructions by taking the basic blocks as units, and a large number of SDC errors are detected through a small number of instruction copies under the condition that the reliability of the program is ensured, so that the error detection efficiency is improved.
In some embodiments, the obtaining module 11 specifically includes:
the device comprises a feature determining unit, a feature detecting unit and a feature detecting unit, wherein the feature determining unit is used for acquiring a program to be detected and determining feature information of a command to be detected of the program to be detected; wherein the characteristic information characterizes the SDC tendency of the instruction to be detected;
the prediction unit is used for inputting the instruction to be detected and the characteristic information thereof into a trained SDC vulnerability prediction model so that the SDC vulnerability prediction model outputs an SDC vulnerability instruction with high SDC tendency in the instruction to be detected; the SDC vulnerability prediction model is a model obtained by training a blank model constructed based on a machine learning algorithm by using a training set, wherein the training set comprises sample instructions and corresponding sample labels, and the sample labels are obtained by determining error injection results after error injection operation is performed on the sample instructions based on characteristic information of the sample instructions.
In some embodiments, the replication module 13 specifically includes:
a storage unit, configured to store a value in an original register where each of the SDC fragile instructions in the instruction path is located into a new register, so as to obtain a corresponding duplicate path;
and the comparison unit is used for comparing the instruction running result in the duplicate path with the instruction running result in the instruction path in a mode of inserting a comparison instruction at the tail end of the duplicate path, and judging that the SDC fragile instruction in the instruction path has an SDC error if the instruction running result is inconsistent with the instruction running result in the instruction path.
In some embodiments, the SDC error detection apparatus further comprises:
the sample determining module is used for acquiring a sample instruction and determining characteristic information of the sample instruction;
the marking module is used for carrying out error injection operation on the sample instruction for preset times, judging whether the times of error injection results with SDC errors are larger than a preset threshold value, if so, judging whether the characteristic information meets the SDC tendency condition, and if so, marking the sample instruction by using a sample label representing that the sample instruction is an SDC fragile instruction so as to obtain the training set;
the training module is used for training the blank model constructed based on the deep learning algorithm by utilizing the training set so as to obtain a trained SDC vulnerability prediction model;
and the alarm module is used for sending alarm information if the SDC error is detected, so as to repair the program to be detected with the SDC error according to the alarm information.
In this specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, so that the same or similar parts between the embodiments are referred to each other. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.
Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The method and the device for detecting the SDC error of the GPGPU program provided by the invention are described in detail, and specific examples are applied to the description of the principle and the implementation mode of the invention, and the description of the above examples is only used for helping to understand the method and the core idea of the invention; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in accordance with the ideas of the present invention, the present description should not be construed as limiting the present invention in view of the above.

Claims (6)

1. The GPGPU program SDC error detection method is characterized by comprising the following steps of:
acquiring a program to be detected, and determining an SDC fragile instruction with high SDC tendency in the instructions to be detected of the program to be detected; the program to be detected is a GPGPU program, and the SDC tendency and the probability of the SDC error of the instruction to be detected are in positive correlation;
constructing instruction paths corresponding to the basic blocks according to the dependency relationship among the SDC fragile instructions in the basic blocks of the program to be detected; wherein the instruction path is a first type path comprising a plurality of SDC fragile instructions with dependency and/or a second type path comprising a single SDC fragile instruction without dependency with any of the SDC fragile instructions;
copying the instruction path to obtain a corresponding copy path, and detecting SDC errors in the program to be detected based on the instruction path and the copy path;
the copying the instruction path to obtain a corresponding copy path, and detecting the SDC error in the program to be detected based on the instruction path and the copy path, including:
storing the value of the original register of each SDC fragile instruction in the instruction path into a new register, respectively calculating the SDC fragile instruction and the duplicate instruction, and storing the calculation results of the SDC fragile instruction and the duplicate instruction into different registers to obtain corresponding duplicate paths;
and comparing the instruction running result in the duplicate path with the instruction running result in the instruction path by inserting a comparison instruction at the end of the duplicate path, and if the instruction running result is inconsistent with the instruction running result in the instruction path, judging that the SDC fragile instruction in the instruction path has an SDC error.
2. The method for detecting SDC errors in a GPGPU program according to claim 1, wherein the acquiring the program to be detected, determining an SDC fragile instruction having a high SDC tendency in the instructions to be detected of the program to be detected, comprises:
acquiring a program to be detected, and determining characteristic information of an instruction to be detected of the program to be detected; wherein the characteristic information characterizes the SDC tendency of the instruction to be detected;
inputting the to-be-detected instruction and the characteristic information thereof into a trained SDC vulnerability prediction model so that the SDC vulnerability prediction model outputs an SDC vulnerability instruction with high SDC tendency in the to-be-detected instruction; the SDC vulnerability prediction model is a model obtained by training a blank model constructed based on a machine learning algorithm by using a training set, wherein the training set comprises sample instructions and corresponding sample labels, and the sample labels are obtained by determining error injection results after error injection operation is performed on the sample instructions based on characteristic information of the sample instructions.
3. The method for detecting SDC errors in a GPGPU program according to claim 2, wherein before inputting the instruction to be detected and the feature information thereof into the trained SDC vulnerability prediction model, further comprises:
acquiring a sample instruction and determining characteristic information of the sample instruction;
performing error injection operation for the sample instruction for preset times, judging whether the times of error injection results with SDC errors are larger than a preset threshold value, if so, judging whether the characteristic information meets SDC tendency conditions, and if so, marking the sample instruction by using a sample label representing that the sample instruction is an SDC fragile instruction so as to obtain the training set;
and training the blank model constructed based on the deep learning algorithm by utilizing the training set to obtain a trained SDC vulnerability prediction model.
4. A GPGPU program SDC error detection method according to claim 2 or 3, wherein the characteristic information comprises instruction attribute information, error propagation information, and shared memory information;
the instruction attribute information is a feature vector for representing the instruction type and the instruction function, the error propagation information is a feature vector for all instruction numbers, mask error instruction numbers and program crash instruction numbers in the error propagation process, and the shared memory information is a feature vector for sharing loading information and shared storage information.
5. A GPGPU program SDC error detection method according to claim 1, 2 or 3, wherein after detecting the SDC error in the program to be detected based on the instruction path and the duplicate path, further comprising:
and if the SDC error is detected, sending out alarm information to repair the program to be detected with the SDC error according to the alarm information.
6. A GPGPU program SDC error detection apparatus, comprising:
the acquisition module is used for acquiring a program to be detected and determining an SDC fragile instruction with high SDC tendency in the instructions to be detected of the program to be detected; the program to be detected is a GPGPU program, and the SDC tendency and the probability of the SDC error of the instruction to be detected are in positive correlation;
the construction module is used for constructing an instruction path corresponding to each basic block according to the dependency relationship between the SDC fragile instructions in each basic block of the program to be detected; wherein the instruction path is a first type path comprising a plurality of SDC fragile instructions with dependency and/or a second type path comprising a single SDC fragile instruction without dependency with any of the SDC fragile instructions;
the copying module is used for copying the instruction path to obtain a corresponding copy path, and detecting the SDC error in the program to be detected based on the instruction path and the copy path;
the copying the instruction path in the copying module to obtain a corresponding copy path, and detecting the SDC error in the program to be detected based on the instruction path and the copy path, including:
storing the value of the original register of each SDC fragile instruction in the instruction path into a new register, respectively calculating the SDC fragile instruction and the duplicate instruction, and storing the calculation results of the SDC fragile instruction and the duplicate instruction into different registers to obtain corresponding duplicate paths;
and comparing the instruction running result in the duplicate path with the instruction running result in the instruction path by inserting a comparison instruction at the end of the duplicate path, and if the instruction running result is inconsistent with the instruction running result in the instruction path, judging that the SDC fragile instruction in the instruction path has an SDC error.
CN202110903201.2A 2021-08-06 2021-08-06 GPGPU program SDC error detection method and device Active CN113610154B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110903201.2A CN113610154B (en) 2021-08-06 2021-08-06 GPGPU program SDC error detection method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110903201.2A CN113610154B (en) 2021-08-06 2021-08-06 GPGPU program SDC error detection method and device

Publications (2)

Publication Number Publication Date
CN113610154A CN113610154A (en) 2021-11-05
CN113610154B true CN113610154B (en) 2023-12-29

Family

ID=78339773

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110903201.2A Active CN113610154B (en) 2021-08-06 2021-08-06 GPGPU program SDC error detection method and device

Country Status (1)

Country Link
CN (1) CN113610154B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108334903A (en) * 2018-02-06 2018-07-27 南京航空航天大学 A kind of instruction SDC fragility prediction techniques based on support vector regression
CN109063775A (en) * 2018-08-03 2018-12-21 南京航空航天大学 Instruction SDC fragility prediction technique based on shot and long term memory network
CN111159011A (en) * 2019-12-09 2020-05-15 南京航空航天大学 Instruction vulnerability prediction method and system based on deep random forest
CN112765609A (en) * 2020-12-31 2021-05-07 南京航空航天大学 Multi-bit SDC fragile instruction identification method based on single-class support vector machine

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10997027B2 (en) * 2017-12-21 2021-05-04 Arizona Board Of Regents On Behalf Of Arizona State University Lightweight checkpoint technique for resilience against soft errors

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108334903A (en) * 2018-02-06 2018-07-27 南京航空航天大学 A kind of instruction SDC fragility prediction techniques based on support vector regression
CN109063775A (en) * 2018-08-03 2018-12-21 南京航空航天大学 Instruction SDC fragility prediction technique based on shot and long term memory network
CN111159011A (en) * 2019-12-09 2020-05-15 南京航空航天大学 Instruction vulnerability prediction method and system based on deep random forest
CN112765609A (en) * 2020-12-31 2021-05-07 南京航空航天大学 Multi-bit SDC fragile instruction identification method based on single-class support vector machine

Also Published As

Publication number Publication date
CN113610154A (en) 2021-11-05

Similar Documents

Publication Publication Date Title
JP4795433B2 (en) Reduction of uncorrectable error rate in a lockstep dual module redundant system.
US20060156177A1 (en) Method and apparatus for recovering from soft errors in register files
JPH06194415A (en) Method and device for testing logic circuit
US10503601B2 (en) Method and apparatus for tracking objects in a first memory
Condia et al. Combining architectural simulation and software fault injection for a fast and accurate CNNs reliability evaluation on GPUs
Thomas et al. Sirius: Neural network based probabilistic assertions for detecting silent data corruption in parallel programs
CN113610154B (en) GPGPU program SDC error detection method and device
Yim Characterization of impact of transient faults and detection of data corruption errors in large-scale n-body programs using graphics processing units
US8924835B2 (en) Content addressable memory continuous error detection with interleave parity
Kestor et al. Comparative analysis of soft-error detection strategies: A case study with iterative methods
CN101295274B (en) Method and equipment for reducing data error of shared memory
Sugihara et al. A simulation-based soft error estimation methodology for computer systems
CN112765609B (en) Multi-bit SDC fragile instruction identification method based on single-class support vector machine
CN114781619A (en) Soft error detection method and device
Yan et al. Multi-Bit Data Flow Error Detection Method Based on SDC Vulnerability Analysis
Kadam et al. Data-centric reliability management in gpus
Weigel et al. Kernel vulnerability factor and efficient hardening for histogram of oriented gradients
US20150161006A1 (en) Information processing apparatus and method for testing same
US8539403B2 (en) Reducing observability of memory elements in circuits
Laskar et al. Tensorfi+: a scalable fault injection framework for modern deep learning neural networks
Coleman et al. A comparison and analysis of soft-fault error models using FGMRES
CN111580844B (en) Software and hardware collaborative application program maintenance method supporting on-orbit dynamic update
Jia et al. Hessenberg reduction with transient error resilience on gpu-based hybrid architectures
CN103116484B (en) Command processing method and equipment
CN111752718A (en) Low-overhead deadlock prediction method and device and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant