CN115408270A - Soft error detection system and method for heterogeneous SoC chip multi-core processor - Google Patents

Soft error detection system and method for heterogeneous SoC chip multi-core processor Download PDF

Info

Publication number
CN115408270A
CN115408270A CN202210922152.1A CN202210922152A CN115408270A CN 115408270 A CN115408270 A CN 115408270A CN 202210922152 A CN202210922152 A CN 202210922152A CN 115408270 A CN115408270 A CN 115408270A
Authority
CN
China
Prior art keywords
kernel
core
data
error detection
sampling
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210922152.1A
Other languages
Chinese (zh)
Inventor
闫允一
乔良全
赖晓玲
丁嘉鑫
张晋新
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xidian University
Original Assignee
Xidian University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xidian University filed Critical Xidian University
Priority to CN202210922152.1A priority Critical patent/CN115408270A/en
Publication of CN115408270A publication Critical patent/CN115408270A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/36Preventing errors by testing or debugging software
    • G06F11/3604Software analysis for verifying properties of programs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0751Error or fault detection not based on redundancy
    • G06F11/0754Error or fault detection not based on redundancy by exceeding limits
    • G06F11/0757Error or fault detection not based on redundancy by exceeding limits by exceeding a time limit, i.e. time-out, e.g. watchdogs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/079Root cause analysis, i.e. error or fault diagnosis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/78Architectures of general purpose stored program computers comprising a single central processing unit
    • G06F15/7807System on chip, i.e. computer system on a single chip; System in package, i.e. computer system on one or more chips in a single package

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • Microelectronics & Electronic Packaging (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The invention discloses a soft error detection system and a method for a heterogeneous SoC chip multi-core processor, wherein the system comprises an APU (auxiliary Power Unit) application unit and an RPU (resilient packet Unit) error detection unit, wherein the APU application unit comprises a core application module and a data sampling module, and the core application module respectively deploys three core applications in different cores; the data sampling module is used for acquiring the field information of the core application in different cores in the core application module at preset time; the RPU error detection unit comprises a data receiving module and a soft error detection module, wherein the data receiving module is used for receiving sampling data and carrying out preliminary overtime judgment according to the arrival time of the sampling data of different kernels, when a certain kernel is overtime, a fault kernel is judged, and if the certain kernel is not overtime, the soft error detection module is used for carrying out soft error detection on the sampling information. The timer and the user code are mutually independent, the user code does not need to be analyzed, and the efficiency of engineering designers is improved.

Description

Soft error detection system and method for heterogeneous SoC chip multi-core processor
Technical Field
The invention belongs to the technical field of soft error detection, and particularly relates to a soft error detection system and method for a heterogeneous SoC chip multi-core processor.
Background
Captured by the earth magnetic field, a large amount of space radiation particles capable of generating radiation effect exist in the near-earth space environment. It is known that space radiation particles are mainly composed of protons, electrons, alpha particles, heavy ions, gamma rays, etc., and form an earth radiation zone in the near-earth space. When a space radiation particle with high enough energy impacts a PN junction of an integrated circuit, the PN junction can generate logic state inversion or circuit failure, and the phenomenon of system failure is called single event effect. In 2010, the American Sailing corporation introduced an SoC chip of "FPGA + processor" and named Zynq-7000 for the first product. The Zynq-7000 integrates an ARM Cortex-A9 MPCore processor system and a programmable logic FPGA on the same chip tightly, so that Zynq series devices have the characteristics of flexibility, configurability, high performance and the like. Thus, once introduced, zynq devices have attracted much aerospace unit interest, including NASA. Afterwards, the sailing manufacturer further provides an MPSoC chip of the FPGA + ARM Cortex-A53 quad-core processor + ARM Cortex-R5 dual-core processor, and the processing performance is greatly improved.
At present, two types of soft error detection methods for Zynq multi-core processors mainly exist. In the first method, a dual-core processor is used as a key application object of a Zynq-7000 device, the dual-core processor operates in a dual-core comparison mode, and a check circuit is independently arranged in an FPGA. The second method is to take the Zynq MPSoC device as a protection object, the FPGA is a main body for bearing the application function, and the multi-core processor bears the scheduling role of the whole task. The multi-core processor operates in a three-core comparison mode, and the information comparison unit is borne by the real-time processing unit with a locking mechanism. The three-core comparison mode not only has the capability of detecting soft errors, but also has the capability of determining the error cores. The two methods realize soft error detection at an assembly language level, specifically divide assembly codes of tasks into basic blocks according to a certain rule, set check points at the tail end of each basic block, and the basic blocks play a role in positioning the soft error occurrence positions of a processor. When the task is run to a checkpoint, the task sends the specified variable information to the inspection circuit or the information comparison unit, whether single-event soft errors occur in the processor is judged by the checking circuit or the information comparison unit.
The prior art has the following disadvantages: 1) The technical implementation level is assembly language, while in the present task, the code is often large C language code, and the engineering difficulty is high; 2) The division rule of the basic block in the prior art is complicated, for a large-scale task, it is, a great deal of energy is needed for dividing basic blocks, and the efficiency is low; 3) The way of dual-core comparison is adopted, the failure of the core to locate is not possible, recovery of a subsequently failing core is disadvantageous; 4) In a large-scale space mission, the FPGA is mostly used as a functional core, so that the logic resources of the FPGA are in short supply, designing an inspection circuit specifically for a processor requires the crowding of logic resources for designing an FPGA.
Disclosure of Invention
In order to solve the problems in the prior art, the invention provides a soft error detection system and a soft error detection method for a heterogeneous SoC chip multi-core processor. The technical problem to be solved by the invention is realized by the following technical scheme:
one aspect of the invention provides a soft error detection system of a heterogeneous SoC chip multi-core processor, which comprises an APU application unit and an RPU error detection unit, wherein,
the APU application unit comprises a core application module and a data sampling module, wherein the core application module respectively deploys three identical core applications in different processor cores and sets preset running time for the three core applications; the data sampling module is used for collecting the field information of the core application in different cores in the core application module at preset time;
the RPU error detection unit comprises a data receiving module and a soft error detection module, wherein the data receiving module is used for receiving sampling data obtained by the data sampling module and carrying out preliminary overtime judgment according to the arrival time of the sampling data of different kernels, when a certain kernel is overtime, the current kernel is directly judged to be a fault kernel, if the sampling time of a plurality of kernels is not overtime, the acquired data is transmitted to the soft error detection module, and the soft error detection module is used for carrying out soft error detection on the sampling information to obtain a fault position and a fault type.
In an embodiment of the present invention, the core application module further includes a sampling timer, and when a timer interrupt of the sampling timer is triggered, each core is switched to the data sampling module to perform data sampling operation on core applications in different cores in the core application module.
In one embodiment of the present invention, the, the running time of different kernels in the core application module meets the following conditions:
Figure BDA0003778192480000031
wherein, t is t 1 Indicates the starting time, t, at which the first kernel starts to run 2 Indicates the start time, t, at which the second core starts to run 3 Indicates the start time, Δ t, at which the third kernel starts to run 1 Representing the time difference, Δ t, between the first and second kernel 2 Representing the time difference between the second and third cores, at representing the time difference between the first and third cores, T m Representing the run time of the core application module at a time.
In one embodiment of the invention, the data sampling module comprises a data sampling component and a data transmission component, wherein,
the data sampling assembly is used for obtaining key information for detecting single-particle soft errors of the core application module, and the key information comprises kernel ID and data in a general register, a floating-point register, a program counter, a stack pointer register and a program state register;
the data sending component is used for sending an interrupt mask to the data receiving module so as to inform the RPU error detection unit of completing a data sampling task.
In one embodiment of the invention, the data receiving module includes an information logging component, a timing component, and a preliminary failure analysis component, wherein,
the information recording component receives and stores sampling data from different kernels of the data sampling module;
the timing component is used for recording the arrival time of each interrupt and the arrival sequence of interrupt kernels when receiving an initial interrupt request initiated by the APU application unit to form a kernel vector O =<F,S,T,Δt FS ,Δt ST >Where F represents the first core number received, S represents the second core number received, T represents the third core number received, Δ T FS Representing the time difference of arrival of the sampled signals of the first and second kernel, Δ t ST Representing the time difference of arrival of the second kernel and the third kernel sampling signals;
the preliminary fault analysis component judges whether each kernel sends an interrupt in advance or does not send an interrupt after overtime according to the kernel vector O provided by the timing component, preliminarily analyzes an abnormal kernel, and updates the abnormal state E of the kernel i
In one embodiment of the invention, the soft error detection module comprises a similarity analysis component, a failure cause analysis component, and a failure model matching component, wherein,
the similarity analysis component is used for receiving the data and the abnormal state of each kernel from the data receiving module and calculating the overall similarity of the kernel data;
the fault cause analysis component is used for analyzing and positioning the fault cause according to a preset abnormal bit quantity calculation rule and a fault kernel positioning algorithm;
and the fault model matching component is used for obtaining a fault model according to the fault reason analysis and positioning result.
Another aspect of the present invention provides a soft error detection method for a heterogeneous SoC chip multi-core processor, including:
s1: respectively deploying three identical core applications in different processor cores and setting a preset time difference value for the three core applications;
s2: acquiring field information of core applications in different cores at preset time;
s3: performing preliminary overtime judgment according to the arrival time of the acquired data from different kernels, directly judging that the current kernel is a fault kernel when a certain kernel is overtime, and executing the step S4 if the arrival time of a plurality of kernels is not overtime;
s4: and carrying out soft error detection on the sampling information of the plurality of kernels to obtain a fault position and a fault type.
In an embodiment of the present invention, the S1 further includes:
and setting a sampling timer, so that when the timing interruption of the sampling timer is triggered, each core is switched to a data sampling module to perform data sampling work on core applications in different cores.
In one embodiment of the invention, said S3 comprises.
When a timing interrupt request of the sampling timer is received, recording the arrival time of each interrupt and the arrival sequence of interrupt kernels to form a kernel vector O =<F,S,T,Δt FS ,Δt ST >Where F represents the first core number received, S represents the second core number received, T represents the third core number received, Δ T FS Representing the time difference of arrival of the sampled signals of the first and second kernel, Δ t ST Representing the time difference of arrival of the second core and the third core sampling signals;
judging whether each kernel sends an interrupt in advance or does not send an interrupt after overtime according to the kernel vector O provided by the timing component, preliminarily analyzing abnormal kernels, and updating the abnormal kernel state E i
In one embodiment of the present invention, the S4 includes:
calculating the overall similarity of the kernel data according to the sampling data and the abnormal state of each kernel;
and analyzing the fault reason and positioning the fault according to a preset abnormal bit quantity calculation rule and a fault kernel positioning algorithm, and acquiring the fault type.
Compared with the prior art, the invention has the beneficial effects that:
1. the technical implementation level of the heterogeneous SoC chip multi-core processor soft error detection system is C language, and compared with an assembly language level soft error detection method, the heterogeneous SoC chip multi-core processor soft error detection system greatly reduces engineering difficulty; the invention adopts the soft error detection method based on the timer, and the timer and the user code are mutually independent, so that engineering designers only need to set the timer in the user code without analyzing the user code, and the efficiency of the engineering designers is improved.
2. The method has the greatest characteristics that the positioning of the fault kernel and the accurate matching of the fault model are supported, and the support is provided for the targeted recovery of the subsequent fault kernel; and the three-core comparison function is transferred to the processor core, and sufficient FPGA resources are reserved.
3. The data sampling and detecting method provided by the invention not only supports the positioning of the fault kernel, but also can more finely position the fault reason, analyze the influence possibly caused by the fault on the task, and further improve the detection precision.
The present invention will be described in further detail with reference to the accompanying drawings and examples.
Drawings
Fig. 1 is a block diagram of a soft error detection system of a heterogeneous SoC chip multi-core processor according to an embodiment of the present invention;
fig. 2 is a schematic diagram of a working process of a soft error detection system of a heterogeneous SoC chip multi-core processor according to an embodiment of the present invention;
FIG. 3 is an APU provided by the embodiments of the present invention a single core data sampling flow chart of an application unit;
FIG. 4 is a block diagram of an embodiment of the present invention an APU application unit multi-core synchronous sampling timing diagram;
fig. 5 is a schematic structural diagram of a data receiving module of an RPU error detection unit according to an embodiment of the present invention;
fig. 6 is a schematic structural diagram of a soft error detection module of an RPU error detection unit according to an embodiment of the present invention;
FIG. 7 is a flow chart of an abnormal fault detection provided by an embodiment of the present invention;
fig. 8 is a flowchart of a soft error detection method for a heterogeneous SoC chip multi-core processor according to an embodiment of the present invention.
Detailed Description
In order to further explain the technical means and effects of the present invention adopted to achieve the predetermined invention purpose, the following describes in detail a heterogeneous SoC chip multi-core processor soft error detection system and method according to the present invention with reference to the accompanying drawings and the detailed description.
In view of the foregoing and other technical matters, features, and advantages of the present invention, the details of the embodiments will become apparent from the following detailed description when taken in conjunction with the accompanying drawings. The technical means and effects of the present invention adopted to achieve the predetermined purpose can be more deeply and specifically understood through the description of the specific embodiments, however, the attached drawings are provided for reference and description only and are not used for limiting the technical scheme of the present invention.
It should be noted that, in this document, relational terms such as first and second, and the like are used solely to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply any actual relationship or order between such entities or operations. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that an article or device that comprises a list of elements does not include only those elements but may include other elements not expressly listed. Without further limitation, an element defined by the phrase "comprising a … …" does not exclude the presence of another identical element in an article or device that comprises the element.
Example one
The SoC Chip is called a System On Chip, that is, a plurality of systems are integrated On the same Chip and composed of an FPGA, a multi-core processor and the like, wherein the multi-core processor undertakes the duties of management and scheduling. The embodiment is applied to the single event effect of the SoC chip in the space radiation environment, is focused on designing the method for detecting and identifying the soft errors of the multi-core processor, thereby ensuring timely detection and location of the processor soft error occurrence. The embodiment of the invention uses an MPSoC chip which contains an FPGA, a four-core Cortex-A53 processor and a two-core Cortex-R5 processor, wherein the four-core Cortex-A53 processor is called an Application Processing Unit (APU), and the two-core Cortex-R5 processor is called a Real-time Processing Unit (RPU). At present, the 65nm memory chip is known to have single event upset, as the chip manufacturing process is greatly advanced, if the embodiment of the invention adopts a 28nm manufacturing process, the working voltage of the electronic device is reduced, and the probability of single event effect of the memory or the trigger is higher and higher.
The embodiment provides a multi-core processor single-particle soft error detection system based on a heterogeneous SoC chip. The system of this embodiment requires the RPU to configure a lockstep mode as a highly reliable unit of the heterogeneous SoC chip. The embodiment provides an APU data synchronous sampling mechanism, and ensures that an APU multi-core processor synchronously samples data. The soft error detection technology of the APU provided by the embodiment can find the soft error occurring in the APU in time, locate the abnormal kernel and further identify the type of the soft error.
The RPU and the APU adopt the communication mode between the processors to carry out information transmission. The specific implementation method of the shared memory and the inter-processor interrupt is known to those skilled in the art, and is not limited in this specification.
Referring to fig. 1 and fig. 2, fig. 1 is a block diagram of a structure of a soft error detection system for a heterogeneous SoC chip and a multicore processor according to an embodiment of the present invention, and fig. 2 is a schematic diagram of a working process of the soft error detection system for a heterogeneous SoC chip and a multicore processor according to an embodiment of the present invention. The soft error detection system comprises an APU application unit 1 and an RPU error detection unit 2, wherein the APU application unit 1 comprises a core application module 11 and a data sampling module 12, the core application module 11 respectively deploys three completely consistent core applications in different processor cores, and preset running time difference values are set for the three core applications; the data sampling module 12 is configured to collect field information of core applications in different cores of the core application module at a preset time.
The RPU error detection unit 2 includes a data receiving module 21 and a soft error detection module 22, the data receiving module 21 is configured to receive the obtained sampling data from the data sampling module 12 and complete preliminary timeout determination according to arrival times of the sampling data of different cores, when there is a certain core timeout, it directly determines that the current core is a faulty core, if none of the sampling times of the cores is timeout, the collected data is transmitted to the soft error detection module 22, the soft error detection module 22 is configured to perform soft error detection on the sampling information, obtain a fault location and a fault type, and enable the APU application unit to enter a fault recovery mechanism when a soft error occurs.
Specifically, the present embodiment refers to the APU as an APU application unit 1, and specifically includes a core application module 11 and a data sampling module 12; the RPU is referred to as an RPU error detection unit 2, and specifically includes a data reception module 21 and a soft error detection module 22. The APU application unit 1 is a decision maker for managing and scheduling the heterogeneous SoC chip, wherein the core application module 11 is a carrier for specific tasks, and the data sampling module 12 collects field information of the core application module 11 to provide basic data support for soft error detection. The RPU error detection unit 2 executes a lockstep mechanism, specifically includes a data receiving module 21 and a soft error detection module 22, and is responsible for detecting and analyzing soft errors of the APU application unit.
Further, the core application module 11 copies the borne core application into three copies, and deploys the copies in different processor cores respectively. Since the core applications are completely consistent, there may be contention for peripheral or internal resources. In order to avoid the adverse effect possibly caused by the competition of resources by the core application, a certain time difference value is set for the three core applications. Thereafter, the core application module 11 is formally run. The core application is the object of soft error detection and constitutes a core application module. All remaining modules serve for soft error detection in the core application. It should be noted that, in the heterogeneous SoC chip used by us, the Arm Cortex-a53 processor in the APU is a quad-core processor, but in this task scenario, we do not fully utilize this resource, and only use three cores. The remaining one core may be used in a later task.
Further, the core application module 1 further includes a sampling timer, and when a timer interrupt of the sampling timer is triggered, each core is switched to the data sampling module, so as to perform data sampling operation on core applications in different cores in the core application module. The sample timer ensures that the data samples for the three cores are synchronized. And after the core application runs, the timing interruption is triggered, the core application module is blocked, and the data enters the data sampling module. The data sampling module 12 collects the field information of the core application module 11 before interruption, and sends the sampled data in an Inter-processor Interrupt (IPI) mode.
The data receiving module 21 sequentially receives the sampling information of the three cores, and completes preliminary timeout judgment according to the arrival time of the sampling signals of the three cores, and when a certain core is overtime, the core can be directly judged to be a fault core, and the fault core should enter a fault recovery mechanism. If the sampling information of the three cores is not timed out, the data receiving module 21 further transfers the sampling signal to the soft error detection module 22.
The soft error detection module 22 completes the single-event soft error diagnosis, positioning and other works, and analyzes the soft error type. If soft error detection module 22 finds a soft error, APU application unit 1 enters the failure recovery mechanism. If the soft error detection module 22 does not find the soft error, it continues to determine whether the core application is completed, and enters the next task cycle if the core application is completed, and resumes running the core application if the core application is not completed.
FIG. 3 is a flow chart of single-core data sampling of an APU application unit according to an embodiment of the present invention. The data sampling module 12 of this embodiment includes a data sampling component and a data sending component, and the data sampling operation is composed of the data sampling component and the data sending component. The data sampling assembly is used for timely and accurately detecting key information of single-event soft errors of the core application module and consists of data in a kernel ID, a general register, a floating-point register, a program counter, a stack pointer register and a program state register. In the present embodiment, the data sampling module 12 specifies the sampling signal as a HeartInfo type, and satisfies rule 1 and rule 2, and the specific definitions and uses are shown in table 1.
Rule 1: the number of the kernel is 1,2,3, which respectively correspond to the sampling signal HB A 、HB B 、HB C The corresponding relation is not changed;
rule 2: the number of the kernel is fixed after the determination.
TABLE 1 definition method of sampling signal
Figure BDA0003778192480000111
In this embodiment, the triggering mechanism of the data sampling module 12 is a timer interrupt. When the timer interrupt of the sampling timer is triggered, the core application module 11 suspends the operation and automatically switches to the data sampling module 12. The processor responds to the interrupt and saves the interrupt site, and pushes the pre-interrupt site data of the core application module into the stack so as to recover the core application module 11 after the data sampling module 12 finishes sampling. In the data sampling module 12, the stack register value is read, and the storage address of the interrupt stack is obtained. And acquiring register data of all core application modules by traversing the interrupt stack, and sorting and packing the register data into HeartInfo types.
Subsequently, the data sending component calls Write _ Mem (u 64 Addr, u64 HB), sets the Write shared memory area address and the sampling signal, and realizes writing of the sampling data to the shared memory area. Thereafter, the data sending component 122 sets the IPI interrupt MASK IPI _ MASK, informing the RPU of the completion of the data sampling task by the error detection unit.
Referring to FIG. 4, FIG. 4 is a timing diagram of the APU application unit multi-core synchronous sampling according to the embodiment of the present invention. The multi-core synchronous sampling mechanism is the basis for correctly analyzing abnormal cores in the application units of the APU. As shown in fig. 4, three engineering files with the same code are first generated for the APU application unit 1, except for the lscript. Ld script, three engineering files are allocated different OCM address spaces, such as 0x0001-0x of the first kernel1000, address space 0x1001-0x2000 for the second core, and address space 0x2001-0x3000 for the third core. Second, there are start interrupts and timer interrupts in a single core, respectively. The start interrupt is triggered at the start position of the core application module 11, and plays a role in informing the RPU of the core running sequence and the start time of the error detection unit 2. The initial interrupt sends IPI interrupt to RPU error detection unit 2 in turn, and reports the initial time t of three kernels of APU application unit 1 1 ,t 2 ,t 3 And then initializing a kernel private timer and starting to run the core application module. The timer interrupt completes the initial setup after the initial interrupt is triggered. The start interrupt and timer interrupt settings are as follows:
XIpiPsu _ triggeripii (); // trigger the initial interrupt
Timeriinterruptiit (); v/timer interrupt initial setting
With continued reference to FIG. 4, the runtime of the data sampling module must be less than the runtime of the core application module. In the APU application unit 1, the running time T of the core application module 11 m Run time T of data sampling Module s ,T m Is a constant value, and is also the trigger time of the timer interrupt, T s May vary from RPU error detection unit to RPU error detection unit. T is s And T m Rule 3 is to be satisfied.
Rule 3: the completion of the core application module and the data sampling module once is taken as a period, the period is an indeterminate value, and T is within any task period s Is always less than T m I.e. by
Figure BDA0003778192480000132
It should be noted that different running times of the kernel are one of the preconditions of synchronous sampling of multi-core data, and are also the basis for the soft error detection module to perform single-event soft error detection. The time difference of the kernel running time is far less than the running time T of the core application module m . The kernel runtime satisfies equation 1.
Figure BDA0003778192480000131
Wherein, t 1 Is the starting time, t, at which the first kernel starts to run 2 Is the starting time, t, at which the second kernel starts to run 3 Is the starting time, Δ t, at which the third kernel starts to run 1 Is the time difference, Δ t, between the first kernel and the second kernel 2 Is the time difference between the second core and the third core, Δ t is the time difference between the first core and the third core.
And after the timing interruption of the sampling timer is triggered, each kernel is switched to the data sampling module to complete the data sampling work of the core application module. And waiting for the soft error detection module to send back a continuous command, and resetting the sampling timer by the data sampling module, thereby ensuring that the running time of the soft error detection modules of the three cores is consistent all the time.
Further, the memory of the RPU error detection unit 2 includes two task modules, which are a data receiving module 21 and a soft error detection module 22. The data receiving module 21 is responsible for receiving the data of the APU application unit, and the soft error detecting module 22 is responsible for detecting whether the APU application unit generates a single-event soft error.
Specifically, please refer to fig. 5, where fig. 5 is a schematic structural diagram of a data receiving module of an RPU error detection unit according to an embodiment of the present invention. The data receiving module 21 of the present embodiment includes three receiving ports HB A 、HB B 、HB C And three output ports HB A ′、HB B ′、HB C '. The receiving port is responsible for receiving the sampled data from the data sampling module. Output port HB i ', i = A, B, C, based on the sample data, increasing the abnormal state E of the kernel i I = a, B, C. Output port will sample data HB i And abnormal state E i To the soft error detection module 22.
With continued reference to fig. 5, the data receiving module 21 includes a timing component, an information recording component that receives and saves sampled data from different cores of the data sampling module, and a preliminary fault analysis component; the timing component is used for receiving the APU responseWhen the unit 1 initiates an initial interrupt request, the time of each interrupt arrival and the sequence of the interrupt kernel arrivals are recorded to form a kernel vector O =<F,S,T,Δt FS ,Δt ST >Where F represents the first core number received, S represents the second core number received, T represents the third core number received, Δ T FS Representing the time difference of arrival of the sampled signals of the first and second kernel, Δ t ST Representing the time difference of arrival of the second kernel and the third kernel sampling signals; the preliminary fault analysis component judges whether each kernel sends an interrupt in advance or does not send an interrupt after overtime according to the kernel vector O provided by the timing component, preliminarily analyzes an abnormal kernel, and updates the abnormal state E of the kernel i
And then, after the soft error detection module judges that the kernel is error-free, updating the time difference and the sequence of the kernel so as to provide reliable reference for a subsequent sampling signal receiving task.
With continued reference to fig. 5, the preliminary fault analysis component determines, based on the kernel vector O provided by the timing component, judging whether a certain kernel sends an interrupt in advance or whether a certain kernel does not send an interrupt after overtime, preliminarily analyzing the abnormal kernel, and updating the abnormal kernel state E i . In the embodiment, the preliminary fault analysis component sets three decisions of 'R-Y-G' for the arrival of the APU sampling signal, wherein R represents that the sampling signal of the kernel is seriously advanced or seriously lagged, Y represents the sub-health state of the kernel, and G represents the health state of the kernel. The kernel state determination is in accordance with the following rules.
Rule 4: and if the current kernel sequence is equal to the known kernel sequence and the kernel time difference is less than or equal to the known time difference, judging that the three kernels are in a healthy state. Namely: if (a)<F′,S′,T′>=<F,S,T>)∧(Δt F′S′ ≤Δt FS )∧(Δt S′T′ ≤Δt ST ) Then E is i =G,i=A,B,C。
Rule 5: if the current kernel sequence is equal to the known kernel sequence, one kernel time difference is less than or equal to the known time difference, and the other kernel time difference is greater than the known time difference and less than or equal to two times the known time difference, the difference can be judgedAnd (4) normally checking the core ID, and judging the abnormal core to be in a sub-health state. Namely: if it is<F′,S′,T′>=<F,S,T>When Δ t is m ′≤Δt m ,Δt n <Δt n ′≤2Δt n M, n belongs to { FS, ST }, m is not equal to n, then E i =Y,i∈(m∪n-m∩n),E j =G,j∈m∩n。
It should be noted that the core time difference refers to a time difference when the data receiving module receives two core sampling signals in sequence, and the core time difference is two because the data receiving module receives three sampling signals. The known time difference is the time difference of the last time the data receiving module received the sampling signal, i.e. Δ t as described above FS And Δ t ST
Rule 6: if the current kernel sequence is equal to the known kernel sequence, one kernel time difference is less than or equal to the known time difference, and the other kernel time difference is more than twice the time difference, the abnormal kernel ID can be judged, and the abnormal kernel is judged to be seriously advanced or seriously delayed. Namely: if it is<F′,S′,T′>=<F,S,T>When Δ t is m ′≤Δt m ,Δt n ′>2Δt n M, n belongs to { FS, sT }, m is not equal to n, then E i =R,i∈(m∪n-m∩n),E j =G,j∈m∩n。
Rule 7: and if the current kernel sequence is not equal to the known kernel sequence, the time difference between the two kernels is smaller than 1.5 times, or the time difference between the two kernels is smaller than the known maximum time difference, judging that all the kernels are in a healthy state. Namely: if it is<F′,s′,T′>≠<F,S,T>When 1.5min, { Δ t } m ′,Δt n ′}≥max{Δt m ′,Δt n ' } or max [ Δ t m ′,Δt n ′}<max{Δt m ,Δt n Is multiplied by the sum of m, n ∈ { FS, ST }, m is not equal to n, E i =G,i=A,B,C。
Rule 8: if the current kernel sequence is not equal to the known kernel sequence, one time difference exists between the two kernels and is larger than the known maximum time difference, and the time difference between the two kernels is smaller than 2 times, the abnormal kernel ID can be judged, and the abnormal kernel is judged to be in a sub-health state. Namely: if it is<F′,S′,T′>≠<F,S,T>,max{Δt m ′,Δt n ′}>max{Δt m ,Δt n When 1.5min is reached, { Δ t } m ′,Δt n ′}<max{Δt m ′,Δt n ′}≤2min{Δt m ', Δ tn' }, m, n ∈ { FS, ST }, when m ≠ n, E i =Y,i∈(m∪n-m∩n),E j =G,j∈m∩n。
Rule 9: if the current kernel sequence is not equal to the known kernel sequence, one time difference exists between the two kernels and is larger than the known maximum time difference, and the time difference between the two kernels is larger than 2 times, the abnormal kernel ID can be judged, and the abnormal kernel is judged to be seriously advanced or seriously delayed. Namely: if it is<F′,S′,T′>≠<F,S,T>,max{Δt m ′,Δt n ′}>max{Δt m ,Δt n When max { Δ t } m ′,Δt n ′}>2min{Δt m ′,Δt n ' }, m, n ∈ { FS, ST }, m is not equal to n, E i =R,i∈(m∪n-m∩n),E j =G,j∈m∩n。
Further, the information recording component is responsible for recording the sampling signal generated by the APU application unit 1. Each sampled signal is sent in the form of an interrupt, which requires protection against loss of the sampled signal when exiting the data receiving module. In the data receiving module, the sampling signal is distributed with a fixed address space, and the pointer points to the address space of the sampling signal to timely store the sampling signal in the interrupt. Thus, the soft error detection module may access the sample signal using the pointer.
Referring to fig. 6, fig. 6 is a schematic structural diagram of a soft error detection module of an RPU error detection unit according to an embodiment of the present invention. The soft error detection module 22 comprises a similarity analysis component, a fault reason analysis component and a fault model matching component, wherein the similarity analysis component is used for receiving data and abnormal states from each kernel of the data receiving module and calculating the overall similarity of the kernel data; the fault cause analysis component is used for analyzing and positioning the fault cause according to a preset abnormal bit quantity calculation rule and a fault kernel positioning algorithm; and the fault model matching component is used for obtaining a fault model according to the fault reason analysis and positioning result.
Specifically, the soft error detection module 22 is a key step for determining the APU abnormality, and is also the central importance of single event error detection. The soft error detection module 22 of the present embodiment includes a similarity analysis component, a failure cause analysis component, and a failure model matching component, and has three input ports and three output ports. The similarity analysis component calculates the overall similarity of the kernel data, the fault reason analysis component analyzes the reasons of unmatched similarity, and the fault model matching component summarizes different fault reasons into the fault model. The input port data of the soft error detection module 22 is the output HB of the data receiving module A ′、HB B ′、HB C ', including data and exception status from each core of the data reception module 21; the output of the soft error detection module is a failure mode F of each core A 、F B 、F C
The similarity analysis component calculates the overall similarity of the sampled data of the APU error detection cell using equation 2,
Figure BDA0003778192480000171
wherein, HB i I = a, B, C, is the sample data sent by the data sampling module, and J is the overall similarity of the three sample data. Whether the single event upset occurs in the APU application unit 1 is judged through the overall similarity, and if the overall similarity is not equal to 100%, at least one kernel is considered to have the single event upset, but at the moment, which kernel is the kernel cannot be specifically determined, and which position of the abnormal kernel has the single event upset cannot be determined.
Further, rather than each single event upset causing a processor task error or failure, there is a possibility that register error data is quickly overwritten by subsequent data. However, the instruction address or the data register has a characteristic of secret propagation to the module operation, and the condition that the abnormal bit caused by the single event upset causes data error or instruction abnormal jump to the subsequent task operation also exists. Therefore, when the overall similarity is 100%, the APU error detection unit is considered to have no single event upset, and when the overall similarity is not equal to 100%, further fault cause analysis is carried out.
Referring to fig. 7, fig. 7 is a flowchart illustrating an abnormal fault detection method according to an embodiment of the present invention. After the overall similarity J is determined to be not equal to 100%, the sampling signal needs to be further analyzed, and the abnormal fault type, the abnormal bit number and the fault kernel are determined. Before the analysis of the sampling signals is specifically executed, an abnormal bit number calculation rule and a fault kernel positioning algorithm are formulated.
Rule 10: and (4) abnormal bit number calculation rules. Given that registers (including general purpose registers, floating point registers, program counters, stack pointer registers, and program status registers) are 64 bits, the similarity J of the registers reg Here, the similarity calculation process refers to formula (2), and the difference is that data of current registers of different cores are taken when calculating the similarity of different registers. Set register exception number of bits n reg Then, then
Figure BDA0003778192480000181
Thereby obtaining the similarity and the number of abnormal bits of each register.
Further, the fault kernel positioning algorithm executes the following steps:
step (1): dividing the three cores A, B, C into AB, AC and BC groups;
step (2): carrying out similarity analysis on the three groups of kernels;
specifically, referring to formula (2), the total similarity of the sampling signals between every two cores of the three cores is calculated respectively, and the three cores are grouped in pairs and calculated three times to obtain the similarity between every two cores.
And (3): if one of the three groups of kernels has 100% similarity, and the other two groups have less than 100% similarity, judging that a fault kernel exists;
and (4): locating faulty kernels, fault kernel determinationBit is in accordance with K fault ∈(G error1 ∪G error2 ) And is
Figure BDA0003778192480000182
K fault Finger fault kernel, G error1 And G error2 An abnormal group with less than 100% similarity, G normal Refers to normal packets with 100% similarity.
And (5): if the similarity of the three groups of cores is not 100%, it indicates that at least two failed cores exist, and the system should perform soft error recovery immediately.
With continued reference to fig. 7, similarity analysis is performed sequentially for general purpose registers, floating point registers, program status word registers, PC counters, stack pointer registers, and memory area sample data cycle usage rules 10 to determine whether a single event soft error occurs in the above registers or data. And then, determining the type of the abnormal fault, and outputting the name of the abnormal register and the number of abnormal bits, wherein the number of the abnormal bits is used for quantifying the single event upset degree of the abnormal register, and the data in the abnormal register is definitely different from the data in the normal register by several bits. The abnormal fault type is to classify the specific position of the single-particle soft error in the sampling signal, fault F divided into general register and floating-point register gf Program status register failure F spsr PC register failure F pc Stack pointer register fault F sp And storage area failure F mem . Finally, determining a fault kernel K by using a fault kernel positioning algorithm fault
In summary, the soft error detection module 22 can perform soft error diagnosis, positioning, and the like, and analyze the type of soft error. If soft error detection module 22 finds a soft error, APU application unit 1 enters the failure recovery mechanism. If the soft error detection module 22 does not find a soft error, it determines if the core application is complete, and entering the next task period if the operation is finished, and recovering the running of the core application if the operation is not finished.
The embodiment of the invention aims to solve the problem of heterogeneous SoC (system on chip) chips multi-core processor soft error detection and identification problems. The invention constructs a data sampling method of a multi-core processor based on a timer, and provides a data basis for soft error detection of the processor. Compared with a soft error detection method based on a control flow, the method is realized from an assembly language level, the multi-core processor data sampling method based on the timer is improved to a high-level program language, and the design difficulty is greatly reduced. The invention designs a soft error detection algorithm adaptive to the sampling data, and detects the soft error in the multi-core processor in time. In order to better evaluate the influence of soft errors on a program, the invention designs a fault model matching scheme of the soft errors, further classifies the soft errors, predicts the possible damage of the soft errors to a system and provides guidance for a processor fault-tolerant mechanism.
Example two
On the basis of the foregoing embodiments, the present embodiment provides a soft error detection method for a heterogeneous SoC chip multi-core processor, please refer to fig. 8, where the detection method includes:
s1: and respectively deploying the three identical core applications in different processor cores, and setting a preset time difference value for the three core applications.
Specifically, in this embodiment, the borne core application is copied into three parts, and the three parts are respectively deployed in different processor cores. Since the core applications are completely consistent, there may be contention for peripheral or internal resources. In order to avoid the adverse effect possibly caused by the competition of resources by the core application, a certain time difference value is set for the three core applications.
The different running times of the kernel are one of the preconditions of the synchronous sampling of the multi-core data and are also the basis of the soft error detection module. The time difference of the kernel running time is far smaller than the running time T of the core application module m . The kernel runtime satisfies formula (1) in the first embodiment.
Further, step S1 further includes:
and setting a sampling timer, so that when the timing interruption of the sampling timer is triggered, each core is switched to a data sampling module to perform data sampling work on core applications in different cores.
S2: and acquiring the field information of the core application in different cores at preset time.
Specifically, this embodiment refers to the APU as an APU application unit 1, which specifically includes a core application module 11 and a data sampling module 12. The data sampling module 12 collects the field information of the core application module 11, and provides basic data support for soft error detection. The data sampling module 12 of this embodiment includes a data sampling component and a data sending component, and the data sampling operation is composed of the data sampling component and the data sending component. The data sampling assembly is used for timely and accurately detecting key information of single-event soft errors of the core application module and consists of data in a kernel ID, a general register, a floating-point register, a program counter, a stack pointer register and a program state register. In the present embodiment, the sampling signal is specified as a HeartInfo type in the data sampling block 12.
S3: and (4) performing primary overtime judgment according to the arrival time of the acquired data from different kernels, directly judging the current kernel as a fault kernel when a certain kernel is overtime, and executing the step (S4) if the arrival time of a plurality of kernels is not overtime.
Specifically, step S3 of this embodiment includes:
when a timing interrupt request of the sampling timer is received, recording the arrival time of each interrupt and the arrival sequence of interrupt kernels to form a kernel vector O =<F,S,T,Δt FS ,Δt ST >Where F represents the first core number received, S represents the second core number received, T represents the third core number received, Δ T FS Representing the time difference of arrival of the sampled signals of the first and second kernel, Δ t ST Representing the time difference of arrival of the second kernel and the third kernel sampling signals;
judging whether each kernel sends out an interrupt in advance or not according to the kernel vector O provided by the timing component, preliminarily analyzing abnormal kernels, and updating the abnormal kernel state E i
S4: and carrying out soft error detection on the sampling information of the plurality of kernels to obtain a fault position and a fault type.
Step S4 of the present embodiment includes: calculating the overall similarity of the kernel data according to the sampling data and the abnormal state of each kernel; and analyzing the fault reasons and positioning the faults according to a preset abnormal bit quantity calculation rule and a fault kernel positioning algorithm, and obtaining the fault type. For a specific processing procedure, please refer to embodiment one, which is not described herein again.
The data sampling and detecting method provided by the embodiment can be used for supporting fault kernel positioning, can be used for more finely positioning fault reasons, analyzing the influence of faults on tasks and further improving the detection precision. The method has the greatest characteristics that the positioning of the fault kernel is supported, and the support is provided for the targeted recovery of the subsequent fault kernel; and the three-core comparison function is transferred to the processor core, and sufficient FPGA resources are reserved.
In the embodiments provided in the present invention, it should be understood that the apparatus and method disclosed in the present invention can be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is only one logical functional division, and other divisions may be realized in practice, for example, a plurality of modules or components may be combined or integrated into another system, or some features may be omitted, or not executed.
In addition, functional modules in the embodiments of the present invention may be integrated into one processing module, or each of the modules may exist alone physically, or two or more modules are integrated into one module. The integrated module can be realized in a hardware form, and can also be realized in a form of hardware and a software functional module.
The foregoing is a further detailed description of the invention in connection with specific preferred embodiments and it is not intended to limit the invention to the specific embodiments described. For those skilled in the art to which the invention pertains, several simple deductions or substitutions can be made without departing from the spirit of the invention, and all shall be considered as belonging to the protection scope of the invention.

Claims (10)

1. A soft error detection system of a heterogeneous SoC chip multi-core processor is characterized by comprising an APU application unit and an RPU error detection unit, wherein,
the APU application unit comprises a core application module and a data sampling module, wherein the core application module respectively deploys three identical core applications in different processor cores and sets preset running time for the three core applications; the data sampling module is used for collecting the field information of the core application in different cores in the core application module at preset time;
the RPU error detection unit comprises a data receiving module and a soft error detection module, wherein the data receiving module is used for receiving sampling data obtained by the data sampling module and carrying out preliminary overtime judgment according to the arrival time of the sampling data of different kernels, when a certain kernel is overtime, the current kernel is directly judged to be a fault kernel, if the sampling time of a plurality of kernels is not overtime, the acquired data is transmitted to the soft error detection module, and the soft error detection module is used for carrying out soft error detection on the sampling information to obtain a fault position and a fault type.
2. The soft error detection system for the heterogeneous SoC chip multi-core processor of claim 1, wherein the core application module further comprises a sampling timer, and when a timing interrupt of the sampling timer is triggered, each core is switched to the data sampling module to perform data sampling operation on core applications in different cores of the core application module.
3. The soft error detection system for the heterogeneous SoC chip multi-core processor according to claim 1, wherein the running time of different cores in the core application module satisfies the following requirements:
Figure FDA0003778192470000011
wherein, t 1 Indicates the start time, t, at which the first core starts to run 2 Indicates the start time, t, at which the second core starts to run 3 Indicates the start time, Δ t, at which the third kernel starts to run 1 Representing the time difference, Δ t, between the first and second kernel 2 Represents the time difference between the second and third cores, Δ T represents the time difference between the first and third cores, T m Representing the run time of the core application module at a time.
4. The heterogeneous SoC chip multi-core processor soft error detection system of claim 1, wherein the data sampling module comprises a data sampling component and a data sending component, wherein,
the data sampling assembly is used for obtaining key information for detecting single-particle soft errors of the core application module, and the key information comprises kernel ID and data in a general register, a floating-point register, a program counter, a stack pointer register and a program state register;
the data sending component is used for sending an interrupt mask to the data receiving module so as to inform the RPU error detection unit of completing a data sampling task.
5. The soft error detection system of a heterogeneous SoC chip multi-core processor of claim 1, wherein the data receiving module comprises an information recording component, a timing component, and a preliminary failure analysis component, wherein,
the information recording component receives and stores sampling data from different kernels of the data sampling module;
the timing component is used for recording the arrival time of each interrupt and the arrival sequence of interrupt kernels when receiving an initial interrupt request initiated by the APU application unit to form a kernel vector O =<F,S,T,Δt FS ,Δt ST >Where F represents the first core number received, S represents the second core number received, T represents the third core number received, Δ T FS Representing a first kernel and a second kernelTime difference of arrival, Δ t, of the individual kernel sampled signals ST Representing the time difference of arrival of the second kernel and the third kernel sampling signals;
the preliminary fault analysis component judges whether each kernel sends an interrupt in advance or does not send an interrupt after overtime according to the kernel vector O provided by the timing component, preliminarily analyzes an abnormal kernel, and updates the abnormal state E of the kernel i
6. The heterogeneous SoC chip multi-core processor soft error detection system of any of claims 1 to 5, wherein the soft error detection module comprises a similarity analysis component, a failure cause analysis component, and a failure model matching component, wherein,
the similarity analysis component is used for receiving the data and the abnormal state of each kernel from the data receiving module and calculating the overall similarity of the kernel data;
the fault cause analysis component is used for analyzing and positioning the fault cause according to a preset abnormal bit quantity calculation rule and a fault kernel positioning algorithm;
and the fault model matching component is used for obtaining a fault model according to the fault reason analysis and positioning result.
7. A soft error detection method for a heterogeneous SoC chip multi-core processor is characterized by comprising the following steps:
s1: respectively deploying three identical core applications in different processor cores and setting a preset time difference value for the three core applications;
s2: acquiring field information of core applications in different cores at preset time;
s3: performing preliminary overtime judgment according to the arrival time of the acquired data from different kernels, directly judging that the current kernel is a fault kernel when a certain kernel is overtime, and executing the step S4 if the arrival time of a plurality of kernels is not overtime;
s4: and carrying out soft error detection on the sampling information of the plurality of kernels to obtain a fault position and a fault type.
8. The soft error detection method for the heterogeneous SoC chip multi-core processor of claim 7, wherein the S1 further comprises:
and setting a sampling timer, so that when the timing interruption of the sampling timer is triggered, each core is switched to a data sampling module to perform data sampling work on core applications in different cores.
9. The method of claim 8, wherein S3 comprises.
When a timing interrupt request of the sampling timer is received, recording the arrival time of each interrupt and the arrival sequence of interrupt kernels to form a kernel vector O =<F,S,T,Δt FS ,Δt ST >Where F represents the first core number received, S represents the second core number received, T represents the third core number received, Δ T FS Representing the time difference of arrival of the sampled signals of the first and second kernel, Δ t ST Representing the time difference of arrival of the second kernel and the third kernel sampling signals;
judging whether each kernel sends an interrupt in advance or does not send an interrupt after overtime according to the kernel vector O provided by the timing component, preliminarily analyzing abnormal kernels, and updating the abnormal kernel state E i
10. The method for detecting the soft error of the heterogeneous SoC chip multi-core processor according to any one of claims 7 to 9, wherein the S4 comprises:
calculating the overall similarity of the kernel data according to the sampling data and the abnormal state of each kernel;
and analyzing the fault reason and positioning the fault according to a preset abnormal bit quantity calculation rule and a fault kernel positioning algorithm, and acquiring the fault type.
CN202210922152.1A 2022-08-02 2022-08-02 Soft error detection system and method for heterogeneous SoC chip multi-core processor Pending CN115408270A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210922152.1A CN115408270A (en) 2022-08-02 2022-08-02 Soft error detection system and method for heterogeneous SoC chip multi-core processor

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210922152.1A CN115408270A (en) 2022-08-02 2022-08-02 Soft error detection system and method for heterogeneous SoC chip multi-core processor

Publications (1)

Publication Number Publication Date
CN115408270A true CN115408270A (en) 2022-11-29

Family

ID=84159731

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210922152.1A Pending CN115408270A (en) 2022-08-02 2022-08-02 Soft error detection system and method for heterogeneous SoC chip multi-core processor

Country Status (1)

Country Link
CN (1) CN115408270A (en)

Similar Documents

Publication Publication Date Title
Blough et al. The broadcast comparison model for on-line fault diagnosis in multicomputer systems: theory and implementation
JP2500038B2 (en) Multiprocessor computer system, fault tolerant processing method and data processing system
CN103415840A (en) Error management across hardware and software layers
KR100304319B1 (en) Apparatus and method for implementing time-lag duplexing techniques
Riesen et al. See applications run and throughput jump: The case for redundant computing in HPC
Some et al. A software-implemented fault injection methodology for design and validation of system fault tolerance
CN105243023B (en) Parallel Runtime error checking method
CN118245262A (en) GPU error recovery method and system
Ignat et al. Soft-error classification and impact analysis on real-time operating systems
US9092333B2 (en) Fault isolation with abstracted objects
CN115408270A (en) Soft error detection system and method for heterogeneous SoC chip multi-core processor
Hernandez et al. Low-cost checkpointing in automotive safety-relevant systems
US9563494B2 (en) Systems and methods for managing task watchdog status register entries
CN108052420B (en) Zynq-7000-based dual-core ARM processor single event upset resistance protection method
Moser et al. Design verification of SIFT
Vargas et al. Preliminary results of SEU fault-injection on multicore processors in AMP mode
US10846162B2 (en) Secure forking of error telemetry data to independent processing units
Yim et al. Pluggable watchdog: Transparent failure detection for MPI programs
Liu et al. A hardware-software collaborated method for soft-error tolerant MPSoC
US6209019B1 (en) Data processing system, computer network, and data processing method
US12072757B2 (en) Data processing system with tag-based queue management
BRUNELLE et al. Fault-tolerant software-Experiment with the sift operating system
Casseau et al. Special Session: Operating Systems under test: an overview of the significance of the operating system in the resiliency of the computing continuum
Lala et al. Reducing the probability of common-mode failure in the fault tolerant parallel processor
Wu et al. An Empirical Study on Environmental Factors for Reproducing Concurrent Software Failures

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination