CN111581003B

CN111581003B - Full-hardware dual-core lock-step processor fault-tolerant system

Info

Publication number: CN111581003B
Application number: CN202010356342.2A
Authority: CN
Inventors: 黄凯; 陈群; 蒋小文
Original assignee: Zhejiang University ZJU; CSG Electric Power Research Institute
Current assignee: Zhejiang University ZJU; Research Institute of Southern Power Grid Co Ltd
Priority date: 2020-04-29
Filing date: 2020-04-29
Publication date: 2021-12-28
Anticipated expiration: 2040-04-29
Also published as: CN111581003A

Abstract

The invention belongs to the field of microprocessors, and provides a full-hardware dual-core lockstep processor fault-tolerant system, which comprises a main processor and a slave processor, and is characterized by further comprising a hardware fault-tolerant module, wherein the hardware fault-tolerant module comprises: the system comprises a fault detection module, a fault recovery module and a fault isolation module; the master processor and the slave processor have the same input signals, the master processor outputs signals to the outside, and the slave processor does not output signals to the outside. The fault-tolerant system of the all-hardware dual-core lockstep processor can realize rapid fault detection, accelerate the fault recovery speed, does not influence the system performance during fault isolation, and reduces the area cost caused by fault tolerance while ensuring the excellent reliability and real-time performance of the processor fault tolerance.

Description

Full-hardware dual-core lock-step processor fault-tolerant system

Technical Field

The invention belongs to the field of microprocessors, and particularly relates to a fault-tolerant system of a full-hardware dual-core lockstep processor.

Background

With the advent of the industrial 4.0 era, industrial microcontrollers are playing an increasingly important role in the development of industrial automation in China. Compared with general consumer-grade application, the industrial microcontroller has higher requirements on reliability, low cost and real-time property. Embedded processors, which are the core of industrial microcontrollers, are being challenged to become reliable due to the reduction of process nodes and the development of low power technologies. The reduction in feature size and voltage threshold results in semiconductor integrated circuits becoming increasingly sensitive to factors such as circuit crosstalk, atmospheric radiation, high energy particles generated by decay of packaging materials, extreme temperatures, electromagnetic interference, etc., and thus the probability of failure due to interference is increasing. The faults caused by the interference are mostly transient faults, random and temporary state changes or transients in the semiconductor caused by the interference of external conditions, and the functions of the affected devices can be recovered through resetting. However, during the operation of the processor, any one-bit error may result in the output of an erroneous result or the failure of the whole system, which may cause huge property loss or even casualties for industrial applications.

Two common fault tolerance methods currently used by the industry for commercial processors are triple modular redundancy and checkpoint-based dual core lockstep fault tolerance. The former adopts three processors to compare in real time on hardware, and then output after majority voting, so that the reliability and real-time performance are higher, but the required area overhead is too large. The latter adopts two processors to compare in real time on hardware, detects the fault, but the recovery of the fault is completed through software, needs to intermittently save the correct state node of the processor, and when the fault occurs, restores the processor to the previous node. This approach is less reliable in performing failover because only the processor state visible to the software can be restored, and when a suspend-type error is encountered, recovery may fail because the software program fails to respond. In addition, fault tolerance of processor embedded Cache (Cache) is not generally considered, so although the dual-core lock step fault tolerance based on the check point saves area by adopting a soft and hard combination mode, the fault tolerance has defects in reliability, performance and instantaneity.

Disclosure of Invention

In order to solve the technical problems in the prior art, the invention provides a fault-tolerant system of a full-hardware dual-core lockstep processor, and the specific technical scheme is as follows.

A fault-tolerant system of a full-hardware dual-core lockstep processor comprises a master processor, a slave processor and a hardware fault-tolerant module, wherein the hardware fault-tolerant module comprises: the system comprises a fault detection module, a fault recovery module and a fault isolation module; the master processor and the slave processor have the same input signals, the master processor outputs signals to the outside, and the slave processor does not output signals to the outside.

Furthermore, the fault detection module pulls out internal related signals of the master processor and the slave processor through hard wires and performs comparison detection, wherein the related signals comprise signals of internal control state registers in the master processor and the slave processor, signals of a bus interface and signals of a Cache interface; wherein the internal control state register comprises: general purpose registers, program counters, status registers, and associated control status registers of the tightly coupled IP inside the processor.

Further, the failure recovery module performs failure recovery including the following two steps:

a. when no fault occurs, the state information of the master processor and the slave processor on the correct node is stored in a rollback buffer area; the correct node is an execution point when the main processor and the slave processor run normally before a fault occurs and the states of the main processor and the slave processor are not inconsistent due to transient errors; the state information is control state register values in the master processor and the slave processor;

b. after the fault occurs, the master processor and the slave processor are reset by hardware, after the reset is completed, the master processor and the slave processor fetch the instruction from the 0 address again, simultaneously the content of the 0 address on the instruction bus is changed, and the state information stored on the correct node in the rollback buffer area is placed into the master processor and the slave processor, so that the master processor and the slave processor execute the instruction again from the correct node stored last time.

Further, the state information is specifically set to the master processor and the slave processor as follows: finding out relevant control state registers in the master processor and the slave processor, adding a data source of state information to be recovered in the condition assignment of the control state registers, and successfully recovering the value in the control state registers after detecting a pulse signal of a set signal; the setting signal is a pulse signal after the hardware reset of the master processor and the slave processor is completed.

Further, the fault isolation module is used for preventing the error writing operation of the master processor and the slave processor and performing rollback operation on the external state.

Further, the external state includes an external memory state, a peripheral interface or system IP state, a state of a master processor and a cache inside a slave processor.

Further, the memory is mounted on a data bus of the master processor and the slave processor, and fault isolation of the memory is completed by establishing a write operation buffer area, wherein the write operation buffer area comprises a write address buffer area, a write data buffer area, a PC buffer area and a fault PC buffer area, and each write operation buffer area consists of 3 registers; the write address buffer area stores write addresses corresponding to each write operation, the write data buffer area stores write data corresponding to each write operation, the PC buffer area stores a PC of a current retirement instruction corresponding to each write operation, and the fault PC buffer area stores a PC of an instruction executed in the period from the fault to the reset of the master processor and the slave processor.

Further, each write operation of the master processor and the slave processor to the memory is temporarily stored in the write operation buffer area; when the main processor and the slave processor initiate the write operation again after the three write operations are fully stored, the write operation with the write address not being 0 stored for the first time in the write operation buffer area is sent out, and the like; when the main processor and the slave processor need to read data from the memory, matching the read address with the address in the write operation buffer area, and returning the data stored in the write operation buffer area to the main processor and the slave processor if the addresses are matched and are not 0; when a fault occurs and the state rollback is needed, the write operation buffer area invalidates the write operation of which the corresponding PC has the same value as the PC in the current fault PC buffer area, namely setting the corresponding write operation address in the write operation buffer area to be 0; when the host except the main processor and the slave processor needs to access the memory, the software ensures that the main processor and the slave processor carry out three times of writing operation to the useless address of the memory, so that the writing operation reserved in the current writing operation buffer area is updated to the memory.

Furthermore, the peripheral interface and the system IP are mounted on a system bus of the master processor and the slave processor, the writing operation of the master processor and the write operation of the slave processor are delayed for three cycles, and the reading operation time sequence is unchanged.

Further, the internal caches of the master processor and the slave processors are as follows: in write-through mode of operation, when a failure occurs, then the following 8 cache lines are invalidated during the failure recovery:

when no data reading error exists in the cache, caching the last 4 write operation addresses by the main processor, and caching the last 4 write operation addresses by the auxiliary processor to be used as cache line addresses needing invalidation;

when read data errors occur in the cache, 1 address of the read data errors is used, the main processor caches the last 3 write operation addresses, and the secondary processor caches the last 4 write operation addresses as cache line addresses needing invalidation.

Has the advantages that:

the fault-tolerant system of the all-hardware dual-core lockstep processor can realize rapid fault detection, accelerate the fault recovery speed, does not influence the system performance during fault isolation, and reduces the area cost caused by fault tolerance while ensuring the excellent reliability and real-time performance of the processor fault tolerance.

Drawings

FIG. 1 is a block diagram of a fault tolerant architecture for a dual core processor of the present invention;

FIG. 2 is a block diagram of a fault detection module of the present invention;

FIG. 3 is a schematic diagram of the preservation of correct node state of the present invention;

FIG. 4 is a block diagram of a status information reset circuit of the present invention;

FIG. 5 is a write operation buffer structure of the present invention;

FIG. 6 is a timing diagram illustrating an invalidation operation of a cache during a reset of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the accompanying drawings and embodiments.

The fault-tolerant system of the all-hardware dual-core lockstep processor provided by the invention realizes real-time detection and recovery of faults and on-chip cache fault tolerance in a write-through mode by adding a hardware fault-tolerant module to processor dual-mode redundancy.

Fig. 1 shows a fault-tolerant system of a full-hardware dual-core lockstep processor, which includes a master processor, a slave processor and a hardware fault-tolerant module; the hardware fault tolerance module comprises: the system comprises a fault detection module, a fault recovery module and a fault isolation module; the master processor and the slave processor have the same input signals, wherein the master processor outputs signals to the outside, and the slave processor cannot output signals to the outside.

As shown in fig. 2, the fault detection module mainly pulls out and compares internal related signals of the master processor and the slave processor through hard wires, where the related signals include signals of internal control state registers in the master processor and the slave processor, signals of a bus interface, and signals of a Cache interface; wherein the internal control state register comprises: general purpose registers, control registers within the processor such as program counters, processor state registers, etc., and some tightly coupled IP such as timers and interrupt controllers' associated control state registers.

Because the generation of the fault is random, in order to prevent the occurrence of the metastable state, the uncertain state is propagated, and finally, the error alarm signal generated by inconsistent comparison needs to be subjected to two-stage synchronization and then is used as a fault isolation and recovery signal.

As shown in fig. 3, the failure recovery module mainly includes the following two steps:

a. when no fault occurs, state information of a master processor and a slave processor on a correct node needs to be stored in a rollback buffer area; the correct node is an execution point when the main processor and the slave processor run normally before a fault occurs and the states of the main processor and the slave processor are not inconsistent due to transient errors; the state information is control state register values in the master processor and the slave processor;

As shown in fig. 4, the state information embedding into the master processor and the slave processor specifically includes: finding out relevant control state registers in the master processor and the slave processor, adding a data source of state information to be recovered in the condition assignment of the control state registers, and successfully recovering the value in the control state registers after detecting a pulse signal of a set signal; the setting signal is a pulse signal after the hardware reset of the master processor and the slave processor is completed.

The fault recovery module can roll back the states of the master processor and the slave processor, but cannot roll back the external states of the master processor and the slave processor, wherein the external states are an external memory state, a peripheral interface or system IP state and a state of a cache inside the master processor and the slave processor, so that the fault isolation module is required to prevent the false write operation of the master processor and the slave processor and the roll back operation of the external states.

The memory is mounted on the data buses of the master processor and the slave processor, in order to carry out fault isolation, the write operation of the data buses needs to be modified, whether the data in the write operation is actually written or not is not important for the memory, and it is important that when the master processor and the slave processor access the address again, the previously written value can be obtained.

As shown in fig. 5, the fault isolation of the memory may be accomplished by establishing a write operation buffer, where the write operation buffer mainly includes a write address buffer, a write data buffer, a PC buffer, and a fault PC buffer, where the write address buffer stores a write address corresponding to each write operation, the write data buffer stores write data corresponding to each write operation, the PC buffer stores a PC of a current retirement instruction corresponding to each write operation, and the fault PC buffer stores a PC of an instruction executed during a period from when a fault occurs to when the main processor and the slave processor are reset.

After the main processor and the slave processor are in failure, the write operation can be executed at most twice, namely, at most no more than three wrong write operations need to be isolated, so each write operation buffer area consists of 3 registers, each write operation of the main processor and the slave processor to the memory is temporarily stored in the write operation buffer area, and when the main processor and the slave processor initiate the write operation again after the three write operations are fully stored, the write operation with the write address which is stored for the first time in the write operation buffer area and is not 0 is sent out, and the like.

When the main processor and the slave processor need to read data from the memory, the read address at the moment is matched with the address in the write operation buffer area, and if the addresses are matched and are not 0, the data stored in the write operation buffer area are returned to the main processor and the slave processor.

When a fault occurs and the state rollback is needed, the write operation buffer area invalidates the write operation of the corresponding PC with the same value as the PC in the current fault PC buffer area, the specific method is that the corresponding write operation address in the write operation buffer area is set to be 0, and data corresponding to the invalidated write operation cannot be written into and stored or read by the main processor and the slave processor. When a host except the main processor and the slave processor needs to access the memory, such as DMA, three times of write operation to the useless address of the memory by the main processor and the slave processor are guaranteed on software, so that the write operation reserved in the current write operation buffer area is updated to the memory, and the latest data is guaranteed to be obtained when the DMA accesses the memory.

The peripheral interface and the system IP are mounted on a system bus of the master processor and the slave processor, and in order to perform fault isolation, the write operation of the system bus needs to be modified. The access of the master processor and the slave processor to the peripheral interface and the system IP is mainly used for controlling the working mode and state of the IP, so whether data is really written into the IP is very important. In general, the master processor and the slave processor do not have frequent direct access to these IPs, so that the write operation to the processor is delayed by three cycles directly on the system bus, and the read operation timing is not changed. For the AHB bus, the stall operation is implemented by pulling Hready low.

The internal caches of the master processor and the slave processors are as follows: in the write-through working mode, when a fault occurs, the fault isolation of the cache is completed by invalidating the error data or the advanced state data in the cache. To ensure that the cache is properly restored and isolated while reducing the failover time, the following 8 cache lines are selected to be invalidated during failover:

Specifically, as shown in fig. 6, when a fault occurs, during a jump from a fault to a reset of the master processor and the slave processor, pulling down a CEN on an SRAM interface of the tag memory area corresponding to the cache, writing 0 to addresses of the 8 caches in sequence, and invalidating a corresponding cache line, thereby completing fault isolation of the cache.

Claims

1. A fault-tolerant system of a full-hardware dual-core lockstep processor comprises a main processor, a secondary processor and a hardware fault-tolerant module, wherein the hardware fault-tolerant module comprises: the system comprises a fault detection module, a fault recovery module and a fault isolation module; the master processor and the slave processor have the same input signals, the master processor outputs signals to the outside, and the slave processor does not output signals to the outside; the fault isolation module is used for preventing the error write operation of the master processor and the slave processor and performing rollback operation on an external state, wherein the external state comprises an external memory state, a peripheral interface or system IP state and a cached state inside the master processor and the slave processor;

the method is characterized in that the memory is mounted on a data bus of a master processor and a slave processor, and the fault isolation of the memory is completed by establishing a write operation buffer area, wherein the write operation buffer area comprises a write address buffer area, a write data buffer area, a PC buffer area and a fault PC buffer area, and each write operation buffer area consists of 3 registers; the write address buffer area stores write addresses corresponding to each write operation, the write data buffer area stores write data corresponding to each write operation, the PC buffer area stores a PC of a current retirement instruction corresponding to each write operation, and the fault PC buffer area stores a PC of an instruction executed in the period from the fault to the reset of the master processor and the slave processor.

2. The full-hardware dual-core lockstep processor fault-tolerant system according to claim 1, wherein the fault detection module pulls out and detects internal related signals of the master processor and the slave processor through hard wiring, wherein the related signals comprise signals of internal control state registers in the master processor and the slave processor, signals of a bus interface and signals of a Cache interface; wherein the internal control state register comprises: general purpose registers, program counters, status registers, and associated control status registers of the tightly coupled IP inside the processor.

3. The fault tolerant system of full hardware dual core lockstep processors according to claim 1, wherein said fault recovery module performing fault recovery comprises the following two steps:

4. The fault tolerant system of full hardware dual core lockstep processors according to claim 3, wherein said state information embedding master processor and slave processor is specifically: finding out relevant control state registers in the master processor and the slave processor, adding a data source of state information to be recovered in the condition assignment of the control state registers, and successfully recovering the value in the control state registers after detecting a pulse signal of a set signal; the setting signal is a pulse signal after the hardware reset of the master processor and the slave processor is completed.

5. The full hardware dual core lockstep processor fault tolerant system according to claim 1, wherein each write operation to the memory by the master processor and the slave processor is temporarily stored in a write operation buffer; when the main processor and the slave processor initiate the write operation again after the three write operations are fully stored, the write operation with the write address not being 0 stored for the first time in the write operation buffer area is sent out, and the like; when the main processor and the slave processor need to read data from the memory, matching the read address with the address in the write operation buffer area, and returning the data stored in the write operation buffer area to the main processor and the slave processor if the addresses are matched and are not 0; when a fault occurs and the state rollback is needed, the write operation buffer area invalidates the write operation of which the corresponding PC has the same value as the PC in the current fault PC buffer area, namely setting the corresponding write operation address in the write operation buffer area to be 0; when the host except the main processor and the slave processor needs to access the memory, the software ensures that the main processor and the slave processor carry out three times of writing operation to the useless address of the memory, so that the writing operation reserved in the current writing operation buffer area is updated to the memory.

6. The full hardware dual-core lockstep processor fault tolerant system according to claim 1, wherein the peripheral interface and the system IP are mounted on a system bus of the master processor and the slave processor, a write operation of the master processor and the slave processor is delayed for three cycles, and a read operation timing is unchanged.

7. The full hardware dual core lockstep processor fault tolerant system according to claim 1, wherein the internal caches of the master processor and the slave processors are: in write-through mode of operation, when a failure occurs, then the following 8 cache lines are invalidated during the failure recovery: