CN117112318A

CN117112318A - Dual-core fault-tolerant system based on RISC-V architecture

Info

Publication number: CN117112318A
Application number: CN202311092830.7A
Authority: CN
Inventors: 李朋凯; 何滇; 张野; 舒斌
Original assignee: Wuhu Research Institute of Xidian University
Current assignee: Wuhu Research Institute of Xidian University
Priority date: 2023-08-28
Filing date: 2023-08-28
Publication date: 2023-11-24

Abstract

The invention discloses a dual-core fault-tolerant system based on RISC-V architecture, comprising: the system comprises a first processor core, a second processor core, an instruction tight coupling memory, a data tight coupling memory, a plurality of bus matrix modules, an interrupt controller, external devices and buses, and a plurality of error checking and correcting modules; the first processor core and the second processor core are both connected with the same bus matrix module, at least one bus matrix module is connected with the instruction tight coupling memory, at least one bus matrix module is connected with the data tight coupling memory, at least one bus matrix module is connected with the bus, at least one bus matrix module is connected with the interrupt controller, the interrupt controller and the external equipment are both connected with the bus, and the instruction tight coupling memory and the data tight coupling memory are both correspondingly provided with an error checking and correcting module. The invention can reduce the fault detection time and the fault recovery time.

Description

Dual-core fault-tolerant system based on RISC-V architecture

Technical Field

The invention belongs to the technical field of digital integrated circuits, and particularly relates to a dual-core fault-tolerant system based on a RISC-V architecture.

Background

Along with the rapid development of new energy automobile industry, the electronization and the intellectualization of the automobile are greatly accelerated. In which the control system, the power system and almost all the auxiliary driving systems of the automobile integrate a large number of semiconductor chips, and the environmental radiation and other factors can cause instantaneous faults of the chips, called soft errors, which pose challenges to the reliability of the chips.

In the related art, the fault tolerant technology is used as a safety critical computer technology, and the normal operation of the processor is ensured through dual-mode or multi-mode redundancy, so that compared with triple-mode redundancy, the dual-mode redundancy is difficult to detect and process faults, but is widely applied to the field of automobile safety control due to small area and relatively low cost. The dual-mode redundancy is that the processor consists of two cores, the two cores execute the same program and have identical steps, and the dual-core lockstep structure is also called a dual-core lockstep system, the dual-core lockstep structure improves the reliability from the circuit design angle, judges whether a circuit has faults or not by comparing output results of the two cores, and makes different reactions according to different fault types so as to enable the system to recover to a normal state, and is generally used in the safety key field; the conventional dual-core lockstep structure generally compares the results output by the cores, and the context is stored in the memory periodically by adding check points, which has the defects that faults can happen in the cores at early time, if the fault recovery time is reduced, the number of check points needs to be increased, and the context information needs to be stored frequently, so that the memory cost is high, the performance is reduced, and the occasion with higher real-time requirements cannot be met.

Accordingly, there is a need to improve upon the deficiencies in the prior art.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides a dual-core fault-tolerant system based on RISC-V architecture. The technical problems to be solved by the invention are realized by the following technical scheme:

in a first aspect, the present invention provides a dual-core fault tolerant system based on RISC-V architecture, comprising:

the system comprises a first processor core, a second processor core, an instruction tight coupling memory, a data tight coupling memory, a plurality of bus matrix modules, an interrupt controller, external devices and buses, and a plurality of error checking and correcting modules;

the first processor core and the second processor core are connected with the same bus matrix module, at least one bus matrix module is connected with the instruction tight coupling memory, at least one bus matrix module is connected with the data tight coupling memory, at least one bus matrix module is connected with the bus, at least one bus matrix module is connected with the interrupt controller, the interrupt controller and the external equipment are connected with the bus, and the instruction tight coupling memory and the data tight coupling memory are respectively provided with an error checking and correcting module correspondingly;

when the first processor core and the second processor core write data into the external device through the buses, the corresponding bus matrix module compares the output data of the first processor core with the output data of the second processor core, ensures that the output data of the first processor core is identical with the output data of the second processor core, and writes the data into the external device; when the first processor core and the second processor core read the external device data through the bus, the corresponding bus matrix module divides the external device data into two parts and inputs the two parts into the first processor core and the second processor core respectively.

The invention has the beneficial effects that:

the dual-core fault-tolerant system based on the RISC-V architecture comprises the fault detection capability of a pipeline stage and is matched with a register file ECC reinforcement strategy, and the context storage and recovery of register resources are not needed, so that the fault detection time and the fault recovery time are greatly reduced, and the influence on the performance is very little; in addition, because of the tightly coupled fault-tolerant structure, the memory, the interrupt controller and the peripheral are shared by the dual cores, the area overhead of the whole fault-tolerant system is small, and the cost advantage is great.

The present invention will be described in further detail with reference to the accompanying drawings and examples.

Drawings

FIG. 1 is a schematic diagram of a dual core fault tolerant system based on RISC-V architecture provided by an embodiment of the present invention;

FIG. 2 is a schematic diagram of the internal structures of a first processor core and a second processor core provided by an embodiment of the present invention;

FIG. 3 is a schematic diagram of the error checking and correction module principle provided by an embodiment of the present invention;

FIG. 4 is a schematic diagram of fault tolerance parameters provided by an embodiment of the present invention;

FIG. 5 is a schematic diagram of pipeline stage fault detection provided by an embodiment of the present invention;

FIG. 6 is a schematic diagram of pipeline flushing provided by an embodiment of the present invention;

fig. 7 is a schematic diagram of fatal fault detection according to an embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to specific examples, but embodiments of the present invention are not limited thereto.

Referring to fig. 1, fig. 1 is a schematic diagram of a dual-core fault-tolerant system based on a RISC-V architecture according to an embodiment of the present invention, where the dual-core fault-tolerant system based on a RISC-V architecture includes: the system comprises a first processor core, a second processor core, an instruction tight coupling memory, a data tight coupling memory, a plurality of bus matrix modules, an interrupt controller, external devices and buses, and a plurality of error checking and correcting modules;

Specifically, referring to fig. 1, the present embodiment provides a dual-core fault-tolerant system based on a RISC-V architecture, which includes two RISC-V processor cores of a 32-bit 5-stage pipeline, namely a first processor core and a second processor core, wherein the first processor core and the second processor core have the same structure, and the second processor core is a Redundant core of the first processor core, namely a redundancy CPU; the system also comprises an ITCM with the data bit width of 64bits, namely an instruction tightly coupled memory, a DTCM with the data bit width of 32bits, namely a data tightly coupled memory, wherein the instruction tightly coupled memory and the data tightly coupled memory are both provided with error checking and correcting modules (Error Checking and Correction, ECC), and the first processor core and the second processor core share the instruction tightly coupled memory and the data tightly coupled memory; the system also comprises a plurality of bus matrix (Crossbar) modules, wherein the first processor core and the second processor core access the shared memory, the interrupt controller (PLIC, CLIC) and the external device (Peripheral) through buses (SystemBus), so that the area consumption of the dual-core fault-tolerant system can be reduced as much as possible, the first processor core and the second processor core are executed strictly and synchronously, and both cores access the external device through the bus matrix modules.

In this embodiment, when two processor cores write data to an external device through a bus, the bus matrix module may compare output data of the two processor cores at any time, determine whether there is an error, and change the output data into a set of write data if there is no error; when the two processor cores read the external device data through the bus, judging whether the data has errors, and if the data has no errors, dividing the read data into two parts by the bus matrix module and sending the two parts to the two processor cores; therefore, the dual-core fault tolerance system provided by the embodiment only copies the processor cores, but not the memory, the interrupt controller and the external equipment, so that the area consumption is reduced as much as possible, and the two processor cores are executed strictly and synchronously, so that different constraints can be applied to the two processor cores in comprehensive and layout design, the probability of common cause failure is reduced, and the fault detection rate is improved.

It should be noted that, the embodiment shown in fig. 1 only schematically illustrates the positions and connection relationships of the modules in the dual-core fault tolerant system, and does not represent the actual positions and connection relationships thereof.

In an alternative embodiment of the present invention, please refer to fig. 2, fig. 2 is a schematic diagram of an internal structure of a first processor core and a second processor core provided in an embodiment of the present invention, where the first processor core includes a first register file set, and the first register file set is correspondingly provided with a first error checking and correcting module;

the second processor core includes a second set of register files that correspondingly provide a second error checking and correcting module.

Specifically, in this embodiment, considering whether the first processor core and the second processor core need to compare the pipeline time data with each other, a first error checking and correcting module is configured for the first Register File set (Register File), a second error checking and correcting module is configured for the second Register File set (Register File), and the ECC module can correct any single-bit error and detect any double-bit error, so that the reliability of the memory can be greatly improved.

In this embodiment, please continue to refer to fig. 2, if the data between the pipelines compared by the first processor core and the second processor core are different, it is indicated that there is a fault, and only the current pipeline PC value (Program Counter) corresponding to the current pipeline needs to be re-executed, because the soft fault usually only causes the register to flip, and executing the instruction again covers the previous error. If there are multiple levels of faults, an arbitration mechanism is required to determine the priority and thus the final flush PC value. In addition, the embodiment also provides a rollback mechanism based on checkpoints, which can be used for recovering the fault of the CSR register, and for undetected faults, program errors can be caused, and finally, a watchdog can be adopted for carrying out timeout reset operation.

Referring to fig. 3, fig. 3 is a schematic diagram illustrating the principle of an error checking and correcting module according to an embodiment of the present invention, where the error checking and correcting module includes an encoder and a decoder, the encoder is disposed at a data input port of a memory, and the decoder is disposed at a data output port of the memory. When the processor needs to write data, the input data data_in is encoded by the encoder to generate a check bit code, and then the input data and the check bit are written into the memory together. When the processor needs to read out the data, the check bit is read out at the same time, and the decoder judges whether the data has bit flipping during storage. The ECC module can correct any single-bit error, detect any double-bit error, and can not detect more errors, but in most cases, the storage device generally only generates one-bit error, so that the reliability of on-chip storage can be greatly improved.

In the above embodiment, the ECC protection mechanism is used for the register file, so as to determine whether a soft error occurs in the register file, and when only one of the register files has a multi-bit error, the value read out by another core can be selected, thereby greatly improving the fault tolerance of the processor, and avoiding the defect of long fault discovery time caused by adopting a check point and rollback mechanism.

In an alternative embodiment of the present invention, with continued reference to FIG. 2, the first register file set and the second register file set each comprise 39bits of data bit width, wherein the 39bits of data bit width comprises 7bits of error correction code.

In an alternative embodiment of the present invention, please continue to refer to fig. 2, the first error checking and correcting module and the second error checking and correcting module each include an encoder and two decoders;

the first processor core and the second processor core perform encoding operations through the encoder when writing back data, and perform decoding operations through the two decoders when reading out data. Specifically, as shown in fig. 2, in this embodiment, if the instruction set is RV32IMAC, 32bits of register files are required to be provided for the RISC-V architecture processor, where the register file includes two read ports and one write port, and register number 0 is fixed to 0, unlike the registers in the pipeline, there are 32 registers in the register file, and at most two registers are read per cycle, and if there is a soft error in other registers during the operation, it cannot be found in time, eventually resulting in failure to recover. The conventional method is to store the data in the register file into the memory periodically, and recover the data after error occurrence, but the time cost of the method is excessive. In view of this, the present embodiment sets 39bits of data bit width, including 7bits of error correction code, for the register file ECC error correction. The coding operation is carried out during writing back, and the decoding operation is carried out on the two reading ports during reading, so that the correction of one check and two check can be realized; if there is no multi-bit error, the two processor cores each employ the read data; if only one processor core is in error, the read value of the other processor core is used, while the probability of error is small and negligible. Therefore, by using an ECC protection mechanism for the register file, whether soft errors occur in the register file can be judged, and when only one of the register files has multi-bit errors, the value read by the other core can be selected, so that the fault tolerance of the processor is greatly improved, and the defect of long fault discovery time caused by adopting a check point and rollback mechanism is avoided.

In an alternative embodiment of the present invention, please refer to fig. 4 and fig. 5, fig. 4 is a schematic diagram of fault tolerance parameters provided by the embodiment of the present invention, fig. 5 is a schematic diagram of pipeline stage fault detection provided by the embodiment of the present invention, the first processor core and the second processor core are both 5-stage pipeline architecture, the first processor core and the second processor core each include a fetch, decode, execute, access and write-back stage, pipeline registers are provided in adjacent stages, including 4-stage pipeline registers, each stage pipeline register registers a current pipeline value corresponding to a current instruction;

and comparing whether the effective information of the pipeline register in the first processor core and the effective information of the pipeline register in the second processor core are the same or not in each clock cycle, if so, generating a corresponding pipeline error mark, and executing pipeline flushing operation.

Specifically, in this embodiment, please continue to refer to fig. 4, the fault-tolerant parameters include a fault detection time and a fault recovery time, and the fault-tolerant system needs to reduce these two parameters as much as possible to achieve higher real-time performance. Most faults do not actually affect the execution results of the processor, but if the processor performance is reduced due to early detection and processing, but the probability of occurrence of faults is very small, for the sensitive fault detection mechanism of the pipeline stage proposed in this embodiment, the early detection has little effect on the performance because the fault recovery time is very fast, and generally does not exceed a few clock cycles.

In this embodiment, referring to fig. 5, the first processor core and the second processor core are both classic 5-stage pipeline architecture, including stages of fetching, decoding, executing, accessing and writing back 5, pipeline registers are set in adjacent data processing stages, 4-stage pipeline registers are respectively represented by stage and number, and are respectively represented by stage1, stage2, stage3 and stage4, and each stage pipeline register registers will register a current pipeline value corresponding to a current instruction; every clock cycle, the effective information of the current Pipeline register is compared by a Pipeline detector (Pipeline processor), if a difference is found, a corresponding Pipeline error mark is generated, only the instruction corresponding to the current Pipeline PC needs to be re-executed, namely Pipeline flushing (Pipeline Flush), and the soft fault usually only causes the register to be bit-flipped, and the re-execution of the instruction can cover the previous error. For single cycle executed and writeback instructions, if a fault occurs in the writeback stage, 5 cycles are required from new execution, soft faults will be immediately exposed to the pipeline stage, while fault recovery only requires about a few clock cycles, depending on the specific instruction, if there are multiple pipeline flushes caused by the fault and skip instructions, their priorities need to be considered. In the above embodiment, the fault detection means of the pipeline stage is adopted and the pipeline flushing mechanism is matched, so that the fault detection time and the fault recovery time are greatly reduced, the whole process is generally not more than 10 clock cycles, and the reliability of the dual-core fault-tolerant system can be improved.

It should be noted that, as shown in fig. 5, the pipeline in the first processor core and the second processor core is a technology for improving the instruction execution efficiency, and divides the instruction execution process into a plurality of stages, and enables each stage to execute a different instruction; in this way, the first processor core and the second processor core may execute multiple instructions within the same clock cycle, thereby improving overall throughput and performance.

1. Fetch (Instruction Fetch, IF): the next instruction is read from the instruction memory.

2. Instruction decode (Instruction Decode, ID): the instruction is decoded, and the type and operand of the instruction are determined.

3. Execution (EX): the operations to execute instructions may include arithmetic logic operations, memory accesses, and the like.

4. Memory Access (MEM): and if the instruction needs to access the memory, executing the memory read-write operation.

5. Write Back (WB): writing the execution result back to the register file or the memory.

Each stage has its own functions and tasks, and different instructions can be executed simultaneously in different stages in a pipelined manner. When an instruction enters the next stage, the previous stage may begin executing the next instruction, thereby achieving instruction level parallelism.

In an alternative embodiment of the present invention, please refer to fig. 6, fig. 6 is a schematic diagram of pipeline flushing provided in the embodiment of the present invention, in which when valid information of pipeline registers in a plurality of the first processor cores and pipeline registers in the second processor cores are different, an arbitration mechanism is used to determine a priority, and a current pipeline PC value corresponding to a current instruction of a jump is determined.

Specifically, referring to fig. 6, the pipeline flushing problem when a plurality of pipeline stage errors are solved in the present embodiment. Considering the extreme case that several pipeline stages have errors at the same time, an arbitration mechanism (Fault Arbiter) is used to determine the priority, i.e. the flushed PC should be the PC value of the later stage, since the latter pipeline stage is the first instruction to execute. The processor adopts a static branch prediction strategy, predicts that a backward jump is needed, otherwise, predicts that the backward jump is not needed. If the jump instruction is inconsistent with the predicted result, pipeline flushing is needed, so if the jump instruction exists and pipeline flushing is needed, if the st1 error signal occurs simultaneously, the flushed PC should be the PC value calculated by the jump instruction, otherwise, the instruction jumps to the wrong address to cause program error, if the st2 error signal occurs simultaneously, the calculation result of the jump instruction comes from a register of stage2, and the jump instruction is executed again. For the other two cases, the PC value registered at this stage is executed. Based on the description of the embodiments above, soft errors present in the pipeline registers can be quickly discovered and resolved.

In an alternative embodiment of the present invention, please refer to fig. 7, fig. 7 is a schematic diagram of a fatal fault detection provided in an embodiment of the present invention, further including: the hash operation module is arranged in each stage of pipeline register correspondingly;

in the process of executing pipeline flushing operation for the first time, the flushed current pipeline PC value is stored into a register after the bit width is compressed by the operation of a hash operation module;

and in the process of executing pipeline flushing operation for the second time, comparing the current pipeline PC value of flushing with the pipeline PC value stored in the register in the last flushing execution, if the current pipeline PC value is the same as the pipeline PC value stored in the register in the last flushing execution, adding 1 to the counter, and determining that the current pipeline PC value is a fatal fault when the value of the counter reaches a threshold value.

Specifically, as shown in fig. 7, in this embodiment, for each pipeline flushing caused by a pipeline fault, the PC value of each time is hashed, compressed, and stored in the register, and this method can reduce the memory resource. When a new fault occurs, the comparison result is that if two PC values are the same, the counter is added with 1, otherwise, the counter is set to 0, and when a certain threshold value is reached, the counter is indicated to be a fatal fault. If there is a fatal fault in the pipeline stage, the repeated execution flush fault is not masked, and the system repeatedly executes the same instruction, which the present embodiment can detect, and the system decides whether to reset the system or execute the specific security code for the fatal fault.

In the embodiment, the fault detection means of the pipeline stage is adopted and matched with the pipeline flushing mechanism, so that the fault detection time and the fault recovery time are greatly reduced, the whole process is generally not more than 10 clock cycles, and a high-reliability key real-time system can be met.

It should be noted that in this document relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that an article or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in an article or apparatus that comprises the element. The terms "connected" or "connected," and the like, are not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. The orientation or positional relationship indicated by "upper", "lower", "left", "right", etc. is based on the orientation or positional relationship shown in the drawings, and is merely for convenience of description and to simplify the description, and is not indicative or implying that the apparatus or elements referred to must have a specific orientation, be constructed and operated in a specific orientation, and therefore should not be construed as limiting the invention.

In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the particular features or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Further, one skilled in the art can engage and combine the different embodiments or examples described in this specification.

The foregoing is a further detailed description of the invention in connection with the preferred embodiments, and it is not intended that the invention be limited to the specific embodiments described. It will be apparent to those skilled in the art that several simple deductions or substitutions may be made without departing from the spirit of the invention, and these should be considered to be within the scope of the invention.

Claims

1. A dual-core fault tolerant system based on RISC-V architecture, comprising: the system comprises a first processor core, a second processor core, an instruction tight coupling memory, a data tight coupling memory, a plurality of bus matrix modules, an interrupt controller, external devices and buses, and a plurality of error checking and correcting modules;

the first processor core and the second processor core are both connected with the same bus matrix module, at least one bus matrix module is connected with the instruction tight coupling memory, at least one bus matrix module is connected with the data tight coupling memory, at least one bus matrix module is connected with the bus, at least one bus matrix module is connected with the interrupt controller, the interrupt controller and the external equipment are both connected with the bus, and the instruction tight coupling memory and the data tight coupling memory are both correspondingly provided with an error checking and correcting module;

when the first processor core and the second processor core write data into the external device through the bus, the corresponding bus matrix module compares the output data of the first processor core with the output data of the second processor core, ensures that the output data of the first processor core is the same as the output data of the second processor core, and writes the data into the external device; when the first processor core and the second processor core read the external device data through the bus, the corresponding bus matrix module divides the external device data into two parts and inputs the two parts into the first processor core and the second processor core respectively.

2. The dual core fault tolerant system of claim 1 wherein said first processor core comprises a first set of register files, said first set of register files correspondingly configured with a first error checking and correction module;

3. The dual core fault tolerant system based on RISC-V architecture of claim 2 wherein said first set of register files and said second set of register files each comprise 39bits of data bit width, wherein 39bits of data bit width comprises 7bits of error correction code.

4. The dual core fault tolerant system based on RISC-V architecture of claim 2 wherein said first error checking and correction module and said second error checking and correction module each comprise an encoder and two decoders;

the first processor core and the second processor core perform coding operation through the coder when writing back data, and perform decoding operation through the two decoders when reading out data.

5. The dual-core fault-tolerant system based on a RISC-V architecture according to claim 1, wherein said first processor core and said second processor core are each a 5-stage pipeline architecture, said first processor core and said second processor core each include instruction fetch, decode, execute, memory access and write-back stages, pipeline registers are provided in adjacent stages, including 4-stage pipeline registers, each stage pipeline register registering a current pipeline PC value corresponding to a current instruction;

and comparing whether the effective information of the pipeline register in the first processor core and the effective information of the pipeline register in the second processor core are the same or not in each clock cycle, generating a corresponding pipeline error mark if the effective information of the pipeline register in the first processor core and the effective information of the pipeline register in the second processor core are different, and executing pipeline flushing operation.

6. The dual core fault tolerant system of claim 5 wherein when the pipeline registers in said plurality of first processor cores and the pipeline registers in said second processor core differ in valid information, an arbitration mechanism is used to determine a priority and to determine a current pipeline PC value corresponding to a current instruction to jump.

7. The dual core fault tolerant system of claim 5, further comprising: the hash operation module is arranged in each stage of pipeline register correspondingly;

in the process of executing pipeline flushing operation for the first time, the flushed current pipeline PC value is stored into a register after the bit width is compressed by the hash operation module;

and in the process of executing pipeline flushing operation for the second time, comparing the current pipeline PC value of flushing with the pipeline PC value stored in the register during the last flushing operation, if the current pipeline PC value is the same as the pipeline PC value stored in the register during the last flushing operation, adding 1 to the counter, otherwise setting the counter to 0, and determining the counter to be a fatal fault when the value of the counter reaches a threshold value.