WO2008101386A1

WO2008101386A1 - Method of recovering single core exception in multi-core system

Info

Publication number: WO2008101386A1
Application number: PCT/CN2008/000224
Authority: WO
Inventors: Xiaoqiang Yan; Jiangning Li; Fang Xu
Original assignee: Maipu Communication Technology Co., Ltd.
Priority date: 2007-01-31
Filing date: 2008-01-30
Publication date: 2008-08-28
Also published as: CN101236515B; CN101236515A; RU2437144C2; RU2009139312A

Abstract

A method of recovering a single core exception in a multi-core system is provided, and this method can make a recovery without interrupting operation when an exception of a certain single core occurs. The method includes the following steps: in an exception processing program, the single core where an exception occurs first sets the state of itself to “abnormal”, then selects a single core which state is normal, helps it recover, and informs a system scheduling module of reassigning the system tasks, so as to shorten the recovery time.

Description

Recovery method for single core anomaly of multi-core system

The present invention relates to a multi-core CPU system, and more particularly to a method for recovering a single-core anomaly of a multi-core system.

Background technique

In a multi-core CPU embedded system (referred to as a multi-core system), whether it is a symmetric multi-core system or a master-slave multi-core system, there may be an abnormality in a certain core. These exceptions include illegal instructions, misalignment operations, and cache. Abnormal, data bus error, etc. There are many reasons for these exceptions. It may be an accidental hardware error, or illegal data causes the program to handle exceptions, or it may be a branch that is not easy to reach in the program. However, most of these errors are one-time damage to the system, because if it is a fixed regular anomaly, it will be discovered and resolved during system testing.

In the prior art, in the case of such a single core abnormality, the usual practice is to record the abnormal information and then restart the entire system. Although this can restore the operation of the system, it will interrupt all services and shorten the system's runn time. Especially considering the current multi-system, it is usually in the high-end or core location, such as provincial core routers, program-controlled switches, and so on. Once these devices fail, the consequences are serious; and the system restarts to normal operation, which takes a long time, and the impact is very large. Therefore, it is particularly important to extend the runtime of multi-core systems. At the same time, it is not worthwhile to restart the entire system for some non-fatal errors.

Summary of the invention

The technical problem to be solved by the present invention is to provide a method for recovering a single-core abnormality of a multi-core system in response to the above-mentioned shortcomings of the prior art. When an abnormality occurs in a single core, the recovery is performed without interrupting the operation.

The present invention solves the technical problem, and adopts a technical solution, which is a method for recovering a single-core abnormality of a multi-core system, including a shared memory and a system scheduling module, which is characterized in that it comprises the following steps: - a. setting a storage in the shared memory Unit, storing the state value of each single core,

Confirmation All single core initial state values are set to "normal";

b. When an exception occurs in a single core, it automatically enters the exception handler, sets its own state value to "abnormal", and notifies a single core that is selected to be in a normal state, and then the single core of the abnormal state actively enters an infinite loop;

C. The single core with the selected normal state sets a single core of the abnormal state to a reset state, and notifies the system scheduling module that the system scheduling module will belong to the single core task of the abnormal state. Scheduling to any other single core in a normal state, the selected single state recovers all the resources of the single core of the abnormal state, and finally solves the single core of the abnormal state;

d. The single-core solution of the abnormal state is restarted after resetting, and the self-status value is set to "to be restored" after the startup is completed;

e. The selected single core with normal status detects that the value of the single core of the abnormal state is "to be restored", sets the status value of the single core to "normal", and notifies the system scheduling module;

Further, in the step b, the notification is sent by the interrupt mode of the inter-core communication; further, the system scheduling module determines the state of each single core according to the state value in the storage unit; When the status of a single core is abnormal, the task is not scheduled to this single core;

Specifically, the multi-core system is a symmetric multi-core system; in step b, the selected single core with normal status may be any single core with normal status.

Specifically, the multi-core system is a master-slave multi-core system; in step b, the selected single core with a normal state is a single core in a main state.

The beneficial effects of the present invention are as follows: When an abnormality occurs in a single core of the system, the task of the single core originally allocated in the abnormal state may be scheduled to other single cores to ensure that the tasks are timely operated, and the effective guarantee sheet is valid. Before and after the nuclear anomaly and recovery, the operation of the system will not be interrupted, and the resources of the system will not be lost. After abnormal single-core recovery, it can work normally, which prolongs the system's runn time and enhances system reliability.

DRAWINGS

Figure 1 is a flow chart of the procedure of the embodiment.

detailed description The technical solution of the present invention will be described in detail below with reference to the accompanying drawings and embodiments.

In a multi-core system having a shared memory and a system scheduling module, the present invention sets a dedicated memory unit in the shared memory, and uses a global array to store the state of the single core. The subscript of the array can be a single core number, and The status values of each single core correspond. All status values of a single core are defined as "normal", "abnormal", "to be recovered", and the initial state values of all single cores are set to "normal". In a multi-core system, all tasks performed by a single core are assigned by the system scheduling module. A single-core state determination program is set in the system scheduling module. When the system scheduling module performs task scheduling, the system first determines the status of each current single core. If the current single-core status is abnormal, the task is not scheduled to the single core. When an abnormal state occurs in a single core, it is generally handled by the CPU's exception handler.

In the exception handling program of the present invention, an abnormal single core is generated. First, the self state is set to "abnormal", and then a single core having a normal state is selected, and the inter-core interrupt communication method is used to notify the selected single core having a normal state. According to the scheduling algorithm, the system scheduling module transfers all the tasks of the abnormal single core to the single core in the normal state, ensuring the recovery work is completed as soon as possible, and the recovery time is shortened. After the notification is completed, the single core of the abnormal state enters the infinite loop, and the exception handler can no longer be exited, preventing more errors and damages.

In a symmetric multi-core system, any single core can set the state of other single cores. Therefore, when a single core has an abnormal state, any single core with normal status can be selected. Any single core has one or more resets. Single core function. The algorithm for selecting a normal single core can be either a sequential lookup or a random lookup. The advantage of sequential search is that the algorithm is simple. The disadvantage is that the selected normal kernel is relatively fixed. The advantage of random search is that the normal kernel found is not fixed, which can increase the probability of successful recovery. The disadvantage is that the algorithm is more complicated.

In a master-slave multi-core system, only a single core in the main state can recover a single core of other abnormal states. That is to say, when a single core has an abnormal state, it must notify the single core in the main state to perform the recovery operation.

CPUs with multiple cores have a mechanism for inter-core communication. One of the communication mechanisms is to use inter-core interrupts. The advantage is that it is very fast and can notify the event at the first time. Therefore, the present invention preferably uses inter-core interrupts to transmit. Notice.

Example

In an embedded system with a symmetric multi-core CPU, as shown in Figure 1, in the step In 101, an exception occurs due to an illegal operation of the single core A. At this time, only the single core A will jump to the exception vector and enter the CPU exception handler, while the other single cores are still operating normally. In the exception handler, single core A first records the exception information, including: exception type, exception PC pointer, value of all status registers, stack structure, and so on.

In step 102, in the exception handler, the single core A modifies the value of the single core state in the shared memory storage unit to "abnormal". When the system scheduling module performs task scheduling, it first determines the status of the current single core. If the current single core status is abnormal, the task is not scheduled to the single core.

In step 103, the single core A randomly selects a single-core B with a normal state in the exception handling program, and then uses the interrupt to notify the single-core B, and finally actively enters the infinite loop, that is, never exits from the exception handler. , the instruction that prevents it from re-executing an exception generates an exception.

In step 104, the single-core B with normal status receives the interrupt message of the single-core A, that is, the daemon that wakes up its own single-core abnormality recovery, prepares to find out which single-core has generated an abnormality, and is ready to recover.

In step 105, the single core B sets the single core A to the reset state by setting the global control register of the CPU, because the multicore CPU provides that when a single core is set to the reset state, it does not execute any code, that is, it is in The stop state, once reset, will run from a fixed start address read command, that is, a restart operation.

In step 106, the single core B notifies the system scheduling module, and the system scheduling module schedules all the tasks originally belonging to the single core A to another normal core according to the scheduling algorithm, thereby ensuring the timeliness of the task execution.

In step 107, the single core B reclaims all the resources originally belonging to the single core A into the system, and the resources mainly include: task queue, stack space, interrupt, and the like.

In step 108, the single core B de-asserts the single core A by setting the global control register of the CPU, and then the single core A starts the restart operation, and the single core B polls the shared memory storage unit at this time. The status value of the single core A, waiting for it to become "to be recovered."

In step 201, the single core A is de-reset, and it will start the read operation from the fixed start address of the CPU to perform the restart operation. In step 202, the single core A re-executes the initial trial operation, and since the new resource is used, the restart may be successful. After the boot is complete, single core A changes its own state in the shared memory storage unit to "to be restored", indicating that it has been booted.

In step 203, the single core B detects that the status of the single core A becomes "to be restored", indicating that the single core A has been booted, and the single core B will modify the state of the single core A in the shared memory storage unit. To be "normal", then the single core B notifies the system scheduling module that a task can be assigned to the single core A.

The exception reply ends.

Claims

Claim

A method for recovering a single core exception of a multi-core system, including a shared memory and a system scheduling module, comprising the steps of:

a storage unit is set in the shared memory, and the state value of each single core is stored, and all single core initial state values are set to "normal";

c. The single core with the selected normal state sets a single core of the abnormal state to a reset state, and notifies the system scheduling module, and the system scheduling module schedules a single core task that belongs to the abnormal state. Giving any other normal state single core, the selected single state normal single core recovers all resources of the single core of the abnormal state, and finally solves the reset single core of the abnormal state;

e. The selected single core with normal status detects that the value of the single core of the abnormal state is "to be restored", sets the status value of the single core to "normal", and notifies the system scheduling module.

The method for recovering a single core anomaly of a multi-core system according to claim 1, wherein in the step b, the notification is sent by an interrupt mode of the inter-core communication.

The method for recovering a single-core abnormality of a multi-core system according to claim 1, wherein the system scheduling module determines a state of each single core according to a state value in the storage unit; When the status of a single core is abnormal, the task is no longer scheduled to this single core.

The method for recovering a single-core anomaly of a multi-core system according to claim 1, 2 or 3, wherein the multi-core system is a symmetric multi-core system; in step b, the selected single-core with a normal state may be Is any single core with normal status.

The method for recovering a single-core anomaly of a multi-core system according to claim 1, 2 or 3, wherein the multi-core system is a master-slave multi-core system; in step b, the single-core with a selected state is normal Is a single core in the main state.