WO2008101386A1 - Method of recovering single core exception in multi-core system - Google Patents

Method of recovering single core exception in multi-core system Download PDF

Info

Publication number
WO2008101386A1
WO2008101386A1 PCT/CN2008/000224 CN2008000224W WO2008101386A1 WO 2008101386 A1 WO2008101386 A1 WO 2008101386A1 CN 2008000224 W CN2008000224 W CN 2008000224W WO 2008101386 A1 WO2008101386 A1 WO 2008101386A1
Authority
WO
WIPO (PCT)
Prior art keywords
core
single core
state
normal
abnormal
Prior art date
Application number
PCT/CN2008/000224
Other languages
French (fr)
Chinese (zh)
Inventor
Xiaoqiang Yan
Jiangning Li
Fang Xu
Original Assignee
Maipu Communication Technology Co., Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Maipu Communication Technology Co., Ltd. filed Critical Maipu Communication Technology Co., Ltd.
Publication of WO2008101386A1 publication Critical patent/WO2008101386A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4812Task transfer initiation or dispatching by interrupt, e.g. masked
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/202Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant
    • G06F11/2023Failover techniques
    • G06F11/2028Failover techniques eliminating a faulty processor or activating a spare
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/202Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant
    • G06F11/2023Failover techniques
    • G06F11/2033Failover techniques switching over of hardware resources
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/202Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant
    • G06F11/2035Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant without idle spare hardware

Definitions

  • the present invention relates to a multi-core CPU system, and more particularly to a method for recovering a single-core anomaly of a multi-core system.
  • a multi-core CPU embedded system (referred to as a multi-core system), whether it is a symmetric multi-core system or a master-slave multi-core system, there may be an abnormality in a certain core.
  • exceptions include illegal instructions, misalignment operations, and cache. Abnormal, data bus error, etc. There are many reasons for these exceptions. It may be an accidental hardware error, or illegal data causes the program to handle exceptions, or it may be a branch that is not easy to reach in the program. However, most of these errors are one-time damage to the system, because if it is a fixed regular anomaly, it will be discovered and resolved during system testing.
  • the technical problem to be solved by the present invention is to provide a method for recovering a single-core abnormality of a multi-core system in response to the above-mentioned shortcomings of the prior art.
  • the recovery is performed without interrupting the operation.
  • the present invention solves the technical problem, and adopts a technical solution, which is a method for recovering a single-core abnormality of a multi-core system, including a shared memory and a system scheduling module, which is characterized in that it comprises the following steps: - a. setting a storage in the shared memory Unit, storing the state value of each single core,
  • the single core with the selected normal state sets a single core of the abnormal state to a reset state, and notifies the system scheduling module that the system scheduling module will belong to the single core task of the abnormal state. Scheduling to any other single core in a normal state, the selected single state recovers all the resources of the single core of the abnormal state, and finally solves the single core of the abnormal state;
  • the selected single core with normal status detects that the value of the single core of the abnormal state is "to be restored”, sets the status value of the single core to "normal”, and notifies the system scheduling module;
  • the notification is sent by the interrupt mode of the inter-core communication; further, the system scheduling module determines the state of each single core according to the state value in the storage unit; When the status of a single core is abnormal, the task is not scheduled to this single core;
  • the multi-core system is a symmetric multi-core system; in step b, the selected single core with normal status may be any single core with normal status.
  • the multi-core system is a master-slave multi-core system; in step b, the selected single core with a normal state is a single core in a main state.
  • the beneficial effects of the present invention are as follows:
  • the task of the single core originally allocated in the abnormal state may be scheduled to other single cores to ensure that the tasks are timely operated, and the effective guarantee sheet is valid.
  • the operation of the system will not be interrupted, and the resources of the system will not be lost.
  • abnormal single-core recovery it can work normally, which prolongs the system's runn time and enhances system reliability.
  • Figure 1 is a flow chart of the procedure of the embodiment.
  • the present invention sets a dedicated memory unit in the shared memory, and uses a global array to store the state of the single core.
  • the subscript of the array can be a single core number, and The status values of each single core correspond. All status values of a single core are defined as "normal”, “abnormal”, “to be recovered”, and the initial state values of all single cores are set to "normal”.
  • all tasks performed by a single core are assigned by the system scheduling module.
  • a single-core state determination program is set in the system scheduling module. When the system scheduling module performs task scheduling, the system first determines the status of each current single core. If the current single-core status is abnormal, the task is not scheduled to the single core. When an abnormal state occurs in a single core, it is generally handled by the CPU's exception handler.
  • an abnormal single core is generated.
  • the self state is set to "abnormal”, and then a single core having a normal state is selected, and the inter-core interrupt communication method is used to notify the selected single core having a normal state.
  • the system scheduling module transfers all the tasks of the abnormal single core to the single core in the normal state, ensuring the recovery work is completed as soon as possible, and the recovery time is shortened. After the notification is completed, the single core of the abnormal state enters the infinite loop, and the exception handler can no longer be exited, preventing more errors and damages.
  • any single core can set the state of other single cores. Therefore, when a single core has an abnormal state, any single core with normal status can be selected. Any single core has one or more resets.
  • the algorithm for selecting a normal single core can be either a sequential lookup or a random lookup.
  • the advantage of sequential search is that the algorithm is simple.
  • the disadvantage is that the selected normal kernel is relatively fixed.
  • the advantage of random search is that the normal kernel found is not fixed, which can increase the probability of successful recovery.
  • the disadvantage is that the algorithm is more complicated.
  • a single core in the main state can recover a single core of other abnormal states. That is to say, when a single core has an abnormal state, it must notify the single core in the main state to perform the recovery operation.
  • CPUs with multiple cores have a mechanism for inter-core communication.
  • One of the communication mechanisms is to use inter-core interrupts.
  • the advantage is that it is very fast and can notify the event at the first time. Therefore, the present invention preferably uses inter-core interrupts to transmit. Notice.
  • the single core A modifies the value of the single core state in the shared memory storage unit to "abnormal".
  • the system scheduling module performs task scheduling, it first determines the status of the current single core. If the current single core status is abnormal, the task is not scheduled to the single core.
  • step 103 the single core A randomly selects a single-core B with a normal state in the exception handling program, and then uses the interrupt to notify the single-core B, and finally actively enters the infinite loop, that is, never exits from the exception handler. , the instruction that prevents it from re-executing an exception generates an exception.
  • step 104 the single-core B with normal status receives the interrupt message of the single-core A, that is, the daemon that wakes up its own single-core abnormality recovery, prepares to find out which single-core has generated an abnormality, and is ready to recover.
  • step 105 the single core B sets the single core A to the reset state by setting the global control register of the CPU, because the multicore CPU provides that when a single core is set to the reset state, it does not execute any code, that is, it is in The stop state, once reset, will run from a fixed start address read command, that is, a restart operation.
  • step 106 the single core B notifies the system scheduling module, and the system scheduling module schedules all the tasks originally belonging to the single core A to another normal core according to the scheduling algorithm, thereby ensuring the timeliness of the task execution.
  • the single core B reclaims all the resources originally belonging to the single core A into the system, and the resources mainly include: task queue, stack space, interrupt, and the like.
  • step 108 the single core B de-asserts the single core A by setting the global control register of the CPU, and then the single core A starts the restart operation, and the single core B polls the shared memory storage unit at this time. The status value of the single core A, waiting for it to become "to be recovered.”
  • step 201 the single core A is de-reset, and it will start the read operation from the fixed start address of the CPU to perform the restart operation.
  • step 202 the single core A re-executes the initial trial operation, and since the new resource is used, the restart may be successful.
  • single core A changes its own state in the shared memory storage unit to "to be restored", indicating that it has been booted.
  • step 203 the single core B detects that the status of the single core A becomes "to be restored", indicating that the single core A has been booted, and the single core B will modify the state of the single core A in the shared memory storage unit. To be "normal”, then the single core B notifies the system scheduling module that a task can be assigned to the single core A.

Abstract

A method of recovering a single core exception in a multi-core system is provided, and this method can make a recovery without interrupting operation when an exception of a certain single core occurs. The method includes the following steps: in an exception processing program, the single core where an exception occurs first sets the state of itself to “abnormal”, then selects a single core which state is normal, helps it recover, and informs a system scheduling module of reassigning the system tasks, so as to shorten the recovery time.

Description

多核系统单核异常的恢复方法 技术领域  Recovery method for single core anomaly of multi-core system
本发明涉及多核 CPU系统, 特别涉及多核系统单核异常的恢复 方法。  The present invention relates to a multi-core CPU system, and more particularly to a method for recovering a single-core anomaly of a multi-core system.
背景技术 Background technique
在一个多核 CPU的嵌入式系统 (简称为多核系统) 中, 不管是 对称多核系统或者主从多核系统,都有可能发生某一个核出现异常的 情况, 这些异常包括非法指令, 不对齐操作, cache异常, 数据总线 错误等。导致这些异常的原因很多, 可能是一次偶然的硬件错误, 或 者是非法的数据导致程序处理异常,也可能是运行到了程序中不易走 到的分支。但这些错误大部分对系统是一次性伤害, 因为如果是固定 有规律的异常现象, 在系统测试时就会被发现并解决。  In a multi-core CPU embedded system (referred to as a multi-core system), whether it is a symmetric multi-core system or a master-slave multi-core system, there may be an abnormality in a certain core. These exceptions include illegal instructions, misalignment operations, and cache. Abnormal, data bus error, etc. There are many reasons for these exceptions. It may be an accidental hardware error, or illegal data causes the program to handle exceptions, or it may be a branch that is not easy to reach in the program. However, most of these errors are one-time damage to the system, because if it is a fixed regular anomaly, it will be discovered and resolved during system testing.
现有技术对于这种某个单核出现异常的情况,通常的做法只是记 录异常信息,然后重新启动整个系统。这样做虽然能够恢复系统的运 行, 但会中断所有的业务, 缩短系统的可运行时间。特别是考虑到目 前多系统, 一般处在高端或者核心的位置, 比如省级核心路由器, 程 控交换机等。一旦这些设备发生了故障, 后果是严重的; 并且系统重 新启动到正常工作, 需要较长时间, 造成的影响是非常大的。 因此, 延长多核系统可运行时间显得尤为重要。同时, 为了一些非致命的错 误而重新启动整个系统也不值得。  In the prior art, in the case of such a single core abnormality, the usual practice is to record the abnormal information and then restart the entire system. Although this can restore the operation of the system, it will interrupt all services and shorten the system's runn time. Especially considering the current multi-system, it is usually in the high-end or core location, such as provincial core routers, program-controlled switches, and so on. Once these devices fail, the consequences are serious; and the system restarts to normal operation, which takes a long time, and the impact is very large. Therefore, it is particularly important to extend the runtime of multi-core systems. At the same time, it is not worthwhile to restart the entire system for some non-fatal errors.
发明内容 Summary of the invention
本发明所要解决的技术问题,就是针对现有技术的上述缺点,提 供一种多核系统单核异常的恢复方法, 当某个单核出现异常时,在不 中断运行的情况下进行恢复。  The technical problem to be solved by the present invention is to provide a method for recovering a single-core abnormality of a multi-core system in response to the above-mentioned shortcomings of the prior art. When an abnormality occurs in a single core, the recovery is performed without interrupting the operation.
本发明解决所述技术问题,采用的技术方案是, 多核系统单核异 常的恢复方法, 包括共享内存和系统调度模块, 其特征在于, 包括以 下步骤: - a. 在所述共享内存中设置存储单元, 存储每个单核的状态值,  The present invention solves the technical problem, and adopts a technical solution, which is a method for recovering a single-core abnormality of a multi-core system, including a shared memory and a system scheduling module, which is characterized in that it comprises the following steps: - a. setting a storage in the shared memory Unit, storing the state value of each single core,
确认本 所有单核初始状态值设置为 "正常"; Confirmation All single core initial state values are set to "normal";
b. 某个单核发生异常时, 自动进入异常处理程序, 将自己状态 值设置为 "异常", 并通知一个被选择的状态正常的单核, 然后该异 常状态的单核主动进入死循环;  b. When an exception occurs in a single core, it automatically enters the exception handler, sets its own state value to "abnormal", and notifies a single core that is selected to be in a normal state, and then the single core of the abnormal state actively enters an infinite loop;
C . 所述被选择的状态正常的单核, 将所述异常状态的单核设置 到复位状态,并通知所述系统调度模块,系统调度模块将原本属于所 述异常状态的单核的任务, 调度给其他任意一个正常状态的单核,所 述被选择的状态正常的单核回收异常状态的单核的所有资源,最后解 复位异常状态的单核;  C. The single core with the selected normal state sets a single core of the abnormal state to a reset state, and notifies the system scheduling module that the system scheduling module will belong to the single core task of the abnormal state. Scheduling to any other single core in a normal state, the selected single state recovers all the resources of the single core of the abnormal state, and finally solves the single core of the abnormal state;
d . 所述异常状态的单核解复位后重新启动, 启动完成后将自己 状态值设为 "待恢复";  d. The single-core solution of the abnormal state is restarted after resetting, and the self-status value is set to "to be restored" after the startup is completed;
e . 所述被选择的状态正常的单核, 检测到所述异常状态的单核 的值为 "待恢复"后, 将该单核的状态值设置为 "正常", 并通知系 统调度模块;  e. The selected single core with normal status detects that the value of the single core of the abnormal state is "to be restored", sets the status value of the single core to "normal", and notifies the system scheduling module;
进一步的, 所述步骤 b中, 通过核间通信的中断方式发送通知; 进一步的, 所述系统调度模块, 根据所述存储单元中的状态值, 对各个单核的状态进行判断;一旦判断某个单核的状态异常时,就不 再向这个单核调度任务;  Further, in the step b, the notification is sent by the interrupt mode of the inter-core communication; further, the system scheduling module determines the state of each single core according to the state value in the storage unit; When the status of a single core is abnormal, the task is not scheduled to this single core;
具体的, 所述多核系统为对称多核系统; 步骤 b中, 所述被选择 的状态正常的单核可以是任意一个状态正常的单核。  Specifically, the multi-core system is a symmetric multi-core system; in step b, the selected single core with normal status may be any single core with normal status.
具体的, 所述多核系统为主从多核系统; 步骤 b中, 所述被选择 的状态正常的单核为处于主状态的单核。  Specifically, the multi-core system is a master-slave multi-core system; in step b, the selected single core with a normal state is a single core in a main state.
本发明的有益效果是: 当系统的某个单核出现异常时,可以先将 原本分配在该异常状态的单核的任务,调度到其他的单核,保证这些 任务及时得到运行,有效保证单核异常及恢复前后,系统的运行不会 中断, 系统的资源也不会丢失。异常单核恢复以后可以正常工作, 延 长了系统的可运行时间, 增强了系统的可靠性。  The beneficial effects of the present invention are as follows: When an abnormality occurs in a single core of the system, the task of the single core originally allocated in the abnormal state may be scheduled to other single cores to ensure that the tasks are timely operated, and the effective guarantee sheet is valid. Before and after the nuclear anomaly and recovery, the operation of the system will not be interrupted, and the resources of the system will not be lost. After abnormal single-core recovery, it can work normally, which prolongs the system's runn time and enhances system reliability.
附图说明 DRAWINGS
图 1是实施例的程序流程图。  Figure 1 is a flow chart of the procedure of the embodiment.
具体实施方式 下面结合附图及实施例, 详细描述本发明的技术方案。 detailed description The technical solution of the present invention will be described in detail below with reference to the accompanying drawings and embodiments.
在具有共享内存和系统调度模块的多核系统中,本发明在共享内 存中,设置一个专门的存储单元,采用一个全局的数组来存储单核的 状态, 数组的下标可以用单核号, 与每个单核的状态值对应。单核的 所有状态值定义为 "正常"、 "异常"、 "待恢复", 设置所有单核的初 始状态值都为 "正常"。 在多核系统中, 所有单核执行的任务, 都是 由系统调度模块分配的。 在系统调度模块中设置单核状态判断程序, 系统调度模块在进行任务调度时,首先判断当前各个单核的状态,如 果当前单核状态异常,则不向该单核调度任务。当某个单核的状态出 现异常时, 一般由 CPU的异常处理程序处理。  In a multi-core system having a shared memory and a system scheduling module, the present invention sets a dedicated memory unit in the shared memory, and uses a global array to store the state of the single core. The subscript of the array can be a single core number, and The status values of each single core correspond. All status values of a single core are defined as "normal", "abnormal", "to be recovered", and the initial state values of all single cores are set to "normal". In a multi-core system, all tasks performed by a single core are assigned by the system scheduling module. A single-core state determination program is set in the system scheduling module. When the system scheduling module performs task scheduling, the system first determines the status of each current single core. If the current single-core status is abnormal, the task is not scheduled to the single core. When an abnormal state occurs in a single core, it is generally handled by the CPU's exception handler.
本发明的异常处理程序中,发生异常的单核, 首先将自己状态设 置为 "异常", 然后选择一个状态正常的单核, 利用核间中断通信方 式, 通知所选择的状态正常的单核。 系统调度模块根据其调度算法, 将该异常单核的任务全部转移到正常状态的单核,保证恢复工作的尽 快完成,缩短恢复时间。通知完成后,异常状态的单核就进入死循环, 不能再退出异常处理程序了, 防止出现更多的错误及破坏。  In the exception handling program of the present invention, an abnormal single core is generated. First, the self state is set to "abnormal", and then a single core having a normal state is selected, and the inter-core interrupt communication method is used to notify the selected single core having a normal state. According to the scheduling algorithm, the system scheduling module transfers all the tasks of the abnormal single core to the single core in the normal state, ensuring the recovery work is completed as soon as possible, and the recovery time is shortened. After the notification is completed, the single core of the abnormal state enters the infinite loop, and the exception handler can no longer be exited, preventing more errors and damages.
在对称多核系统中, 任何一个单核都可以设置其它单核的状态, 所以当一个单核出现异常状态时, 可以选择任意一个状态正常的单 核,任何一个单核都有复位其他一个或多个单核的功能。选择正常单 核的算法可以是顺序查找,也可以是随机査找。顺序查找的优点是算 法简单,缺点是选择到的正常核比较固定; 随机査找的优点是找到的 正常核不是固定的,能够增大恢复成功的几率,缺点是算法比较复杂。  In a symmetric multi-core system, any single core can set the state of other single cores. Therefore, when a single core has an abnormal state, any single core with normal status can be selected. Any single core has one or more resets. Single core function. The algorithm for selecting a normal single core can be either a sequential lookup or a random lookup. The advantage of sequential search is that the algorithm is simple. The disadvantage is that the selected normal kernel is relatively fixed. The advantage of random search is that the normal kernel found is not fixed, which can increase the probability of successful recovery. The disadvantage is that the algorithm is more complicated.
在主从多核系统中,只有处于主状态的单核可以恢复其它异常状 态的单核,也就是说当某个单核出现异常状态时, 必须通知处于主状 态的单核, 才能进行恢复操作。  In a master-slave multi-core system, only a single core in the main state can recover a single core of other abnormal states. That is to say, when a single core has an abnormal state, it must notify the single core in the main state to perform the recovery operation.
具有多核的 CPU都具有核间通信的机制, 其中一种通信机制是 使用核间中断, 它的好处是非常迅速, 能够在第一时间将事件通知出 去, 因此本发明优选使用核间中断来发送通知。  CPUs with multiple cores have a mechanism for inter-core communication. One of the communication mechanisms is to use inter-core interrupts. The advantage is that it is very fast and can notify the event at the first time. Therefore, the present invention preferably uses inter-core interrupts to transmit. Notice.
实施例  Example
在一个对称的多核 CPU的嵌入式系统中, 如图 1所示, 在步骤 101中, 由于单核 A出现了非法操作而产生了异常, 这时只有单核 A 会跳转到异常向量, 进入 CPU异常处理程序, 而其他单核还是在正 常运行。 单核 A在异常处理程序中, 首先记录异常信息, 包括: 异 常类型, 异常 PC指针, 所有状态寄存器的值, 栈结构等等。 In an embedded system with a symmetric multi-core CPU, as shown in Figure 1, in the step In 101, an exception occurs due to an illegal operation of the single core A. At this time, only the single core A will jump to the exception vector and enter the CPU exception handler, while the other single cores are still operating normally. In the exception handler, single core A first records the exception information, including: exception type, exception PC pointer, value of all status registers, stack structure, and so on.
在步骤 102中, 单核 A在异常处理程序中, 修改共享内存存储 单元中的本单核状态的值为 "异常"。 当系统调度模块进行任务调度 时首先判断当前单核的状态,如果当前单核状态异常, 则不向该单核 调度任务。  In step 102, in the exception handler, the single core A modifies the value of the single core state in the shared memory storage unit to "abnormal". When the system scheduling module performs task scheduling, it first determines the status of the current single core. If the current single core status is abnormal, the task is not scheduled to the single core.
在步骤 103中, 单核 A在异常处理程序中, 随机选择了一个状 态正常的单核 B, 然后使用中断通知单核 B, 最后自己主动进入死循 环,也就是永远不从异常处理程序里面退出, 防止其重新执行出现异 常的那条指令产生异常。  In step 103, the single core A randomly selects a single-core B with a normal state in the exception handling program, and then uses the interrupt to notify the single-core B, and finally actively enters the infinite loop, that is, never exits from the exception handler. , the instruction that prevents it from re-executing an exception generates an exception.
在步骤 104中, 状态正常的单核 B收到了单核 A的中断消息, 即唤醒自己的单核异常恢复的守护进程,准备査找哪个单核产生了异 常, 并准备进行恢复。  In step 104, the single-core B with normal status receives the interrupt message of the single-core A, that is, the daemon that wakes up its own single-core abnormality recovery, prepares to find out which single-core has generated an abnormality, and is ready to recover.
在步骤 105中, 单核 B通过设置 CPU的全局控制寄存器, 将单 核 A设置到复位状态, 因为多核 CPU提供当一个单核被设置到复位 状态时,它不会执行任何代码,也就是处于停止状态,一旦被解复位, 它将从固定的启动地址读指令运行, 也就是进行一次重启操作。  In step 105, the single core B sets the single core A to the reset state by setting the global control register of the CPU, because the multicore CPU provides that when a single core is set to the reset state, it does not execute any code, that is, it is in The stop state, once reset, will run from a fixed start address read command, that is, a restart operation.
在步骤 106中, 单核 B通知系统调度模块, 系统调度模块根据 其调度算法将所有原来属于单核 A的任务调度到另一状态正常的单 核, 保证了任务执行的时效性。  In step 106, the single core B notifies the system scheduling module, and the system scheduling module schedules all the tasks originally belonging to the single core A to another normal core according to the scheduling algorithm, thereby ensuring the timeliness of the task execution.
在步骤 107中, 单核 B再把所有原来属于单核 A的资源, 回收 到系统中, 这些资源主要包括: 任务队列、 堆栈空间、 中断等等。  In step 107, the single core B reclaims all the resources originally belonging to the single core A into the system, and the resources mainly include: task queue, stack space, interrupt, and the like.
在步骤 108中, 单核 B通过设置 CPU的全局控制寄存器, 将单 核 A解幵复位状态, 这时单核 A就开始进行重新启动的动作, 单核 B此时要轮询共享内存存储单元中单核 A的状态值,等待其变为 "待 恢复"。  In step 108, the single core B de-asserts the single core A by setting the global control register of the CPU, and then the single core A starts the restart operation, and the single core B polls the shared memory storage unit at this time. The status value of the single core A, waiting for it to become "to be recovered."
在步骤 201中, 单核 A被解复位, 它将从 CPU固定的启动地址 开始读取指令运行, 进行重启操作。 在步骤 202中, 单核 A重新执行一遍初试化操作, 由于使用了 新的资源, 所以肯定可以重新启动成功。 在启动完成后, 单核 A将 共享内存存储单元中自己的状态改为 "待恢复", 表明自己己经启动 完成。 In step 201, the single core A is de-reset, and it will start the read operation from the fixed start address of the CPU to perform the restart operation. In step 202, the single core A re-executes the initial trial operation, and since the new resource is used, the restart may be successful. After the boot is complete, single core A changes its own state in the shared memory storage unit to "to be restored", indicating that it has been booted.
在步骤 203中, 单核 B检测到单核 A的状态变成了 "待恢复", 表明单核 A己经启动完成, 此时单核 B将单核 A在共享内存存储单 元中的状态修改为 "正常", 然后单核 B通知系统调度模块, 可以给 单核 A分配任务。  In step 203, the single core B detects that the status of the single core A becomes "to be restored", indicating that the single core A has been booted, and the single core B will modify the state of the single core A in the shared memory storage unit. To be "normal", then the single core B notifies the system scheduling module that a task can be assigned to the single core A.
异常回复结束。  The exception reply ends.

Claims

权利要求书 Claim
1. 多核系统单核异常的恢复方法, 包括共享内存和系统调度 模块, 其特征在于, 包括以下步骤:  A method for recovering a single core exception of a multi-core system, including a shared memory and a system scheduling module, comprising the steps of:
a.在所述共享内存中设置存储单元,存储每个单核的状态值, 所有单核初始状态值设置为 "正常";  a storage unit is set in the shared memory, and the state value of each single core is stored, and all single core initial state values are set to "normal";
b. 某个单核发生异常时, 自动进入异常处理程序, 将自己状 态值设置为 "异常", 并通知一个被选择的状态正常的单核, 然后 该异常状态的单核主动进入死循环;  b. When an exception occurs in a single core, it automatically enters the exception handler, sets its own state value to "abnormal", and notifies a single core that is selected to be in a normal state, and then the single core of the abnormal state actively enters an infinite loop;
c 所述被选择的状态正常的单核, 将所述异常状态的单核设 置到复位状态, 并通知所述系统调度模块, 系统调度模块将原本 属于所述异常状态的单核的任务, 调度给其他任意一个正常状态 的单核, 所述被选择的状态正常的单核回收异常状态的单核的所 有资源, 最后解复位异常状态的单核;  c. The single core with the selected normal state sets a single core of the abnormal state to a reset state, and notifies the system scheduling module, and the system scheduling module schedules a single core task that belongs to the abnormal state. Giving any other normal state single core, the selected single state normal single core recovers all resources of the single core of the abnormal state, and finally solves the reset single core of the abnormal state;
d. 所述异常状态的单核解复位后重新启动, 启动完成后将自 己状态值设为 "待恢复";  d. The single-core solution of the abnormal state is restarted after resetting, and the self-status value is set to "to be restored" after the startup is completed;
e . 所述被选择的状态正常的单核, 检测到所述异常状态的单 核的值为 "待恢复"后, 将该单核的状态值设置为 "正常", 并通 知系统调度模块。  e. The selected single core with normal status detects that the value of the single core of the abnormal state is "to be restored", sets the status value of the single core to "normal", and notifies the system scheduling module.
2. 根据权利要求 1所述的多核系统单核异常的恢复方法, 其 特征在于, 所述步骤 b中, 通过核间通信的中断方式发送通知。  The method for recovering a single core anomaly of a multi-core system according to claim 1, wherein in the step b, the notification is sent by an interrupt mode of the inter-core communication.
3. 根据权利要求 1所述的多核系统单核异常的恢复方法, 其 特征在于, 所述系统调度模块, 根据所述存储单元中的状态值, 对各个单核的状态进行判断; 一旦判断某个单核的状态异常时, 就不再向这个单核调度任务。  The method for recovering a single-core abnormality of a multi-core system according to claim 1, wherein the system scheduling module determines a state of each single core according to a state value in the storage unit; When the status of a single core is abnormal, the task is no longer scheduled to this single core.
4. 根据权利要求 1、 2或 3所述的多核系统单核异常的恢复 方法, 其特征在于, 所述多核系统为对称多核系统; 步骤 b中, 所述被选择的状态正常的单核可以是任意一个状态正常的单核。  The method for recovering a single-core anomaly of a multi-core system according to claim 1, 2 or 3, wherein the multi-core system is a symmetric multi-core system; in step b, the selected single-core with a normal state may be Is any single core with normal status.
5. 根据权利要求 1、 2或 3所述的多核系统单核异常的恢复 方法, 其特征在于, 所述多核系统为主从多核系统; 步骤 b中, 所述被选择的状态正常的单核为处于主状态的单核。  The method for recovering a single-core anomaly of a multi-core system according to claim 1, 2 or 3, wherein the multi-core system is a master-slave multi-core system; in step b, the single-core with a selected state is normal Is a single core in the main state.
PCT/CN2008/000224 2007-01-31 2008-01-30 Method of recovering single core exception in multi-core system WO2008101386A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN200710048366A CN101236515B (en) 2007-01-31 2007-01-31 Multi-core system single-core abnormity restoration method
CN200710048366.6 2007-01-31

Publications (1)

Publication Number Publication Date
WO2008101386A1 true WO2008101386A1 (en) 2008-08-28

Family

ID=39709613

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2008/000224 WO2008101386A1 (en) 2007-01-31 2008-01-30 Method of recovering single core exception in multi-core system

Country Status (3)

Country Link
CN (1) CN101236515B (en)
RU (1) RU2437144C2 (en)
WO (1) WO2008101386A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112199230A (en) * 2020-10-19 2021-01-08 广东电网有限责任公司佛山供电局 Storage controller supporting multi-core system exception handling

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2541407A1 (en) * 2010-02-23 2013-01-02 Fujitsu Limited Multi-core processor system, control program, and control method
CN103150224B (en) * 2013-03-11 2015-11-11 杭州华三通信技术有限公司 For improving the electronic equipment and method that start reliability
US9367406B2 (en) * 2013-08-14 2016-06-14 Intel Corporation Manageability redundancy for micro server and clustered system-on-a-chip deployments
CN103425545A (en) * 2013-08-20 2013-12-04 浪潮电子信息产业股份有限公司 System fault tolerance method for multiprocessor server
CN103870350A (en) * 2014-03-27 2014-06-18 浪潮电子信息产业股份有限公司 Microprocessor multi-core strengthening method based on watchdog
CN104866460B (en) * 2015-06-04 2017-10-10 电子科技大学 A kind of fault-tolerant adaptive reconfigurable System and method for based on SoC
CN107872397A (en) * 2016-09-27 2018-04-03 阿里巴巴集团控股有限公司 Traffic scheduling method, dispatching platform and scheduling system during pressure survey
CN106844082A (en) * 2017-01-18 2017-06-13 联想(北京)有限公司 Processor predictive failure analysis method and device
CN113672363B (en) * 2021-07-21 2024-02-02 惠州华阳通用电子有限公司 Method for recovering multi-task exception and storage medium
CN114750774B (en) * 2021-12-20 2023-01-13 广州汽车集团股份有限公司 Safety monitoring method and automobile
CN115827355B (en) * 2023-01-10 2023-04-28 深流微智能科技(深圳)有限公司 Method and device for detecting abnormal core in graphics processor and electronic equipment

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1729456A (en) * 2002-12-19 2006-02-01 英特尔公司 On-die mechanism for high-reliability processor
US20060069953A1 (en) * 2004-09-14 2006-03-30 Lippett Mark D Debug in a multicore architecture
CN1834950A (en) * 2005-03-15 2006-09-20 英特尔公司 Multicore processor having active and inactive execution cores

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5815651A (en) * 1991-10-17 1998-09-29 Digital Equipment Corporation Method and apparatus for CPU failure recovery in symmetric multi-processing systems
JP2000181890A (en) * 1998-12-15 2000-06-30 Fujitsu Ltd Multiprocessor exchange and switching method of its main processor
CN100361118C (en) * 2005-03-01 2008-01-09 华为技术有限公司 Multiple-CPU system and its control method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1729456A (en) * 2002-12-19 2006-02-01 英特尔公司 On-die mechanism for high-reliability processor
US20060069953A1 (en) * 2004-09-14 2006-03-30 Lippett Mark D Debug in a multicore architecture
CN1834950A (en) * 2005-03-15 2006-09-20 英特尔公司 Multicore processor having active and inactive execution cores

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112199230A (en) * 2020-10-19 2021-01-08 广东电网有限责任公司佛山供电局 Storage controller supporting multi-core system exception handling

Also Published As

Publication number Publication date
CN101236515B (en) 2010-05-19
CN101236515A (en) 2008-08-06
RU2437144C2 (en) 2011-12-20
RU2009139312A (en) 2011-04-27

Similar Documents

Publication Publication Date Title
WO2008101386A1 (en) Method of recovering single core exception in multi-core system
US9798595B2 (en) Transparent user mode scheduling on traditional threading systems
US8412981B2 (en) Core sparing on multi-core platforms
US6996745B1 (en) Process for shutting down a CPU in a SMP configuration
TWI274991B (en) A method, apparatus, and system for buffering instructions
US9335998B2 (en) Multi-core processor system, monitoring control method, and computer product
US20060020852A1 (en) Method and system of servicing asynchronous interrupts in multiple processors executing a user program
JP5726340B2 (en) Processor system
WO2008008211A2 (en) A write filter cache method and apparatus for protecting the microprocessor core from soft errors
JP2005285119A (en) Method and system for executing user program in non deterministic processor
US7305578B2 (en) Failover method in a clustered computer system
JP2005285121A (en) Method and system of exchanging information between processors
JPH07311749A (en) Multiprocessor system and kernel substituting method
CN101887386A (en) Method and system for processing failure of redundant array of independent disk controller
US20190121689A1 (en) Apparatus and method for increasing resilience to faults
JP5673666B2 (en) Multi-core processor system, interrupt program, and interrupt method
CN115576734B (en) Multi-core heterogeneous log storage method and system
US20220318053A1 (en) Method of supporting persistence and computing device
US9176806B2 (en) Computer and memory inspection method
CN104978208A (en) Warm restart method and device thereof
JP5867630B2 (en) Multi-core processor system, multi-core processor system control method, and multi-core processor system control program
JP2010044699A (en) Information processor
Kimura et al. GPU-based first aid for system faults
JP4788516B2 (en) Dynamic replacement system, dynamic replacement method and program
JP2008217665A (en) Multiprocessor system, task scheduling method and task scheduling program

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 08706420

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 4501/CHENP/2009

Country of ref document: IN

NENP Non-entry into the national phase

Ref country code: DE

WWE Wipo information: entry into national phase

Ref document number: 2009139312

Country of ref document: RU

122 Ep: pct application non-entry in european phase

Ref document number: 08706420

Country of ref document: EP

Kind code of ref document: A1