多核系统单核异常的恢复方法 技术领域 Recovery method for single core anomaly of multi-core system
本发明涉及多核 CPU系统, 特别涉及多核系统单核异常的恢复 方法。 The present invention relates to a multi-core CPU system, and more particularly to a method for recovering a single-core anomaly of a multi-core system.
背景技术 Background technique
在一个多核 CPU的嵌入式系统 (简称为多核系统) 中, 不管是 对称多核系统或者主从多核系统,都有可能发生某一个核出现异常的 情况, 这些异常包括非法指令, 不对齐操作, cache异常, 数据总线 错误等。导致这些异常的原因很多, 可能是一次偶然的硬件错误, 或 者是非法的数据导致程序处理异常,也可能是运行到了程序中不易走 到的分支。但这些错误大部分对系统是一次性伤害, 因为如果是固定 有规律的异常现象, 在系统测试时就会被发现并解决。 In a multi-core CPU embedded system (referred to as a multi-core system), whether it is a symmetric multi-core system or a master-slave multi-core system, there may be an abnormality in a certain core. These exceptions include illegal instructions, misalignment operations, and cache. Abnormal, data bus error, etc. There are many reasons for these exceptions. It may be an accidental hardware error, or illegal data causes the program to handle exceptions, or it may be a branch that is not easy to reach in the program. However, most of these errors are one-time damage to the system, because if it is a fixed regular anomaly, it will be discovered and resolved during system testing.
现有技术对于这种某个单核出现异常的情况,通常的做法只是记 录异常信息,然后重新启动整个系统。这样做虽然能够恢复系统的运 行, 但会中断所有的业务, 缩短系统的可运行时间。特别是考虑到目 前多系统, 一般处在高端或者核心的位置, 比如省级核心路由器, 程 控交换机等。一旦这些设备发生了故障, 后果是严重的; 并且系统重 新启动到正常工作, 需要较长时间, 造成的影响是非常大的。 因此, 延长多核系统可运行时间显得尤为重要。同时, 为了一些非致命的错 误而重新启动整个系统也不值得。 In the prior art, in the case of such a single core abnormality, the usual practice is to record the abnormal information and then restart the entire system. Although this can restore the operation of the system, it will interrupt all services and shorten the system's runn time. Especially considering the current multi-system, it is usually in the high-end or core location, such as provincial core routers, program-controlled switches, and so on. Once these devices fail, the consequences are serious; and the system restarts to normal operation, which takes a long time, and the impact is very large. Therefore, it is particularly important to extend the runtime of multi-core systems. At the same time, it is not worthwhile to restart the entire system for some non-fatal errors.
发明内容 Summary of the invention
本发明所要解决的技术问题,就是针对现有技术的上述缺点,提 供一种多核系统单核异常的恢复方法, 当某个单核出现异常时,在不 中断运行的情况下进行恢复。 The technical problem to be solved by the present invention is to provide a method for recovering a single-core abnormality of a multi-core system in response to the above-mentioned shortcomings of the prior art. When an abnormality occurs in a single core, the recovery is performed without interrupting the operation.
本发明解决所述技术问题,采用的技术方案是, 多核系统单核异 常的恢复方法, 包括共享内存和系统调度模块, 其特征在于, 包括以 下步骤: - a. 在所述共享内存中设置存储单元, 存储每个单核的状态值, The present invention solves the technical problem, and adopts a technical solution, which is a method for recovering a single-core abnormality of a multi-core system, including a shared memory and a system scheduling module, which is characterized in that it comprises the following steps: - a. setting a storage in the shared memory Unit, storing the state value of each single core,
确认本
所有单核初始状态值设置为 "正常"; Confirmation All single core initial state values are set to "normal";
b. 某个单核发生异常时, 自动进入异常处理程序, 将自己状态 值设置为 "异常", 并通知一个被选择的状态正常的单核, 然后该异 常状态的单核主动进入死循环; b. When an exception occurs in a single core, it automatically enters the exception handler, sets its own state value to "abnormal", and notifies a single core that is selected to be in a normal state, and then the single core of the abnormal state actively enters an infinite loop;
C . 所述被选择的状态正常的单核, 将所述异常状态的单核设置 到复位状态,并通知所述系统调度模块,系统调度模块将原本属于所 述异常状态的单核的任务, 调度给其他任意一个正常状态的单核,所 述被选择的状态正常的单核回收异常状态的单核的所有资源,最后解 复位异常状态的单核; C. The single core with the selected normal state sets a single core of the abnormal state to a reset state, and notifies the system scheduling module that the system scheduling module will belong to the single core task of the abnormal state. Scheduling to any other single core in a normal state, the selected single state recovers all the resources of the single core of the abnormal state, and finally solves the single core of the abnormal state;
d . 所述异常状态的单核解复位后重新启动, 启动完成后将自己 状态值设为 "待恢复"; d. The single-core solution of the abnormal state is restarted after resetting, and the self-status value is set to "to be restored" after the startup is completed;
e . 所述被选择的状态正常的单核, 检测到所述异常状态的单核 的值为 "待恢复"后, 将该单核的状态值设置为 "正常", 并通知系 统调度模块; e. The selected single core with normal status detects that the value of the single core of the abnormal state is "to be restored", sets the status value of the single core to "normal", and notifies the system scheduling module;
进一步的, 所述步骤 b中, 通过核间通信的中断方式发送通知; 进一步的, 所述系统调度模块, 根据所述存储单元中的状态值, 对各个单核的状态进行判断;一旦判断某个单核的状态异常时,就不 再向这个单核调度任务; Further, in the step b, the notification is sent by the interrupt mode of the inter-core communication; further, the system scheduling module determines the state of each single core according to the state value in the storage unit; When the status of a single core is abnormal, the task is not scheduled to this single core;
具体的, 所述多核系统为对称多核系统; 步骤 b中, 所述被选择 的状态正常的单核可以是任意一个状态正常的单核。 Specifically, the multi-core system is a symmetric multi-core system; in step b, the selected single core with normal status may be any single core with normal status.
具体的, 所述多核系统为主从多核系统; 步骤 b中, 所述被选择 的状态正常的单核为处于主状态的单核。 Specifically, the multi-core system is a master-slave multi-core system; in step b, the selected single core with a normal state is a single core in a main state.
本发明的有益效果是: 当系统的某个单核出现异常时,可以先将 原本分配在该异常状态的单核的任务,调度到其他的单核,保证这些 任务及时得到运行,有效保证单核异常及恢复前后,系统的运行不会 中断, 系统的资源也不会丢失。异常单核恢复以后可以正常工作, 延 长了系统的可运行时间, 增强了系统的可靠性。 The beneficial effects of the present invention are as follows: When an abnormality occurs in a single core of the system, the task of the single core originally allocated in the abnormal state may be scheduled to other single cores to ensure that the tasks are timely operated, and the effective guarantee sheet is valid. Before and after the nuclear anomaly and recovery, the operation of the system will not be interrupted, and the resources of the system will not be lost. After abnormal single-core recovery, it can work normally, which prolongs the system's runn time and enhances system reliability.
附图说明 DRAWINGS
图 1是实施例的程序流程图。 Figure 1 is a flow chart of the procedure of the embodiment.
具体实施方式
下面结合附图及实施例, 详细描述本发明的技术方案。 detailed description The technical solution of the present invention will be described in detail below with reference to the accompanying drawings and embodiments.
在具有共享内存和系统调度模块的多核系统中,本发明在共享内 存中,设置一个专门的存储单元,采用一个全局的数组来存储单核的 状态, 数组的下标可以用单核号, 与每个单核的状态值对应。单核的 所有状态值定义为 "正常"、 "异常"、 "待恢复", 设置所有单核的初 始状态值都为 "正常"。 在多核系统中, 所有单核执行的任务, 都是 由系统调度模块分配的。 在系统调度模块中设置单核状态判断程序, 系统调度模块在进行任务调度时,首先判断当前各个单核的状态,如 果当前单核状态异常,则不向该单核调度任务。当某个单核的状态出 现异常时, 一般由 CPU的异常处理程序处理。 In a multi-core system having a shared memory and a system scheduling module, the present invention sets a dedicated memory unit in the shared memory, and uses a global array to store the state of the single core. The subscript of the array can be a single core number, and The status values of each single core correspond. All status values of a single core are defined as "normal", "abnormal", "to be recovered", and the initial state values of all single cores are set to "normal". In a multi-core system, all tasks performed by a single core are assigned by the system scheduling module. A single-core state determination program is set in the system scheduling module. When the system scheduling module performs task scheduling, the system first determines the status of each current single core. If the current single-core status is abnormal, the task is not scheduled to the single core. When an abnormal state occurs in a single core, it is generally handled by the CPU's exception handler.
本发明的异常处理程序中,发生异常的单核, 首先将自己状态设 置为 "异常", 然后选择一个状态正常的单核, 利用核间中断通信方 式, 通知所选择的状态正常的单核。 系统调度模块根据其调度算法, 将该异常单核的任务全部转移到正常状态的单核,保证恢复工作的尽 快完成,缩短恢复时间。通知完成后,异常状态的单核就进入死循环, 不能再退出异常处理程序了, 防止出现更多的错误及破坏。 In the exception handling program of the present invention, an abnormal single core is generated. First, the self state is set to "abnormal", and then a single core having a normal state is selected, and the inter-core interrupt communication method is used to notify the selected single core having a normal state. According to the scheduling algorithm, the system scheduling module transfers all the tasks of the abnormal single core to the single core in the normal state, ensuring the recovery work is completed as soon as possible, and the recovery time is shortened. After the notification is completed, the single core of the abnormal state enters the infinite loop, and the exception handler can no longer be exited, preventing more errors and damages.
在对称多核系统中, 任何一个单核都可以设置其它单核的状态, 所以当一个单核出现异常状态时, 可以选择任意一个状态正常的单 核,任何一个单核都有复位其他一个或多个单核的功能。选择正常单 核的算法可以是顺序查找,也可以是随机査找。顺序查找的优点是算 法简单,缺点是选择到的正常核比较固定; 随机査找的优点是找到的 正常核不是固定的,能够增大恢复成功的几率,缺点是算法比较复杂。 In a symmetric multi-core system, any single core can set the state of other single cores. Therefore, when a single core has an abnormal state, any single core with normal status can be selected. Any single core has one or more resets. Single core function. The algorithm for selecting a normal single core can be either a sequential lookup or a random lookup. The advantage of sequential search is that the algorithm is simple. The disadvantage is that the selected normal kernel is relatively fixed. The advantage of random search is that the normal kernel found is not fixed, which can increase the probability of successful recovery. The disadvantage is that the algorithm is more complicated.
在主从多核系统中,只有处于主状态的单核可以恢复其它异常状 态的单核,也就是说当某个单核出现异常状态时, 必须通知处于主状 态的单核, 才能进行恢复操作。 In a master-slave multi-core system, only a single core in the main state can recover a single core of other abnormal states. That is to say, when a single core has an abnormal state, it must notify the single core in the main state to perform the recovery operation.
具有多核的 CPU都具有核间通信的机制, 其中一种通信机制是 使用核间中断, 它的好处是非常迅速, 能够在第一时间将事件通知出 去, 因此本发明优选使用核间中断来发送通知。 CPUs with multiple cores have a mechanism for inter-core communication. One of the communication mechanisms is to use inter-core interrupts. The advantage is that it is very fast and can notify the event at the first time. Therefore, the present invention preferably uses inter-core interrupts to transmit. Notice.
实施例 Example
在一个对称的多核 CPU的嵌入式系统中, 如图 1所示, 在步骤
101中, 由于单核 A出现了非法操作而产生了异常, 这时只有单核 A 会跳转到异常向量, 进入 CPU异常处理程序, 而其他单核还是在正 常运行。 单核 A在异常处理程序中, 首先记录异常信息, 包括: 异 常类型, 异常 PC指针, 所有状态寄存器的值, 栈结构等等。 In an embedded system with a symmetric multi-core CPU, as shown in Figure 1, in the step In 101, an exception occurs due to an illegal operation of the single core A. At this time, only the single core A will jump to the exception vector and enter the CPU exception handler, while the other single cores are still operating normally. In the exception handler, single core A first records the exception information, including: exception type, exception PC pointer, value of all status registers, stack structure, and so on.
在步骤 102中, 单核 A在异常处理程序中, 修改共享内存存储 单元中的本单核状态的值为 "异常"。 当系统调度模块进行任务调度 时首先判断当前单核的状态,如果当前单核状态异常, 则不向该单核 调度任务。 In step 102, in the exception handler, the single core A modifies the value of the single core state in the shared memory storage unit to "abnormal". When the system scheduling module performs task scheduling, it first determines the status of the current single core. If the current single core status is abnormal, the task is not scheduled to the single core.
在步骤 103中, 单核 A在异常处理程序中, 随机选择了一个状 态正常的单核 B, 然后使用中断通知单核 B, 最后自己主动进入死循 环,也就是永远不从异常处理程序里面退出, 防止其重新执行出现异 常的那条指令产生异常。 In step 103, the single core A randomly selects a single-core B with a normal state in the exception handling program, and then uses the interrupt to notify the single-core B, and finally actively enters the infinite loop, that is, never exits from the exception handler. , the instruction that prevents it from re-executing an exception generates an exception.
在步骤 104中, 状态正常的单核 B收到了单核 A的中断消息, 即唤醒自己的单核异常恢复的守护进程,准备査找哪个单核产生了异 常, 并准备进行恢复。 In step 104, the single-core B with normal status receives the interrupt message of the single-core A, that is, the daemon that wakes up its own single-core abnormality recovery, prepares to find out which single-core has generated an abnormality, and is ready to recover.
在步骤 105中, 单核 B通过设置 CPU的全局控制寄存器, 将单 核 A设置到复位状态, 因为多核 CPU提供当一个单核被设置到复位 状态时,它不会执行任何代码,也就是处于停止状态,一旦被解复位, 它将从固定的启动地址读指令运行, 也就是进行一次重启操作。 In step 105, the single core B sets the single core A to the reset state by setting the global control register of the CPU, because the multicore CPU provides that when a single core is set to the reset state, it does not execute any code, that is, it is in The stop state, once reset, will run from a fixed start address read command, that is, a restart operation.
在步骤 106中, 单核 B通知系统调度模块, 系统调度模块根据 其调度算法将所有原来属于单核 A的任务调度到另一状态正常的单 核, 保证了任务执行的时效性。 In step 106, the single core B notifies the system scheduling module, and the system scheduling module schedules all the tasks originally belonging to the single core A to another normal core according to the scheduling algorithm, thereby ensuring the timeliness of the task execution.
在步骤 107中, 单核 B再把所有原来属于单核 A的资源, 回收 到系统中, 这些资源主要包括: 任务队列、 堆栈空间、 中断等等。 In step 107, the single core B reclaims all the resources originally belonging to the single core A into the system, and the resources mainly include: task queue, stack space, interrupt, and the like.
在步骤 108中, 单核 B通过设置 CPU的全局控制寄存器, 将单 核 A解幵复位状态, 这时单核 A就开始进行重新启动的动作, 单核 B此时要轮询共享内存存储单元中单核 A的状态值,等待其变为 "待 恢复"。 In step 108, the single core B de-asserts the single core A by setting the global control register of the CPU, and then the single core A starts the restart operation, and the single core B polls the shared memory storage unit at this time. The status value of the single core A, waiting for it to become "to be recovered."
在步骤 201中, 单核 A被解复位, 它将从 CPU固定的启动地址 开始读取指令运行, 进行重启操作。
在步骤 202中, 单核 A重新执行一遍初试化操作, 由于使用了 新的资源, 所以肯定可以重新启动成功。 在启动完成后, 单核 A将 共享内存存储单元中自己的状态改为 "待恢复", 表明自己己经启动 完成。 In step 201, the single core A is de-reset, and it will start the read operation from the fixed start address of the CPU to perform the restart operation. In step 202, the single core A re-executes the initial trial operation, and since the new resource is used, the restart may be successful. After the boot is complete, single core A changes its own state in the shared memory storage unit to "to be restored", indicating that it has been booted.
在步骤 203中, 单核 B检测到单核 A的状态变成了 "待恢复", 表明单核 A己经启动完成, 此时单核 B将单核 A在共享内存存储单 元中的状态修改为 "正常", 然后单核 B通知系统调度模块, 可以给 单核 A分配任务。 In step 203, the single core B detects that the status of the single core A becomes "to be restored", indicating that the single core A has been booted, and the single core B will modify the state of the single core A in the shared memory storage unit. To be "normal", then the single core B notifies the system scheduling module that a task can be assigned to the single core A.
异常回复结束。
The exception reply ends.