WO2009092322A1 - 一种多处理器系统故障恢复的方法及装置 - Google Patents

一种多处理器系统故障恢复的方法及装置 Download PDF

Info

Publication number
WO2009092322A1
WO2009092322A1 PCT/CN2009/070154 CN2009070154W WO2009092322A1 WO 2009092322 A1 WO2009092322 A1 WO 2009092322A1 CN 2009070154 W CN2009070154 W CN 2009070154W WO 2009092322 A1 WO2009092322 A1 WO 2009092322A1
Authority
WO
WIPO (PCT)
Prior art keywords
processor
main processor
hardware unit
normal
election
Prior art date
Application number
PCT/CN2009/070154
Other languages
English (en)
French (fr)
Inventor
Yunquan Xue
Feng Tang
Shaoyun Wu
Ya DENG
Original Assignee
Huawei Technologies Co., Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co., Ltd. filed Critical Huawei Technologies Co., Ltd.
Publication of WO2009092322A1 publication Critical patent/WO2009092322A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/202Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant
    • G06F11/2023Failover techniques
    • G06F11/2028Failover techniques eliminating a faulty processor or activating a spare

Definitions

  • the present invention relates to the field of computer technologies, and in particular, to a method and apparatus for fault recovery of a multiprocessor system.
  • a multiprocessor system is a computer system with multiple microprocessors, including a traditional multiprocessor system consisting of multiple single core chips, a multicore system with a single multicore chip, and multiple processing consisting of multiple multicore chips. System.
  • a multiprocessor system because multiple microprocessors can perform the same processing, the processing power is much more powerful than that of a normal single processor system. Multiprocessor systems are also becoming more widely used due to their powerful computing power.
  • fault recovery is an important part of fault management.
  • the fault recovery refers to enabling the system to continue to operate by various means after a system failure.
  • a common fault management method is to prepare a number of redundant processors as backup processors in advance. After the system is running normally, the redundant processors do not participate in the system work; After a failure of some or some of the processors, the services and data on the failed processor are switched to the redundant processors, allowing the system to continue to operate.
  • system failure recovery is achieved by switching operations and data on the faulty processor to redundant processors, but since the redundant processor does not participate in the work after the system is working properly, Therefore, this failure recovery method will result in waste of processor resources.
  • the number of redundant processors is too small, there are more faulty processors in the system, and the fault recovery capability is lost due to insufficient number of redundant processors; More will result in wasted resources and higher costs.
  • Embodiments of the present invention provide a method and apparatus for multi-processor system failure recovery, thereby avoiding waste of processor system resources.
  • a method for fault recovery of a multiprocessor system comprising:
  • the selected primary processor isolates the failed hardware unit and redistributes the traffic assigned to the failed hardware unit to the normal hardware unit in the system that has processing capability for the service
  • a device for fault recovery of a multiprocessor system comprising:
  • the isolation module is used to control the main processor to isolate the failed hardware unit
  • a service allocation module configured to, after learning that the operation of isolating the faulty hardware unit is successful, the control main processor reallocating the service allocated to the failed hardware unit to the system for the service A normal hardware unit that handles the capabilities.
  • FIG. 1 is a schematic diagram of a processing procedure according to an embodiment of the present invention.
  • FIG. 2 is a schematic diagram of a system according to an application embodiment of the present invention
  • FIG. 3 is a schematic structural diagram of an apparatus according to an embodiment of the present invention
  • FIG. 2 is a schematic diagram of a system according to an application embodiment of the present invention
  • FIG. 3 is a schematic structural diagram of an apparatus according to an embodiment of the present invention
  • FIG. 3 is a schematic structural diagram of an apparatus according to an embodiment of the present invention
  • FIG. 4 is a schematic structural diagram of a module for determining a main processor according to an embodiment of the present invention.
  • the selected main processor isolates the failed hardware unit, and reallocates the service allocated to the failed hardware unit to the system with processing capability for the service.
  • Normal hardware unit The main processor isolates the failed hardware unit to prevent the faulty hardware unit from affecting the operation of other normal hardware units in the system.
  • the specific isolation mode may be to control the faulty hardware unit so that it does not affect other hardware units. Any operation, such as stopping the operation of a failed hardware unit, preventing communication with other processors, etc.
  • the method may further include: the main processor controlling the failed hardware unit to perform resetting, and allowing the reset to fail Then, the reset operation of the failed hardware unit is repeated. If the reset is successful within a predetermined number of times, the fault recovery operation for the multiprocessor system is completed, and the subsequent fault recovery operation does not need to be continued.
  • the method provided by the embodiment of the present invention may further include: setting a predetermined number of isolations, when the main processor fails to isolate the failed hardware unit, retrying the operation within the predetermined number of isolations, if it is scheduled Failure to successfully isolate a failed hardware unit within the number of isolations means that the failure recovery operation of the multiprocessor system by the primary processor has failed this time.
  • the method provided by the embodiment of the present invention may further include: setting a predetermined number of reassignment, when the main processor reassigns the service allocated to the failed hardware unit to a normal hardware unit having processing capability for the service. After the operation fails, the operation is retried within the predetermined number of redistribution times, and if the service allocated to the failed hardware unit is successfully allocated to the normal hardware unit in the system that has processing capability for the service within a predetermined number of times, The processor system failback operation is completed; otherwise, it means that the failure recovery operation of the multiprocessor system through the main processor failed this time.
  • the determining of the main processor may be, but is not limited to, being implemented by any one of the following methods:
  • the operation of determining, by the election algorithm, a processor in the system as the main processor may be: when the system is started, restarted, or fails, the election algorithm is used according to a preset election.
  • the rule determines that a certain processor in the system is the main processor; or, when the predetermined trigger condition is met during the running of the system, the normal processor in the system is replaced by the election algorithm according to a preset election rule. processor.
  • FIG. 1 The processing procedure of the embodiment of the present invention is as shown in FIG. 1.
  • the main processor performs a fault recovery operation on the system, which may specifically include the following steps:
  • Step 1 Reset the faulty processor, which may be a software reset or a hardware reset.
  • Step 2 Determine whether the reset operation of step 1 is successful. If successful, perform step 8, otherwise, perform step 3;
  • Step 3 Determine whether the number of reset operations reaches a preset threshold n, and if yes, execute step 4, otherwise, return to step 1;
  • the threshold value n can be set by the operator according to actual needs during the actual application process.
  • Step 4 Isolate the faulty processor so that the faulty processor does not affect the operation of other normal processors in the system, for example: stopping the faulty processor, preventing the faulty processor from accessing the system shared memory, Prevent the failed processor from communicating with other processors in the system, and so on;
  • Step 5 determine whether the isolation operation of the step 4 is successful, if successful, perform step 6, otherwise, perform step 9;
  • Step 6 Redistribute the software service and data of the faulty processor, and allocate the software service and data to a normal working processor in the system for processing capability to maintain normal operation of the system. ;
  • Step 7 Determine whether the operation of reassigning the software service and data in step 6 is successful, if successful. , go to step 8, otherwise, go to step 9;
  • Step 8 The multi-processor fault recovery is completed and the system runs normally.
  • the main processor is allowed to repeatedly perform step 6, and when the number of repeated operations reaches a predetermined threshold, if the software is not successfully
  • the service and data reallocation means that the failure recovery of the multiprocessor system by the main processor fails.
  • the system may detect that the hardware unit has returned to normal, and send the message that the hardware unit returns to normal to the current main processor.
  • the normalized hardware unit reports a normal recovery message to the current main processor, so that the main processor can allocate the software service and data that it can process to the new unit after performing a new fault isolation recovery operation.
  • the main processor after the main processor re-allocates the software service and data of the faulty processor, it may detect that multiple processing operations for the software service and data have normal processing capabilities. Device. Thereafter, the main processor can allocate all software services and data of the failed hardware unit to a working processor; or assign software services and data of the failed hardware unit to a predetermined algorithm. A number of working processors with processing power.
  • the determining of the main processor may be implemented by any one of the following methods: [49] (1) designating any normal processor in the system capable of controlling the operation of other processors as the main processing Device
  • the specified operation can be completed by the operator after the system is started, restarted, operated normally, or failed.
  • the specified main processor fails, the new main processor is re-designated;
  • the operator controls the replacement of the designated processor in the system as the main processor.
  • the method for determining the processor by using the election algorithm may be: when the system is started, restarted, or fails, the election algorithm determines a certain processor in the system according to a preset election rule. Or; during the running of the system, when the predetermined trigger condition is met, the normal processor in the system is replaced by the election algorithm according to a preset election rule as the main processor.
  • an election algorithm in the prior art such as an adaptive election algorithm or a distributed election algorithm is used.
  • the election algorithm is executed to determine the main processor according to the pre-set election rules.
  • the election rule may include, but is not limited to, a condition parameter including a processor working state, a processing authority, a running speed, and the like, and may also set a priority level for the plurality of condition parameters, for example, the condition parameter is high in priority.
  • the low rank is ⁇ working status, processing authority, running speed>.
  • the other processor works (based on the second priority processing permission parameter), can control the failed hardware unit, and the hardware unit with processing capability for its business (based on the second priority processing permission parameter), and the running speed
  • the processor (the third priority based operating speed parameter) of the processor that satisfies the first two priority parameters is the main processor;
  • the election rule may include condition parameters such as a processor operating state, a processing authority, and a running speed (sorted according to a priority level from high to low), when a predetermined interval is reached.
  • the election algorithm determines that the working state is normal (based on the first priority working state parameter) based on the above election rule, and can control other processor operations in the system (based on the second priority processing permission parameter), and the running speed
  • the processor that is the fastest (based on the third priority operating speed parameter) among the processors that satisfy the first two priority parameters is the main processor.
  • the technical solution provided by the embodiment of the present invention can not only perform fault recovery operations on the system in the event of a processor failure, but also may fail the system according to the above steps when other controllable hardware units in the system fail. restore.
  • the software services and data in the hardware unit are redistributed according to the implementation provided by the embodiment of the present invention, and the software services and data may be allocated to the system.
  • the software and data described are other working hardware units that have processing capabilities. For example, when the memory A in the system fails, the main processor performs a reset operation on the memory A; when the reset operation fails, the memory A is isolated; after the isolation operation is successful, the memory A is allocated to the memory A. The data is reassigned to other normally working storage in the system.
  • the fault recovery method provided by the embodiment of the present invention can be implemented by a software algorithm, which reduces coupling to specific hardware, so that it can be applied to more hardware platforms; and does not need to set a redundant processor, which reduces hardware design complexity. Sex and system implementation costs.
  • the technical solution provided by the embodiments of the present invention is applicable to a multi-processor system in which microprocessors can communicate with each other through certain methods, and the selected main processor can control the software and hardware operations of other processors.
  • the main processor can control other processors by directly accessing the control registers in other processors by the main processor, and implementing control functions by modifying the control registers, such as controlling other processor resets, changing their states, or stopping their operation. Wait.
  • each microprocessor is not required to have the same structure and implementation functions, and the spatial distribution of each microprocessor is not limited, that is, the microprocessors are not required to be integrated on the same integrated circuit substrate, or are installed on the same circuit board.
  • a plurality of microprocessors are connected by a bus form to realize communication; processing authority between the microprocessors is different, for example, processor A implements control of device 1, and processor B implements Control of devices 1-7, etc.
  • Step 1 When processor A fails, the processor C is determined to be the main processor by the election algorithm; [62] Step 2, the main processor C performs a reset operation on the processor A;
  • Step 3 determining whether the reset operation is successful, if successful, the fault recovery operation is completed, no shell I", step 4;
  • Step 4 it is determined whether the reset operation reaches a preset threshold n, if yes, step 5 is performed, otherwise step 2 is performed;
  • Step 5 The main processor C isolates the processor A, and the isolation may stop the processor A.
  • Step 6 Determine whether the operation of step 5 is successful. If successful, perform step 7, otherwise, the operation of recovering the failure of the multiprocessor system by the main processor fails;
  • Step 7 The main processor C reallocates the service allocated to the processor A to the processor 1 to the processor B.
  • the processor B has the processing capability for the service of the device 1);
  • Step 8. Determine whether the operation of step 7 is successful. If successful, the fault recovery operation is completed. Otherwise, the operation of recovering the failure of the multiprocessor system by the main processor fails.
  • All or part of the steps of implementing the foregoing method embodiments may be performed by hardware related to the program instructions.
  • the foregoing program may be stored in a computer readable storage medium, and after the program is executed, the method includes the above method embodiment.
  • the foregoing storage medium includes: a medium that can store program codes, such as a ROM, a RAM, a magnetic disk, or an optical disk.
  • the embodiment of the present invention further provides a device for recovering a fault of a multi-processor system, and the structure thereof is as shown in FIG. 3.
  • the specific implementation structure may include:
  • Isolation Module 1 used to control the main processor to isolate the failed hardware unit, so that the failed hardware unit does not affect the normal operation of other hardware units in the system, if the isolation operation is successful.
  • the isolation module sends a signal to the service distribution module to start working. Otherwise, the isolation module signals that the system failure recovery operation fails, and the service allocation module does not work;
  • the main processor is configured to isolate the failed hardware unit to prevent the faulty hardware unit from affecting the operation of other normal hardware units in the system, and the specific isolation manner may use, for example, stopping the faulty hardware unit to work, blocking it from other Communication between processors and the like is achieved.
  • the service allocation module 2 is configured to, after learning that the operation of the isolating the faulty hardware unit is successful, control the main processor to reallocate the service allocated to the failed hardware unit to the system. A normal hardware unit with processing capabilities. If the re-allocation operation of the service is successful, it indicates that the system failure recovery operation is successful, otherwise, the system failure recovery operation fails.
  • the device of the embodiment of the present invention may further include a reset module 3, configured to control the main processor to perform a software or hardware reset on the failed hardware unit, and if the reset is successful within a predetermined number of times, the multiprocessor The fault recovery operation is completed; otherwise, the isolation module 1 is notified to perform corresponding operations.
  • a reset module 3 configured to control the main processor to perform a software or hardware reset on the failed hardware unit, and if the reset is successful within a predetermined number of times, the multiprocessor The fault recovery operation is completed; otherwise, the isolation module 1 is notified to perform corresponding operations.
  • the apparatus of the embodiment of the present invention may further include at least one module 4 for determining a main processor, as shown in FIG.
  • the main processor designation module 41 is configured to specify any normal processor in the system capable of controlling the operation of other processors as the main processor;
  • the election algorithm module 42 is configured to execute an election algorithm, and determine a normal processor in the system as the main processor according to a preset election rule.
  • the detection module 43 is further configured to notify the election algorithm module 42 to perform the selection of the main processor when detecting system startup, system restart, system failure, or meeting a predetermined trigger condition during system operation. Operation. In the actual application process, the operator can set one or more trigger conditions to the condition trigger module to implement the corresponding trigger function.

Description

说明书 一种多处理器系统故障恢复的方法及装置
[1] 本申请要求于 2008年 01月 18日提交中国专利局、 申请号为 200810056461.5、 发 明名称为"一种多处理器系统故障恢复的方法及装置"的中国专利申请的优先权, 其全部内容通过弓 I用结合在本申请中。
[2] 技术领域
[3] 本发明涉及计算机技术领域, 尤其涉及一种多处理器系统故障恢复的方法及装 置。
[4] 发明背景
[5] 多处理器系统是具备多个微处理器的计算机系统, 包括传统的由多个单核芯片 组成的多处理器系统, 单个多核芯片的多核系统和由多个多核芯片组成的多处 理器系统。 在多处理器系统中, 由于其多个微处理器能够同吋进行计算处理, 因此处理能力比普通单处理器系统要强大很多。 多处理器系统也因其强大的计 算处理能力得到越来越广泛的应用。
[6] 但是, 随着多处理器系统中处理器数量的增加, 系统发生故障的概率也将随之 增加, 而对于多处理器系统而言, 系统中任何一个处理器发生故障都可能影响 整个系统的正常运行。 因此, 有必要对多处理器系统进行故障管理。 其中, 故 障恢复是故障管理的重要组成部分, 所述的故障恢复是指在系统发生故障后, 通过各种手段使系统能够继续运行。
[7] 目前常见的一种故障管理方法是, 预先准备若干个冗余的处理器作为备份处理 器, 在系统正常运行吋, 所述的冗余处理器不参与系统工作; 当发现系统中某 个或某些处理器发生故障吋, 将故障处理器上的业务和数据切换到冗余处理器 上, 使系统能够继续运行。 这种故障管理方法中, 通过将故障处理器上的业务 和数据切换到冗余处理器上的操作实现系统故障恢复, 但是, 由于所述的冗余 处理器在系统正常工作吋不参与工作, 因此这种故障恢复方法将造成对处理器 资源的浪费。 另外, 如果冗余处理器数量太少, 当系统中出现故障的处理器较 多吋, 由于冗余处理器数量不足而失去故障恢复能力; 如果冗余处理器数量过 多, 将造成资源浪费和成本提高。
[8] 现有技术中还存在另外一种常见的故障管理方法, 具体是在多个处理器上运行 相同的软件业务, 在提供相同输入数据的情况下, 各个处理器应产生相同的数 据输出。 如果某个处理器发生故障, 产生了异常的数据输出, 则根据多数原则 进行判定, 将异常的数据输出屏蔽掉。 这种故障管理方法中, 通过屏蔽故障处 理器输出的异常数据实现系统故障恢复, 但是, 由于系统中多个处理器运行相 同的软件业务, 因此降低了多处理器系统的工作效率, 浪费了处理器的处理能 力。
[9] 发明人在实现本发明的过程中, 发现现有技术中对多处理器系统故障恢复的操 作均存在浪费处理器的计算能力, 以及多处理器系统工作效率低的问题。
[10] 发明内容
[11] 本发明的实施例提供了一种多处理器系统故障恢复的方法及装置, 从而避免对 处理器系统资源的浪费。
[12] 一种多处理器系统故障恢复的方法, 包括:
[13] 选定的主处理器对发生故障的硬件单元进行隔离, 并将分配给所述发生故障的 硬件单元的业务重新分配给系统中针对所述业务具备处理能力的正常硬件单元
[14] 一种多处理器系统故障恢复的装置, 包括:
[15] 隔离模块, 用于控制主处理器对发生故障的硬件单元进行隔离;
[16] 业务分配模块, 用于在获知所述对故障硬件单元进行隔离的操作成功后, 控制 主处理器将分配给所述发生故障的硬件单元的业务重新分配给系统中针对所述 业务具备处理能力的正常硬件单元。
[17] 由上述本发明的实施例提供的技术方案可以看出, 由于本发明实施例中釆用选 定的主处理器对系统进行故障恢复操作, 且充分利用系统中的每个处理器资源
, 提高了多处理器系统的工作效率, 且降低了系统实现成本。
[18] 附图简要说明
[19] 图 1为本发明实施例的处理过程示意图;
[20] 图 2为本发明应用实施例的一种系统示意图; [21] 图 3为本发明实施例提供的装置的结构示意图;
[22] 图 4为本发明实施例提供的用于确定主处理器的模块结构示意图。
[23] 实施本发明的方式
[24] 本发明实施例中, 选定的主处理器对发生故障的硬件单元进行隔离, 并将分配 给所述发生故障的硬件单元的业务重新分配给系统中针对所述业务具备处理能 力的正常硬件单元。 所述主处理器对发生故障的硬件单元进行隔离的目的在于 防止故障硬件单元影响系统中其他正常硬件单元的运行, 具体隔离方式可以是 控制发生故障的硬件单元, 使其不影响其他硬件单元运行的任何操作, 例如停 止故障硬件单元工作、 阻止其与其他处理器之间的通信等等方式实现。
[25] 其中, 所述选定的主处理器对发生故障的硬件单元进行隔离的步骤前, 该方法 还可以包括: 主处理器控制发生故障的硬件单元进行复位, 且允许在复位失败 的情况下, 重复对所述发生故障的硬件单元的复位操作, 如果在预定次数内复 位成功, 则对多处理器系统的故障恢复操作完成, 不需要再继续进行后续故障 恢复操作。
[26] 本发明实施例提供的方法还可以包括: 设置预定隔离次数, 当主处理器对发生 故障的硬件单元进行隔离的操作失败吋, 在所述预定隔离次数内重试该操作, 若在预定隔离次数内未能成功隔离发生故障的硬件单元, 则意味着本次通过主 处理器对多处理器系统的故障恢复操作失败。
[27] 本发明实施例提供的方法还可以包括: 设置预定重分配次数, 当主处理器将分 配给所述发生故障的硬件单元的业务重新分配给针对所述业务具备处理能力的 正常硬件单元的操作失败吋, 在所述预定重分配次数内重试该操作, 若在预定 次数内成功将分配给发生故障的硬件单元的业务分配给系统中针对所述业务具 备处理能力的正常硬件单元, 多处理器系统故障恢复操作完成; 否则, 意味着 本次通过主处理器多处理器系统故障恢复操作失败。
[28] 上述本发明实施例中, 所述主处理器的确定可以但不仅限于通过以下任意一种 方法实现:
[29] (1) 指定系统中任意一个能够控制其他处理器工作的正常处理器为主处理器 [30] (2) 通过选举算法按照预先设定的选举规则确定系统中的某个处理器为主处 理器。 所述的选举规则可在实际应用过程中, 根据需求进行设定。 通过所述选 举算法确定的主处理器是能够控制其他处理器工作的正常处理器。
[31] 其中, 所述通过选举算法确定系统中的某个处理器为主处理器的操作具体可以 是: 当系统在启动, 或者重启, 或者发生故障吋, 通过选举算法按照预先设定 的选举规则确定系统中的某个处理器为主处理器; 或者, 在系统运行过程中, 当满足预定的触发条件吋, 则通过选举算法按照预先设定的选举规则更换系统 中的正常处理器作为主处理器。
[32] 下面将结合附图对本发明实施例在实际应用过程中的具体实现方式进行详细的 说明。
[33] 本发明实施例的处理过程如图 1所示, 当系统检测到某处理器发生故障吋, 主 处理器对系统进行故障恢复操作, 具体可以包括以下步骤:
[34] 步骤 1、 对所述故障处理器进行复位, 具体可以是软件复位, 也可以是硬件复 位;
[35] 步骤 2、 判断所述步骤 1的复位操作是否成功, 如果成功, 执行步骤 8, 否则, 执行步骤 3 ;
[36] 步骤 3、 判断所述复位操作次数是否达到预先设定的门限值 n, 如果是, 执行步 骤 4, 否则, 返回执行步骤 1 ;
[37] 所述的门限值 n可以在实际应用过程中, 由操作人员根据实际需要进行设置。
[38] 步骤 4、 对所述故障处理器进行隔离, 以使发生故障的处理器不影响系统中其 他正常处理器的运行, 例如: 停止故障处理器工作、 阻止故障处理器访问系统 共享存储器、 阻止故障处理器与系统中其他处理器通信, 等等;
[39] 步骤 5、 判断所述步骤 4的隔离操作是否成功, 如果成功, 执行步骤 6, 否则, 执行步骤 9;
[40] 步骤 6、 对所述故障处理器的软件业务和数据进行重新分配, 将所述的软件业 务和数据分配给系统中针对其具备处理能力的正常工作的处理器, 以维持系统 正常运行;
[41] 步骤 7、 判断所述步骤 6对软件业务和数据重新分配的操作是否成功, 如果成功 , 执行步骤 8, 否则, 执行步骤 9;
[42] 步骤 8、 多处理器故障恢复完成, 系统正常运行;
[43] 步骤 9、 多处理器故障恢复失败。
[44] 其中, 如果所述对故障处理器的隔离操作失败, 则允许主处理器重复执行步骤 4, 且当该重复操作次数达到预定的门限值吋, 若未成功对故障处理器进行隔离 , 则表示所述通过主处理器对多处理器系统进行故障恢复失败。
[45] 如果所述对故障处理器的软件业务和数据重新分配的操作失败, 则允许主处理 器重复执行步骤 6, 且当该重复操作次数达到预定的门限值吋, 若未成功对软件 业务及数据重新分配, 则表示所述通过主处理器对多处理器系统进行故障恢复 失败。
[46] 上述本发明实施例中, 当被隔离的硬件单元恢复正常吋, 可以由系统检测到该 硬件单元已经恢复正常, 并将所述硬件单元恢复正常的消息发送给当前的主处 理器, 或者, 由该恢复正常的硬件单元向当前的主处理器上报恢复正常的消息 , 以便主处理器在进行新的故障隔离恢复操作吋, 可以向该硬件单元分配其能 够处理的软件业务和数据。
[47] 上述本发明实施例中, 主处理器在对所述故障处理器的软件业务和数据进行重 新分配吋, 可能检测到多个针对所述软件业务和数据具备处理能力的正常工作 的处理器。 此吋, 主处理器可以将发生故障的硬件单元的所有软件业务和数据 全部分配给一个正常工作的处理器; 也可以根据预先规定的算法, 将发生故障 的硬件单元的软件业务和数据分配给多个具备处理能力的正常工作的处理器。
[48] 本发明应用实施例中, 所述主处理器的确定可以通过以下任意一种方法实现: [49] (1) 指定系统中任意一个能够控制其他处理器工作的正常处理器为主处理器
[50] 所述的指定操作可以由操作人员在系统启动、 重启、 正常运行、 或者发生故障 吋完成, 当指定的主处理器发生故障吋, 重新指定新的主处理器; 也可以在系 统运行过程中, 当满足预定的触发条件吋, 由操作人员控制, 更换系统中的指 定处理器为主处理器。
[51] (2) 通过选举算法按照预先设定的选举规则确定系统中的某个处理器为主处 [52] 所述通过选举算法确定处理器的方法具体可以是, 在系统启动, 或者重启, 或 者发生故障吋, 通过选举算法按照预先设定的选举规则确定系统中的某个处理 器为主处理器; 或者, 在系统运行过程中, 当满足预定的触发条件吋, 通过选 举算法按照预先设定的选举规则更换系统中的正常处理器作为主处理器。 具体 可以但不仅限于釆用例如自适应选举算法、 分布式选举算法等现有技术中的选 举算法。
[53] 为便于对上述通过选举算法确定主处理器处理过程的理解, 下面将以在系统发 生故障吋, 和系统正常运行吋为例, 对主处理器的确定过程进行详细说明:
[54] (1) 当系统发生故障吋, 执行选举算法按照预先设定的选举规则确定主处理 器。 其中, 所述的选举规则可以但不仅限于包括处理器工作状态、 处理权限、 运行速度等条件参数, 还可以对所述的多个条件参数设置优先级别, 例如上述 条件参数按优先级由高到低排列为 <工作状态、 处理权限、 运行速度>, 当系统 发生故障吋, 选举算法基于上述选举规则确定工作状态正常, 即未发生故障 ( 基于第一优先级的工作状态参数) 、 能够控制系统中其他处理器工作 (基于第 二优先级的处理权限参数) 、 能够控制发生故障的硬件单元及针对其业务具备 处理能力的硬件单元工作 (基于第二优先级的处理权限参数) 、 且运行速度在 满足前两个优先级参数的处理器中最快 (基于第三优先级的运行速度参数) 的 处理器为主处理器;
(2) 在系统运行过程中, 当满足预定的触发条件吋, 通过选举算法按照预先 确定的选举规则更换系统中的正常处理器作为主处理器。 例如, 以一定的间隔 吋间作为预定的触发条件, 所述的选举规则可以包括处理器工作状态、 处理权 限、 运行速度 (按照优先级别由高到低排序) 等条件参数, 当达到预定的间隔 吋间吋, 通过选举算法基于上述选举规则确定工作状态正常 (基于第一优先级 的工作状态参数) 、 能够控制系统中其他处理器工作 (基于第二优先级的处理 权限参数) 、 且运行速度在满足前两个优先级参数的处理器中最快 (基于第三 优先级的运行速度参数) 的处理器为主处理器。
应当指出的是, 上述两个关于主处理器确定过程的描述仅为本发明实施例选定 主处理器的实现方法中的两个具体实现方式, 并不能理解为对本发明专利范围 的限定。 其中选举规则、 触发条件等参数应根据应用中的实际需要具体确定。
[57] 上述本发明实施例提供的技术方案, 不仅可以在处理器发生故障吋对系统进行 故障恢复操作, 当系统中其他可控的硬件单元发生故障吋, 也可以按照上述步 骤对系统进行故障恢复。 当系统中其他硬件单元发生故障吋, 按照上述本发明 实施例提供的实现方案对所述硬件单元中的软件业务和数据进行重新分配吋, 可将所述的软件业务和数据分配给系统中针对所述的软件和数据具备处理能力 的其他正常工作的硬件单元。 例如, 当系统中的存储器 A发生故障吋, 主处理器 在对存储器 A进行复位操作; 当所述复位操作失败, 对存储器 A进行隔离操作; 所述隔离操作成功后, 将分配给存储器 A的数据重新分配到系统中其他正常工作 的存储器上。
[58] 本发明实施例提供的故障恢复方法可以通过软件算法实现, 减少了对特定硬件 的耦合, 使其能适用于更多的硬件平台; 而且无需设置冗余处理器, 降低了硬 件设计复杂性及系统实现成本。
[59] 本发明实施例提供的技术方案适用于各微处理器之间能够通过一定方法通信, 且选定的主处理器能够对其他处理器的软件、 硬件运行进行控制的多处理器系 统。 主处理器控制其他处理器的实现方法可以是, 由主处理器直接访问其他处 理器中的控制寄存器, 通过修改控制寄存器实现控制功能, 例如控制其他处理 器复位、 改变其状态、 或停止其运行等。 另外, 各微处理器不要求具备相同的 结构及实现功能, 且对各微处理器的空间分布无限制, 即不要求各微处理器集 成在同一个集成电路基片上, 或者安装在同一电路板上, 或者放置在同一特定 空间内, 因此本发明实施例的应用范围很广泛。 下面将以图 2所示的多核系统为 例, 详细描述通过主处理器控制系统中其他处理器完成系统故障恢复的实现方 案:
[60] 如图 2所示, 多个微处理器之间通过总线形式连接, 实现通信; 各微处理器之 间的处理权限不同, 如处理器 A实现对设备 1的控制、 处理器 B实现对设备 1-7的 控制等。
[61] 步骤 1、 当处理器 A发生故障吋, 通过选举算法确定处理器 C为主处理器; [62] 步骤 2、 主处理器 C对处理器 A进行复位操作;
[63] 步骤 3、 判断所述复位操作是否成功, 如果成功, 所述故障恢复操作完成, 否 贝 I」, 执行步骤 4;
[64] 步骤 4、 判断所述复位操作是否达到预先设定的门限值 n, 如果是, 执行步骤 5 , 否则执行步骤 2;
[65] 步骤 5、 主处理器 C对处理器 A进行隔离, 所述的隔离可以是停止处理器 A工作
, 也可以是阻止处理器 A访问设备 1, 等等;
[66] 步骤 6、 判断步骤 5的操作是否成功, 如果成功, 执行步骤 7, 否则, 所述通过 主处理器对多处理器系统进行故障恢复的操作失败;
步骤 7、 主处理器 C将分配给处理器 A的针对设备 1的业务重新分配给处理器 B 处理器 B针对设备 1的业务具备处理能力) ;
[68] 步骤 8、 判断步骤 7的操作是否成功, 如果成功, 所述故障恢复操作完成, 否则 , 所述通过主处理器对多处理器系统进行故障恢复的操作失败。
[69] 实现上述方法实施例的全部或部分步骤可以通过程序指令相关的硬件来完成, 前述的程序可以存储于一计算机可读取存储介质中, 该程序在执行吋, 执行包 括上述方法实施例的步骤; 而前述的存储介质包括: ROM、 RAM、 磁碟或者光 盘等各种可以存储程序代码的介质。
[70] 本发明实施例还提供一种多处理器系统故障恢复的装置, 其结构如图 3所示, 具体实现结构可以包括:
[71] 隔离模块 1, 用于控制主处理器对发生故障的硬件单元进行隔离, 以使发生故 障的硬件单元不影响系统中其他硬件单元的正常工作, 如果所述隔离操作成功
, 则隔离模块发出信号告知所述业务分配模块可以开始工作, 否则, 隔离模块 发出信号告知系统故障恢复操作失败, 所述业务分配模块不工作;
[72] 所述主处理器对发生故障的硬件单元进行隔离的目的在于防止故障硬件单元影 响系统中其他正常硬件单元的运行, 具体隔离方式可以釆用例如停止故障硬件 单元工作、 阻止其与其他处理器之间的通信等等方式实现。
[73] 业务分配模块 2, 用于在获知所述隔离所述故障硬件单元的操作成功后, 控制 主处理器将分配给所述发生故障的硬件单元的业务重新分配给系统中针对所述 业务具备处理能力的正常硬件单元。 其中, 如果所述的对业务的重新分配操作 成功, 则表示系统故障恢复操作成功, 否则, 表示系统故障恢复操作失败。
[74] 上述本发明实施例的装置中, 还可以包括复位模块 3, 用于控制主处理器对发 生故障的硬件单元进行软件或硬件复位, 如果在预定次数内复位成功, 所述多 处理器故障恢复操作完成; 否则, 通知所述隔离模块 1进行相应操作。
[75] 上述本发明实施例的装置中, 还可以包括图 4所示的至少一个用于确定主处理 器的模块 4, 具体可以包括:
[76] 主处理器指定模块 41, 用于指定系统中任意一个能够控制其他处理器工作的正 常处理器为主处理器;
[77] 或者,
[78] 选举算法模块 42, 用于执行选举算法, 按照预先设定的选举规则确定系统中的 某个正常处理器为主处理器。
[79] 其中还可以包括检测模块 43, 用于当检测到系统启动、 系统重启、 系统发生故 障或在系统运行过程中符合预定的触发条件, 则通知所述选举算法模块 42执行 选择主处理器的操作。 在实际应用过程中, 操作人员可以通过对所述条件触发 模块设置一个或多个触发条件, 以实现相应的触发功能。
[80] 综上所述, 在本发明实施例提供的技术方案中, 由于系统中所有正常处理器均 参与业务处理, 且无需由多个处理器完成相同的数据处理工作, 因此充分利用 各个处理器的处理能力, 提高了多处理器系统的工作率及处理能力。
[81] 以上所述, 仅为本发明较佳的具体实施方式, 但本发明的保护范围并不局限于 此, 任何熟悉本技术领域的技术人员在本发明揭露的技术范围内, 可轻易想到 的变化或替换, 都应涵盖在本发明的保护范围之内。 因此, 本发明的保护范围 应该以权利要求的保护范围为准。

Claims

权利要求书
[1] 一种多处理器系统故障恢复的方法, 其特征在于, 包括:
选定的主处理器对发生故障的硬件单元进行隔离, 并将分配给所述发生故 障的硬件单元的业务重新分配给系统中针对所述业务具备处理能力的正常 硬件单元。
[2] 根据权利要求 1所述的方法, 其特征在于, 所述选定的主处理器对发生故障 的硬件单元进行隔离前, 该方法还包括:
所述主处理器控制所述发生故障的硬件单元进行复位, 且如果在预定次数 内复位成功, 所述多处理器系统故障恢复操作完成。
[3] 根据权利要求 2所述的方法, 其特征在于, 所述选定的主处理器对发生故障 的硬件单元进行隔离的操作包括:
设置预定隔离次数, 所述主处理器对发生故障的硬件单元进行隔离, 当对 所述发生故障的硬件单元进行隔离的操作失败, 在所述预定隔离次数内重 试该操作。
[4] 根据权利要求 2所述的方法, 其特征在于, 所述将分配给所述发生故障的硬 件单元的业务重新分配的操作包括:
设置预定重分配次数, 所述主处理器将分配给所述发生故障的硬件单元的 业务重新分配给系统中针对所述业务具备处理能力的正常硬件单元, 当将 分配给所述发生故障的硬件单元的业务重新分配给针对所述业务具备处理 能力的正常硬件单元的操作失败, 在所述预定重分配次数内重试该操作。
[5] 根据权利要求 1〜4任意一项所述的方法, 其特征在于, 所述主处理器通过 下述方式选定:
指定系统中任意一个能够控制其他处理器工作的正常处理器为所述主处理 器;
或者,
通过选举算法按照预先设定的选举规则确定系统中的某个正常处理器为所 述主处理器。
[6] 根据权利要求 5所述的方法, 其特征在于, 所述通过选举算法按照预先设定 的选举规则确定系统中的某个正常处理器为所述主处理器包括: 当系统启动吋, 通过选举算法按照预先设定的选举规则确定系统中的某个 正常处理器为所述主处理器;
或者,
当系统重启吋, 通过选举算法按照预先设定的选举规则确定系统中的某个 正常处理器为所述主处理器;
或者,
当系统发生故障吋, 通过选举算法按照预先设定的选举规则确定系统中的 某个正常处理器为所述主处理器;
或者,
在系统运行过程中, 当满足预定的触发条件吋, 通过选举算法按照预先设 定的选举规则更换系统中的正常处理器作为主处理器。
[7] 一种多处理器系统故障恢复的装置, 其特征在于, 包括:
隔离模块, 用于控制主处理器对发生故障的硬件单元进行隔离; 业务分配模块, 用于在获知所述对发生故障的硬件单元进行隔离的操作成 功后, 控制所述主处理器将分配给所述发生故障的硬件单元的业务重新分 配给系统中针对所述业务具备处理能力的正常硬件单元。
[8] 根据权利要求 7所述的装置, 其特征在于, 该装置还包括:
复位模块, 用于控制所述主处理器对所述发生故障的硬件单元进行复位, 如果在预定次数内复位成功, 所述多处理器故障恢复操作完成; 否则, 通 知所述隔离模块执行隔离所述发生故障的硬件单元的操作。
[9] 根据权利要求 7或 8任意一项所述的装置, 其特征在于, 该装置还包括: 主处理器指定模块, 用于指定系统中任意一个能够控制其他处理器工作的 正常处理器为主处理器;
或者,
选举算法模块, 用于执行选举算法, 按照预先设定的选举规则确定系统中 的某个正常处理器为主处理器。
[10] 根据权利要求 9所述的装置, 其特征在于, 如果釆用选举算法模块, 该装置 还包括检测模块, 用于当检测到系统启动、 系统重启、 系统发生故障或在 系统运行过程中符合预定的触发条件, 通知所述选举算法模块执行确定主 处理器的操作。
PCT/CN2009/070154 2008-01-18 2009-01-15 一种多处理器系统故障恢复的方法及装置 WO2009092322A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN200810056461.5 2008-01-18
CNA2008100564615A CN101216793A (zh) 2008-01-18 2008-01-18 一种多处理器系统故障恢复的方法及装置

Publications (1)

Publication Number Publication Date
WO2009092322A1 true WO2009092322A1 (zh) 2009-07-30

Family

ID=39623229

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2009/070154 WO2009092322A1 (zh) 2008-01-18 2009-01-15 一种多处理器系统故障恢复的方法及装置

Country Status (2)

Country Link
CN (1) CN101216793A (zh)
WO (1) WO2009092322A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105278651A (zh) * 2015-11-27 2016-01-27 中国科学院微电子研究所 一种冗余控制系统

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101216793A (zh) * 2008-01-18 2008-07-09 华为技术有限公司 一种多处理器系统故障恢复的方法及装置
CN102053873B (zh) * 2011-01-13 2012-12-05 浙江大学 一种缓存感知的多核处理器虚拟机故障隔离保证方法
CN103425545A (zh) * 2013-08-20 2013-12-04 浪潮电子信息产业股份有限公司 一种多处理器服务器的系统容错方法
WO2015042925A1 (zh) * 2013-09-29 2015-04-02 华为技术有限公司 服务器的控制方法和服务器的控制设备
CN105446833B (zh) * 2013-09-29 2020-04-14 华为技术有限公司 服务器的控制方法和服务器的控制设备
CN103634312A (zh) * 2013-11-26 2014-03-12 广州晶锐信息技术有限公司 一种基于音频共享实现多音频快速同步的设备管理方法
CN105700975B (zh) * 2016-01-08 2019-05-24 华为技术有限公司 一种中央处理器cpu热移除、热添加方法及装置
EP3471339B1 (en) 2016-10-31 2020-04-29 Huawei Technologies Co., Ltd. Method and enabling device for starting physical device
CN111132282B (zh) * 2018-11-01 2021-06-01 华为终端有限公司 一种应用于移动终端的应用处理器唤醒方法及装置
CN109947586A (zh) * 2019-03-20 2019-06-28 浪潮商用机器有限公司 一种隔离故障设备的方法、装置和介质
CN111611111B (zh) * 2020-05-22 2020-12-22 北京中科海讯数字科技股份有限公司 多处理器信号处理设备快速故障恢复方法及其系统
CN116051018B (zh) * 2022-11-25 2023-07-14 北京多氪信息科技有限公司 选举处理方法、装置、电子设备及计算机可读存储介质

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1928832A (zh) * 2005-09-09 2007-03-14 国际商业机器公司 多处理计算系统中的状态跟踪和恢复方法和系统
CN1987804A (zh) * 2005-12-22 2007-06-27 国际商业机器公司 并行计算系统中的冗余保护的方法和系统
CN101216793A (zh) * 2008-01-18 2008-07-09 华为技术有限公司 一种多处理器系统故障恢复的方法及装置

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1928832A (zh) * 2005-09-09 2007-03-14 国际商业机器公司 多处理计算系统中的状态跟踪和恢复方法和系统
CN1987804A (zh) * 2005-12-22 2007-06-27 国际商业机器公司 并行计算系统中的冗余保护的方法和系统
CN101216793A (zh) * 2008-01-18 2008-07-09 华为技术有限公司 一种多处理器系统故障恢复的方法及装置

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105278651A (zh) * 2015-11-27 2016-01-27 中国科学院微电子研究所 一种冗余控制系统

Also Published As

Publication number Publication date
CN101216793A (zh) 2008-07-09

Similar Documents

Publication Publication Date Title
WO2009092322A1 (zh) 一种多处理器系统故障恢复的方法及装置
US8423816B2 (en) Method and computer system for failover
US7418627B2 (en) Cluster system wherein failover reset signals are sent from nodes according to their priority
JP4934642B2 (ja) 計算機システム
US8346933B2 (en) Virtual machine location system, virtual machine location method, program, virtual machine manager, and server
US8788672B2 (en) Microprocessor with software control over allocation of shared resources among multiple virtual servers
US8032786B2 (en) Information-processing equipment and system therefor with switching control for switchover operation
WO2016165304A1 (zh) 一种实例节点管理的方法及管理设备
JP2007207219A (ja) 計算機システムの管理方法、管理サーバ、計算機システム及びプログラム
US11640314B2 (en) Service provision system, resource allocation method, and resource allocation program
JP2006178969A (ja) 動作不能なマスタ作業負荷管理プロセスを代替するシステムおよび方法
JP2007172334A (ja) 並列型演算システムの冗長性を確保するための方法、システム、およびプログラム
JP2003330740A (ja) 多重化計算機システム、論理計算機の割当方法および論理計算機の割当プログラム
JP2007164305A (ja) ブート制御方法および計算機システム並びにその処理プログラム
EP2360614B1 (en) Information processing device and hardware setting method for said information processing device
JP2006285810A (ja) クラスタ構成コンピュータシステム及びその系リセット方法
WO2013190694A1 (ja) 計算機の復旧方法、計算機システム及び記憶媒体
JP2004272899A (ja) コンピュータシステムにおけるリセット方法
KR20180094369A (ko) 네트워크 장치 및 그의 인터럽트 관리 방법
JP2009003537A (ja) 計算機
CN109358982B (zh) 硬盘自愈装置、方法以及硬盘
CN106528276A (zh) 一种基于任务调度的故障处理方法
JP2009026182A (ja) プログラム実行システム及び実行装置
US10528397B2 (en) Method, device, and non-transitory computer readable storage medium for creating virtual machine
US20140059389A1 (en) Computer and memory inspection method

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 09704358

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 09704358

Country of ref document: EP

Kind code of ref document: A1