CN117112296A - Fault processing method and device for redundant system, electronic equipment and storage medium - Google Patents

Fault processing method and device for redundant system, electronic equipment and storage medium Download PDF

Info

Publication number
CN117112296A
CN117112296A CN202311013917.0A CN202311013917A CN117112296A CN 117112296 A CN117112296 A CN 117112296A CN 202311013917 A CN202311013917 A CN 202311013917A CN 117112296 A CN117112296 A CN 117112296A
Authority
CN
China
Prior art keywords
switch chip
processor
instruction
sending
monitoring
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311013917.0A
Other languages
Chinese (zh)
Inventor
张顺顺
王晓松
刘振
徐通
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Inspur Intelligent Technology Co Ltd
Original Assignee
Suzhou Inspur Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Inspur Intelligent Technology Co Ltd filed Critical Suzhou Inspur Intelligent Technology Co Ltd
Priority to CN202311013917.0A priority Critical patent/CN117112296A/en
Publication of CN117112296A publication Critical patent/CN117112296A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1415Saving, restoring, recovering or retrying at system level
    • G06F11/142Reconfiguring to eliminate the error
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1415Saving, restoring, recovering or retrying at system level
    • G06F11/1438Restarting or rejuvenating
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Hardware Redundancy (AREA)

Abstract

The invention provides a fault processing method, a device, electronic equipment and a storage medium of a redundant system, which are applied to a system management device, wherein the method comprises the following steps: monitoring all processors in the redundant system according to the monitoring link; in response to monitoring a fault processor, determining whether a working server which is working is mounted on a first switch chip corresponding to the fault processor; if yes, a first unloading instruction is sent to the first switch chip, a first mounting instruction is sent to a second switch chip corresponding to the target processor, a restarting instruction is sent to the fault processor in response to the completion of mounting the working server on the second switch chip, a second unloading instruction is sent to the second switch chip in response to the successful restarting of the fault processor, and a second mounting instruction is sent to the first switch chip. When a certain CPU fails, the system service is seamlessly switched to a normal CPU so as to meet the requirement of high reliability of a redundant system.

Description

Fault processing method and device for redundant system, electronic equipment and storage medium
Technical Field
The present invention relates to the field of fault processing technologies, and in particular, to a fault processing method and apparatus for a redundant system, an electronic device, and a storage medium.
Background
With more and more businesses going on a network, the more and more data are needed to be carried by a server, the more data are needed to prove that the risk of bearing is larger, the data with a large size are subjected to continuous interactive calculation every day, the data are lost for a plurality of reasons, and importantly, when a production system fails, the data recovery and the business takeover can be effectively and rapidly carried out, the system is ensured not to stop, and therefore the continuity of the businesses is ensured, which is a problem that every enterprise needs to face. When a server is subject to network attacks, intrusions, power failures, or operational errors, the data deployed by the enterprise on the server will be lost or no longer exist, which is a significant business impact for the enterprise. Therefore, the redundancy of the system has the meaning that when all accidents happen, the original system can be quickly and safely recovered, and the normal operation of the service is ensured in a certain range.
The existing double-path or multi-path server is not truly designed in a redundancy mode, the system is guaranteed not to be powered off only when the main CPU fails, key control rights of the system are switched to the secondary CPU, however, equipment hung under a PCIE data link is offline when the main CPU fails, and related processing operation of a user cannot be completed. The first scheme of the current application is that a CPLD is used for monitoring any CPU module, the abnormality monitoring of any CPU module is realized by a third party CPLD, the CPLD controls an electronic switch, and a management signal link of a management system of the intelligent cabinet is switched to a master management module or a slave management module. However, as long as the main management module fails, the device hung under the module is offline until the module failure is processed, and the user request cannot be operated. And when the master device has no fault, the slave device is always in an idle state, which has negative influence on the densification of the required device, and causes the problem of computing resource waste.
Disclosure of Invention
In view of the foregoing, it is desirable to provide a failure processing method, apparatus, electronic device, and storage medium that can realize a redundant system with high concurrency and high reliability.
In a first aspect, a fault handling method of a redundant system is provided, and the fault handling method is applied to a system management device, and the method includes:
monitoring all processors in the redundant system according to the monitoring link;
in response to monitoring a fault processor, determining whether a working server which is working is mounted on a first switch chip corresponding to the fault processor;
if yes, a first unloading instruction is sent to the first switch chip, and a first mounting instruction is sent to a second switch chip corresponding to a target processor in the redundant system;
responding to the completion of the mounting of the second switch chip on the working server, and sending a restarting instruction to the fault processor;
and responding to the successful restarting of the fault processor, sending a second unloading instruction to the second switch chip and sending a second mounting instruction to the first switch chip so as to mount the working server.
In one embodiment, the monitoring all processors in the redundant system according to the monitoring link includes:
The monitoring link comprises a heartbeat monitoring link, an interrupt alarm link and an abnormal information routing link;
responding to the alarm of one monitoring link corresponding to the alarm processor, and continuously monitoring the alarm processor according to the other two monitoring links;
responding to the two monitoring links corresponding to the alarm processor to alarm, and continuously monitoring the alarm processor according to the other monitoring link;
and determining that the alarm processor fails in response to determining that three monitoring links corresponding to the alarm processor alarm.
In one embodiment, the sending the first offload instruction to the first switch chip and sending the first mount instruction to the second switch chip corresponding to the target processor in the redundant system includes:
determining the number of working servers of the working servers corresponding to the fault processor;
determining a target processor from all the processors according to the number of the working servers;
sending the first unloading instruction to the first switch chip;
responsive to the first switch chip unloading the working server being completed, modifying a register configuration of the first switch chip and sending a first mounting instruction to the second switch chip;
And responding to the completion of the second switch chip mounting of the working server, and sending reset information to the working server.
In one embodiment, the determining the target processor from the all processors according to the number of the working servers includes:
determining the number of idle servers corresponding to each processor;
and determining the processors with the number of the idle servers not smaller than the number of the working servers as the target processors.
In one embodiment, the sending the second offload instruction to the second switch chip and the sending the second mount instruction to the first switch chip include:
sending the second unloading instruction to the second switch chip;
responding to the completion of unloading the working server by the second switch chip, and sending a second mounting instruction to the first switch chip;
and modifying the register configuration and sending the reset information to the working server in response to the completion of the re-mounting of the working server by the first switch chip.
In one embodiment, there is also provided a fault handling method of a redundancy system, applied to a first switch chip, the method including:
Releasing work port resources in the fault processor in response to receiving a first unloading instruction sent by the system management device;
and in response to receiving a second mounting instruction sent by the system management device, reallocating the work port resources to the work server according to the first high-speed serial computer expansion bus between the work port resources and the fault processor.
In one embodiment, there is also provided a fault handling method of a redundant system, applied to a second switch chip, the method including:
in response to receiving a first mounting instruction sent by the system management device, allocating work port resources to the work server according to a second high-speed serial computer expansion bus with the target processor;
and responding to receiving a second unloading instruction sent by the system management device, and releasing the work port resources in the target processor.
In another aspect, there is provided a fault handling apparatus for a redundant system, for use in a system management apparatus, the apparatus comprising:
the monitoring module monitors all processors in the redundant system according to the monitoring link;
the determining module is used for determining whether a working server which works is mounted on a first switch chip corresponding to the fault processor or not in response to the monitoring of the fault processor;
A first sending module, if yes, configured to send a first unloading instruction to the first switch chip and send a first mounting instruction to a second switch chip corresponding to a target processor in the redundant system,
a second sending module, configured to send a restart instruction to the failure processor in response to completion of mounting the second switch chip on the working server,
and the third sending module is used for responding to the restarting success of the fault processor, sending a second unloading instruction to the second switch chip and sending a second mounting instruction to the first switch chip so as to mount the working server.
In one embodiment, the monitoring module monitors all processors in the redundant system according to the monitoring link, including:
the monitoring link comprises a heartbeat monitoring link, an interrupt alarm link and an abnormal information routing link;
responding to the alarm of one monitoring link corresponding to the alarm processor, and continuously monitoring the alarm processor according to the other two monitoring links;
responding to the two monitoring links corresponding to the alarm processor to alarm, and continuously monitoring the alarm processor according to the other monitoring link;
And determining that the alarm processor fails in response to determining that three monitoring links corresponding to the alarm processor alarm.
In one embodiment, the sending, by the first sending module, the first offload instruction to the first switch chip and the sending, by the first sending module, the first mount instruction to the second switch chip corresponding to the target processor in the redundant system includes:
determining the number of working servers of the working servers corresponding to the fault processor;
determining a target processor from all the processors according to the number of the working servers;
sending the first unloading instruction to the first switch chip;
responsive to the first switch chip unloading the working server being completed, modifying a register configuration of the first switch chip and sending a first mounting instruction to the second switch chip;
and responding to the completion of the second switch chip mounting of the working server, and sending reset information to the working server.
In one embodiment, the determining, by the first sending module, the target processor from the all processors according to the number of working servers includes:
determining the number of idle servers corresponding to each processor;
And determining the processors with the number of the idle servers not smaller than the number of the working servers as the target processors.
In one embodiment, the sending, by the second sending module, a second offload instruction to the second switch chip and sending, by the first switch chip, a second mount instruction includes:
sending the second unloading instruction to the second switch chip;
responding to the completion of unloading the working server by the second switch chip, and sending a second mounting instruction to the first switch chip;
and modifying the register configuration and sending the reset information to the working server in response to the completion of the re-mounting of the working server by the first switch chip.
In one embodiment, there is also provided a fault handling apparatus of a redundancy system, applied to a first switch chip, the apparatus including:
the first releasing module is used for responding to the first unloading instruction sent by the system management device and releasing the work port resources in the fault processor;
and the first allocation module is used for responding to the second mounting instruction sent by the system management device and reallocating the work port resources to the work server according to the first high-speed serial computer expansion bus between the first mounting instruction and the fault processor.
In one embodiment, there is also provided a fault handling apparatus of a redundancy system, applied to a second switch chip, the apparatus including:
the second allocation module is used for responding to the first mounting instruction sent by the system management device and allocating work port resources to the work server according to the second high-speed serial computer expansion bus with the target processor;
and the second releasing module is used for responding to the second unloading instruction sent by the system management device and releasing the work port resources in the target processor.
In yet another aspect, an electronic device is provided comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the following steps when executing the computer program:
monitoring all processors in the redundant system according to the monitoring link;
in response to monitoring a fault processor, determining whether a working server which is working is mounted on a first switch chip corresponding to the fault processor;
if yes, a first unloading instruction is sent to the first switch chip, and a first mounting instruction is sent to a second switch chip corresponding to a target processor in the redundant system;
Responding to the completion of the mounting of the second switch chip on the working server, and sending a restarting instruction to the fault processor;
and responding to the successful restarting of the fault processor, sending a second unloading instruction to the second switch chip and sending a second mounting instruction to the first switch chip so as to mount the working server.
In one embodiment, the processor, when executing the computer program, performs the steps of:
the monitoring of all processors in the redundant system according to the monitoring link comprises:
the monitoring link comprises a heartbeat monitoring link, an interrupt alarm link and an abnormal information routing link;
responding to the alarm of one monitoring link corresponding to the alarm processor, and continuously monitoring the alarm processor according to the other two monitoring links;
responding to the two monitoring links corresponding to the alarm processor to alarm, and continuously monitoring the alarm processor according to the other monitoring link;
and determining that the alarm processor fails in response to determining that three monitoring links corresponding to the alarm processor alarm.
In one embodiment, the processor, when executing the computer program, performs the steps of:
The sending the first unloading instruction to the first switch chip and sending the first mounting instruction to the second switch chip corresponding to the target processor in the redundant system includes:
determining the number of working servers of the working servers corresponding to the fault processor;
determining a target processor from all the processors according to the number of the working servers;
sending the first unloading instruction to the first switch chip;
responsive to the first switch chip unloading the working server being completed, modifying a register configuration of the first switch chip and sending a first mounting instruction to the second switch chip;
and responding to the completion of the second switch chip mounting of the working server, and sending reset information to the working server.
In one embodiment, the processor, when executing the computer program, performs the steps of:
said determining a target processor from said all processors according to said number of working servers comprises:
determining the number of idle servers corresponding to each processor;
and determining the processors with the number of the idle servers not smaller than the number of the working servers as the target processors.
In one embodiment, the processor, when executing the computer program, performs the steps of:
the sending the second unloading instruction to the second switch chip and the sending the second mounting instruction to the first switch chip include:
sending the second unloading instruction to the second switch chip;
responding to the completion of unloading the working server by the second switch chip, and sending a second mounting instruction to the first switch chip;
and modifying the register configuration and sending the reset information to the working server in response to the completion of the re-mounting of the working server by the first switch chip.
In one embodiment, the processor, when executing the computer program, performs the steps of:
releasing work port resources in the fault processor in response to receiving a first unloading instruction sent by the system management device;
and in response to receiving a second mounting instruction sent by the system management device, reallocating the work port resources to the work server according to the first high-speed serial computer expansion bus between the work port resources and the fault processor.
In one embodiment, the processor, when executing the computer program, performs the steps of:
In response to receiving a first mounting instruction sent by the system management device, allocating work port resources to the work server according to a second high-speed serial computer expansion bus with the target processor;
and responding to receiving a second unloading instruction sent by the system management device, and releasing the work port resources in the target processor.
In yet another aspect, a computer readable storage medium is provided, having stored thereon a computer program which when executed by a processor performs the steps of:
monitoring all processors in the redundant system according to the monitoring link;
in response to monitoring a fault processor, determining whether a working server which is working is mounted on a first switch chip corresponding to the fault processor;
if yes, a first unloading instruction is sent to the first switch chip, and a first mounting instruction is sent to a second switch chip corresponding to a target processor in the redundant system;
responding to the completion of the mounting of the second switch chip on the working server, and sending a restarting instruction to the fault processor;
and responding to the successful restarting of the fault processor, sending a second unloading instruction to the second switch chip and sending a second mounting instruction to the first switch chip so as to mount the working server.
In one embodiment, the computer program when executed by a processor performs the steps of:
the monitoring of all processors in the redundant system according to the monitoring link comprises:
the monitoring link comprises a heartbeat monitoring link, an interrupt alarm link and an abnormal information routing link;
responding to the alarm of one monitoring link corresponding to the alarm processor, and continuously monitoring the alarm processor according to the other two monitoring links;
responding to the two monitoring links corresponding to the alarm processor to alarm, and continuously monitoring the alarm processor according to the other monitoring link;
and determining that the alarm processor fails in response to determining that three monitoring links corresponding to the alarm processor alarm.
In one embodiment, the computer program when executed by a processor performs the steps of:
the sending the first unloading instruction to the first switch chip and sending the first mounting instruction to the second switch chip corresponding to the target processor in the redundant system includes:
determining the number of working servers of the working servers corresponding to the fault processor;
determining a target processor from all the processors according to the number of the working servers;
Sending the first unloading instruction to the first switch chip;
responsive to the first switch chip unloading the working server being completed, modifying a register configuration of the first switch chip and sending a first mounting instruction to the second switch chip;
and responding to the completion of the second switch chip mounting of the working server, and sending reset information to the working server.
In one embodiment, the computer program when executed by a processor performs the steps of:
said determining a target processor from said all processors according to said number of working servers comprises:
determining the number of idle servers corresponding to each processor;
and determining the processors with the number of the idle servers not smaller than the number of the working servers as the target processors.
In one embodiment, the computer program when executed by a processor performs the steps of:
the sending the second unloading instruction to the second switch chip and the sending the second mounting instruction to the first switch chip include:
sending the second unloading instruction to the second switch chip;
responding to the completion of unloading the working server by the second switch chip, and sending a second mounting instruction to the first switch chip;
And modifying the register configuration and sending the reset information to the working server in response to the completion of the re-mounting of the working server by the first switch chip.
In one embodiment, the computer program when executed by a processor performs the steps of:
releasing work port resources in the fault processor in response to receiving a first unloading instruction sent by the system management device;
and in response to receiving a second mounting instruction sent by the system management device, reallocating the work port resources to the work server according to the first high-speed serial computer expansion bus between the work port resources and the fault processor.
In one embodiment, the computer program when executed by a processor performs the steps of:
in response to receiving a first mounting instruction sent by the system management device, allocating work port resources to the work server according to a second high-speed serial computer expansion bus with the target processor;
and responding to receiving a second unloading instruction sent by the system management device, and releasing the work port resources in the target processor.
Monitoring all processors in the redundant system according to the monitoring link; in response to monitoring a fault processor, determining whether a working server which is working is mounted on a first switch chip corresponding to the fault processor; if yes, a first unloading instruction is sent to the first switch chip, and a first mounting instruction is sent to a second switch chip corresponding to a target processor in the redundant system, so that the work server is mounted to the target processor; and sending a restarting instruction to the fault processor in response to the completion of the mounting of the working server by the second switch chip, and sending a second unloading instruction to the second switch chip and a second mounting instruction to the first switch chip in response to the successful restarting of the fault processor so as to realize the re-mounting of the working server to the repaired fault processor. The CPU works simultaneously to meet the high concurrent calculation requirement, and meanwhile, under the condition that a certain CPU is down, the system service can be seamlessly switched to another CPU to ensure that the server system can realize the high concurrent data calculation and meet the requirement of high reliability.
Drawings
FIG. 1 is a system topology of a fault handling method for a redundant system;
FIG. 2 is a schematic diagram illustrating steps of a fault handling method for a redundant system of a system management device;
FIG. 3 is a system topology of a multiple switch chip interconnect system;
FIG. 4 is a schematic diagram of a failure handling device of a redundant system applied to a system management device;
fig. 5 is an internal structural diagram of a computer device in an embodiment of the present application.
Detailed Description
The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.
The system Management Device can be composed of a BMC, an mCPU (Management CPU as a Management center in the server) and a CPLD, wherein after the working state of the CPU module is monitored by the BMC and the fault state is notified to the mCPU through an LPC/IIC (integrated circuit bus) signal, the mCPU can be communicated with PCIE FabricSwitch through UART (universal asynchronous transceiver universal serial data bus) for asynchronous communication, the bidirectional communication of the bus can realize full duplex transmission and reception and the CPU configuration is modified through PCIE links, and after the CPLD is reset through the CPLD, the CPLD is reset, the CPLD is notified to the user after the CPLD is reset through the CPLD, and the CPLD is reset, and the user can finish the fault state notification after the CPLD is reset through the CPLD. When the Device uninstallation-installation needs to be completed again, the mCPU communicates with PCIE FabricSwitch through UART and modifies the register configuration, which will not be described in detail later.
In one embodiment, as shown in fig. 2, the present invention provides a fault handling method of a redundant system, applied to a system management device, the method comprising:
s201, monitoring all processors in a redundant system according to a monitoring link;
s202, determining whether a working server which is working is mounted on a first switch chip corresponding to a fault processor or not in response to monitoring the fault processor;
s203, if yes, a first unloading instruction is sent to the first switch chip, and a first mounting instruction is sent to a second switch chip corresponding to a target processor in the redundant system;
s204, responding to completion of mounting the working server on the second switch chip, and sending a restarting instruction to the fault processor;
s205, responding to the restarting success of the fault processor, sending a second unloading instruction to the second switch chip and sending a second mounting instruction to the first switch chip so as to mount the working server.
Specifically, in the redundant system, the redundant system may include a plurality of CPUs and PCIE FabricSwitch corresponding to each CPU, where in a normal working state, the CPUs are interconnected with devices through PCIE links and PCIE FabricSwitch and perform tasks such as related data processing, control management, and high performance computing, and since PCIE FabricSwitch uses and accesses computing resources in the CPUs, the devices mounted on PCIE FabricSwitch are equivalent to those mounted on the corresponding CPUs. The CPU is respectively provided with 1 path of heartbeat monitoring (mcpu_heart error), 1 path of abnormal interrupt alarming SMI_GPIO and 1 path of abnormal information routing MDI signal which are connected with the processing device, the processing device monitors the working state of each CPU in real time through a heartbeat monitoring link, an interrupt alarming link and an abnormal information routing link, when the processing device monitors a certain path of CPU fault, whether the fault CPU mounts a working device which works under the condition that the fault CPU is firstly determined, if the mounted device is idle, the fault CPU is idle, a restarting instruction is directly sent to the fault CPU at the moment, and the fault CPU is restarted and can not influence the mounted device. If the working server equipment is mounted, equipment switching needs to be completed, and meanwhile, after the fault CPU is restarted and normal work is successfully recovered, the server equipment is switched back. Therefore, when no fault exists, the multi-path CPU works simultaneously to provide high calculation power to meet high concurrency, and when a certain path of CPU breaks down, the non-inductive switching can be performed, so that the normal work of the downlink equipment is ensured.
In one embodiment, the monitoring all processors in the redundant system according to the monitoring link includes:
the monitoring link comprises a heartbeat monitoring link, an interrupt alarm link and an abnormal information routing link;
responding to the alarm of one monitoring link corresponding to the alarm processor, and continuously monitoring the alarm processor according to the other two monitoring links;
responding to the two monitoring links corresponding to the alarm processor to alarm, and continuously monitoring the alarm processor according to the other monitoring link;
and determining that the alarm processor fails in response to determining that three monitoring links corresponding to the alarm processor alarm.
Specifically, the CPU is respectively provided with 1-path heartbeat monitoring (mcpu_heartbeat error), 1-path abort alarm smi_gpio and 1-path abort information routing MDI signals, and the processing device is connected to the processing device, and monitors the working states of the CPUs in real time through the heartbeat monitoring link, the abort alarm link and the abort information routing link, monitors the working states of the CPU modules, and in order to prevent erroneous judgment caused by the single link being interfered, the processing device waits for three-path monitoring feedback signals of the CPU, and only when the three-path monitoring feedback signals (MDI, mcpu_ heartError, SMI _gpio) are all alarmed, the processing device can determine that the CPU is in a fault state.
In one embodiment, the sending the first offload instruction to the first switch chip and sending the first mount instruction to the second switch chip corresponding to the target processor in the redundant system includes:
determining the number of working servers of the working servers corresponding to the fault processor;
determining a target processor from all the processors according to the number of the working servers;
sending the first unloading instruction to the first switch chip;
responsive to the first switch chip unloading the working server being completed, modifying a register configuration of the first switch chip and sending a first mounting instruction to the second switch chip;
and responding to the completion of the second switch chip mounting of the working server, and sending reset information to the working server.
Specifically, assuming that CPU0 fails, the processing Device sends an offload instruction to PCIE FabricSwitch0, and then changes PCIE FabricSwitch the internal register configuration, so that PCIE FabricSwitch allows three server devices mounted under itself to be mounted under PCIE FabricSwitch1 (second switch chip) at the same time when the ports are interconnected, and at this time, three devices originally mounted under CPU0 are mounted under CPU 1. After the unloading-loading operation is completed, the processing Device sends a PERST signal to the corresponding working server Device, and after the Device is reset, the devices are all loaded under the CPU1 and work normally. As shown in the drawing, the liquid crystal display device,
In one embodiment, the determining the target processor from the all processors according to the number of the working servers includes:
determining the number of idle servers corresponding to each processor;
and determining the processors with the number of the idle servers not smaller than the number of the working servers as the target processors.
Specifically, in the redundant system, since the redundant system includes a plurality of CPUs and PCIE FabricSwitch corresponding to each CPU, that is, a switch chip, when the CPU0 fails, it is necessary to determine that the target CPU, for example, PCIE FabricSwitch0 corresponding to the failed CPU0, has four devices mounted thereon, wherein only three of the devices are working, that is, the number of working servers is 3, and at this time, the number of PCIE FabricSwitch mounted on the target CPU that needs to be determined is also not less than 3, and the more the number of idle servers, it is stated that the CPU corresponding to PCIE FabricSwitch has sufficient resources allocated to the three working servers of the failed CPU0, so when selecting the target CPU, the CPU corresponding to PCIE FabricSwitch of the target CPU can be selected to have the largest number of mounted servers and the largest number of mounted idle servers. The more servers that can be mounted indicate that the CPU performance is strongest, the more idle servers that are currently mounted indicate that they can allocate more processor resources.
In one embodiment, the sending the second offload instruction to the second switch chip and the sending the second mount instruction to the first switch chip include:
sending the second unloading instruction to the second switch chip;
responding to the completion of unloading the working server by the second switch chip, and sending a second mounting instruction to the first switch chip;
and modifying the register configuration and sending the reset information to the working server in response to the completion of the re-mounting of the working server by the first switch chip.
Specifically, the processing device restarts the failed CPU0 after completing the mounting of the working server device, and when the failed CPU0 is restarted successfully, the upper layer user device is notified to complete the restart. During the restart of the CPU0, all devices work normally through the CPU1, and business processing is not affected. When the fault module is successfully restarted, the PCIE FabricSwitch0 and the CPU0 are successfully reconnected, an unloading instruction is sent to PCIE FabricSwitch1, PCIE FabricSwitch1 releases port resources, so that three devices mounted under the CPU1 are unloaded, then a mounting instruction is sent to PCIE FabricSwitch0, at this time PCIE FabricSwitch0 allocates task resources in the CPU0 to the currently working devices through a PCIE connection line with the CPU0, then the register configuration of the original Fabricswitch0 is modified back, namely, a server hung under the CPU0 is not allowed to be mounted under the CPU1 through port interconnection between the Fabricswitch0 and the Fabricswitch1, then reset information is sent to the devices, and the devices are formally started to work after reset.
In one embodiment, there is also provided a fault handling method of a redundancy system, applied to a first switch chip, the method including:
releasing work port resources in the fault processor in response to receiving a first unloading instruction sent by the system management device;
and in response to receiving a second mounting instruction sent by the system management device, reallocating the work port resources to the work server according to the first high-speed serial computer expansion bus between the work port resources and the fault processor.
Specifically, when the first unloading instruction sent by the processing Device is received, the PCIE FabricSwitch0 releases the work port resources corresponding to the three ports S4, S5 and S6, so as to unload the three devices mounted under the CPU 0. Then, after the processing device modifies its own register configuration, the device originally mounted on the CPU0 is mounted on the CPU1 through port communication with the PCIE FabricSwitch 1. Then, when receiving the second mounting instruction sent by the processing device, at this time, PCIE FabricSwitch again allocates the work port resource in CPU0 to the work device through the PCIE line successfully connected with CPU0 again.
In one embodiment, there is also provided a fault handling method of a redundant system, applied to a second switch chip, the method including:
In response to receiving a first mounting instruction sent by the system management device, allocating work port resources to the work server according to a second high-speed serial computer expansion bus with the target processor;
and responding to receiving a second unloading instruction sent by the system management device, and releasing the work port resources in the target processor.
Specifically, as described above, when the first mount instruction sent by the processing apparatus is received, the PCIE FabricSwitch1 mounts the device originally mounted on the CPU0 onto the CPU1 through the port communication with PCIE FabricSwitch and the PCIE link between itself and the CPU1, so that the work port resources in the CPU1 can be allocated to the three work devices. When receiving the second offload instruction sent by the processing apparatus, PCIE FabricSwitch1 re-releases the work port resources originally allocated to the three work devices, thereby offloading the three work devices.
Fig. 3 is a topology diagram of interconnection of multiple switch chips, and SW0, 2, 4, and 6 in column a are regarded as uplink SW in a 2×4 topology; the SW1, 3, 5, 7 of column B is regarded as a downstream SW in the 2×4 topology, and through interconnection of multiple switch chips, more upstream host can be connected and more downstream devices can be connected. And then redundant backup and switching can be performed between host and device, and the method can be applied to a cluster server or a data center to improve the stability and efficiency of the cluster server or the data center.
The scheme of the application has the following beneficial effects:
1) The device is not divided into a master device and a slave device under the current redundant system, so that the device can work simultaneously to meet the requirement of high concurrency calculation, and meanwhile, under the condition that a certain CPU is down, the system service can be seamlessly switched to another CPU to ensure that the server system can realize the high concurrency data calculation and can meet the requirement of high reliability;
2) When a master device such as a CPU fails, a slave device such as a device is always in an idle state, which has negative influence on the densification of the required device, and the problem of computing resource waste can be effectively solved.
It should be understood that, although the steps in the flowchart of fig. 2 are shown in sequence as indicated by the arrows, the steps are not necessarily performed in sequence as indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in fig. 2 may include multiple sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, nor do the order in which the sub-steps or stages are performed necessarily performed in sequence, but may be performed alternately or alternately with at least a portion of the sub-steps or stages of other steps or other steps.
In one embodiment, as shown in fig. 4, a fault handling apparatus of a redundant system is applied to a system management apparatus, the apparatus includes:
the monitoring module 401 monitors all processors in the redundant system according to the monitoring link;
a determining module 402, configured to determine, in response to monitoring a fault processor, whether a working server that is working is mounted on a first switch chip corresponding to the fault processor;
the first sending module 403 is configured to send a first unloading instruction to the first switch chip and send a first mounting instruction to a second switch chip corresponding to the target processor in the redundant system if the first unloading instruction is received;
a second sending module 404, configured to send a restart instruction to the failure processor in response to the second switch chip mounting the working server being completed;
and the third sending module 405 is configured to send a second unloading instruction to the second switch chip and send a second mount instruction to the first switch chip in response to the restart success of the fault processor, so as to mount the working server.
In one embodiment, the monitoring module monitors all processors in the redundant system according to the monitoring link, including:
The monitoring link comprises a heartbeat monitoring link, an interrupt alarm link and an abnormal information routing link;
responding to the alarm of one monitoring link corresponding to the alarm processor, and continuously monitoring the alarm processor according to the other two monitoring links;
responding to the two monitoring links corresponding to the alarm processor to alarm, and continuously monitoring the alarm processor according to the other monitoring link;
and determining that the alarm processor fails in response to determining that three monitoring links corresponding to the alarm processor alarm.
In one embodiment, the sending, by the first sending module, the first offload instruction to the first switch chip and the sending, by the first sending module, the first mount instruction to the second switch chip corresponding to the target processor in the redundant system includes:
determining the number of working servers of the working servers corresponding to the fault processor;
determining a target processor from all the processors according to the number of the working servers;
sending the first unloading instruction to the first switch chip;
responsive to the first switch chip unloading the working server being completed, modifying a register configuration of the first switch chip and sending a first mounting instruction to the second switch chip;
And responding to the completion of the second switch chip mounting of the working server, and sending reset information to the working server.
In one embodiment, the determining, by the first sending module, the target processor from the all processors according to the number of working servers includes:
determining the number of idle servers corresponding to each processor;
and determining the processors with the number of the idle servers not smaller than the number of the working servers as the target processors.
In one embodiment, the sending, by the second sending module, a second offload instruction to the second switch chip and sending, by the first switch chip, a second mount instruction includes:
sending the second unloading instruction to the second switch chip;
responding to the completion of unloading the working server by the second switch chip, and sending a second mounting instruction to the first switch chip;
and modifying the register configuration and sending the reset information to the working server in response to the completion of the re-mounting of the working server by the first switch chip.
In one embodiment, there is also provided a fault handling apparatus of a redundancy system, applied to a first switch chip, the apparatus including:
The first releasing module is used for responding to the first unloading instruction sent by the system management device and releasing the work port resources in the fault processor;
and the first allocation module is used for responding to the second mounting instruction sent by the system management device and reallocating the work port resources to the work server according to the first high-speed serial computer expansion bus between the first mounting instruction and the fault processor.
In one embodiment, there is also provided a fault handling apparatus of a redundancy system, applied to a second switch chip, the apparatus including:
the second allocation module is used for responding to the first mounting instruction sent by the system management device and allocating work port resources to the work server according to the second high-speed serial computer expansion bus with the target processor;
and the second releasing module is used for responding to the second unloading instruction sent by the system management device and releasing the work port resources in the target processor.
For specific limitations on the fault handling means of the redundant system, reference may be made to the above limitation on the fault handling method of the redundant system, and no further description is given here. The respective modules in the fault handling apparatus of the redundant system described above may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.
In one embodiment, a computer device is provided, which may be a terminal, and the internal structure of which may be as shown in fig. 5. The computer device includes a processor, a memory, a network interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program, when executed by a processor, implements an alert information processing method. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, can also be keys, a track ball or a touch pad arranged on the shell of the computer equipment, and can also be an external keyboard, a touch pad or a mouse and the like.
It will be appreciated by those skilled in the art that the structure shown in FIG. 5 is merely a block diagram of some of the structures associated with the present inventive arrangements and is not limiting of the computer device to which the present inventive arrangements may be applied, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.
In one embodiment, an electronic device is provided that includes a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the steps of:
monitoring all processors in the redundant system according to the monitoring link;
in response to monitoring a fault processor, determining whether a working server which is working is mounted on a first switch chip corresponding to the fault processor;
if so, a first unloading instruction is sent to the first switch chip, a first mounting instruction is sent to a second switch chip corresponding to a target processor in the redundant system,
responsive to the second switch chip mounting the working server being completed, sending a restart instruction to the failure processor,
And responding to the successful restarting of the fault processor, sending a second unloading instruction to the second switch chip and sending a second mounting instruction to the first switch chip.
In one embodiment, the processor, when executing the computer program, performs the steps of:
the monitoring of all processors in the redundant system according to the monitoring link comprises:
the monitoring link comprises a heartbeat monitoring link, an interrupt alarm link and an abnormal information routing link;
responding to the alarm of one monitoring link corresponding to the alarm processor, and continuously monitoring the alarm processor according to the other two monitoring links;
responding to the two monitoring links corresponding to the alarm processor to alarm, and continuously monitoring the alarm processor according to the other monitoring link;
and determining that the alarm processor fails in response to determining that three monitoring links corresponding to the alarm processor alarm.
In one embodiment, the processor, when executing the computer program, performs the steps of:
the sending the first unloading instruction to the first switch chip and sending the first mounting instruction to the second switch chip corresponding to the target processor in the redundant system includes:
Determining the number of working servers of the working servers corresponding to the fault processor;
determining a target processor from all the processors according to the number of the working servers;
sending the first unloading instruction to the first switch chip;
responsive to the first switch chip unloading the working server being completed, modifying a register configuration of the first switch chip and sending a first mounting instruction to the second switch chip;
and responding to the completion of the second switch chip mounting of the working server, and sending reset information to the working server.
In one embodiment, the processor, when executing the computer program, performs the steps of:
said determining a target processor from said all processors according to said number of working servers comprises:
determining the number of idle servers corresponding to each processor;
and determining the processors with the number of the idle servers not smaller than the number of the working servers as the target processors.
In one embodiment, the processor, when executing the computer program, performs the steps of:
the sending the second unloading instruction to the second switch chip and the sending the second mounting instruction to the first switch chip include:
Sending the second unloading instruction to the second switch chip;
responding to the completion of unloading the working server by the second switch chip, and sending a second mounting instruction to the first switch chip;
and modifying the register configuration and sending the reset information to the working server in response to the completion of the re-mounting of the working server by the first switch chip.
In one embodiment, the processor, when executing the computer program, performs the steps of:
releasing work port resources in the fault processor in response to receiving a first unloading instruction sent by the system management device;
and in response to receiving a second mounting instruction sent by the system management device, reallocating the work port resources to the work server according to the first high-speed serial computer expansion bus between the work port resources and the fault processor.
In one embodiment, the processor, when executing the computer program, performs the steps of:
in response to receiving a first mounting instruction sent by the system management device, allocating work port resources to the work server according to a second high-speed serial computer expansion bus with the target processor;
And responding to receiving a second unloading instruction sent by the system management device, and releasing the work port resources in the target processor.
In one embodiment, a computer readable storage medium is provided having stored thereon a computer program which when executed by a processor performs the steps of:
monitoring all processors in the redundant system according to the monitoring link;
in response to monitoring a fault processor, determining whether a working server which is working is mounted on a first switch chip corresponding to the fault processor;
if so, a first unloading instruction is sent to the first switch chip, a first mounting instruction is sent to a second switch chip corresponding to a target processor in the redundant system,
responsive to the second switch chip mounting the working server being completed, sending a restart instruction to the failure processor,
and responding to the successful restarting of the fault processor, sending a second unloading instruction to the second switch chip and sending a second mounting instruction to the first switch chip.
In one embodiment, the computer program when executed by a processor performs the steps of:
The monitoring of all processors in the redundant system according to the monitoring link comprises:
the monitoring link comprises a heartbeat monitoring link, an interrupt alarm link and an abnormal information routing link;
responding to the alarm of one monitoring link corresponding to the alarm processor, and continuously monitoring the alarm processor according to the other two monitoring links;
responding to the two monitoring links corresponding to the alarm processor to alarm, and continuously monitoring the alarm processor according to the other monitoring link;
and determining that the alarm processor fails in response to determining that three monitoring links corresponding to the alarm processor alarm.
In one embodiment, the computer program when executed by a processor performs the steps of:
the sending the first unloading instruction to the first switch chip and sending the first mounting instruction to the second switch chip corresponding to the target processor in the redundant system includes:
determining the number of working servers of the working servers corresponding to the fault processor;
determining a target processor from all the processors according to the number of the working servers;
sending the first unloading instruction to the first switch chip;
Responsive to the first switch chip unloading the working server being completed, modifying a register configuration of the first switch chip and sending a first mounting instruction to the second switch chip;
and responding to the completion of the second switch chip mounting of the working server, and sending reset information to the working server.
In one embodiment, the computer program when executed by a processor performs the steps of:
said determining a target processor from said all processors according to said number of working servers comprises:
determining the number of idle servers corresponding to each processor;
and determining the processors with the number of the idle servers not smaller than the number of the working servers as the target processors.
In one embodiment, the computer program when executed by a processor performs the steps of:
the sending the second unloading instruction to the second switch chip and the sending the second mounting instruction to the first switch chip include:
sending the second unloading instruction to the second switch chip;
responding to the completion of unloading the working server by the second switch chip, and sending a second mounting instruction to the first switch chip;
And modifying the register configuration and sending the reset information to the working server in response to the completion of the re-mounting of the working server by the first switch chip.
In one embodiment, the computer program when executed by a processor performs the steps of:
releasing work port resources in the fault processor in response to receiving a first unloading instruction sent by the system management device;
and in response to receiving a second mounting instruction sent by the system management device, reallocating the work port resources to the work server according to the first high-speed serial computer expansion bus between the work port resources and the fault processor.
In one embodiment, the computer program when executed by a processor performs the steps of:
in response to receiving a first mounting instruction sent by the system management device, allocating work port resources to the work server according to a second high-speed serial computer expansion bus with the target processor;
and responding to receiving a second unloading instruction sent by the system management device, and releasing the work port resources in the target processor.
Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.
The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The above examples illustrate only a few embodiments of the application, which are described in detail and are not to be construed as limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of protection of the present application is to be determined by the appended claims.

Claims (10)

1. A fault handling method for a redundant system, applied to a system management device, the method comprising:
monitoring all processors in the redundant system according to the monitoring link;
in response to monitoring a fault processor, determining whether a working server which is working is mounted on a first switch chip corresponding to the fault processor;
if yes, a first unloading instruction is sent to the first switch chip, and a first mounting instruction is sent to a second switch chip corresponding to a target processor in the redundant system;
Responding to the completion of the mounting of the second switch chip on the working server, and sending a restarting instruction to the fault processor;
and responding to the successful restarting of the fault processor, sending a second unloading instruction to the second switch chip and sending a second mounting instruction to the first switch chip so as to mount the working server.
2. The method of claim 1, wherein monitoring all processors in the redundant system based on the monitoring link comprises:
the monitoring link comprises a heartbeat monitoring link, an interrupt alarm link and an abnormal information routing link;
responding to the alarm of one monitoring link corresponding to the alarm processor, and continuously monitoring the alarm processor according to the other two monitoring links;
responding to the two monitoring links corresponding to the alarm processor to alarm, and continuously monitoring the alarm processor according to the other monitoring link;
and determining that the alarm processor fails in response to determining that three monitoring links corresponding to the alarm processor alarm.
3. The method of claim 1, wherein the sending a first offload instruction to the first switch chip and a first mount instruction to a second switch chip corresponding to a target processor in the redundant system comprises:
Determining the number of working servers of the working servers corresponding to the fault processor;
determining a target processor from all the processors according to the number of the working servers;
sending the first unloading instruction to the first switch chip;
responsive to the first switch chip unloading the working server being completed, modifying a register configuration of the first switch chip and sending a first mounting instruction to the second switch chip;
and responding to the completion of the second switch chip mounting of the working server, and sending reset information to the working server.
4. A method according to claim 3, wherein said determining a target processor from said all processors based on said number of working servers comprises:
determining the number of idle servers corresponding to each processor;
and determining the processors with the number of the idle servers not smaller than the number of the working servers as the target processors.
5. The method of claim 3, wherein the sending a second offload instruction to the second switch chip and the first switch chip sending a second mount instruction comprises:
Sending the second unloading instruction to the second switch chip;
responding to the completion of unloading the working server by the second switch chip, and sending a second mounting instruction to the first switch chip;
and modifying the register configuration and sending the reset information to the working server in response to the completion of the re-mounting of the working server by the first switch chip.
6. A fault handling method for a redundant system, applied to a first switch chip, the method comprising:
releasing work port resources in the fault processor in response to receiving a first unloading instruction sent by the system management device;
and in response to receiving a second mounting instruction sent by the system management device, reallocating the work port resources to the work server according to the first high-speed serial computer expansion bus between the work port resources and the fault processor.
7. A fault handling method of a redundant system, applied to a second switch chip, the method comprising:
in response to receiving a first mounting instruction sent by the system management device, allocating work port resources to the work server according to a second high-speed serial computer expansion bus with the target processor;
And responding to receiving a second unloading instruction sent by the system management device, and releasing the work port resources in the target processor.
8. A fault handling device for a redundant system, for use in a system management device, the device comprising:
the monitoring module monitors all processors in the redundant system according to the monitoring link;
the determining module is used for determining whether a working server which works is mounted on a first switch chip corresponding to the fault processor or not in response to the monitoring of the fault processor;
the first sending module is used for sending a first unloading instruction to the first switch chip and sending a first mounting instruction to a second switch chip corresponding to a target processor in the redundant system if the first unloading instruction is received;
the second sending module is used for responding to the completion of the mounting of the second switch chip on the working server and sending a restarting instruction to the fault processor;
and the third sending module is used for responding to the restarting success of the fault processor, sending a second unloading instruction to the second switch chip and sending a second mounting instruction to the first switch chip so as to mount the working server.
9. An electronic device, comprising:
One or more processors; and a memory associated with the one or more processors, the memory for storing program instructions that, when read for execution by the one or more processors, perform the method of any of claims 1-7.
10. A computer storage medium, characterized in that it has stored thereon a computer program, wherein the program, when executed by a processor, implements the method according to any of claims 1-7.
CN202311013917.0A 2023-08-11 2023-08-11 Fault processing method and device for redundant system, electronic equipment and storage medium Pending CN117112296A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311013917.0A CN117112296A (en) 2023-08-11 2023-08-11 Fault processing method and device for redundant system, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311013917.0A CN117112296A (en) 2023-08-11 2023-08-11 Fault processing method and device for redundant system, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN117112296A true CN117112296A (en) 2023-11-24

Family

ID=88806802

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311013917.0A Pending CN117112296A (en) 2023-08-11 2023-08-11 Fault processing method and device for redundant system, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN117112296A (en)

Similar Documents

Publication Publication Date Title
US11755435B2 (en) Cluster availability management
US8032786B2 (en) Information-processing equipment and system therefor with switching control for switchover operation
CN116881053B (en) Data processing method, exchange board, data processing system and data processing device
US11768724B2 (en) Data availability in a constrained deployment of a high-availability system in the presence of pending faults
US11409471B2 (en) Method and apparatus for performing data access management of all flash array server
CN111124728A (en) Automatic service recovery method, system, readable storage medium and server
CN113742165B (en) Dual master control equipment and master-slave control method
CN116266150A (en) Service recovery method, data processing unit and related equipment
JP2009069963A (en) Multiprocessor system
CN109995597B (en) Network equipment fault processing method and device
JP2009003537A (en) Computer
CN109358982B (en) Hard disk self-healing device and method and hard disk
WO2008004330A1 (en) Multiple processor system
CN117112296A (en) Fault processing method and device for redundant system, electronic equipment and storage medium
CN110633176B (en) Working system switching method, cube star and switching device
JP3621634B2 (en) Redundant configuration switching system
JPH07121395A (en) Method for preferentially selecting auxiliary device
CN107783855B (en) Fault self-healing control device and method for virtual network element
JP6822706B1 (en) Cluster system, server equipment, takeover method, and program
US11366618B2 (en) All flash array server and control method thereof
US11809293B2 (en) Storage node failure detection based on register values for an all flash array server
US20230092343A1 (en) Lockstep processor recovery for vehicle applications
CN115629855A (en) Redundancy task migration strategy and computing device
CN113114481A (en) Data operation method and device, computer equipment and storage medium
JP2016224490A (en) Redundant system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination