CN117112296A - Fault processing method and device for redundant system, electronic equipment and storage medium - Google Patents
Fault processing method and device for redundant system, electronic equipment and storage medium Download PDFInfo
- Publication number
- CN117112296A CN117112296A CN202311013917.0A CN202311013917A CN117112296A CN 117112296 A CN117112296 A CN 117112296A CN 202311013917 A CN202311013917 A CN 202311013917A CN 117112296 A CN117112296 A CN 117112296A
- Authority
- CN
- China
- Prior art keywords
- switch chip
- processor
- instruction
- sending
- monitoring
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000003672 processing method Methods 0.000 title abstract description 6
- 238000012544 monitoring process Methods 0.000 claims abstract description 143
- 230000004044 response Effects 0.000 claims abstract description 58
- 238000000034 method Methods 0.000 claims abstract description 32
- 238000004590 computer program Methods 0.000 claims description 34
- 230000002159 abnormal effect Effects 0.000 claims description 12
- 238000012545 processing Methods 0.000 description 19
- 238000004364 calculation method Methods 0.000 description 6
- 238000010586 diagram Methods 0.000 description 5
- 230000006854 communication Effects 0.000 description 3
- 238000004891 communication Methods 0.000 description 3
- 238000000280 densification Methods 0.000 description 2
- 239000004973 liquid crystal related substance Substances 0.000 description 2
- 230000001360 synchronised effect Effects 0.000 description 2
- 239000002699 waste material Substances 0.000 description 2
- 230000005856 abnormality Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000007175 bidirectional communication Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000003203 everyday effect Effects 0.000 description 1
- 230000001939 inductive effect Effects 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 238000009434 installation Methods 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000011084 recovery Methods 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 238000011144 upstream manufacturing Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/14—Error detection or correction of the data by redundancy in operation
- G06F11/1402—Saving, restoring, recovering or retrying
- G06F11/1415—Saving, restoring, recovering or retrying at system level
- G06F11/142—Reconfiguring to eliminate the error
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/14—Error detection or correction of the data by redundancy in operation
- G06F11/1402—Saving, restoring, recovering or retrying
- G06F11/1415—Saving, restoring, recovering or retrying at system level
- G06F11/1438—Restarting or rejuvenating
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/3003—Monitoring arrangements specially adapted to the computing system or computing system component being monitored
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Quality & Reliability (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Hardware Redundancy (AREA)
Abstract
The invention provides a fault processing method, a device, electronic equipment and a storage medium of a redundant system, which are applied to a system management device, wherein the method comprises the following steps: monitoring all processors in the redundant system according to the monitoring link; in response to monitoring a fault processor, determining whether a working server which is working is mounted on a first switch chip corresponding to the fault processor; if yes, a first unloading instruction is sent to the first switch chip, a first mounting instruction is sent to a second switch chip corresponding to the target processor, a restarting instruction is sent to the fault processor in response to the completion of mounting the working server on the second switch chip, a second unloading instruction is sent to the second switch chip in response to the successful restarting of the fault processor, and a second mounting instruction is sent to the first switch chip. When a certain CPU fails, the system service is seamlessly switched to a normal CPU so as to meet the requirement of high reliability of a redundant system.
Description
Technical Field
The present invention relates to the field of fault processing technologies, and in particular, to a fault processing method and apparatus for a redundant system, an electronic device, and a storage medium.
Background
With more and more businesses going on a network, the more and more data are needed to be carried by a server, the more data are needed to prove that the risk of bearing is larger, the data with a large size are subjected to continuous interactive calculation every day, the data are lost for a plurality of reasons, and importantly, when a production system fails, the data recovery and the business takeover can be effectively and rapidly carried out, the system is ensured not to stop, and therefore the continuity of the businesses is ensured, which is a problem that every enterprise needs to face. When a server is subject to network attacks, intrusions, power failures, or operational errors, the data deployed by the enterprise on the server will be lost or no longer exist, which is a significant business impact for the enterprise. Therefore, the redundancy of the system has the meaning that when all accidents happen, the original system can be quickly and safely recovered, and the normal operation of the service is ensured in a certain range.
The existing double-path or multi-path server is not truly designed in a redundancy mode, the system is guaranteed not to be powered off only when the main CPU fails, key control rights of the system are switched to the secondary CPU, however, equipment hung under a PCIE data link is offline when the main CPU fails, and related processing operation of a user cannot be completed. The first scheme of the current application is that a CPLD is used for monitoring any CPU module, the abnormality monitoring of any CPU module is realized by a third party CPLD, the CPLD controls an electronic switch, and a management signal link of a management system of the intelligent cabinet is switched to a master management module or a slave management module. However, as long as the main management module fails, the device hung under the module is offline until the module failure is processed, and the user request cannot be operated. And when the master device has no fault, the slave device is always in an idle state, which has negative influence on the densification of the required device, and causes the problem of computing resource waste.
Disclosure of Invention
In view of the foregoing, it is desirable to provide a failure processing method, apparatus, electronic device, and storage medium that can realize a redundant system with high concurrency and high reliability.
In a first aspect, a fault handling method of a redundant system is provided, and the fault handling method is applied to a system management device, and the method includes:
monitoring all processors in the redundant system according to the monitoring link;
in response to monitoring a fault processor, determining whether a working server which is working is mounted on a first switch chip corresponding to the fault processor;
if yes, a first unloading instruction is sent to the first switch chip, and a first mounting instruction is sent to a second switch chip corresponding to a target processor in the redundant system;
responding to the completion of the mounting of the second switch chip on the working server, and sending a restarting instruction to the fault processor;
and responding to the successful restarting of the fault processor, sending a second unloading instruction to the second switch chip and sending a second mounting instruction to the first switch chip so as to mount the working server.
In one embodiment, the monitoring all processors in the redundant system according to the monitoring link includes:
The monitoring link comprises a heartbeat monitoring link, an interrupt alarm link and an abnormal information routing link;
responding to the alarm of one monitoring link corresponding to the alarm processor, and continuously monitoring the alarm processor according to the other two monitoring links;
responding to the two monitoring links corresponding to the alarm processor to alarm, and continuously monitoring the alarm processor according to the other monitoring link;
and determining that the alarm processor fails in response to determining that three monitoring links corresponding to the alarm processor alarm.
In one embodiment, the sending the first offload instruction to the first switch chip and sending the first mount instruction to the second switch chip corresponding to the target processor in the redundant system includes:
determining the number of working servers of the working servers corresponding to the fault processor;
determining a target processor from all the processors according to the number of the working servers;
sending the first unloading instruction to the first switch chip;
responsive to the first switch chip unloading the working server being completed, modifying a register configuration of the first switch chip and sending a first mounting instruction to the second switch chip;
And responding to the completion of the second switch chip mounting of the working server, and sending reset information to the working server.
In one embodiment, the determining the target processor from the all processors according to the number of the working servers includes:
determining the number of idle servers corresponding to each processor;
and determining the processors with the number of the idle servers not smaller than the number of the working servers as the target processors.
In one embodiment, the sending the second offload instruction to the second switch chip and the sending the second mount instruction to the first switch chip include:
sending the second unloading instruction to the second switch chip;
responding to the completion of unloading the working server by the second switch chip, and sending a second mounting instruction to the first switch chip;
and modifying the register configuration and sending the reset information to the working server in response to the completion of the re-mounting of the working server by the first switch chip.
In one embodiment, there is also provided a fault handling method of a redundancy system, applied to a first switch chip, the method including:
Releasing work port resources in the fault processor in response to receiving a first unloading instruction sent by the system management device;
and in response to receiving a second mounting instruction sent by the system management device, reallocating the work port resources to the work server according to the first high-speed serial computer expansion bus between the work port resources and the fault processor.
In one embodiment, there is also provided a fault handling method of a redundant system, applied to a second switch chip, the method including:
in response to receiving a first mounting instruction sent by the system management device, allocating work port resources to the work server according to a second high-speed serial computer expansion bus with the target processor;
and responding to receiving a second unloading instruction sent by the system management device, and releasing the work port resources in the target processor.
In another aspect, there is provided a fault handling apparatus for a redundant system, for use in a system management apparatus, the apparatus comprising:
the monitoring module monitors all processors in the redundant system according to the monitoring link;
the determining module is used for determining whether a working server which works is mounted on a first switch chip corresponding to the fault processor or not in response to the monitoring of the fault processor;
A first sending module, if yes, configured to send a first unloading instruction to the first switch chip and send a first mounting instruction to a second switch chip corresponding to a target processor in the redundant system,
a second sending module, configured to send a restart instruction to the failure processor in response to completion of mounting the second switch chip on the working server,
and the third sending module is used for responding to the restarting success of the fault processor, sending a second unloading instruction to the second switch chip and sending a second mounting instruction to the first switch chip so as to mount the working server.
In one embodiment, the monitoring module monitors all processors in the redundant system according to the monitoring link, including:
the monitoring link comprises a heartbeat monitoring link, an interrupt alarm link and an abnormal information routing link;
responding to the alarm of one monitoring link corresponding to the alarm processor, and continuously monitoring the alarm processor according to the other two monitoring links;
responding to the two monitoring links corresponding to the alarm processor to alarm, and continuously monitoring the alarm processor according to the other monitoring link;
And determining that the alarm processor fails in response to determining that three monitoring links corresponding to the alarm processor alarm.
In one embodiment, the sending, by the first sending module, the first offload instruction to the first switch chip and the sending, by the first sending module, the first mount instruction to the second switch chip corresponding to the target processor in the redundant system includes:
determining the number of working servers of the working servers corresponding to the fault processor;
determining a target processor from all the processors according to the number of the working servers;
sending the first unloading instruction to the first switch chip;
responsive to the first switch chip unloading the working server being completed, modifying a register configuration of the first switch chip and sending a first mounting instruction to the second switch chip;
and responding to the completion of the second switch chip mounting of the working server, and sending reset information to the working server.
In one embodiment, the determining, by the first sending module, the target processor from the all processors according to the number of working servers includes:
determining the number of idle servers corresponding to each processor;
And determining the processors with the number of the idle servers not smaller than the number of the working servers as the target processors.
In one embodiment, the sending, by the second sending module, a second offload instruction to the second switch chip and sending, by the first switch chip, a second mount instruction includes:
sending the second unloading instruction to the second switch chip;
responding to the completion of unloading the working server by the second switch chip, and sending a second mounting instruction to the first switch chip;
and modifying the register configuration and sending the reset information to the working server in response to the completion of the re-mounting of the working server by the first switch chip.
In one embodiment, there is also provided a fault handling apparatus of a redundancy system, applied to a first switch chip, the apparatus including:
the first releasing module is used for responding to the first unloading instruction sent by the system management device and releasing the work port resources in the fault processor;
and the first allocation module is used for responding to the second mounting instruction sent by the system management device and reallocating the work port resources to the work server according to the first high-speed serial computer expansion bus between the first mounting instruction and the fault processor.
In one embodiment, there is also provided a fault handling apparatus of a redundancy system, applied to a second switch chip, the apparatus including:
the second allocation module is used for responding to the first mounting instruction sent by the system management device and allocating work port resources to the work server according to the second high-speed serial computer expansion bus with the target processor;
and the second releasing module is used for responding to the second unloading instruction sent by the system management device and releasing the work port resources in the target processor.
In yet another aspect, an electronic device is provided comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the following steps when executing the computer program:
monitoring all processors in the redundant system according to the monitoring link;
in response to monitoring a fault processor, determining whether a working server which is working is mounted on a first switch chip corresponding to the fault processor;
if yes, a first unloading instruction is sent to the first switch chip, and a first mounting instruction is sent to a second switch chip corresponding to a target processor in the redundant system;
Responding to the completion of the mounting of the second switch chip on the working server, and sending a restarting instruction to the fault processor;
and responding to the successful restarting of the fault processor, sending a second unloading instruction to the second switch chip and sending a second mounting instruction to the first switch chip so as to mount the working server.
In one embodiment, the processor, when executing the computer program, performs the steps of:
the monitoring of all processors in the redundant system according to the monitoring link comprises:
the monitoring link comprises a heartbeat monitoring link, an interrupt alarm link and an abnormal information routing link;
responding to the alarm of one monitoring link corresponding to the alarm processor, and continuously monitoring the alarm processor according to the other two monitoring links;
responding to the two monitoring links corresponding to the alarm processor to alarm, and continuously monitoring the alarm processor according to the other monitoring link;
and determining that the alarm processor fails in response to determining that three monitoring links corresponding to the alarm processor alarm.
In one embodiment, the processor, when executing the computer program, performs the steps of:
The sending the first unloading instruction to the first switch chip and sending the first mounting instruction to the second switch chip corresponding to the target processor in the redundant system includes:
determining the number of working servers of the working servers corresponding to the fault processor;
determining a target processor from all the processors according to the number of the working servers;
sending the first unloading instruction to the first switch chip;
responsive to the first switch chip unloading the working server being completed, modifying a register configuration of the first switch chip and sending a first mounting instruction to the second switch chip;
and responding to the completion of the second switch chip mounting of the working server, and sending reset information to the working server.
In one embodiment, the processor, when executing the computer program, performs the steps of:
said determining a target processor from said all processors according to said number of working servers comprises:
determining the number of idle servers corresponding to each processor;
and determining the processors with the number of the idle servers not smaller than the number of the working servers as the target processors.
In one embodiment, the processor, when executing the computer program, performs the steps of:
the sending the second unloading instruction to the second switch chip and the sending the second mounting instruction to the first switch chip include:
sending the second unloading instruction to the second switch chip;
responding to the completion of unloading the working server by the second switch chip, and sending a second mounting instruction to the first switch chip;
and modifying the register configuration and sending the reset information to the working server in response to the completion of the re-mounting of the working server by the first switch chip.
In one embodiment, the processor, when executing the computer program, performs the steps of:
releasing work port resources in the fault processor in response to receiving a first unloading instruction sent by the system management device;
and in response to receiving a second mounting instruction sent by the system management device, reallocating the work port resources to the work server according to the first high-speed serial computer expansion bus between the work port resources and the fault processor.
In one embodiment, the processor, when executing the computer program, performs the steps of:
In response to receiving a first mounting instruction sent by the system management device, allocating work port resources to the work server according to a second high-speed serial computer expansion bus with the target processor;
and responding to receiving a second unloading instruction sent by the system management device, and releasing the work port resources in the target processor.
In yet another aspect, a computer readable storage medium is provided, having stored thereon a computer program which when executed by a processor performs the steps of:
monitoring all processors in the redundant system according to the monitoring link;
in response to monitoring a fault processor, determining whether a working server which is working is mounted on a first switch chip corresponding to the fault processor;
if yes, a first unloading instruction is sent to the first switch chip, and a first mounting instruction is sent to a second switch chip corresponding to a target processor in the redundant system;
responding to the completion of the mounting of the second switch chip on the working server, and sending a restarting instruction to the fault processor;
and responding to the successful restarting of the fault processor, sending a second unloading instruction to the second switch chip and sending a second mounting instruction to the first switch chip so as to mount the working server.
In one embodiment, the computer program when executed by a processor performs the steps of:
the monitoring of all processors in the redundant system according to the monitoring link comprises:
the monitoring link comprises a heartbeat monitoring link, an interrupt alarm link and an abnormal information routing link;
responding to the alarm of one monitoring link corresponding to the alarm processor, and continuously monitoring the alarm processor according to the other two monitoring links;
responding to the two monitoring links corresponding to the alarm processor to alarm, and continuously monitoring the alarm processor according to the other monitoring link;
and determining that the alarm processor fails in response to determining that three monitoring links corresponding to the alarm processor alarm.
In one embodiment, the computer program when executed by a processor performs the steps of:
the sending the first unloading instruction to the first switch chip and sending the first mounting instruction to the second switch chip corresponding to the target processor in the redundant system includes:
determining the number of working servers of the working servers corresponding to the fault processor;
determining a target processor from all the processors according to the number of the working servers;
Sending the first unloading instruction to the first switch chip;
responsive to the first switch chip unloading the working server being completed, modifying a register configuration of the first switch chip and sending a first mounting instruction to the second switch chip;
and responding to the completion of the second switch chip mounting of the working server, and sending reset information to the working server.
In one embodiment, the computer program when executed by a processor performs the steps of:
said determining a target processor from said all processors according to said number of working servers comprises:
determining the number of idle servers corresponding to each processor;
and determining the processors with the number of the idle servers not smaller than the number of the working servers as the target processors.
In one embodiment, the computer program when executed by a processor performs the steps of:
the sending the second unloading instruction to the second switch chip and the sending the second mounting instruction to the first switch chip include:
sending the second unloading instruction to the second switch chip;
responding to the completion of unloading the working server by the second switch chip, and sending a second mounting instruction to the first switch chip;
And modifying the register configuration and sending the reset information to the working server in response to the completion of the re-mounting of the working server by the first switch chip.
In one embodiment, the computer program when executed by a processor performs the steps of:
releasing work port resources in the fault processor in response to receiving a first unloading instruction sent by the system management device;
and in response to receiving a second mounting instruction sent by the system management device, reallocating the work port resources to the work server according to the first high-speed serial computer expansion bus between the work port resources and the fault processor.
In one embodiment, the computer program when executed by a processor performs the steps of:
in response to receiving a first mounting instruction sent by the system management device, allocating work port resources to the work server according to a second high-speed serial computer expansion bus with the target processor;
and responding to receiving a second unloading instruction sent by the system management device, and releasing the work port resources in the target processor.
Monitoring all processors in the redundant system according to the monitoring link; in response to monitoring a fault processor, determining whether a working server which is working is mounted on a first switch chip corresponding to the fault processor; if yes, a first unloading instruction is sent to the first switch chip, and a first mounting instruction is sent to a second switch chip corresponding to a target processor in the redundant system, so that the work server is mounted to the target processor; and sending a restarting instruction to the fault processor in response to the completion of the mounting of the working server by the second switch chip, and sending a second unloading instruction to the second switch chip and a second mounting instruction to the first switch chip in response to the successful restarting of the fault processor so as to realize the re-mounting of the working server to the repaired fault processor. The CPU works simultaneously to meet the high concurrent calculation requirement, and meanwhile, under the condition that a certain CPU is down, the system service can be seamlessly switched to another CPU to ensure that the server system can realize the high concurrent data calculation and meet the requirement of high reliability.
Drawings
FIG. 1 is a system topology of a fault handling method for a redundant system;
FIG. 2 is a schematic diagram illustrating steps of a fault handling method for a redundant system of a system management device;
FIG. 3 is a system topology of a multiple switch chip interconnect system;
FIG. 4 is a schematic diagram of a failure handling device of a redundant system applied to a system management device;
fig. 5 is an internal structural diagram of a computer device in an embodiment of the present application.
Detailed Description
The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.
The system Management Device can be composed of a BMC, an mCPU (Management CPU as a Management center in the server) and a CPLD, wherein after the working state of the CPU module is monitored by the BMC and the fault state is notified to the mCPU through an LPC/IIC (integrated circuit bus) signal, the mCPU can be communicated with PCIE FabricSwitch through UART (universal asynchronous transceiver universal serial data bus) for asynchronous communication, the bidirectional communication of the bus can realize full duplex transmission and reception and the CPU configuration is modified through PCIE links, and after the CPLD is reset through the CPLD, the CPLD is reset, the CPLD is notified to the user after the CPLD is reset through the CPLD, and the CPLD is reset, and the user can finish the fault state notification after the CPLD is reset through the CPLD. When the Device uninstallation-installation needs to be completed again, the mCPU communicates with PCIE FabricSwitch through UART and modifies the register configuration, which will not be described in detail later.
In one embodiment, as shown in fig. 2, the present invention provides a fault handling method of a redundant system, applied to a system management device, the method comprising:
s201, monitoring all processors in a redundant system according to a monitoring link;
s202, determining whether a working server which is working is mounted on a first switch chip corresponding to a fault processor or not in response to monitoring the fault processor;
s203, if yes, a first unloading instruction is sent to the first switch chip, and a first mounting instruction is sent to a second switch chip corresponding to a target processor in the redundant system;
s204, responding to completion of mounting the working server on the second switch chip, and sending a restarting instruction to the fault processor;
s205, responding to the restarting success of the fault processor, sending a second unloading instruction to the second switch chip and sending a second mounting instruction to the first switch chip so as to mount the working server.
Specifically, in the redundant system, the redundant system may include a plurality of CPUs and PCIE FabricSwitch corresponding to each CPU, where in a normal working state, the CPUs are interconnected with devices through PCIE links and PCIE FabricSwitch and perform tasks such as related data processing, control management, and high performance computing, and since PCIE FabricSwitch uses and accesses computing resources in the CPUs, the devices mounted on PCIE FabricSwitch are equivalent to those mounted on the corresponding CPUs. The CPU is respectively provided with 1 path of heartbeat monitoring (mcpu_heart error), 1 path of abnormal interrupt alarming SMI_GPIO and 1 path of abnormal information routing MDI signal which are connected with the processing device, the processing device monitors the working state of each CPU in real time through a heartbeat monitoring link, an interrupt alarming link and an abnormal information routing link, when the processing device monitors a certain path of CPU fault, whether the fault CPU mounts a working device which works under the condition that the fault CPU is firstly determined, if the mounted device is idle, the fault CPU is idle, a restarting instruction is directly sent to the fault CPU at the moment, and the fault CPU is restarted and can not influence the mounted device. If the working server equipment is mounted, equipment switching needs to be completed, and meanwhile, after the fault CPU is restarted and normal work is successfully recovered, the server equipment is switched back. Therefore, when no fault exists, the multi-path CPU works simultaneously to provide high calculation power to meet high concurrency, and when a certain path of CPU breaks down, the non-inductive switching can be performed, so that the normal work of the downlink equipment is ensured.
In one embodiment, the monitoring all processors in the redundant system according to the monitoring link includes:
the monitoring link comprises a heartbeat monitoring link, an interrupt alarm link and an abnormal information routing link;
responding to the alarm of one monitoring link corresponding to the alarm processor, and continuously monitoring the alarm processor according to the other two monitoring links;
responding to the two monitoring links corresponding to the alarm processor to alarm, and continuously monitoring the alarm processor according to the other monitoring link;
and determining that the alarm processor fails in response to determining that three monitoring links corresponding to the alarm processor alarm.
Specifically, the CPU is respectively provided with 1-path heartbeat monitoring (mcpu_heartbeat error), 1-path abort alarm smi_gpio and 1-path abort information routing MDI signals, and the processing device is connected to the processing device, and monitors the working states of the CPUs in real time through the heartbeat monitoring link, the abort alarm link and the abort information routing link, monitors the working states of the CPU modules, and in order to prevent erroneous judgment caused by the single link being interfered, the processing device waits for three-path monitoring feedback signals of the CPU, and only when the three-path monitoring feedback signals (MDI, mcpu_ heartError, SMI _gpio) are all alarmed, the processing device can determine that the CPU is in a fault state.
In one embodiment, the sending the first offload instruction to the first switch chip and sending the first mount instruction to the second switch chip corresponding to the target processor in the redundant system includes:
determining the number of working servers of the working servers corresponding to the fault processor;
determining a target processor from all the processors according to the number of the working servers;
sending the first unloading instruction to the first switch chip;
responsive to the first switch chip unloading the working server being completed, modifying a register configuration of the first switch chip and sending a first mounting instruction to the second switch chip;
and responding to the completion of the second switch chip mounting of the working server, and sending reset information to the working server.
Specifically, assuming that CPU0 fails, the processing Device sends an offload instruction to PCIE FabricSwitch0, and then changes PCIE FabricSwitch the internal register configuration, so that PCIE FabricSwitch allows three server devices mounted under itself to be mounted under PCIE FabricSwitch1 (second switch chip) at the same time when the ports are interconnected, and at this time, three devices originally mounted under CPU0 are mounted under CPU 1. After the unloading-loading operation is completed, the processing Device sends a PERST signal to the corresponding working server Device, and after the Device is reset, the devices are all loaded under the CPU1 and work normally. As shown in the drawing, the liquid crystal display device,
In one embodiment, the determining the target processor from the all processors according to the number of the working servers includes:
determining the number of idle servers corresponding to each processor;
and determining the processors with the number of the idle servers not smaller than the number of the working servers as the target processors.
Specifically, in the redundant system, since the redundant system includes a plurality of CPUs and PCIE FabricSwitch corresponding to each CPU, that is, a switch chip, when the CPU0 fails, it is necessary to determine that the target CPU, for example, PCIE FabricSwitch0 corresponding to the failed CPU0, has four devices mounted thereon, wherein only three of the devices are working, that is, the number of working servers is 3, and at this time, the number of PCIE FabricSwitch mounted on the target CPU that needs to be determined is also not less than 3, and the more the number of idle servers, it is stated that the CPU corresponding to PCIE FabricSwitch has sufficient resources allocated to the three working servers of the failed CPU0, so when selecting the target CPU, the CPU corresponding to PCIE FabricSwitch of the target CPU can be selected to have the largest number of mounted servers and the largest number of mounted idle servers. The more servers that can be mounted indicate that the CPU performance is strongest, the more idle servers that are currently mounted indicate that they can allocate more processor resources.
In one embodiment, the sending the second offload instruction to the second switch chip and the sending the second mount instruction to the first switch chip include:
sending the second unloading instruction to the second switch chip;
responding to the completion of unloading the working server by the second switch chip, and sending a second mounting instruction to the first switch chip;
and modifying the register configuration and sending the reset information to the working server in response to the completion of the re-mounting of the working server by the first switch chip.
Specifically, the processing device restarts the failed CPU0 after completing the mounting of the working server device, and when the failed CPU0 is restarted successfully, the upper layer user device is notified to complete the restart. During the restart of the CPU0, all devices work normally through the CPU1, and business processing is not affected. When the fault module is successfully restarted, the PCIE FabricSwitch0 and the CPU0 are successfully reconnected, an unloading instruction is sent to PCIE FabricSwitch1, PCIE FabricSwitch1 releases port resources, so that three devices mounted under the CPU1 are unloaded, then a mounting instruction is sent to PCIE FabricSwitch0, at this time PCIE FabricSwitch0 allocates task resources in the CPU0 to the currently working devices through a PCIE connection line with the CPU0, then the register configuration of the original Fabricswitch0 is modified back, namely, a server hung under the CPU0 is not allowed to be mounted under the CPU1 through port interconnection between the Fabricswitch0 and the Fabricswitch1, then reset information is sent to the devices, and the devices are formally started to work after reset.
In one embodiment, there is also provided a fault handling method of a redundancy system, applied to a first switch chip, the method including:
releasing work port resources in the fault processor in response to receiving a first unloading instruction sent by the system management device;
and in response to receiving a second mounting instruction sent by the system management device, reallocating the work port resources to the work server according to the first high-speed serial computer expansion bus between the work port resources and the fault processor.
Specifically, when the first unloading instruction sent by the processing Device is received, the PCIE FabricSwitch0 releases the work port resources corresponding to the three ports S4, S5 and S6, so as to unload the three devices mounted under the CPU 0. Then, after the processing device modifies its own register configuration, the device originally mounted on the CPU0 is mounted on the CPU1 through port communication with the PCIE FabricSwitch 1. Then, when receiving the second mounting instruction sent by the processing device, at this time, PCIE FabricSwitch again allocates the work port resource in CPU0 to the work device through the PCIE line successfully connected with CPU0 again.
In one embodiment, there is also provided a fault handling method of a redundant system, applied to a second switch chip, the method including:
In response to receiving a first mounting instruction sent by the system management device, allocating work port resources to the work server according to a second high-speed serial computer expansion bus with the target processor;
and responding to receiving a second unloading instruction sent by the system management device, and releasing the work port resources in the target processor.
Specifically, as described above, when the first mount instruction sent by the processing apparatus is received, the PCIE FabricSwitch1 mounts the device originally mounted on the CPU0 onto the CPU1 through the port communication with PCIE FabricSwitch and the PCIE link between itself and the CPU1, so that the work port resources in the CPU1 can be allocated to the three work devices. When receiving the second offload instruction sent by the processing apparatus, PCIE FabricSwitch1 re-releases the work port resources originally allocated to the three work devices, thereby offloading the three work devices.
Fig. 3 is a topology diagram of interconnection of multiple switch chips, and SW0, 2, 4, and 6 in column a are regarded as uplink SW in a 2×4 topology; the SW1, 3, 5, 7 of column B is regarded as a downstream SW in the 2×4 topology, and through interconnection of multiple switch chips, more upstream host can be connected and more downstream devices can be connected. And then redundant backup and switching can be performed between host and device, and the method can be applied to a cluster server or a data center to improve the stability and efficiency of the cluster server or the data center.
The scheme of the application has the following beneficial effects:
1) The device is not divided into a master device and a slave device under the current redundant system, so that the device can work simultaneously to meet the requirement of high concurrency calculation, and meanwhile, under the condition that a certain CPU is down, the system service can be seamlessly switched to another CPU to ensure that the server system can realize the high concurrency data calculation and can meet the requirement of high reliability;
2) When a master device such as a CPU fails, a slave device such as a device is always in an idle state, which has negative influence on the densification of the required device, and the problem of computing resource waste can be effectively solved.
It should be understood that, although the steps in the flowchart of fig. 2 are shown in sequence as indicated by the arrows, the steps are not necessarily performed in sequence as indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in fig. 2 may include multiple sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, nor do the order in which the sub-steps or stages are performed necessarily performed in sequence, but may be performed alternately or alternately with at least a portion of the sub-steps or stages of other steps or other steps.
In one embodiment, as shown in fig. 4, a fault handling apparatus of a redundant system is applied to a system management apparatus, the apparatus includes:
the monitoring module 401 monitors all processors in the redundant system according to the monitoring link;
a determining module 402, configured to determine, in response to monitoring a fault processor, whether a working server that is working is mounted on a first switch chip corresponding to the fault processor;
the first sending module 403 is configured to send a first unloading instruction to the first switch chip and send a first mounting instruction to a second switch chip corresponding to the target processor in the redundant system if the first unloading instruction is received;
a second sending module 404, configured to send a restart instruction to the failure processor in response to the second switch chip mounting the working server being completed;
and the third sending module 405 is configured to send a second unloading instruction to the second switch chip and send a second mount instruction to the first switch chip in response to the restart success of the fault processor, so as to mount the working server.
In one embodiment, the monitoring module monitors all processors in the redundant system according to the monitoring link, including:
The monitoring link comprises a heartbeat monitoring link, an interrupt alarm link and an abnormal information routing link;
responding to the alarm of one monitoring link corresponding to the alarm processor, and continuously monitoring the alarm processor according to the other two monitoring links;
responding to the two monitoring links corresponding to the alarm processor to alarm, and continuously monitoring the alarm processor according to the other monitoring link;
and determining that the alarm processor fails in response to determining that three monitoring links corresponding to the alarm processor alarm.
In one embodiment, the sending, by the first sending module, the first offload instruction to the first switch chip and the sending, by the first sending module, the first mount instruction to the second switch chip corresponding to the target processor in the redundant system includes:
determining the number of working servers of the working servers corresponding to the fault processor;
determining a target processor from all the processors according to the number of the working servers;
sending the first unloading instruction to the first switch chip;
responsive to the first switch chip unloading the working server being completed, modifying a register configuration of the first switch chip and sending a first mounting instruction to the second switch chip;
And responding to the completion of the second switch chip mounting of the working server, and sending reset information to the working server.
In one embodiment, the determining, by the first sending module, the target processor from the all processors according to the number of working servers includes:
determining the number of idle servers corresponding to each processor;
and determining the processors with the number of the idle servers not smaller than the number of the working servers as the target processors.
In one embodiment, the sending, by the second sending module, a second offload instruction to the second switch chip and sending, by the first switch chip, a second mount instruction includes:
sending the second unloading instruction to the second switch chip;
responding to the completion of unloading the working server by the second switch chip, and sending a second mounting instruction to the first switch chip;
and modifying the register configuration and sending the reset information to the working server in response to the completion of the re-mounting of the working server by the first switch chip.
In one embodiment, there is also provided a fault handling apparatus of a redundancy system, applied to a first switch chip, the apparatus including:
The first releasing module is used for responding to the first unloading instruction sent by the system management device and releasing the work port resources in the fault processor;
and the first allocation module is used for responding to the second mounting instruction sent by the system management device and reallocating the work port resources to the work server according to the first high-speed serial computer expansion bus between the first mounting instruction and the fault processor.
In one embodiment, there is also provided a fault handling apparatus of a redundancy system, applied to a second switch chip, the apparatus including:
the second allocation module is used for responding to the first mounting instruction sent by the system management device and allocating work port resources to the work server according to the second high-speed serial computer expansion bus with the target processor;
and the second releasing module is used for responding to the second unloading instruction sent by the system management device and releasing the work port resources in the target processor.
For specific limitations on the fault handling means of the redundant system, reference may be made to the above limitation on the fault handling method of the redundant system, and no further description is given here. The respective modules in the fault handling apparatus of the redundant system described above may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.
In one embodiment, a computer device is provided, which may be a terminal, and the internal structure of which may be as shown in fig. 5. The computer device includes a processor, a memory, a network interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program, when executed by a processor, implements an alert information processing method. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, can also be keys, a track ball or a touch pad arranged on the shell of the computer equipment, and can also be an external keyboard, a touch pad or a mouse and the like.
It will be appreciated by those skilled in the art that the structure shown in FIG. 5 is merely a block diagram of some of the structures associated with the present inventive arrangements and is not limiting of the computer device to which the present inventive arrangements may be applied, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.
In one embodiment, an electronic device is provided that includes a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the steps of:
monitoring all processors in the redundant system according to the monitoring link;
in response to monitoring a fault processor, determining whether a working server which is working is mounted on a first switch chip corresponding to the fault processor;
if so, a first unloading instruction is sent to the first switch chip, a first mounting instruction is sent to a second switch chip corresponding to a target processor in the redundant system,
responsive to the second switch chip mounting the working server being completed, sending a restart instruction to the failure processor,
And responding to the successful restarting of the fault processor, sending a second unloading instruction to the second switch chip and sending a second mounting instruction to the first switch chip.
In one embodiment, the processor, when executing the computer program, performs the steps of:
the monitoring of all processors in the redundant system according to the monitoring link comprises:
the monitoring link comprises a heartbeat monitoring link, an interrupt alarm link and an abnormal information routing link;
responding to the alarm of one monitoring link corresponding to the alarm processor, and continuously monitoring the alarm processor according to the other two monitoring links;
responding to the two monitoring links corresponding to the alarm processor to alarm, and continuously monitoring the alarm processor according to the other monitoring link;
and determining that the alarm processor fails in response to determining that three monitoring links corresponding to the alarm processor alarm.
In one embodiment, the processor, when executing the computer program, performs the steps of:
the sending the first unloading instruction to the first switch chip and sending the first mounting instruction to the second switch chip corresponding to the target processor in the redundant system includes:
Determining the number of working servers of the working servers corresponding to the fault processor;
determining a target processor from all the processors according to the number of the working servers;
sending the first unloading instruction to the first switch chip;
responsive to the first switch chip unloading the working server being completed, modifying a register configuration of the first switch chip and sending a first mounting instruction to the second switch chip;
and responding to the completion of the second switch chip mounting of the working server, and sending reset information to the working server.
In one embodiment, the processor, when executing the computer program, performs the steps of:
said determining a target processor from said all processors according to said number of working servers comprises:
determining the number of idle servers corresponding to each processor;
and determining the processors with the number of the idle servers not smaller than the number of the working servers as the target processors.
In one embodiment, the processor, when executing the computer program, performs the steps of:
the sending the second unloading instruction to the second switch chip and the sending the second mounting instruction to the first switch chip include:
Sending the second unloading instruction to the second switch chip;
responding to the completion of unloading the working server by the second switch chip, and sending a second mounting instruction to the first switch chip;
and modifying the register configuration and sending the reset information to the working server in response to the completion of the re-mounting of the working server by the first switch chip.
In one embodiment, the processor, when executing the computer program, performs the steps of:
releasing work port resources in the fault processor in response to receiving a first unloading instruction sent by the system management device;
and in response to receiving a second mounting instruction sent by the system management device, reallocating the work port resources to the work server according to the first high-speed serial computer expansion bus between the work port resources and the fault processor.
In one embodiment, the processor, when executing the computer program, performs the steps of:
in response to receiving a first mounting instruction sent by the system management device, allocating work port resources to the work server according to a second high-speed serial computer expansion bus with the target processor;
And responding to receiving a second unloading instruction sent by the system management device, and releasing the work port resources in the target processor.
In one embodiment, a computer readable storage medium is provided having stored thereon a computer program which when executed by a processor performs the steps of:
monitoring all processors in the redundant system according to the monitoring link;
in response to monitoring a fault processor, determining whether a working server which is working is mounted on a first switch chip corresponding to the fault processor;
if so, a first unloading instruction is sent to the first switch chip, a first mounting instruction is sent to a second switch chip corresponding to a target processor in the redundant system,
responsive to the second switch chip mounting the working server being completed, sending a restart instruction to the failure processor,
and responding to the successful restarting of the fault processor, sending a second unloading instruction to the second switch chip and sending a second mounting instruction to the first switch chip.
In one embodiment, the computer program when executed by a processor performs the steps of:
The monitoring of all processors in the redundant system according to the monitoring link comprises:
the monitoring link comprises a heartbeat monitoring link, an interrupt alarm link and an abnormal information routing link;
responding to the alarm of one monitoring link corresponding to the alarm processor, and continuously monitoring the alarm processor according to the other two monitoring links;
responding to the two monitoring links corresponding to the alarm processor to alarm, and continuously monitoring the alarm processor according to the other monitoring link;
and determining that the alarm processor fails in response to determining that three monitoring links corresponding to the alarm processor alarm.
In one embodiment, the computer program when executed by a processor performs the steps of:
the sending the first unloading instruction to the first switch chip and sending the first mounting instruction to the second switch chip corresponding to the target processor in the redundant system includes:
determining the number of working servers of the working servers corresponding to the fault processor;
determining a target processor from all the processors according to the number of the working servers;
sending the first unloading instruction to the first switch chip;
Responsive to the first switch chip unloading the working server being completed, modifying a register configuration of the first switch chip and sending a first mounting instruction to the second switch chip;
and responding to the completion of the second switch chip mounting of the working server, and sending reset information to the working server.
In one embodiment, the computer program when executed by a processor performs the steps of:
said determining a target processor from said all processors according to said number of working servers comprises:
determining the number of idle servers corresponding to each processor;
and determining the processors with the number of the idle servers not smaller than the number of the working servers as the target processors.
In one embodiment, the computer program when executed by a processor performs the steps of:
the sending the second unloading instruction to the second switch chip and the sending the second mounting instruction to the first switch chip include:
sending the second unloading instruction to the second switch chip;
responding to the completion of unloading the working server by the second switch chip, and sending a second mounting instruction to the first switch chip;
And modifying the register configuration and sending the reset information to the working server in response to the completion of the re-mounting of the working server by the first switch chip.
In one embodiment, the computer program when executed by a processor performs the steps of:
releasing work port resources in the fault processor in response to receiving a first unloading instruction sent by the system management device;
and in response to receiving a second mounting instruction sent by the system management device, reallocating the work port resources to the work server according to the first high-speed serial computer expansion bus between the work port resources and the fault processor.
In one embodiment, the computer program when executed by a processor performs the steps of:
in response to receiving a first mounting instruction sent by the system management device, allocating work port resources to the work server according to a second high-speed serial computer expansion bus with the target processor;
and responding to receiving a second unloading instruction sent by the system management device, and releasing the work port resources in the target processor.
Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.
The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The above examples illustrate only a few embodiments of the application, which are described in detail and are not to be construed as limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of protection of the present application is to be determined by the appended claims.
Claims (10)
1. A fault handling method for a redundant system, applied to a system management device, the method comprising:
monitoring all processors in the redundant system according to the monitoring link;
in response to monitoring a fault processor, determining whether a working server which is working is mounted on a first switch chip corresponding to the fault processor;
if yes, a first unloading instruction is sent to the first switch chip, and a first mounting instruction is sent to a second switch chip corresponding to a target processor in the redundant system;
Responding to the completion of the mounting of the second switch chip on the working server, and sending a restarting instruction to the fault processor;
and responding to the successful restarting of the fault processor, sending a second unloading instruction to the second switch chip and sending a second mounting instruction to the first switch chip so as to mount the working server.
2. The method of claim 1, wherein monitoring all processors in the redundant system based on the monitoring link comprises:
the monitoring link comprises a heartbeat monitoring link, an interrupt alarm link and an abnormal information routing link;
responding to the alarm of one monitoring link corresponding to the alarm processor, and continuously monitoring the alarm processor according to the other two monitoring links;
responding to the two monitoring links corresponding to the alarm processor to alarm, and continuously monitoring the alarm processor according to the other monitoring link;
and determining that the alarm processor fails in response to determining that three monitoring links corresponding to the alarm processor alarm.
3. The method of claim 1, wherein the sending a first offload instruction to the first switch chip and a first mount instruction to a second switch chip corresponding to a target processor in the redundant system comprises:
Determining the number of working servers of the working servers corresponding to the fault processor;
determining a target processor from all the processors according to the number of the working servers;
sending the first unloading instruction to the first switch chip;
responsive to the first switch chip unloading the working server being completed, modifying a register configuration of the first switch chip and sending a first mounting instruction to the second switch chip;
and responding to the completion of the second switch chip mounting of the working server, and sending reset information to the working server.
4. A method according to claim 3, wherein said determining a target processor from said all processors based on said number of working servers comprises:
determining the number of idle servers corresponding to each processor;
and determining the processors with the number of the idle servers not smaller than the number of the working servers as the target processors.
5. The method of claim 3, wherein the sending a second offload instruction to the second switch chip and the first switch chip sending a second mount instruction comprises:
Sending the second unloading instruction to the second switch chip;
responding to the completion of unloading the working server by the second switch chip, and sending a second mounting instruction to the first switch chip;
and modifying the register configuration and sending the reset information to the working server in response to the completion of the re-mounting of the working server by the first switch chip.
6. A fault handling method for a redundant system, applied to a first switch chip, the method comprising:
releasing work port resources in the fault processor in response to receiving a first unloading instruction sent by the system management device;
and in response to receiving a second mounting instruction sent by the system management device, reallocating the work port resources to the work server according to the first high-speed serial computer expansion bus between the work port resources and the fault processor.
7. A fault handling method of a redundant system, applied to a second switch chip, the method comprising:
in response to receiving a first mounting instruction sent by the system management device, allocating work port resources to the work server according to a second high-speed serial computer expansion bus with the target processor;
And responding to receiving a second unloading instruction sent by the system management device, and releasing the work port resources in the target processor.
8. A fault handling device for a redundant system, for use in a system management device, the device comprising:
the monitoring module monitors all processors in the redundant system according to the monitoring link;
the determining module is used for determining whether a working server which works is mounted on a first switch chip corresponding to the fault processor or not in response to the monitoring of the fault processor;
the first sending module is used for sending a first unloading instruction to the first switch chip and sending a first mounting instruction to a second switch chip corresponding to a target processor in the redundant system if the first unloading instruction is received;
the second sending module is used for responding to the completion of the mounting of the second switch chip on the working server and sending a restarting instruction to the fault processor;
and the third sending module is used for responding to the restarting success of the fault processor, sending a second unloading instruction to the second switch chip and sending a second mounting instruction to the first switch chip so as to mount the working server.
9. An electronic device, comprising:
One or more processors; and a memory associated with the one or more processors, the memory for storing program instructions that, when read for execution by the one or more processors, perform the method of any of claims 1-7.
10. A computer storage medium, characterized in that it has stored thereon a computer program, wherein the program, when executed by a processor, implements the method according to any of claims 1-7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311013917.0A CN117112296A (en) | 2023-08-11 | 2023-08-11 | Fault processing method and device for redundant system, electronic equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311013917.0A CN117112296A (en) | 2023-08-11 | 2023-08-11 | Fault processing method and device for redundant system, electronic equipment and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN117112296A true CN117112296A (en) | 2023-11-24 |
Family
ID=88806802
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311013917.0A Pending CN117112296A (en) | 2023-08-11 | 2023-08-11 | Fault processing method and device for redundant system, electronic equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117112296A (en) |
-
2023
- 2023-08-11 CN CN202311013917.0A patent/CN117112296A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11755435B2 (en) | Cluster availability management | |
US8032786B2 (en) | Information-processing equipment and system therefor with switching control for switchover operation | |
CN116881053B (en) | Data processing method, exchange board, data processing system and data processing device | |
US11768724B2 (en) | Data availability in a constrained deployment of a high-availability system in the presence of pending faults | |
US11409471B2 (en) | Method and apparatus for performing data access management of all flash array server | |
CN111124728A (en) | Automatic service recovery method, system, readable storage medium and server | |
CN113742165B (en) | Dual master control equipment and master-slave control method | |
CN116266150A (en) | Service recovery method, data processing unit and related equipment | |
JP2009069963A (en) | Multiprocessor system | |
CN109995597B (en) | Network equipment fault processing method and device | |
JP2009003537A (en) | Computer | |
CN109358982B (en) | Hard disk self-healing device and method and hard disk | |
WO2008004330A1 (en) | Multiple processor system | |
CN117112296A (en) | Fault processing method and device for redundant system, electronic equipment and storage medium | |
CN110633176B (en) | Working system switching method, cube star and switching device | |
JP3621634B2 (en) | Redundant configuration switching system | |
JPH07121395A (en) | Method for preferentially selecting auxiliary device | |
CN107783855B (en) | Fault self-healing control device and method for virtual network element | |
JP6822706B1 (en) | Cluster system, server equipment, takeover method, and program | |
US11366618B2 (en) | All flash array server and control method thereof | |
US11809293B2 (en) | Storage node failure detection based on register values for an all flash array server | |
US20230092343A1 (en) | Lockstep processor recovery for vehicle applications | |
CN115629855A (en) | Redundancy task migration strategy and computing device | |
CN113114481A (en) | Data operation method and device, computer equipment and storage medium | |
JP2016224490A (en) | Redundant system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |