WO2021073105A1 - Dual-computer hot standby system - Google Patents
Dual-computer hot standby system Download PDFInfo
- Publication number
- WO2021073105A1 WO2021073105A1 PCT/CN2020/092835 CN2020092835W WO2021073105A1 WO 2021073105 A1 WO2021073105 A1 WO 2021073105A1 CN 2020092835 W CN2020092835 W CN 2020092835W WO 2021073105 A1 WO2021073105 A1 WO 2021073105A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- host device
- backup
- bus
- data
- party
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F13/00—Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
- G06F13/38—Information transfer, e.g. on bus
- G06F13/42—Bus transfer protocol, e.g. handshake; Synchronisation
- G06F13/4282—Bus transfer protocol, e.g. handshake; Synchronisation on a serial bus, e.g. I2C bus, SPI bus
Definitions
- the invention relates to the field of server technology.
- the invention further relates to a dual-machine hot backup system.
- the main and backup equipment are two independent operating entities, and data differences will inevitably occur when the two operate sequentially or at the same time. Even if the LAN returns to normal after a period of time, the two monitors at this time do not know whose data is the latest, and who should synchronize with whom, so a split-brain situation may occur.
- a strategy of secondary verification of the serial port is added to avoid the failure of the standby machine to determine the true operation of the host after the LAN network is interrupted.
- the standby machine can still check that the host is alive through the serial port.
- the standby machine does not take over the work of the host and is always in a standby working state.
- data synchronization between the main and standby devices cannot be performed through the LAN, and the user cannot remotely control and access the host device through the LAN.
- the data queried is not the latest data.
- the present invention proposes a dual-machine hot backup system based on the above objective, wherein the system includes:
- a standby device which communicates with the host device through the first I3C bus;
- At least one slave device the at least one slave device is in communication connection with the host device and the standby device through the second I3C bus;
- the host device is configured to respond to the startup of the dual-system hot backup system, collect the parameters of the slave device, store the parameter mapping in the database and synchronize it to the standby device via the first I3C bus, and is configured to pass the second device based on the management instruction
- the I3C bus manages the slave device, and generates a mapping according to the changed parameter, and synchronizes the mapping of the changed parameter to the standby device through the first I3C bus;
- the standby device is configured to, in response to receiving a management instruction from the outside, forward the received management instruction to the host device through the first I3C bus.
- the backup device is further configured to: in response to receiving an emergency management instruction from the outside, forcibly occupy the second I3C bus to manage the slave device temporarily, and according to the changed parameters Generate parameter mapping, and synchronize the mapping of emergency management commands and changed parameters to the host device through the first I3C bus.
- the host device is further configured to: in response to the host device entering the upgrade mode and/or the resource occupation exceeds the threshold, notify the standby device to temporarily take over the slave device through the first I3C bus And in response to the host device exiting the upgrade mode and/or the resource occupation no longer exceeds the threshold, the standby device is notified through the first I3C bus to stop taking over the management of the slave device.
- the backup device is further configured to: in response to receiving a notification of a temporary takeover of the host device, manage the slave device through the second I3C bus, and according to the changed related parameters Generate a parameter mapping; and in response to receiving a notification that the host device stops taking over, stop managing the slave device and synchronize the mapping of the changed parameter to the host device through the first I3C bus.
- the backup device is further configured to: in response to the dual-system hot backup system being activated, actively initiate a clock synchronization request to the host device.
- the host device and the backup device are further configured to: in response to either of the host device and the backup device initiates the synchronization of the parameter mapping, the initiator generates the original data, modification After the data and the modification time are synchronized, the data is packaged and sent to the other party.
- the host device and the backup device are further configured as:
- the modification time in the received synchronized packaged data is compared with the local modification time, and the modified data with the newer modification time is used as the standard for synchronization.
- the host device and the backup device are further configured to detect the operating state of each other through a two-way physical IO heartbeat detection mechanism.
- the host device and the backup device are further configured to: in response to either the host device and the backup device detecting that the other party has a failure, the non-faulty party records the failure of the failed party Enter the log, and reset the faulty party through the external double reset mechanism, and synchronize the clock and database of the faulty party.
- the host device and the backup device are further configured to: in response to the non-faulty party being unable to reset the failed party through an external dual reset mechanism and/or the failure of the resetting, the non-faulty party Send an alarm to notify the operation and maintenance personnel to deal with it.
- the present invention has at least the following beneficial effects: for the low communication rate of I2C bus and serial port used in the current multi-node server solution, it cannot meet the requirements of interaction and synchronization between active and standby, so data synchronization has to be Relying heavily on the problem of external LAN and switches, the I3C bus is used to establish the internal communication architecture of the multi-point server dual-system hot standby system, and two I3C buses are used to construct the communication architecture between the master and the master and the slave in the system.
- the host device collects the status of the slave device through the I3C bus, establishes a mapping and stores it in the database, so that for example, when the user needs to call the operating parameters of a certain slave device, there is no need for the host device to respond to the user's instructions. Go to the slave device to obtain it, but can directly feed back the relevant information recorded in the database to the user.
- the host device will synchronize the corresponding information to the standby device to ensure the consistency of data in the main and standby devices.
- the backup device is also allowed to be directly accessed by external users such as users to issue instructions.
- the backup device will forward the management instruction to the host device through the I3C bus so that the host device can respond to the slave device.
- Management Using the dual-machine hot backup system of the present invention not only improves the efficiency of the internal communication of the system, but also avoids the reliability problems caused by relying on external LAN and external switches, and ensures the data consistency of the main and backup equipment to a certain extent. The above development and utilization of the resources of the backup equipment, thereby further improving the safety and operating efficiency of the entire multi-node server system.
- Figure 1 shows a schematic diagram of an embodiment of a dual-machine hot backup system according to the present invention
- Figure 2 shows a schematic diagram of another embodiment of the dual-machine hot backup system according to the present invention.
- Fig. 3 shows a schematic structural diagram of an embodiment of the host device and the standby device of the dual-system hot backup system according to the present invention.
- Fig. 1 shows a schematic diagram of an embodiment of a dual-machine hot backup system 100 according to the present invention.
- the dual-system hot backup system 100 according to the present invention is especially used for the control and management of a multi-node server system.
- the dual-machine hot backup system 100 at least includes:
- a backup device 20 which is in communication connection with the host device 10 through the first I3C bus 30;
- At least one slave device 40 the at least one slave device 40 is in communication connection with the host device 10 and the backup device 20 through the second I3C bus 50;
- the host device 10 is configured to collect the parameters of the slave device 40 in response to the startup of the dual-machine hot backup system 100, store the parameter mapping in the database and synchronize to the backup device 20 through the first I3C bus 30, and is configured to be based on
- the management instruction manages the slave device 40 through the second I3C bus 50, and generates a mapping according to the changed parameters, and synchronizes the mapping of the changed parameters to the standby device 20 through the first I3C bus 30;
- the backup device 20 is configured to, in response to receiving a management instruction from the outside, forward the received management instruction to the host device 10 through the first I3C bus 30.
- the host device 10 and the slave device 20 are CMC devices (Chassis Management Controller, chassis management controller). Similar to BMC, the whole machine is managed and controlled in multi-node server systems such as blades. CMC can send commands to each node for management.
- the slave device 40 is preferably a BMC (Baseboard Management Controller, baseboard management controller) in a multi-node server system. The BMC can perform some operations on the machine such as firmware upgrade, viewing machine equipment, and so on when the machine is not turned on.
- the dual-machine hot backup system 100 uses an I3C bus for communication connection.
- I3C is a two-wire serial communication bus that integrates the key attributes of I2C and SPI buses. It is compatible with the I2C protocol. It has new features such as multiple masters, slave soft interrupts, dynamic allocation of slave addresses, and support for hot swapping. The speed can be as high as 33Mbps. Usually used to connect the sensor to the application processor. Further, in order to separate the master-backup communication and master-slave management to avoid mutual interference and reduce bus pressure, the first I3C bus 30 is used between the master and backup (10 and 20), and the master-slave (10 and 40) And a second I3C bus 50 is used between 20 and 40).
- the dual-computer hot backup system 100 can complete all the functions of the system originally constructed by the I2C bus. And on top of this, the dual-machine hot backup system 100 according to the present invention adds new functions.
- the host device 10 is configured to collect the parameters of the slave device 40 in response to the startup of the dual-machine hot backup system 100, store the mapping of the parameters in the database, and synchronize the parameters to the backup device 20 through the first I3C bus 30.
- the CMC host device 10 checks the BMC slave device 20 on each node after it is powered on, and establishes a parameter based on the node device serial number, various operating parameters and other parameters Map the database.
- the CMC host device 10 maps the parameters of all node BMC slave devices to the CMC host device 10.
- the user requests to call the parameters of a certain slave device 40, he only needs to access the CMC host device 10 to obtain the corresponding parameters of all the slave devices 40, which prevents the CMC host device 10 from reporting to the corresponding slave device when the user queries.
- the device 40 reads the parameters, thereby speeding up the speed of responding to user instructions.
- the host device 10 synchronizes the database to the standby device 20 through the first I3C bus 30 to ensure data consistency between the main and standby devices.
- the host device 10 is configured to manage the slave device 40 through the second I3C bus 50 based on management instructions, and generates a mapping according to the changed parameters, and synchronizes the mapping of the changed parameters through the first I3C bus 30 ⁇ 20 ⁇ To the standby equipment 20. That is to say, when, for example, a user accesses the host device 10 from the outside to issue a management instruction and/or when the host device 10 generates a management instruction according to a preset control strategy, the host device 10 completes its management and control of the slave device through the second I3C bus 50 40 functions.
- the host device 10 maps the corresponding parameters to the database of the host device 10, and synchronizes the mapping of the parameters to the backup device 20 in real time through the first I3C bus 30, thereby ensuring that the management process is from Real-time data update and synchronization of main and standby data after changes in the parameters of the machine equipment 40.
- the function of the dual-machine hot backup system 100 also includes the development and utilization of the resources of the backup device 20 to a certain extent, that is, the backup device 20 is also allowed to be directly accessed by external users, such as users.
- the host device 10 and the standby device 20 respectively have their own communication addresses. Therefore, when the user directly accesses the backup device 20 from the outside according to the communication address and issues a management instruction, the backup device 20 will forward the management instruction to the host device 10 through the first I3C bus 30 after receiving the corresponding management instruction. The host device 10 performs corresponding management on the slave device 40.
- the backup device 20 is further configured to: in response to receiving an emergency management instruction from the outside, forcibly temporarily occupy the second I3C bus 50 to manage the slave device 40, and generate a parameter map according to the changed parameter, and synchronize the emergency management command and the map of the changed parameter to the host device 10 through the first I3C bus 30.
- the backup device 20 when the user directly accesses the backup device 20 from the outside according to the communication address and issues a management instruction, and the management instruction is a specific emergency management instruction At this time, the backup device 20 will forcibly occupy the second I3C bus 50 temporarily to manage the slave device 40 according to the emergency management instruction issued by the user, and synchronize the corresponding information to the host device 10.
- the specific emergency management instructions mentioned here usually refer to the management instructions that have very strong timeliness requirements and/or are closely related to system operation safety and must be processed immediately and/or that the user is forced to process immediately.
- the backup device 20 will forcibly occupy the second I3C bus 50 temporarily to manage the slave device 40, eliminating the step of the backup device 20 forwarding the management command to the host device 10 through the first I3C bus 30, which improves the response speed.
- the host device 10 is further configured to: in response to the host device 10 entering the upgrade mode and/or the resource occupation exceeds a threshold, the host device 10 passes through the first I3C bus 30 Notify the backup device 20 to temporarily take over the management of the slave device 40; and in response to the host device 10 exiting the upgrade mode and/or the resource occupation no longer exceeds the threshold, the host device 10 notifies the backup device 20 to stop taking over through the first I3C bus 30 Management of the slave device 40.
- the host device when the host device needs to upgrade the firmware and/or the system resource occupation exceeds the threshold, the host device cannot continue to support the management of the slave device due to the limited memory space.
- the method adopted is that the operation and maintenance personnel temporarily close some functions, and restart the previously temporarily closed functions when the firmware upgrade ends and/or the resource occupation is relieved.
- the management and control of the slave device by the host device is interrupted, and continuous management and control cannot be achieved. Therefore, the dual-system hot backup system 100 according to the present invention further develops and utilizes the resources of the backup device under the above-mentioned specific circumstances.
- the host device 10 When the host device 10 enters the upgrade mode and/or the resource occupation exceeds the threshold, the host device 10 actively The standby device 20 is notified through the first I3C bus 30 to temporarily take over the management of the slave device 40; and when the master device 10 is upgraded, it exits the upgrade mode and/or the resource occupation no longer exceeds the threshold so that the slave device 40 can continue to be managed
- the host device 10 informs the standby device 20 through the first I3C bus 30 to stop taking over the management of the slave device 40.
- the backup device 20 is further configured to: in response to receiving the notification of the temporary takeover of the host device 10, manage the slave device 40 through the second I3C bus 50, and The parameter mapping is generated according to the changed related parameters; and in response to receiving the notification of stopping the takeover of the host device 40, the slave device is stopped from managing and the mapping of the changed parameter is synchronized to the host device 10 through the first I3C bus 30.
- the backup device 20 serves as the backup of the host device 10 and is on standby for a long time. In a traditional dual-system hot backup system, once the host device 10 fails, the backup device 20 becomes the host to maintain the normal operation of the system.
- the backup device 20 In the dual hot backup system of the present invention, in addition to the above, no matter what state the host device 10 is in, once the backup device 20 receives the temporary takeover notice sent to it by the host device 10, the backup device 20 will It will temporarily take over the management of the slave device 40, manage the slave device 40 through the second I3C bus 50, and generate a parameter mapping according to the changed related parameters. In addition, once the backup device 20 receives the notification to stop taking over from the host device 10, the backup device 20 will stop managing the slave device 40, return the management work to the host device 10, and pass the first I3C bus 30. The mapping of the parameters affected by the management actions during the management of the backup device 20 is synchronized to the host device 10 to ensure that the host device can accurately manage the slave device 40 and ensure that the master and backup data are consistent.
- the backup device 20 is further configured to: in response to the dual-machine hot backup system 100 being activated, the backup device 20 actively initiates a clock synchronization request to the host device 10.
- the system time of the active and standby devices plays an important role in many situations and functions, so it is necessary to ensure the clock synchronization between the active and standby devices.
- the strategy adopted for clock synchronization is that after the dual-system hot backup system 100 is started, that is, after the host device 10 and the backup device 20 are turned on, the backup device 20 actively initiates a clock synchronization request to the host device 10. , To ensure the consistency of the system time of the two devices.
- the host device 10 and the backup device 20 are further configured to: in response to either the host device 10 and the backup device 20 initiating the synchronization of the parameter mapping, the initiator ( 10 or 20) Generate synchronized packaged data including original data, modified data and modified time and send it to the other party (20 or 10).
- the same data storage format is used in the host device 10 and the backup device 20.
- the initiator (host device 10 or backup device 20) will "original data + modified data +
- the "modification time” is packaged into synchronized packaged data, and then the synchronized packaged data is sent to the other party (the standby device 20 or the host device 10) so that the other party can update the data.
- the host device 10 and the backup device 20 are further configured as:
- the original data therein is compared with the local data
- the modification time in the received synchronized packaged data is compared with the local modification time, and the modified data with the newer modification time is used as the standard for synchronization.
- this arbitration mechanism specifically includes the following parts. First, when the host device 10 and/or the backup device 20 receives the synchronized packaged data sent by the other party, the synchronized packaged data is parsed to extract the original data, the modified data, and the modification time, and the extracted original data is compared with the original data. Compare with local data. If the original data is the same as the local data, it means that the local data has not changed, so the local data can be directly updated according to the extracted modified data.
- the modification time extracted from the received synchronous packaged data is compared with the modification time in the synchronous packaged data generated when the local data changes, and the modified data with the newer modification time is the final valid data. Synchronize primary and secondary data. Specifically, if the modified time extracted from the received synchronized packaged data is relatively new, that is, the modified data extracted from the received synchronized packaged data is the final valid data, the data extracted from the received synchronized packaged data After the modification, the local data is updated to realize the synchronization of the main and standby data.
- the modification time in the synchronized packaged data generated when the local data changes is newer, that is, the local data is the final valid data, so the local update is not performed.
- the synchronous packaged data will be sent to the other party immediately. If the generated synchronous packaged data has been sent to the other party when the local data changes, no further processing is necessary.
- FIG. 2 shows a schematic diagram of another embodiment of a dual-system hot backup system 100' according to the present invention, in which the host device 10 and the backup device 20 are further configured to detect the operation of each other through a two-way physical IO heartbeat detection mechanism 60 status.
- the embodiment of the dual-machine hot backup system 100' according to the present invention adds a two-way physical IO heartbeat detection mechanism between the host device 10 and the backup device 20 compared with the traditional dual-machine hot backup system.
- the backup device detects whether the host device is faulty, the host device also checks the operating status of the backup device in real time, so as to prevent the system from not knowing this situation when the backup device fails before the host device, which leads to the need for backup when the host device fails The occurrence of a situation where the system completely loses management when the equipment starts to perform management functions.
- the host device 10 and the backup device 20 are further configured to respond to the failure of either the host device 10 or the backup device 20 detecting that the other party has a failure .
- the non-faulty party (10 or 20) records the failure of the faulty party (20 or 10) in the log, and resets the faulty party (20 or 10) through the external double reset mechanism 70, and responds to the faulty party (20 or 10).
- Synchronize clock and database Synchronize clock and database.
- the non-faulty party (host device 10 or backup device 20) will send the failure party (backup device 20 or host device 10) The failure situation is recorded in the local log.
- the non-faulty party (host device 10 or backup device 20) will restart and reset the failed party (backup device 20 or host device 10) through the external dual reset mechanism 70 to ensure that both devices can be online at the same time to prevent redundancy failure. If the reset is successful, the non-faulty party (host device 10 or backup device 20) will synchronize the clock and database of the failed party (backup device 20 or host device 10).
- the host device 10 and the backup device 20 are further configured to respond to the failure of the non-faulty party (10 or 20) to pass the external dual reset mechanism 70 to the failed party ( 20 or 10) Restart reset and/or restart reset fails, the non-faulty party (10 or 20) issues an alarm to notify the operation and maintenance personnel to deal with. That is to say, in some situations, it may happen that the non-faulty party (host device 10 or backup device 20) cannot reset the failed party (backup device 20 or host device 10) through the external dual reset mechanism 70 or tries to restart and reset.
- the strategy adopted at this time is that the non-faulty party (host device 10 or backup device 20) issues an alarm to notify the operation and maintenance personnel to deal with the failure party (backup device 20 or host device 10) to eliminate the fault. Maintain the effectiveness of dual-system hot backup.
- Fig. 3 shows a schematic structural diagram of an embodiment of the host device and the standby device of the dual-system hot backup system according to the present invention.
- the host device 10 and the backup device 20 are both CMC devices
- the slave device 40 is a BMC device as an example
- the host device 10 and the backup device 20 maintain the same structure. The following will further explain the specific composition and functions of each module in the main and standby CMC equipment.
- the system module on the CMC device is the state machine of the entire system, and is the core scheduling module of the system. The scheduling, state judgment, and data flow among the various modules are all processed by the system module.
- the CMC host After the CMC host is powered on, it checks the BMC equipment on each node, and establishes a parameter mapping database according to the node equipment serial number, operating state parameters and other parameters.
- CMC maps all BMC parameters of all nodes to CMC. In this way, the user only needs to access the CMC device to obtain the BMC data of all nodes, avoiding the need for the CMC to read the BMC parameters one by one when the user queries, and speed up the response to users.
- the present invention adopts I3C bus connection between two CMC devices, which is used for data transmission and synchronization work between the two devices, so as to realize real-time data synchronization between the two devices.
- the I3C bus has a rate of 33Mbps, a soft interrupt mechanism, and a checksum fault tolerance mechanism.
- the principle of synchronization work includes the following content: one is clock synchronization; second, whose data changes, who is responsible for initiating the synchronization work; third, two data changes at the same time, subject to the last setting.
- the power-on backup machine actively initiates a time synchronization request to the host to ensure the time consistency of the two devices, and can be used to determine the final effective data modification problem when the master and the slave modify the data at the same time.
- the same data storage format is used in the two systems.
- the data synchronization module will package and send the "original data + modified data + modification time" to the other one.
- CMC CMC.
- another CMC After receiving the synchronized data, another CMC first determines whether the original data is the same as the local data. If the same indicates that the local data has not changed, it directly updates the modified data. If the original data is different from the local data, it means that the local data has also been modified. Compare the data modification time and synchronize the modified data with the newer time to maintain data consistency.
- the system adopts a two-way heartbeat packet detection mechanism. After any CMC device fails, the other CMC device will sense and record the LOG log and force reset the failed CMC device to restore normal operation. Take CMC0 (host device 10) as an example: if CMC0 fails, CMC1 (standby device 20) will find that the heartbeat signal of CMC0 cannot be detected in time. At this time, CMC1 determines that CMC0 is faulty, and records the LOG log and resets it. Signal IO line, forcibly reset CMC0. When the CMC0 faulty device restarts and recovers, it will promptly ask about the working mode of CMC1 and whether it is necessary to forcibly synchronize data.
- CMC0 host device 10
- CMC1 standby device 20
- CMC1 will reply to its own working mode and the need to forcibly synchronize data. And CMC1 will send all the synchronized data + mandatory update flag to CMC0, and CMC0 will update the data to be completely consistent. After that, CMC1 will also trigger the mode switching process.
- both CMCs are in idle mode, and the backup device 20 actively synchronizes data from the host device 10.
- the invention adopts a two-way heartbeat detection mechanism for fault detection to ensure that any problem of the dual machines can be found by the other, reset it in time, restore the backup state, record logs and inform operation and maintenance personnel.
- the detection module in the system is responsible for the generation of the heartbeat signal of one's own device and the monitoring function of the other's device's heartbeat signal.
- the detection module is responsible for collecting the working status of the machine and the abnormal suspension of the thread.
- the fault detection module will generate a continuous pulse signal on the heartbeat signal IO.
- the heartbeat signal will stop outputting.
- the detection module will naturally fail to output signals.
- the detection module will monitor in real time whether there is a heartbeat signal on the other party's CMC device's heartbeat signal IO.
- the other party's pulse signal is not detected for more than one second, for example, the other party's device can be preliminarily determined to be abnormal.
- the fault handling mechanism of the reset module is that when the fault detection module determines that the other party's device is faulty, the own CMC device immediately takes over all the work, and in addition, restarts the faulty device through the reset signal line.
- the specific mode switching, the data synchronization mode and the working mode switching after the failure recovery are all introduced above, and will not be repeated.
- the fault recovery module of this machine is mainly composed of watchdog and external reset signal line.
- watchdog When the system is running abnormally, it will not feed the dog. After a period of time, the watchdog will starve to death and restart the system.
- the other party's device When the other party's device first detects the failure of the local device, the other party will reset the device through the reset signal line.
- the watchdog generally has a long delay (4 seconds for this system design, which can be adjusted according to the actual situation). Under normal circumstances, the CMC reset of the other party will have priority over the watchdog to find system faults.
- the watchdog provides a reset when two CMCs fail at the same time due to external interference. If the actual time of the watchdog is short enough, or the heartbeat detection mechanism is slow, the watchdog resets first.
- the fault module determines that the other party's CMC equipment is faulty
- the machine will immediately record the situation in the LOG log, and give general alarms through LEDs and reporting to the remote server. If the faulty device has two or more faults in a day, a severe alarm will be reported. If the faulty device cannot be restored by resetting, the CMC device continues to report a fatal fault alarm. The alarm will continue to exist, and even if another CMC device resumes business through reset, it will not be cancelled unless the maintenance personnel manually eliminate it.
- the CMC equipment is responsible for information collection and cooling fan control on each node of the entire cabinet/box, and the display control and management of the buttons and indicators on the front panel.
- the management module specifically includes at least the following content:
- the CMC device collects data on each node, not in the form of periodic polling, but in the form of active reporting by the BMC of each node.
- the BMC When the parameters on the node BMC change, after the BMC processes the management of the node, it initiates a communication request to the CMC as soon as possible, and synchronizes the parameters to the mapping area of the CMC to ensure that the CMC parameters are consistent with the BMC parameters. Since the BMC initiates a soft interrupt through the I3C bus, it can only be initiated when the I3C bus is idle. Therefore, when the bus is detected to be busy, the BMC will delay the initiation operation for a period of time (in the present invention, it is preferably 10ms, which can be adjusted according to actual conditions) .
- the CMC When the user configures the BMC device of a node through the CMC, after the CMC verifies that the parameters are legal, it initiates communication with the node BMC and configures the parameters to the BMC. After the configuration is successful, the CMC maps the node to the parameters for modification. , To ensure the consistency of parameters.
- the CMC device provides external services such as web services, command line and other human-computer interaction interfaces 80, which are used for human-computer interaction such as remote device management, firmware update, or fault reporting to the remote control center.
- the upgrade module in the CMC device is mainly responsible for the upgrade of the system. It is mainly responsible for two parts of the upgrade. One is the upgrade of the CMC device's own firmware, and the other is the firmware upgrade on each node.
- the CMC upgrade module is also responsible for determining the consistency of the upgrade package. .
- the user can manage the BMC, BIOS, CPLD and other firmware upgrades of each node through the CMC.
- CMC cannot upgrade the firmware on the node through the internal I2C bus, and must rely on LAN.
- the node cannot be upgraded.
- firmware upgrade through the internal I3C bus becomes a reality. Even if it does not rely on the LAN, the CMC can still upgrade the firmware of the node.
- the user Since there is BMC data mapping of each node on the CMC, the user only needs to log in to the CMC to upgrade the firmware on each node. The user first selects a node to upgrade the firmware. CMC will list the upgradeable firmware according to the model of the node. The user uploads the firmware upgrade package to the CMC. CMC will judge the compatibility of the firmware upgrade according to the model of the node. If the model version is incorrect, the user will be prompted to terminate the upgrade operation, and the subsequent upgrade operation will only be performed if the upgrade conditions are met.
- CMC there are two ways for CMC to transmit firmware upgrade package data to node BMC.
- CMC transmits firmware upgrade package to node BMC via LAN for upgrade; the other is CMC transmits firmware upgrade package data to node BMC via internal I3C bus.
- the direct data synchronization and interaction between CMC and BMC are always performed on the I3C bus, and the firmware upgrade package is relatively large.
- the present invention preferably transmits the estimated upgrade package to the node BMC via LAN , When the LAN link fails, the I3C bus is used for transmission.
- CMC divides the estimated upgrade package into several small pieces, each with a number and The check code is then transmitted via the I3C bus at intervals. In this way, data synchronization can still be performed during the transmission interval.
- BMC receives a small piece of data, it performs data verification and unpacking storage. When the verification fails, the BMC informs the CMC to resend the small piece of data. After all the fragmented data is transmitted to the BMC, the BMC combines the fragmented data to restore the complete firmware upgrade package.
- BMC will cache the upgrade package and verify its integrity. When the verification is passed, it returns OK, and the verification fails to return the upgrade package verification failure to the CMC.
- the CMC informs the BMC to upgrade the firmware through the I3C bus. BMC will again determine whether the firmware upgrade package meets the upgrade requirements of the firmware of the node. If it does not meet the requirements of the firmware upgrade of the node, the CMC will be notified to terminate the upgrade operation. If it does, the BMC will start to upgrade the firmware. Feedback to users and improve friendliness.
- CMC When CMC receives the feedback from BMC that the upgrade is estimated to be successful, it prompts the user that the upgrade is complete. The upgrade process is over.
- the system module, data module, synchronization module, detection module, reset module, alarm module, management module, network module and upgrade module based on the above introduction constitute the host computer in the dual-system hot backup system 100 and/or 100' according to the present invention
- the functional structure of the device 10 and the backup device 20 can thus construct the aforementioned embodiments of the dual-machine hot backup system 100 and/or 100' according to the present invention, complete corresponding functions, and achieve corresponding technical effects.
- the present invention has at least the following beneficial effects: for the low communication rate of I2C bus and serial port used in the current multi-node server solution, it cannot meet the requirements of interaction and synchronization between active and standby, so data synchronization has to be Relying heavily on the problem of external LAN and switches, the I3C bus is used to establish the internal communication architecture of the multi-point server dual-system hot standby system, and two I3C buses are used to construct the communication architecture between the master and the master and the slave in the system.
- the host device collects the status of the slave device through the I3C bus, establishes a mapping and stores it in the database, so that for example, when the user needs to call the operating parameters of a certain slave device, there is no need for the host device to respond to the user's instructions. Go to the slave device to obtain it, but can directly feed back the relevant information recorded in the database to the user.
- the host device will synchronize the corresponding information to the standby device to ensure the consistency of data in the main and standby devices.
- the backup device is also allowed to be directly accessed by external users such as users to issue instructions.
- the backup device will forward the management instruction to the host device through the I3C bus so that the host device can respond to the slave device.
- Management Using the dual-machine hot backup system of the present invention not only improves the efficiency of the internal communication of the system, but also avoids the reliability problems caused by relying on external LAN and external switches, and ensures the data consistency of the main and backup equipment to a certain extent. The above development and utilization of the resources of the backup equipment, thereby further improving the safety and operating efficiency of the entire multi-node server system.
Abstract
Description
Claims (10)
- 一种双机热备系统,其特征在于,所述系统包括:A dual-machine hot backup system, characterized in that the system includes:主机设备;Host device备机设备,所述备机设备通过第一I3C总线与所述主机设备通信连接;A backup device, where the backup device communicates with the host device through a first I3C bus;至少一个从机设备,所述至少一个从机设备通过第二I3C总线与所述主机设备和所述备机设备通信连接;At least one slave device, the at least one slave device is in communication connection with the host device and the standby device through a second I3C bus;其中,所述主机设备配置为响应于所述双机热备系统启动,收集所述从机设备的参数,将所述参数的映射存入数据库并通过所述第一I3C总线同步给所述备机设备,并且配置为基于管理指令通过所述第二I3C总线管理所述从机设备,并且根据发生变化的参数生成映射,通过所述第一I3C总线将所述发生变化的参数的映射同步给所述备机设备;Wherein, the host device is configured to collect the parameters of the slave device in response to the startup of the dual-machine hot backup system, store the mapping of the parameters in a database, and synchronize the parameters to the backup device via the first I3C bus. The device is configured to manage the slave device through the second I3C bus based on the management instruction, and generate a mapping according to the changed parameter, and synchronize the mapping of the changed parameter to the first I3C bus. The standby equipment;所述备机设备配置为响应于从外部接收到管理指令,将所述接收到的管理指令通过所述第一I3C总线转发给所述主机设备。The backup device is configured to, in response to receiving a management instruction from the outside, forward the received management instruction to the host device through the first I3C bus.
- 根据权利要求1所述的系统,其特征在于,所述备机设备进一步配置为:The system according to claim 1, wherein the backup device is further configured to:响应于从外部接收到紧急的管理指令,强制临时占用所述第二I3C总线管理所述从机设备,并且根据发生变化的参数生成参数的映射,通过所述第一I3C总线将所述紧急的管理指令及所述发生变化的参数的映射同步给所述主机设备。In response to receiving an emergency management instruction from the outside, the second I3C bus is forcibly occupied to manage the slave device, and a parameter mapping is generated according to the changed parameters, and the emergency The mapping of the management instruction and the changed parameter is synchronized to the host device.
- 根据权利要求1所述的系统,其特征在于,所述主机设备进一步配置为:The system according to claim 1, wherein the host device is further configured to:响应于所述主机设备进入升级模式和/或资源占用超过阈值,通过所述第一I3C总线通知所述备机设备暂时接管对所述从机设备的管理;并且In response to the host device entering the upgrade mode and/or the resource occupation exceeds the threshold, notify the backup device through the first I3C bus to temporarily take over the management of the slave device; and响应于所述主机设备退出升级模式和/或资源占用不再超过阈值,通过所述第一I3C总线通知所述备机设备停止接管对所述从机设备的管理。In response to the host device exiting the upgrade mode and/or the resource occupation no longer exceeds the threshold, the standby device is notified through the first I3C bus to stop taking over the management of the slave device.
- 根据权利要求3所述的系统,其特征在于,所述备机设备进一步配置为:The system according to claim 3, wherein the backup device is further configured to:响应于接收到所述主机设备的暂时接管的通知,通过所述第二I3C总线管理所述从机设备,并且根据发生变化的相关参数生成参数的映射;并 且In response to receiving the notification of the temporary takeover of the host device, manage the slave device through the second I3C bus, and generate a parameter mapping according to the changed related parameters; and响应于接收到所述主机设备的停止接管的通知,停止管理所述从机设备并通过所述第一I3C总线将所述发生变化的参数的映射同步给所述主机设备。In response to receiving the notification of stopping the takeover of the host device, stop managing the slave device and synchronize the mapping of the changed parameter to the host device through the first I3C bus.
- 根据权利要求1所述的系统,其特征在于,所述备机设备进一步配置为:The system according to claim 1, wherein the backup device is further configured to:响应于所述双机热备系统启动,主动向所述主机设备发起时钟同步请求。In response to the startup of the dual-system hot backup system, actively initiate a clock synchronization request to the host device.
- 根据权利要求1所述的系统,其特征在于,所述主机设备和所述备机设备进一步配置为:The system according to claim 1, wherein the host device and the backup device are further configured as:响应于所述主机设备和所述备机设备任一方发起参数的映射的同步,发起方生成包括原有数据、修改后数据和修改时间的同步打包数据并发送给对方。In response to the synchronization of the mapping of parameters initiated by either of the host device and the backup device, the initiator generates and sends synchronized packaged data including original data, modified data, and modified time to the other party.
- 根据权利要求6所述的系统,其特征在于,所述主机设备和所述备机设备进一步配置为:The system according to claim 6, wherein the host device and the backup device are further configured as:响应于所述主机设备和/或所述备机设备接收到对方发送的同步打包数据,将其中的原有数据与本地数据进行比较;In response to the host device and/or the backup device receiving the synchronized packaged data sent by the other party, comparing the original data therein with the local data;响应于所述原有数据与所述本地数据相同,根据所述修改后数据修改所述本地数据;In response to the original data being the same as the local data, modifying the local data according to the modified data;响应于所述原有数据与所述本地数据不同,将所述接收到的同步打包数据中的修改时间与本地的修改时间进行比较,以修改时间较新的修改后数据为准进行同步。In response to the difference between the original data and the local data, the modification time in the received synchronized packaged data is compared with the local modification time, and the modified data with a newer modification time is used as the standard for synchronization.
- 根据权利要求1所述的系统,其特征在于,所述主机设备和所述备机设备进一步配置为:The system according to claim 1, wherein the host device and the backup device are further configured as:通过双向物理IO心跳检测机制检测对方的运行状态。Detect the other party's operating status through a two-way physical IO heartbeat detection mechanism.
- 根据权利要求1所述的系统,其特征在于,所述主机设备和所述备机设备进一步配置为:The system according to claim 1, wherein the host device and the backup device are further configured as:响应于所述主机设备和所述备机设备任一方检测到对方出现故障,非故障方将故障方的故障情况记入日志,并通过外部双复位机制将故障方重 启复位,并对故障方进行时钟及数据库同步。In response to either the host device or the backup device detecting that the other party has a failure, the non-faulty party records the failure of the failed party in the log, and resets the failed party through the external double reset mechanism, and performs a check on the failed party. Clock and database synchronization.
- 根据权利要求9所述的系统,其特征在于,所述主机设备和所述备机设备进一步配置为:The system according to claim 9, wherein the host device and the backup device are further configured to:响应于非故障方无法通过外部双复位机制将故障方重启复位和/或重启复位失败,所述非故障方发出告警以通知运维人员处理。In response to the non-faulty party being unable to restart and reset the faulty party through the external dual reset mechanism and/or the restarting and resetting failure, the non-faulty party issues an alarm to notify the operation and maintenance personnel to handle it.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910995329.9A CN110750480B (en) | 2019-10-18 | 2019-10-18 | Dual-computer hot standby system |
CN201910995329.9 | 2019-10-18 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2021073105A1 true WO2021073105A1 (en) | 2021-04-22 |
Family
ID=69278976
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2020/092835 WO2021073105A1 (en) | 2019-10-18 | 2020-05-28 | Dual-computer hot standby system |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN110750480B (en) |
WO (1) | WO2021073105A1 (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113852529A (en) * | 2021-08-11 | 2021-12-28 | 交控科技股份有限公司 | Back board bus system for data communication of trackside equipment and data transmission method thereof |
Families Citing this family (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109491311A (en) * | 2018-11-13 | 2019-03-19 | 江苏常熟发电有限公司 | A kind of CEMS data transmission failure judgment method |
CN110750480B (en) * | 2019-10-18 | 2021-06-29 | 苏州浪潮智能科技有限公司 | Dual-computer hot standby system |
CN111698117A (en) * | 2020-04-01 | 2020-09-22 | 新华三信息安全技术有限公司 | Equipment management method, network equipment, storage medium and router |
CN111736880A (en) * | 2020-05-28 | 2020-10-02 | 苏州浪潮智能科技有限公司 | BMC refreshing method, system, equipment, product and storage medium |
CN111813859A (en) * | 2020-07-14 | 2020-10-23 | 积成电子股份有限公司 | Time slice-based synchronization method for historical items of transformer substation between main machine and auxiliary machine |
CN112398712B (en) * | 2020-09-29 | 2022-01-28 | 卡斯柯信号有限公司 | CAN and MLVDS dual-bus-based communication board active/standby control method |
CN114690857A (en) * | 2020-12-28 | 2022-07-01 | 技嘉科技股份有限公司 | Cabinet management control device and cabinet management control system |
CN113852549B (en) * | 2021-09-27 | 2023-10-17 | 卡斯柯信号有限公司 | Method for realizing independent data receiving and processing of main and standby systems |
CN117032579A (en) * | 2023-08-21 | 2023-11-10 | 上海合芯数字科技有限公司 | Slave starting method, device and storage medium |
CN117533251A (en) * | 2024-01-08 | 2024-02-09 | 知迪汽车技术(北京)有限公司 | Distributed file system for vehicle-mounted bus data recorder |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103199972A (en) * | 2013-03-25 | 2013-07-10 | 成都瑞科电气有限公司 | Double machine warm backup switching method and warm backup system achieved based on SOA and RS485 bus |
US20130293251A1 (en) * | 2012-05-07 | 2013-11-07 | Tesla Motors, Inc. | Wire break detection in redundant communications |
CN109960679A (en) * | 2017-12-14 | 2019-07-02 | 英特尔公司 | For controlling the systems, devices and methods of the duty ratio of the clock signal of multi-point interconnection |
CN110750480A (en) * | 2019-10-18 | 2020-02-04 | 苏州浪潮智能科技有限公司 | Dual-computer hot standby system |
Family Cites Families (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6970961B1 (en) * | 2001-01-02 | 2005-11-29 | Juniper Networks, Inc. | Reliable and redundant control signals in a multi-master system |
CN104679907A (en) * | 2015-03-24 | 2015-06-03 | 新余兴邦信息产业有限公司 | Realization method and system for high-availability and high-performance database cluster |
CN105389231A (en) * | 2015-10-28 | 2016-03-09 | 浪潮(北京)电子信息产业有限公司 | Database dual-computer backup method and system |
CN107634855A (en) * | 2017-09-12 | 2018-01-26 | 天津津航计算技术研究所 | A kind of double hot standby method of embedded system |
CN108090009A (en) * | 2017-11-13 | 2018-05-29 | 北京全路通信信号研究设计院集团有限公司 | A kind of multimachine method, apparatus of falling machine and system |
CN109144913A (en) * | 2018-09-29 | 2019-01-04 | 联想(北京)有限公司 | A kind of data processing method, system and electronic equipment |
CN109815186A (en) * | 2018-12-18 | 2019-05-28 | 北京航天晨信科技有限责任公司 | Dual redundant communication equipment and method |
-
2019
- 2019-10-18 CN CN201910995329.9A patent/CN110750480B/en active Active
-
2020
- 2020-05-28 WO PCT/CN2020/092835 patent/WO2021073105A1/en active Application Filing
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20130293251A1 (en) * | 2012-05-07 | 2013-11-07 | Tesla Motors, Inc. | Wire break detection in redundant communications |
CN103199972A (en) * | 2013-03-25 | 2013-07-10 | 成都瑞科电气有限公司 | Double machine warm backup switching method and warm backup system achieved based on SOA and RS485 bus |
CN109960679A (en) * | 2017-12-14 | 2019-07-02 | 英特尔公司 | For controlling the systems, devices and methods of the duty ratio of the clock signal of multi-point interconnection |
CN110750480A (en) * | 2019-10-18 | 2020-02-04 | 苏州浪潮智能科技有限公司 | Dual-computer hot standby system |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113852529A (en) * | 2021-08-11 | 2021-12-28 | 交控科技股份有限公司 | Back board bus system for data communication of trackside equipment and data transmission method thereof |
CN113852529B (en) * | 2021-08-11 | 2023-03-24 | 交控科技股份有限公司 | Back board bus system for data communication of trackside equipment and data transmission method thereof |
Also Published As
Publication number | Publication date |
---|---|
CN110750480B (en) | 2021-06-29 |
CN110750480A (en) | 2020-02-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2021073105A1 (en) | Dual-computer hot standby system | |
CN107733684B (en) | Multi-controller computing redundancy cluster based on Loongson processor | |
US20140095925A1 (en) | Client for controlling automatic failover from a primary to a standby server | |
CN103199972B (en) | The two-node cluster hot backup changing method realized based on SOA, RS485 bus and hot backup system | |
US5875290A (en) | Method and program product for synchronizing operator initiated commands with a failover process in a distributed processing system | |
CN103647781B (en) | Mixed redundancy programmable control system based on equipment redundancy and network redundancy | |
CN105471622B (en) | A kind of high availability method and system of the control node active-standby switch based on Galera | |
US7853767B2 (en) | Dual writing device and its control method | |
US6012150A (en) | Apparatus for synchronizing operator initiated commands with a failover process in a distributed processing system | |
CN107147540A (en) | Fault handling method and troubleshooting cluster in highly available system | |
JP2004532442A (en) | Failover processing in a storage system | |
BR112019027654A2 (en) | train network node and canopen-based train network node monitoring method | |
US20150019671A1 (en) | Information processing system, trouble detecting method, and information processing apparatus | |
JPH03164837A (en) | Spare switching system for communication control processor | |
CN111737037A (en) | Substrate management control method, master-slave heterogeneous BMC control system and storage medium | |
CN107071189B (en) | Connection method of communication equipment physical interface | |
JP5625605B2 (en) | OS operation state confirmation system, device to be confirmed, OS operation state confirmation device, OS operation state confirmation method, and program | |
CN110399254A (en) | A kind of server CMC dual-locomotive heat activating method, system, terminal and storage medium | |
CN114124803B (en) | Device management method and device, electronic device and storage medium | |
CN102638369B (en) | Method, device and system for arbitrating main/standby switch | |
CN116069373A (en) | BMC firmware upgrading method, device and medium thereof | |
JP7328907B2 (en) | control system, control method | |
CN113794765A (en) | Gate load balancing method and device based on file transmission | |
CN107423167A (en) | A kind of ISCSI target redundancy control methods and system based on dual control storage | |
CN106656437A (en) | Redundant hot standby platform |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 20876515 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20876515 Country of ref document: EP Kind code of ref document: A1 |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20876515 Country of ref document: EP Kind code of ref document: A1 |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 02.11.2022) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20876515 Country of ref document: EP Kind code of ref document: A1 |