WO2021073105A1 - Dual-computer hot standby system - Google Patents

Dual-computer hot standby system Download PDF

Info

Publication number
WO2021073105A1
WO2021073105A1 PCT/CN2020/092835 CN2020092835W WO2021073105A1 WO 2021073105 A1 WO2021073105 A1 WO 2021073105A1 CN 2020092835 W CN2020092835 W CN 2020092835W WO 2021073105 A1 WO2021073105 A1 WO 2021073105A1
Authority
WO
WIPO (PCT)
Prior art keywords
host device
backup
bus
data
party
Prior art date
Application number
PCT/CN2020/092835
Other languages
French (fr)
Chinese (zh)
Inventor
韩红瑞
黄柏学
Original Assignee
苏州浪潮智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 苏州浪潮智能科技有限公司 filed Critical 苏州浪潮智能科技有限公司
Publication of WO2021073105A1 publication Critical patent/WO2021073105A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/38Information transfer, e.g. on bus
    • G06F13/42Bus transfer protocol, e.g. handshake; Synchronisation
    • G06F13/4282Bus transfer protocol, e.g. handshake; Synchronisation on a serial bus, e.g. I2C bus, SPI bus

Definitions

  • the invention relates to the field of server technology.
  • the invention further relates to a dual-machine hot backup system.
  • the main and backup equipment are two independent operating entities, and data differences will inevitably occur when the two operate sequentially or at the same time. Even if the LAN returns to normal after a period of time, the two monitors at this time do not know whose data is the latest, and who should synchronize with whom, so a split-brain situation may occur.
  • a strategy of secondary verification of the serial port is added to avoid the failure of the standby machine to determine the true operation of the host after the LAN network is interrupted.
  • the standby machine can still check that the host is alive through the serial port.
  • the standby machine does not take over the work of the host and is always in a standby working state.
  • data synchronization between the main and standby devices cannot be performed through the LAN, and the user cannot remotely control and access the host device through the LAN.
  • the data queried is not the latest data.
  • the present invention proposes a dual-machine hot backup system based on the above objective, wherein the system includes:
  • a standby device which communicates with the host device through the first I3C bus;
  • At least one slave device the at least one slave device is in communication connection with the host device and the standby device through the second I3C bus;
  • the host device is configured to respond to the startup of the dual-system hot backup system, collect the parameters of the slave device, store the parameter mapping in the database and synchronize it to the standby device via the first I3C bus, and is configured to pass the second device based on the management instruction
  • the I3C bus manages the slave device, and generates a mapping according to the changed parameter, and synchronizes the mapping of the changed parameter to the standby device through the first I3C bus;
  • the standby device is configured to, in response to receiving a management instruction from the outside, forward the received management instruction to the host device through the first I3C bus.
  • the backup device is further configured to: in response to receiving an emergency management instruction from the outside, forcibly occupy the second I3C bus to manage the slave device temporarily, and according to the changed parameters Generate parameter mapping, and synchronize the mapping of emergency management commands and changed parameters to the host device through the first I3C bus.
  • the host device is further configured to: in response to the host device entering the upgrade mode and/or the resource occupation exceeds the threshold, notify the standby device to temporarily take over the slave device through the first I3C bus And in response to the host device exiting the upgrade mode and/or the resource occupation no longer exceeds the threshold, the standby device is notified through the first I3C bus to stop taking over the management of the slave device.
  • the backup device is further configured to: in response to receiving a notification of a temporary takeover of the host device, manage the slave device through the second I3C bus, and according to the changed related parameters Generate a parameter mapping; and in response to receiving a notification that the host device stops taking over, stop managing the slave device and synchronize the mapping of the changed parameter to the host device through the first I3C bus.
  • the backup device is further configured to: in response to the dual-system hot backup system being activated, actively initiate a clock synchronization request to the host device.
  • the host device and the backup device are further configured to: in response to either of the host device and the backup device initiates the synchronization of the parameter mapping, the initiator generates the original data, modification After the data and the modification time are synchronized, the data is packaged and sent to the other party.
  • the host device and the backup device are further configured as:
  • the modification time in the received synchronized packaged data is compared with the local modification time, and the modified data with the newer modification time is used as the standard for synchronization.
  • the host device and the backup device are further configured to detect the operating state of each other through a two-way physical IO heartbeat detection mechanism.
  • the host device and the backup device are further configured to: in response to either the host device and the backup device detecting that the other party has a failure, the non-faulty party records the failure of the failed party Enter the log, and reset the faulty party through the external double reset mechanism, and synchronize the clock and database of the faulty party.
  • the host device and the backup device are further configured to: in response to the non-faulty party being unable to reset the failed party through an external dual reset mechanism and/or the failure of the resetting, the non-faulty party Send an alarm to notify the operation and maintenance personnel to deal with it.
  • the present invention has at least the following beneficial effects: for the low communication rate of I2C bus and serial port used in the current multi-node server solution, it cannot meet the requirements of interaction and synchronization between active and standby, so data synchronization has to be Relying heavily on the problem of external LAN and switches, the I3C bus is used to establish the internal communication architecture of the multi-point server dual-system hot standby system, and two I3C buses are used to construct the communication architecture between the master and the master and the slave in the system.
  • the host device collects the status of the slave device through the I3C bus, establishes a mapping and stores it in the database, so that for example, when the user needs to call the operating parameters of a certain slave device, there is no need for the host device to respond to the user's instructions. Go to the slave device to obtain it, but can directly feed back the relevant information recorded in the database to the user.
  • the host device will synchronize the corresponding information to the standby device to ensure the consistency of data in the main and standby devices.
  • the backup device is also allowed to be directly accessed by external users such as users to issue instructions.
  • the backup device will forward the management instruction to the host device through the I3C bus so that the host device can respond to the slave device.
  • Management Using the dual-machine hot backup system of the present invention not only improves the efficiency of the internal communication of the system, but also avoids the reliability problems caused by relying on external LAN and external switches, and ensures the data consistency of the main and backup equipment to a certain extent. The above development and utilization of the resources of the backup equipment, thereby further improving the safety and operating efficiency of the entire multi-node server system.
  • Figure 1 shows a schematic diagram of an embodiment of a dual-machine hot backup system according to the present invention
  • Figure 2 shows a schematic diagram of another embodiment of the dual-machine hot backup system according to the present invention.
  • Fig. 3 shows a schematic structural diagram of an embodiment of the host device and the standby device of the dual-system hot backup system according to the present invention.
  • Fig. 1 shows a schematic diagram of an embodiment of a dual-machine hot backup system 100 according to the present invention.
  • the dual-system hot backup system 100 according to the present invention is especially used for the control and management of a multi-node server system.
  • the dual-machine hot backup system 100 at least includes:
  • a backup device 20 which is in communication connection with the host device 10 through the first I3C bus 30;
  • At least one slave device 40 the at least one slave device 40 is in communication connection with the host device 10 and the backup device 20 through the second I3C bus 50;
  • the host device 10 is configured to collect the parameters of the slave device 40 in response to the startup of the dual-machine hot backup system 100, store the parameter mapping in the database and synchronize to the backup device 20 through the first I3C bus 30, and is configured to be based on
  • the management instruction manages the slave device 40 through the second I3C bus 50, and generates a mapping according to the changed parameters, and synchronizes the mapping of the changed parameters to the standby device 20 through the first I3C bus 30;
  • the backup device 20 is configured to, in response to receiving a management instruction from the outside, forward the received management instruction to the host device 10 through the first I3C bus 30.
  • the host device 10 and the slave device 20 are CMC devices (Chassis Management Controller, chassis management controller). Similar to BMC, the whole machine is managed and controlled in multi-node server systems such as blades. CMC can send commands to each node for management.
  • the slave device 40 is preferably a BMC (Baseboard Management Controller, baseboard management controller) in a multi-node server system. The BMC can perform some operations on the machine such as firmware upgrade, viewing machine equipment, and so on when the machine is not turned on.
  • the dual-machine hot backup system 100 uses an I3C bus for communication connection.
  • I3C is a two-wire serial communication bus that integrates the key attributes of I2C and SPI buses. It is compatible with the I2C protocol. It has new features such as multiple masters, slave soft interrupts, dynamic allocation of slave addresses, and support for hot swapping. The speed can be as high as 33Mbps. Usually used to connect the sensor to the application processor. Further, in order to separate the master-backup communication and master-slave management to avoid mutual interference and reduce bus pressure, the first I3C bus 30 is used between the master and backup (10 and 20), and the master-slave (10 and 40) And a second I3C bus 50 is used between 20 and 40).
  • the dual-computer hot backup system 100 can complete all the functions of the system originally constructed by the I2C bus. And on top of this, the dual-machine hot backup system 100 according to the present invention adds new functions.
  • the host device 10 is configured to collect the parameters of the slave device 40 in response to the startup of the dual-machine hot backup system 100, store the mapping of the parameters in the database, and synchronize the parameters to the backup device 20 through the first I3C bus 30.
  • the CMC host device 10 checks the BMC slave device 20 on each node after it is powered on, and establishes a parameter based on the node device serial number, various operating parameters and other parameters Map the database.
  • the CMC host device 10 maps the parameters of all node BMC slave devices to the CMC host device 10.
  • the user requests to call the parameters of a certain slave device 40, he only needs to access the CMC host device 10 to obtain the corresponding parameters of all the slave devices 40, which prevents the CMC host device 10 from reporting to the corresponding slave device when the user queries.
  • the device 40 reads the parameters, thereby speeding up the speed of responding to user instructions.
  • the host device 10 synchronizes the database to the standby device 20 through the first I3C bus 30 to ensure data consistency between the main and standby devices.
  • the host device 10 is configured to manage the slave device 40 through the second I3C bus 50 based on management instructions, and generates a mapping according to the changed parameters, and synchronizes the mapping of the changed parameters through the first I3C bus 30 ⁇ 20 ⁇ To the standby equipment 20. That is to say, when, for example, a user accesses the host device 10 from the outside to issue a management instruction and/or when the host device 10 generates a management instruction according to a preset control strategy, the host device 10 completes its management and control of the slave device through the second I3C bus 50 40 functions.
  • the host device 10 maps the corresponding parameters to the database of the host device 10, and synchronizes the mapping of the parameters to the backup device 20 in real time through the first I3C bus 30, thereby ensuring that the management process is from Real-time data update and synchronization of main and standby data after changes in the parameters of the machine equipment 40.
  • the function of the dual-machine hot backup system 100 also includes the development and utilization of the resources of the backup device 20 to a certain extent, that is, the backup device 20 is also allowed to be directly accessed by external users, such as users.
  • the host device 10 and the standby device 20 respectively have their own communication addresses. Therefore, when the user directly accesses the backup device 20 from the outside according to the communication address and issues a management instruction, the backup device 20 will forward the management instruction to the host device 10 through the first I3C bus 30 after receiving the corresponding management instruction. The host device 10 performs corresponding management on the slave device 40.
  • the backup device 20 is further configured to: in response to receiving an emergency management instruction from the outside, forcibly temporarily occupy the second I3C bus 50 to manage the slave device 40, and generate a parameter map according to the changed parameter, and synchronize the emergency management command and the map of the changed parameter to the host device 10 through the first I3C bus 30.
  • the backup device 20 when the user directly accesses the backup device 20 from the outside according to the communication address and issues a management instruction, and the management instruction is a specific emergency management instruction At this time, the backup device 20 will forcibly occupy the second I3C bus 50 temporarily to manage the slave device 40 according to the emergency management instruction issued by the user, and synchronize the corresponding information to the host device 10.
  • the specific emergency management instructions mentioned here usually refer to the management instructions that have very strong timeliness requirements and/or are closely related to system operation safety and must be processed immediately and/or that the user is forced to process immediately.
  • the backup device 20 will forcibly occupy the second I3C bus 50 temporarily to manage the slave device 40, eliminating the step of the backup device 20 forwarding the management command to the host device 10 through the first I3C bus 30, which improves the response speed.
  • the host device 10 is further configured to: in response to the host device 10 entering the upgrade mode and/or the resource occupation exceeds a threshold, the host device 10 passes through the first I3C bus 30 Notify the backup device 20 to temporarily take over the management of the slave device 40; and in response to the host device 10 exiting the upgrade mode and/or the resource occupation no longer exceeds the threshold, the host device 10 notifies the backup device 20 to stop taking over through the first I3C bus 30 Management of the slave device 40.
  • the host device when the host device needs to upgrade the firmware and/or the system resource occupation exceeds the threshold, the host device cannot continue to support the management of the slave device due to the limited memory space.
  • the method adopted is that the operation and maintenance personnel temporarily close some functions, and restart the previously temporarily closed functions when the firmware upgrade ends and/or the resource occupation is relieved.
  • the management and control of the slave device by the host device is interrupted, and continuous management and control cannot be achieved. Therefore, the dual-system hot backup system 100 according to the present invention further develops and utilizes the resources of the backup device under the above-mentioned specific circumstances.
  • the host device 10 When the host device 10 enters the upgrade mode and/or the resource occupation exceeds the threshold, the host device 10 actively The standby device 20 is notified through the first I3C bus 30 to temporarily take over the management of the slave device 40; and when the master device 10 is upgraded, it exits the upgrade mode and/or the resource occupation no longer exceeds the threshold so that the slave device 40 can continue to be managed
  • the host device 10 informs the standby device 20 through the first I3C bus 30 to stop taking over the management of the slave device 40.
  • the backup device 20 is further configured to: in response to receiving the notification of the temporary takeover of the host device 10, manage the slave device 40 through the second I3C bus 50, and The parameter mapping is generated according to the changed related parameters; and in response to receiving the notification of stopping the takeover of the host device 40, the slave device is stopped from managing and the mapping of the changed parameter is synchronized to the host device 10 through the first I3C bus 30.
  • the backup device 20 serves as the backup of the host device 10 and is on standby for a long time. In a traditional dual-system hot backup system, once the host device 10 fails, the backup device 20 becomes the host to maintain the normal operation of the system.
  • the backup device 20 In the dual hot backup system of the present invention, in addition to the above, no matter what state the host device 10 is in, once the backup device 20 receives the temporary takeover notice sent to it by the host device 10, the backup device 20 will It will temporarily take over the management of the slave device 40, manage the slave device 40 through the second I3C bus 50, and generate a parameter mapping according to the changed related parameters. In addition, once the backup device 20 receives the notification to stop taking over from the host device 10, the backup device 20 will stop managing the slave device 40, return the management work to the host device 10, and pass the first I3C bus 30. The mapping of the parameters affected by the management actions during the management of the backup device 20 is synchronized to the host device 10 to ensure that the host device can accurately manage the slave device 40 and ensure that the master and backup data are consistent.
  • the backup device 20 is further configured to: in response to the dual-machine hot backup system 100 being activated, the backup device 20 actively initiates a clock synchronization request to the host device 10.
  • the system time of the active and standby devices plays an important role in many situations and functions, so it is necessary to ensure the clock synchronization between the active and standby devices.
  • the strategy adopted for clock synchronization is that after the dual-system hot backup system 100 is started, that is, after the host device 10 and the backup device 20 are turned on, the backup device 20 actively initiates a clock synchronization request to the host device 10. , To ensure the consistency of the system time of the two devices.
  • the host device 10 and the backup device 20 are further configured to: in response to either the host device 10 and the backup device 20 initiating the synchronization of the parameter mapping, the initiator ( 10 or 20) Generate synchronized packaged data including original data, modified data and modified time and send it to the other party (20 or 10).
  • the same data storage format is used in the host device 10 and the backup device 20.
  • the initiator (host device 10 or backup device 20) will "original data + modified data +
  • the "modification time” is packaged into synchronized packaged data, and then the synchronized packaged data is sent to the other party (the standby device 20 or the host device 10) so that the other party can update the data.
  • the host device 10 and the backup device 20 are further configured as:
  • the original data therein is compared with the local data
  • the modification time in the received synchronized packaged data is compared with the local modification time, and the modified data with the newer modification time is used as the standard for synchronization.
  • this arbitration mechanism specifically includes the following parts. First, when the host device 10 and/or the backup device 20 receives the synchronized packaged data sent by the other party, the synchronized packaged data is parsed to extract the original data, the modified data, and the modification time, and the extracted original data is compared with the original data. Compare with local data. If the original data is the same as the local data, it means that the local data has not changed, so the local data can be directly updated according to the extracted modified data.
  • the modification time extracted from the received synchronous packaged data is compared with the modification time in the synchronous packaged data generated when the local data changes, and the modified data with the newer modification time is the final valid data. Synchronize primary and secondary data. Specifically, if the modified time extracted from the received synchronized packaged data is relatively new, that is, the modified data extracted from the received synchronized packaged data is the final valid data, the data extracted from the received synchronized packaged data After the modification, the local data is updated to realize the synchronization of the main and standby data.
  • the modification time in the synchronized packaged data generated when the local data changes is newer, that is, the local data is the final valid data, so the local update is not performed.
  • the synchronous packaged data will be sent to the other party immediately. If the generated synchronous packaged data has been sent to the other party when the local data changes, no further processing is necessary.
  • FIG. 2 shows a schematic diagram of another embodiment of a dual-system hot backup system 100' according to the present invention, in which the host device 10 and the backup device 20 are further configured to detect the operation of each other through a two-way physical IO heartbeat detection mechanism 60 status.
  • the embodiment of the dual-machine hot backup system 100' according to the present invention adds a two-way physical IO heartbeat detection mechanism between the host device 10 and the backup device 20 compared with the traditional dual-machine hot backup system.
  • the backup device detects whether the host device is faulty, the host device also checks the operating status of the backup device in real time, so as to prevent the system from not knowing this situation when the backup device fails before the host device, which leads to the need for backup when the host device fails The occurrence of a situation where the system completely loses management when the equipment starts to perform management functions.
  • the host device 10 and the backup device 20 are further configured to respond to the failure of either the host device 10 or the backup device 20 detecting that the other party has a failure .
  • the non-faulty party (10 or 20) records the failure of the faulty party (20 or 10) in the log, and resets the faulty party (20 or 10) through the external double reset mechanism 70, and responds to the faulty party (20 or 10).
  • Synchronize clock and database Synchronize clock and database.
  • the non-faulty party (host device 10 or backup device 20) will send the failure party (backup device 20 or host device 10) The failure situation is recorded in the local log.
  • the non-faulty party (host device 10 or backup device 20) will restart and reset the failed party (backup device 20 or host device 10) through the external dual reset mechanism 70 to ensure that both devices can be online at the same time to prevent redundancy failure. If the reset is successful, the non-faulty party (host device 10 or backup device 20) will synchronize the clock and database of the failed party (backup device 20 or host device 10).
  • the host device 10 and the backup device 20 are further configured to respond to the failure of the non-faulty party (10 or 20) to pass the external dual reset mechanism 70 to the failed party ( 20 or 10) Restart reset and/or restart reset fails, the non-faulty party (10 or 20) issues an alarm to notify the operation and maintenance personnel to deal with. That is to say, in some situations, it may happen that the non-faulty party (host device 10 or backup device 20) cannot reset the failed party (backup device 20 or host device 10) through the external dual reset mechanism 70 or tries to restart and reset.
  • the strategy adopted at this time is that the non-faulty party (host device 10 or backup device 20) issues an alarm to notify the operation and maintenance personnel to deal with the failure party (backup device 20 or host device 10) to eliminate the fault. Maintain the effectiveness of dual-system hot backup.
  • Fig. 3 shows a schematic structural diagram of an embodiment of the host device and the standby device of the dual-system hot backup system according to the present invention.
  • the host device 10 and the backup device 20 are both CMC devices
  • the slave device 40 is a BMC device as an example
  • the host device 10 and the backup device 20 maintain the same structure. The following will further explain the specific composition and functions of each module in the main and standby CMC equipment.
  • the system module on the CMC device is the state machine of the entire system, and is the core scheduling module of the system. The scheduling, state judgment, and data flow among the various modules are all processed by the system module.
  • the CMC host After the CMC host is powered on, it checks the BMC equipment on each node, and establishes a parameter mapping database according to the node equipment serial number, operating state parameters and other parameters.
  • CMC maps all BMC parameters of all nodes to CMC. In this way, the user only needs to access the CMC device to obtain the BMC data of all nodes, avoiding the need for the CMC to read the BMC parameters one by one when the user queries, and speed up the response to users.
  • the present invention adopts I3C bus connection between two CMC devices, which is used for data transmission and synchronization work between the two devices, so as to realize real-time data synchronization between the two devices.
  • the I3C bus has a rate of 33Mbps, a soft interrupt mechanism, and a checksum fault tolerance mechanism.
  • the principle of synchronization work includes the following content: one is clock synchronization; second, whose data changes, who is responsible for initiating the synchronization work; third, two data changes at the same time, subject to the last setting.
  • the power-on backup machine actively initiates a time synchronization request to the host to ensure the time consistency of the two devices, and can be used to determine the final effective data modification problem when the master and the slave modify the data at the same time.
  • the same data storage format is used in the two systems.
  • the data synchronization module will package and send the "original data + modified data + modification time" to the other one.
  • CMC CMC.
  • another CMC After receiving the synchronized data, another CMC first determines whether the original data is the same as the local data. If the same indicates that the local data has not changed, it directly updates the modified data. If the original data is different from the local data, it means that the local data has also been modified. Compare the data modification time and synchronize the modified data with the newer time to maintain data consistency.
  • the system adopts a two-way heartbeat packet detection mechanism. After any CMC device fails, the other CMC device will sense and record the LOG log and force reset the failed CMC device to restore normal operation. Take CMC0 (host device 10) as an example: if CMC0 fails, CMC1 (standby device 20) will find that the heartbeat signal of CMC0 cannot be detected in time. At this time, CMC1 determines that CMC0 is faulty, and records the LOG log and resets it. Signal IO line, forcibly reset CMC0. When the CMC0 faulty device restarts and recovers, it will promptly ask about the working mode of CMC1 and whether it is necessary to forcibly synchronize data.
  • CMC0 host device 10
  • CMC1 standby device 20
  • CMC1 will reply to its own working mode and the need to forcibly synchronize data. And CMC1 will send all the synchronized data + mandatory update flag to CMC0, and CMC0 will update the data to be completely consistent. After that, CMC1 will also trigger the mode switching process.
  • both CMCs are in idle mode, and the backup device 20 actively synchronizes data from the host device 10.
  • the invention adopts a two-way heartbeat detection mechanism for fault detection to ensure that any problem of the dual machines can be found by the other, reset it in time, restore the backup state, record logs and inform operation and maintenance personnel.
  • the detection module in the system is responsible for the generation of the heartbeat signal of one's own device and the monitoring function of the other's device's heartbeat signal.
  • the detection module is responsible for collecting the working status of the machine and the abnormal suspension of the thread.
  • the fault detection module will generate a continuous pulse signal on the heartbeat signal IO.
  • the heartbeat signal will stop outputting.
  • the detection module will naturally fail to output signals.
  • the detection module will monitor in real time whether there is a heartbeat signal on the other party's CMC device's heartbeat signal IO.
  • the other party's pulse signal is not detected for more than one second, for example, the other party's device can be preliminarily determined to be abnormal.
  • the fault handling mechanism of the reset module is that when the fault detection module determines that the other party's device is faulty, the own CMC device immediately takes over all the work, and in addition, restarts the faulty device through the reset signal line.
  • the specific mode switching, the data synchronization mode and the working mode switching after the failure recovery are all introduced above, and will not be repeated.
  • the fault recovery module of this machine is mainly composed of watchdog and external reset signal line.
  • watchdog When the system is running abnormally, it will not feed the dog. After a period of time, the watchdog will starve to death and restart the system.
  • the other party's device When the other party's device first detects the failure of the local device, the other party will reset the device through the reset signal line.
  • the watchdog generally has a long delay (4 seconds for this system design, which can be adjusted according to the actual situation). Under normal circumstances, the CMC reset of the other party will have priority over the watchdog to find system faults.
  • the watchdog provides a reset when two CMCs fail at the same time due to external interference. If the actual time of the watchdog is short enough, or the heartbeat detection mechanism is slow, the watchdog resets first.
  • the fault module determines that the other party's CMC equipment is faulty
  • the machine will immediately record the situation in the LOG log, and give general alarms through LEDs and reporting to the remote server. If the faulty device has two or more faults in a day, a severe alarm will be reported. If the faulty device cannot be restored by resetting, the CMC device continues to report a fatal fault alarm. The alarm will continue to exist, and even if another CMC device resumes business through reset, it will not be cancelled unless the maintenance personnel manually eliminate it.
  • the CMC equipment is responsible for information collection and cooling fan control on each node of the entire cabinet/box, and the display control and management of the buttons and indicators on the front panel.
  • the management module specifically includes at least the following content:
  • the CMC device collects data on each node, not in the form of periodic polling, but in the form of active reporting by the BMC of each node.
  • the BMC When the parameters on the node BMC change, after the BMC processes the management of the node, it initiates a communication request to the CMC as soon as possible, and synchronizes the parameters to the mapping area of the CMC to ensure that the CMC parameters are consistent with the BMC parameters. Since the BMC initiates a soft interrupt through the I3C bus, it can only be initiated when the I3C bus is idle. Therefore, when the bus is detected to be busy, the BMC will delay the initiation operation for a period of time (in the present invention, it is preferably 10ms, which can be adjusted according to actual conditions) .
  • the CMC When the user configures the BMC device of a node through the CMC, after the CMC verifies that the parameters are legal, it initiates communication with the node BMC and configures the parameters to the BMC. After the configuration is successful, the CMC maps the node to the parameters for modification. , To ensure the consistency of parameters.
  • the CMC device provides external services such as web services, command line and other human-computer interaction interfaces 80, which are used for human-computer interaction such as remote device management, firmware update, or fault reporting to the remote control center.
  • the upgrade module in the CMC device is mainly responsible for the upgrade of the system. It is mainly responsible for two parts of the upgrade. One is the upgrade of the CMC device's own firmware, and the other is the firmware upgrade on each node.
  • the CMC upgrade module is also responsible for determining the consistency of the upgrade package. .
  • the user can manage the BMC, BIOS, CPLD and other firmware upgrades of each node through the CMC.
  • CMC cannot upgrade the firmware on the node through the internal I2C bus, and must rely on LAN.
  • the node cannot be upgraded.
  • firmware upgrade through the internal I3C bus becomes a reality. Even if it does not rely on the LAN, the CMC can still upgrade the firmware of the node.
  • the user Since there is BMC data mapping of each node on the CMC, the user only needs to log in to the CMC to upgrade the firmware on each node. The user first selects a node to upgrade the firmware. CMC will list the upgradeable firmware according to the model of the node. The user uploads the firmware upgrade package to the CMC. CMC will judge the compatibility of the firmware upgrade according to the model of the node. If the model version is incorrect, the user will be prompted to terminate the upgrade operation, and the subsequent upgrade operation will only be performed if the upgrade conditions are met.
  • CMC there are two ways for CMC to transmit firmware upgrade package data to node BMC.
  • CMC transmits firmware upgrade package to node BMC via LAN for upgrade; the other is CMC transmits firmware upgrade package data to node BMC via internal I3C bus.
  • the direct data synchronization and interaction between CMC and BMC are always performed on the I3C bus, and the firmware upgrade package is relatively large.
  • the present invention preferably transmits the estimated upgrade package to the node BMC via LAN , When the LAN link fails, the I3C bus is used for transmission.
  • CMC divides the estimated upgrade package into several small pieces, each with a number and The check code is then transmitted via the I3C bus at intervals. In this way, data synchronization can still be performed during the transmission interval.
  • BMC receives a small piece of data, it performs data verification and unpacking storage. When the verification fails, the BMC informs the CMC to resend the small piece of data. After all the fragmented data is transmitted to the BMC, the BMC combines the fragmented data to restore the complete firmware upgrade package.
  • BMC will cache the upgrade package and verify its integrity. When the verification is passed, it returns OK, and the verification fails to return the upgrade package verification failure to the CMC.
  • the CMC informs the BMC to upgrade the firmware through the I3C bus. BMC will again determine whether the firmware upgrade package meets the upgrade requirements of the firmware of the node. If it does not meet the requirements of the firmware upgrade of the node, the CMC will be notified to terminate the upgrade operation. If it does, the BMC will start to upgrade the firmware. Feedback to users and improve friendliness.
  • CMC When CMC receives the feedback from BMC that the upgrade is estimated to be successful, it prompts the user that the upgrade is complete. The upgrade process is over.
  • the system module, data module, synchronization module, detection module, reset module, alarm module, management module, network module and upgrade module based on the above introduction constitute the host computer in the dual-system hot backup system 100 and/or 100' according to the present invention
  • the functional structure of the device 10 and the backup device 20 can thus construct the aforementioned embodiments of the dual-machine hot backup system 100 and/or 100' according to the present invention, complete corresponding functions, and achieve corresponding technical effects.
  • the present invention has at least the following beneficial effects: for the low communication rate of I2C bus and serial port used in the current multi-node server solution, it cannot meet the requirements of interaction and synchronization between active and standby, so data synchronization has to be Relying heavily on the problem of external LAN and switches, the I3C bus is used to establish the internal communication architecture of the multi-point server dual-system hot standby system, and two I3C buses are used to construct the communication architecture between the master and the master and the slave in the system.
  • the host device collects the status of the slave device through the I3C bus, establishes a mapping and stores it in the database, so that for example, when the user needs to call the operating parameters of a certain slave device, there is no need for the host device to respond to the user's instructions. Go to the slave device to obtain it, but can directly feed back the relevant information recorded in the database to the user.
  • the host device will synchronize the corresponding information to the standby device to ensure the consistency of data in the main and standby devices.
  • the backup device is also allowed to be directly accessed by external users such as users to issue instructions.
  • the backup device will forward the management instruction to the host device through the I3C bus so that the host device can respond to the slave device.
  • Management Using the dual-machine hot backup system of the present invention not only improves the efficiency of the internal communication of the system, but also avoids the reliability problems caused by relying on external LAN and external switches, and ensures the data consistency of the main and backup equipment to a certain extent. The above development and utilization of the resources of the backup equipment, thereby further improving the safety and operating efficiency of the entire multi-node server system.

Abstract

A dual-computer hot standby system, comprising: a host device, a standby device which is in communication connection with the host device by means of a first I3C bus, and at least one slave device which is in communication connection with the host device and the standby device by means of a second I3C bus; wherein the host device is configured to respond to the startup of the dual-computer hot standby system to collect parameters of the slave device, store the mapping of the parameters into a database, and synchronize the mapping of the parameters to the standby device by means of the first I3C bus, and the host device is configured to manage the slave device by means of the second I3C bus on the basis of a management instruction, generate a mapping according to the changed parameters, and synchronize the mapping of the changed parameters to the standby device by means of the first I3C bus; the standby device is configured to forward the received management instruction to the host device by means of the first I3C bus in response to the reception of the management instruction from the outside. By utilizing the system of the present invention, the problem of insufficient reliability caused by excessive dependence on an LAN is solved, and the safety and the operation efficiency of the whole server system are improved.

Description

一种双机热备系统A dual-machine hot backup system
本申请要求于2019年10月18日提交中国专利局、申请号为201910995329.9、发明名称为“一种双机热备系统”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on October 18, 2019, the application number is 201910995329.9, and the invention title is "a dual-system hot backup system", the entire content of which is incorporated into this application by reference .
技术领域Technical field
本发明涉及服务器技术领域。本发明进一步涉及一种双机热备系统。The invention relates to the field of server technology. The invention further relates to a dual-machine hot backup system.
背景技术Background technique
随着AI技术和互联网技术的快速发展,以往单一的服务器单元已经无法满足海量数据的处理需要。大规模并行化的计算机系统架构,尤其具有扩展性强,计算能力高、支持统一管理等特点,越来越迎合大数据时代对服务器产品的需求。这使得目前的服务器系统物理体积逐渐庞大、模块组成逐渐复杂、集成度逐渐增加。随着服务器的功能和节点数量逐渐增加,而对于监控管理的挑战逐渐增加,同时对系统的冗余度要求也越来越高。With the rapid development of AI technology and Internet technology, a single server unit in the past has been unable to meet the processing needs of massive data. Massively parallelized computer system architecture, especially with the characteristics of strong scalability, high computing power, and support for unified management, is increasingly catering to the needs of server products in the era of big data. This makes the physical volume of the current server system gradually larger, the module composition gradually becomes more complicated, and the integration degree gradually increases. With the gradual increase in server functions and the number of nodes, the challenges to monitoring and management are gradually increasing, and the requirements for system redundancy are getting higher and higher.
现有的多节点、大规模、高密度服务器冗余监控管理系统中,系统内部通常采用I2C或串口等通信总线,速率较低,无法满足尤其是主备之间的交互和同步的要求。因此,主备之间的数据同步不得不严重依赖外部LAN和交换机。外接网线的最大缺点就是存在可靠性风险,可能出现网线接触不良或被人为断开、甚至交换机重启的风险。一旦LAN出现故障,主备之间的有效通信会中断,管理系统就可能会出现混乱和远程无法控制的问题。其次,主备设备是两个独立运行的主体,当两者先后或同时运行后必然出现数据差异情况。即使一段时间后LAN恢复正常,此时主备两个监控器不知道谁的数据是最新的,也不知道应该谁同步谁,由此会出现脑裂的情况。In the existing multi-node, large-scale, high-density server redundancy monitoring and management system, communication buses such as I2C or serial ports are usually used inside the system, and the speed is low, which cannot meet the requirements of interaction and synchronization, especially between active and standby. Therefore, data synchronization between active and standby has to rely heavily on external LANs and switches. The biggest disadvantage of the external network cable is that there is a reliability risk, and there may be a risk that the network cable is poorly connected or is artificially disconnected, or even the switch restarts. Once the LAN fails, the effective communication between the main and standby will be interrupted, and the management system may be confused and remotely uncontrollable. Secondly, the main and backup equipment are two independent operating entities, and data differences will inevitably occur when the two operate sequentially or at the same time. Even if the LAN returns to normal after a period of time, the two monitors at this time do not know whose data is the latest, and who should synchronize with whom, so a split-brain situation may occur.
为此,在一些方案中,对主机设备故障的检测过程中,增加了串口二次验证的策略,避免了LAN网络中断后,备机无法判断主机的真实运行情况。但是,当主机LAN被断开时,备机依然可以通过串口检查到主机存活,而此时,备机并不会接管主机工作,始终处于待机工作状态。此时,主备之间无法通过LAN进行数据同步,而用户亦无法通过LAN对主机设备进行 远程控制和访问。此时用户即使通过LAN登录从设备,查询到的数据也不是最新的数据。For this reason, in some schemes, in the process of detecting the failure of the host device, a strategy of secondary verification of the serial port is added to avoid the failure of the standby machine to determine the true operation of the host after the LAN network is interrupted. However, when the host LAN is disconnected, the standby machine can still check that the host is alive through the serial port. At this time, the standby machine does not take over the work of the host and is always in a standby working state. At this time, data synchronization between the main and standby devices cannot be performed through the LAN, and the user cannot remotely control and access the host device through the LAN. At this time, even if the user logs in to the slave device through the LAN, the data queried is not the latest data.
因此,需要针对当前多节点服务器方案中采用的I2C总线、串口等通信速率较低,无法满足主备之间的交互和同步的要求,因而数据同步不得不严重依赖外部LAN和交换机的问题进行改进,提出一种主备之间建立安全可靠的内部通信的机制。Therefore, it is necessary to address the low communication rate of I2C bus and serial port used in the current multi-node server solution, which cannot meet the requirements of interaction and synchronization between active and standby. Therefore, data synchronization has to rely heavily on external LAN and switch issues to improve , Propose a mechanism for establishing safe and reliable internal communication between main and standby.
发明内容Summary of the invention
一方面,本发明基于上述目的提出了一种双机热备系统,其中,该系统包括:On the one hand, the present invention proposes a dual-machine hot backup system based on the above objective, wherein the system includes:
主机设备;Host device
备机设备,该备机设备通过第一I3C总线与主机设备通信连接;A standby device, which communicates with the host device through the first I3C bus;
至少一个从机设备,该至少一个从机设备通过第二I3C总线与主机设备和备机设备通信连接;At least one slave device, the at least one slave device is in communication connection with the host device and the standby device through the second I3C bus;
其中,主机设备配置为响应于双机热备系统启动,收集从机设备的参数,将参数的映射存入数据库并通过第一I3C总线同步给备机设备,并且配置为基于管理指令通过第二I3C总线管理从机设备,并且根据发生变化的参数生成映射,通过第一I3C总线将发生变化的参数的映射同步给备机设备;Among them, the host device is configured to respond to the startup of the dual-system hot backup system, collect the parameters of the slave device, store the parameter mapping in the database and synchronize it to the standby device via the first I3C bus, and is configured to pass the second device based on the management instruction The I3C bus manages the slave device, and generates a mapping according to the changed parameter, and synchronizes the mapping of the changed parameter to the standby device through the first I3C bus;
备机设备配置为响应于从外部接收到管理指令,将接收到的管理指令通过第一I3C总线转发给主机设备。The standby device is configured to, in response to receiving a management instruction from the outside, forward the received management instruction to the host device through the first I3C bus.
根据本发明的双机热备系统的实施例,其中备机设备进一步配置为:响应于从外部接收到紧急的管理指令,强制临时占用第二I3C总线管理从机设备,并且根据发生变化的参数生成参数的映射,通过第一I3C总线将紧急的管理指令及发生变化的参数的映射同步给主机设备。According to the embodiment of the dual-system hot backup system of the present invention, the backup device is further configured to: in response to receiving an emergency management instruction from the outside, forcibly occupy the second I3C bus to manage the slave device temporarily, and according to the changed parameters Generate parameter mapping, and synchronize the mapping of emergency management commands and changed parameters to the host device through the first I3C bus.
根据本发明的双机热备系统的实施例,其中主机设备进一步配置为:响应于主机设备进入升级模式和/或资源占用超过阈值,通过第一I3C总线通知备机设备暂时接管对从机设备的管理;并且响应于主机设备退出升级 模式和/或资源占用不再超过阈值,通过第一I3C总线通知备机设备停止接管对从机设备的管理。According to the embodiment of the dual-system hot backup system of the present invention, the host device is further configured to: in response to the host device entering the upgrade mode and/or the resource occupation exceeds the threshold, notify the standby device to temporarily take over the slave device through the first I3C bus And in response to the host device exiting the upgrade mode and/or the resource occupation no longer exceeds the threshold, the standby device is notified through the first I3C bus to stop taking over the management of the slave device.
根据本发明的双机热备系统的实施例,其中备机设备进一步配置为:响应于接收到主机设备的暂时接管的通知,通过第二I3C总线管理从机设备,并且根据发生变化的相关参数生成参数的映射;并且响应于接收到主机设备的停止接管的通知,停止管理从机设备并通过第一I3C总线将发生变化的参数的映射同步给主机设备。According to an embodiment of the dual-system hot backup system of the present invention, the backup device is further configured to: in response to receiving a notification of a temporary takeover of the host device, manage the slave device through the second I3C bus, and according to the changed related parameters Generate a parameter mapping; and in response to receiving a notification that the host device stops taking over, stop managing the slave device and synchronize the mapping of the changed parameter to the host device through the first I3C bus.
根据本发明的双机热备系统的实施例,其中备机设备进一步配置为:响应于双机热备系统启动,主动向主机设备发起时钟同步请求。According to an embodiment of the dual-system hot backup system of the present invention, the backup device is further configured to: in response to the dual-system hot backup system being activated, actively initiate a clock synchronization request to the host device.
根据本发明的双机热备系统的实施例,其中主机设备和备机设备进一步配置为:响应于主机设备和备机设备任一方发起参数的映射的同步,发起方生成包括原有数据、修改后数据和修改时间的同步打包数据并发送给对方。According to the embodiment of the dual-system hot backup system of the present invention, the host device and the backup device are further configured to: in response to either of the host device and the backup device initiates the synchronization of the parameter mapping, the initiator generates the original data, modification After the data and the modification time are synchronized, the data is packaged and sent to the other party.
根据本发明的双机热备系统的实施例,其中主机设备和备机设备进一步配置为:According to the embodiment of the dual-system hot backup system of the present invention, the host device and the backup device are further configured as:
响应于主机设备和/或备机设备接收到对方发送的同步打包数据,将其中的原有数据与本地数据进行比较;In response to the host device and/or the standby device receiving the synchronized packaged data sent by the other party, compare the original data therein with the local data;
响应于原有数据与本地数据相同,根据修改后数据修改本地数据;In response to the original data being the same as the local data, modify the local data according to the modified data;
响应于原有数据与本地数据不同,将接收到的同步打包数据中的修改时间与本地的修改时间进行比较,以修改时间较新的修改后数据为准进行同步。In response to the difference between the original data and the local data, the modification time in the received synchronized packaged data is compared with the local modification time, and the modified data with the newer modification time is used as the standard for synchronization.
根据本发明的双机热备系统的实施例,其中主机设备和所述备机设备进一步配置为:通过双向物理IO心跳检测机制检测对方的运行状态。According to an embodiment of the dual-system hot backup system of the present invention, the host device and the backup device are further configured to detect the operating state of each other through a two-way physical IO heartbeat detection mechanism.
根据本发明的双机热备系统的实施例,其中主机设备和备机设备进一步配置为:响应于主机设备和备机设备任一方检测到对方出现故障,非故障方将故障方的故障情况记入日志,并通过外部双复位机制将故障方重启复位,并对故障方进行时钟及数据库同步。According to the embodiment of the dual-system hot backup system of the present invention, the host device and the backup device are further configured to: in response to either the host device and the backup device detecting that the other party has a failure, the non-faulty party records the failure of the failed party Enter the log, and reset the faulty party through the external double reset mechanism, and synchronize the clock and database of the faulty party.
根据本发明的双机热备系统的实施例,其中主机设备和备机设备进一步配置为:响应于非故障方无法通过外部双复位机制将故障方重启复位和/或重启复位失败,非故障方发出告警以通知运维人员处理。According to the embodiment of the dual-system hot backup system of the present invention, the host device and the backup device are further configured to: in response to the non-faulty party being unable to reset the failed party through an external dual reset mechanism and/or the failure of the resetting, the non-faulty party Send an alarm to notify the operation and maintenance personnel to deal with it.
采用上述技术方案,本发明至少具有如下有益效果:针对当前多节点服务器方案中采用的I2C总线、串口等通信速率较低,无法满足主备之间的交互和同步的要求,因而数据同步不得不严重依赖外部LAN和交换机的问题,提出了利用I3C总线建立多接点服务器双机热备系统的内部通信架构,采用两条I3C总线分别构建系统中主备和主从之间的通信架构。并且在系统启动时,由主机设备通过I3C总线采集从机设备的状态、建立映射并存入数据库,从而在例如用户需要调用某从机设备的运行参数时,无需主机设备响应于用户的指令再去向从机设备获取,而是可以直接将数据库中记录的相关信息直接反馈给用户。同时无论是系统启动时还是主机设备管理从机设备时,主机设备都会将相应的信息同步给备机设备,保证主备设备中数据的一致性。另外,备机设备也允许被外部例如用户直接访问下发指令,此时,备机设备在接收到管理指令后会通过I3C总线将该管理指令转发给主机设备以便主机设备对从机设备进行相应的管理。利用本发明的双机热备系统,不仅提高了系统内部通信的效率,同时避免了由于依赖外部LAN和外部交换机而产生的可靠性问题,在保证了主备设备数据一致性的情况下一定程度上开发利用备机设备的资源,从而进一步提高了整个多节点服务器系统的安全性和运行效率。By adopting the above technical solution, the present invention has at least the following beneficial effects: for the low communication rate of I2C bus and serial port used in the current multi-node server solution, it cannot meet the requirements of interaction and synchronization between active and standby, so data synchronization has to be Relying heavily on the problem of external LAN and switches, the I3C bus is used to establish the internal communication architecture of the multi-point server dual-system hot standby system, and two I3C buses are used to construct the communication architecture between the master and the master and the slave in the system. And when the system is started, the host device collects the status of the slave device through the I3C bus, establishes a mapping and stores it in the database, so that for example, when the user needs to call the operating parameters of a certain slave device, there is no need for the host device to respond to the user's instructions. Go to the slave device to obtain it, but can directly feed back the relevant information recorded in the database to the user. At the same time, whether it is when the system is started or when the host device manages the slave device, the host device will synchronize the corresponding information to the standby device to ensure the consistency of data in the main and standby devices. In addition, the backup device is also allowed to be directly accessed by external users such as users to issue instructions. At this time, after receiving the management instruction, the backup device will forward the management instruction to the host device through the I3C bus so that the host device can respond to the slave device. Management. Using the dual-machine hot backup system of the present invention not only improves the efficiency of the internal communication of the system, but also avoids the reliability problems caused by relying on external LAN and external switches, and ensures the data consistency of the main and backup equipment to a certain extent. The above development and utilization of the resources of the backup equipment, thereby further improving the safety and operating efficiency of the entire multi-node server system.
本发明提供了实施例的各方面,不应当用于限制本发明的保护范围。根据在此描述的技术可设想到其它实施方式,这对于本领域普通技术人员来说在研究以下附图和具体实施方式后将是显而易见的,并且这些实施方式意图被包含在本申请的范围内。The present invention provides various aspects of the embodiments, and should not be used to limit the protection scope of the present invention. Other embodiments can be envisaged based on the technology described herein, which will be obvious to those of ordinary skill in the art after studying the following drawings and specific embodiments, and these embodiments are intended to be included in the scope of the present application .
下面参考附图更详细地解释和描述了本发明的实施例,但它们不应理解为对于本发明的限制。The embodiments of the present invention are explained and described in more detail below with reference to the accompanying drawings, but they should not be construed as limiting the present invention.
附图说明Description of the drawings
为了更清楚地说明本发明实施例的技术方案,下面将对现有技术和实 施例描述中所需要使用的附图作简单地介绍,附图中的部件不一定按比例绘制,并且可以省略相关的元件,或者在一些情况下比例可能已经被放大,以便强调和清楚地示出本文描述的新颖特征。另外,如本领域中已知的,结构位置可以被不同地布置。In order to explain the technical solutions of the embodiments of the present invention more clearly, the following will briefly introduce the drawings that need to be used in the description of the prior art and the embodiments. The components in the drawings are not necessarily drawn to scale, and the related can be omitted. The elements, or in some cases, the scale may have been exaggerated in order to emphasize and clearly illustrate the novel features described herein. In addition, as is known in the art, the structural positions can be arranged differently.
图1示出了根据本发明的双机热备系统的实施例的示意图;Figure 1 shows a schematic diagram of an embodiment of a dual-machine hot backup system according to the present invention;
图2示出了根据本发明的双机热备系统的又一实施例的示意图;Figure 2 shows a schematic diagram of another embodiment of the dual-machine hot backup system according to the present invention;
图3示出了根据本发明的双机热备系统的主机设备和备机设备的实施例的示意性结构图。Fig. 3 shows a schematic structural diagram of an embodiment of the host device and the standby device of the dual-system hot backup system according to the present invention.
具体实施方式Detailed ways
虽然本发明可以以各种形式实施,但是在附图中示出并且在下文中将描述一些示例性和非限制性实施例,但应该理解的是,本公开将被认为是本发明的示例并不意图将本发明限制于所说明的具体实施例。Although the present invention may be implemented in various forms, some exemplary and non-limiting embodiments are shown in the drawings and described below, but it should be understood that the present disclosure will be regarded as an example of the present invention and not It is intended to limit the invention to the specific embodiments described.
图1示出了根据本发明的双机热备系统100的实施例的示意图。根据本发明的双机热备系统100尤其用于多节点服务器系统的控制管理。在如图1所示的实施例中,该双机热备系统100至少包括:Fig. 1 shows a schematic diagram of an embodiment of a dual-machine hot backup system 100 according to the present invention. The dual-system hot backup system 100 according to the present invention is especially used for the control and management of a multi-node server system. In the embodiment shown in FIG. 1, the dual-machine hot backup system 100 at least includes:
主机设备10; Host device 10;
备机设备20,该备机设备20通过第一I3C总线30与主机设备10通信连接;A backup device 20, which is in communication connection with the host device 10 through the first I3C bus 30;
至少一个从机设备40,该至少一个从机设备40通过第二I3C总线50与主机设备10和备机设备20通信连接;At least one slave device 40, the at least one slave device 40 is in communication connection with the host device 10 and the backup device 20 through the second I3C bus 50;
其中,主机设备10配置为响应于双机热备系统100启动,收集从机设备40的参数,将参数的映射存入数据库并通过第一I3C总线30同步给备机设备20,并且配置为基于管理指令通过第二I3C总线50管理从机设备40,并且根据发生变化的参数生成映射,通过第一I3C总线30将发生变化的参数的映射同步给备机设备20;Wherein, the host device 10 is configured to collect the parameters of the slave device 40 in response to the startup of the dual-machine hot backup system 100, store the parameter mapping in the database and synchronize to the backup device 20 through the first I3C bus 30, and is configured to be based on The management instruction manages the slave device 40 through the second I3C bus 50, and generates a mapping according to the changed parameters, and synchronizes the mapping of the changed parameters to the standby device 20 through the first I3C bus 30;
备机设备20配置为响应于从外部接收到管理指令,将接收到的管理指令通过第一I3C总线30转发给主机设备10。The backup device 20 is configured to, in response to receiving a management instruction from the outside, forward the received management instruction to the host device 10 through the first I3C bus 30.
由于根据本发明的双机热备系统100尤其用于多节点服务器系统的控 制管理,因此,优选地主机设备10和从机设备20为CMC设备(Chassis Management Controller,机箱管理控制器),其功能和BMC类似,在刀片等多节点服务器系统中对整机进行管理控制,CMC可以对各节点发送命令以进行管理。此外,从机设备40优选地是多节点服务器系统中BMC(Baseboard Management Controller,基板管理控制器),BMC可以在机器未开机的状态下,对机器进行固件升级、查看机器设备、等一些操作。Since the dual-machine hot standby system 100 according to the present invention is particularly used for the control and management of a multi-node server system, it is preferable that the host device 10 and the slave device 20 are CMC devices (Chassis Management Controller, chassis management controller). Similar to BMC, the whole machine is managed and controlled in multi-node server systems such as blades. CMC can send commands to each node for management. In addition, the slave device 40 is preferably a BMC (Baseboard Management Controller, baseboard management controller) in a multi-node server system. The BMC can perform some operations on the machine such as firmware upgrade, viewing machine equipment, and so on when the machine is not turned on.
另外,根据本发明的双机热备系统100内部采用I3C总线进行通信连接。I3C是一种融合了I2C和SPI总线关键属性的两线串行通讯总线,兼容I2C协议,具有多master、slave软中断、动态分配slave地址、支持热插拔等新特性,速度可高达33Mbps,通常用于将传感器连接到应用处理器。进一步地,为了将主备通信和主从管理分离开以避免相互之间的干扰并且减轻总线压力,在主备(10与20)之间采用第一I3C总线30,在主从(10与40以及20与40)之间采用第二I3C总线50。In addition, the dual-machine hot backup system 100 according to the present invention uses an I3C bus for communication connection. I3C is a two-wire serial communication bus that integrates the key attributes of I2C and SPI buses. It is compatible with the I2C protocol. It has new features such as multiple masters, slave soft interrupts, dynamic allocation of slave addresses, and support for hot swapping. The speed can be as high as 33Mbps. Usually used to connect the sensor to the application processor. Further, in order to separate the master-backup communication and master-slave management to avoid mutual interference and reduce bus pressure, the first I3C bus 30 is used between the master and backup (10 and 20), and the master-slave (10 and 40) And a second I3C bus 50 is used between 20 and 40).
由于I3C总线兼容I2C总线的协议,因此,根据本发明的双机热备系统100可以完成原本由I2C总线构建的系统的全部功能。并且在此之上,根据本发明的双机热备系统100增加了新的功能。其中,主机设备10配置为响应于双机热备系统100启动,收集从机设备40的参数,将参数的映射存入数据库并通过第一I3C总线30同步给备机设备20。也就是说,在双机热备系统100初始上电启动时,CMC主机设备10上电后检查各节点上BMC从机设备20,根据节点设备序列号、各类运行参数等参数,建立一个参数映射数据库。CMC主机设备10将所有节点BMC从机设备的参数映射一份到CMC主机设备10上。这样当用户要求调取某从机设备40的参数时,只需要访问CMC主机设备10就可以获取到所有从机设备40的相应参数,避免了用户查询时CMC主机设备10才向相应的从机设备40读取参数,进而加快了响应用户指令的速度。此外,主机设备10通过第一I3C总线30将数据库同步给备机设备20保证了主备之间的数据一致性。Since the I3C bus is compatible with the I2C bus protocol, the dual-computer hot backup system 100 according to the present invention can complete all the functions of the system originally constructed by the I2C bus. And on top of this, the dual-machine hot backup system 100 according to the present invention adds new functions. The host device 10 is configured to collect the parameters of the slave device 40 in response to the startup of the dual-machine hot backup system 100, store the mapping of the parameters in the database, and synchronize the parameters to the backup device 20 through the first I3C bus 30. That is to say, when the dual-system hot backup system 100 is initially powered on, the CMC host device 10 checks the BMC slave device 20 on each node after it is powered on, and establishes a parameter based on the node device serial number, various operating parameters and other parameters Map the database. The CMC host device 10 maps the parameters of all node BMC slave devices to the CMC host device 10. In this way, when the user requests to call the parameters of a certain slave device 40, he only needs to access the CMC host device 10 to obtain the corresponding parameters of all the slave devices 40, which prevents the CMC host device 10 from reporting to the corresponding slave device when the user queries. The device 40 reads the parameters, thereby speeding up the speed of responding to user instructions. In addition, the host device 10 synchronizes the database to the standby device 20 through the first I3C bus 30 to ensure data consistency between the main and standby devices.
另一个新增的功能是主机设备10配置为基于管理指令通过第二I3C总线50管理从机设备40,并且根据发生变化的参数生成映射,通过第一I3C总线30将发生变化的参数的映射同步给备机设备20。也就是说,当例如用 户从外部访问主机设备10发出管理指令时和/或主机设备10根据预设的控制策略产生管理指令时,主机设备10通过第二I3C总线50完成其管理控制从机设备40的功能。不仅如此,还增加了主机设备10将相应的参数的映射到主机设备10的数据库中,并通过第一I3C总线30将参数的映射实时同步给备机设备20,进而保证了在管理过程中从机设备40参数发生变化后的实时数据更新及主备数据同步。Another newly added function is that the host device 10 is configured to manage the slave device 40 through the second I3C bus 50 based on management instructions, and generates a mapping according to the changed parameters, and synchronizes the mapping of the changed parameters through the first I3C bus 30给备机设备20。 To the standby equipment 20. That is to say, when, for example, a user accesses the host device 10 from the outside to issue a management instruction and/or when the host device 10 generates a management instruction according to a preset control strategy, the host device 10 completes its management and control of the slave device through the second I3C bus 50 40 functions. Not only that, but it also adds that the host device 10 maps the corresponding parameters to the database of the host device 10, and synchronizes the mapping of the parameters to the backup device 20 in real time through the first I3C bus 30, thereby ensuring that the management process is from Real-time data update and synchronization of main and standby data after changes in the parameters of the machine equipment 40.
此外,根据本发明的双机热备系统100的功能还包括一定程度上开发利用备机设备20的资源,即备机设备20也允许被外部例如用户直接访问。在本发明的构思中,主机设备10和备机设备20分别具有各自的通信地址。因此,当用户从外部根据通信地址直接访问备机设备20并发出管理指令时,备机设备20在接收到相应的管理指令后会通过第一I3C总线30将该管理指令转发给主机设备10以便主机设备10对从机设备40进行相应的管理。In addition, the function of the dual-machine hot backup system 100 according to the present invention also includes the development and utilization of the resources of the backup device 20 to a certain extent, that is, the backup device 20 is also allowed to be directly accessed by external users, such as users. In the concept of the present invention, the host device 10 and the standby device 20 respectively have their own communication addresses. Therefore, when the user directly accesses the backup device 20 from the outside according to the communication address and issues a management instruction, the backup device 20 will forward the management instruction to the host device 10 through the first I3C bus 30 after receiving the corresponding management instruction. The host device 10 performs corresponding management on the slave device 40.
在本发明的双机热备系统100的一个或多个实施例中,备机设备20进一步配置为:响应于从外部接收到紧急的管理指令,强制临时占用第二I3C总线50管理从机设备40,并且根据发生变化的参数生成参数的映射,通过第一I3C总线30将紧急的管理指令及发生变化的参数的映射同步给主机设备10。也就是说,为了更进一步开发利用备机设备20的资源,在这些实施例中,当用户从外部根据通信地址直接访问备机设备20并发出管理指令,并且该管理指令是特定的紧急管理指令时,备机设备20会强制临时占用第二I3C总线50根据用户发出的紧急管理指令管理从机设备40,并将相应的信息同步给主机设备10。这里所提及的特定的紧急管理指令通常是指对时效性要求非常强的和/或与系统运行安全关系极为紧密必须立即处理的和/或用户强制要求立即处理的管理指令。备机设备20会强制临时占用第二I3C总线50来管理从机设备40免去了备机设备20将管理指令通过第一I3C总线30转发给主机设备10的步骤,提高了响应速度。In one or more embodiments of the dual-system hot backup system 100 of the present invention, the backup device 20 is further configured to: in response to receiving an emergency management instruction from the outside, forcibly temporarily occupy the second I3C bus 50 to manage the slave device 40, and generate a parameter map according to the changed parameter, and synchronize the emergency management command and the map of the changed parameter to the host device 10 through the first I3C bus 30. That is, in order to further develop and utilize the resources of the backup device 20, in these embodiments, when the user directly accesses the backup device 20 from the outside according to the communication address and issues a management instruction, and the management instruction is a specific emergency management instruction At this time, the backup device 20 will forcibly occupy the second I3C bus 50 temporarily to manage the slave device 40 according to the emergency management instruction issued by the user, and synchronize the corresponding information to the host device 10. The specific emergency management instructions mentioned here usually refer to the management instructions that have very strong timeliness requirements and/or are closely related to system operation safety and must be processed immediately and/or that the user is forced to process immediately. The backup device 20 will forcibly occupy the second I3C bus 50 temporarily to manage the slave device 40, eliminating the step of the backup device 20 forwarding the management command to the host device 10 through the first I3C bus 30, which improves the response speed.
在本发明的双机热备系统的一个或多个实施例中,主机设备10进一步配置为:响应于主机设备10进入升级模式和/或资源占用超过阈值,主机设备10通过第一I3C总线30通知备机设备20暂时接管对从机设备40的管理;并且响应于主机设备10退出升级模式和/或资源占用不再超过阈值,主机设备 10通过第一I3C总线30通知备机设备20停止接管对从机设备40的管理。在传统的主从多节点服务器控制系统中,当主机设备需要进行固件升级和/或系统资源占用超过阈值时,主机设备由于内存空间有限,无法继续支持对从机设备的管理,因此,通常所采用的方式是运维人员暂时关闭部分功能,待固件升级结束和/或资源占用情况得到缓解时在重新开始先前暂时关闭的功能。这样的缺点显而易见,在上述特定情况下主机设备对从机设备的管理控制是中断的,不能做到连续管控。因此,根据本发明的双机热备系统100在上述特定情况下进一步地开发利用备机设备的资源,当主机设备10进入升级模式和/或资源占用过大超过阈值时,主机设备10主动地通过第一I3C总线30通知备机设备20暂时接管对从机设备40的管理;并且当主机设备10升级结束退出升级模式和/或资源占用不再超过阈值从而能够继续对从机设备40进行管理控制的时候,主机设备10通过第一I3C总线30通知备机设备20停止接管对从机设备40的管理。In one or more embodiments of the dual-system hot backup system of the present invention, the host device 10 is further configured to: in response to the host device 10 entering the upgrade mode and/or the resource occupation exceeds a threshold, the host device 10 passes through the first I3C bus 30 Notify the backup device 20 to temporarily take over the management of the slave device 40; and in response to the host device 10 exiting the upgrade mode and/or the resource occupation no longer exceeds the threshold, the host device 10 notifies the backup device 20 to stop taking over through the first I3C bus 30 Management of the slave device 40. In the traditional master-slave multi-node server control system, when the host device needs to upgrade the firmware and/or the system resource occupation exceeds the threshold, the host device cannot continue to support the management of the slave device due to the limited memory space. The method adopted is that the operation and maintenance personnel temporarily close some functions, and restart the previously temporarily closed functions when the firmware upgrade ends and/or the resource occupation is relieved. Such shortcomings are obvious. Under the above-mentioned specific circumstances, the management and control of the slave device by the host device is interrupted, and continuous management and control cannot be achieved. Therefore, the dual-system hot backup system 100 according to the present invention further develops and utilizes the resources of the backup device under the above-mentioned specific circumstances. When the host device 10 enters the upgrade mode and/or the resource occupation exceeds the threshold, the host device 10 actively The standby device 20 is notified through the first I3C bus 30 to temporarily take over the management of the slave device 40; and when the master device 10 is upgraded, it exits the upgrade mode and/or the resource occupation no longer exceeds the threshold so that the slave device 40 can continue to be managed When controlling, the host device 10 informs the standby device 20 through the first I3C bus 30 to stop taking over the management of the slave device 40.
在本发明的双机热备系统100的一些实施例中,备机设备20进一步配置为:响应于接收到主机设备10的暂时接管的通知,通过第二I3C总线50管理从机设备40,并且根据发生变化的相关参数生成参数的映射;并且响应于接收到主机设备40的停止接管的通知,停止管理从机设备并通过第一I3C总线30将发生变化的参数的映射同步给主机设备10。备机设备20作为主机设备10的备份,长期处于待命状态,在传统的双机热备系统中,一旦主机设备10出现故障,备机设备20成为主机来维护系统的正常运行。而在本发明的双机热备系统中,除了上述情况以外,无论主机设备10处于何种状态,一旦备机设备20接收到主机设备10发给它的暂时接管的通知,备机设备20就会临时地接管从机设备40的管理工作,通过第二I3C总线50管理从机设备40,并且根据发生变化的相关参数生成参数的映射。并且,一旦备机设备20接收到主机设备10发给它的停止接管的通知,备机设备20就会停止管理从机设备40,将管理工作交还给主机设备10,并通过第一I3C总线30将备机设备20管理期间因管理动作所影响到的参数的映射同步给主机设备10,以保证主机设备能够准确地进行从机设备40的管理,同时确保主备数据一致。In some embodiments of the dual-system hot backup system 100 of the present invention, the backup device 20 is further configured to: in response to receiving the notification of the temporary takeover of the host device 10, manage the slave device 40 through the second I3C bus 50, and The parameter mapping is generated according to the changed related parameters; and in response to receiving the notification of stopping the takeover of the host device 40, the slave device is stopped from managing and the mapping of the changed parameter is synchronized to the host device 10 through the first I3C bus 30. The backup device 20 serves as the backup of the host device 10 and is on standby for a long time. In a traditional dual-system hot backup system, once the host device 10 fails, the backup device 20 becomes the host to maintain the normal operation of the system. In the dual hot backup system of the present invention, in addition to the above, no matter what state the host device 10 is in, once the backup device 20 receives the temporary takeover notice sent to it by the host device 10, the backup device 20 will It will temporarily take over the management of the slave device 40, manage the slave device 40 through the second I3C bus 50, and generate a parameter mapping according to the changed related parameters. In addition, once the backup device 20 receives the notification to stop taking over from the host device 10, the backup device 20 will stop managing the slave device 40, return the management work to the host device 10, and pass the first I3C bus 30. The mapping of the parameters affected by the management actions during the management of the backup device 20 is synchronized to the host device 10 to ensure that the host device can accurately manage the slave device 40 and ensure that the master and backup data are consistent.
在本发明的双机热备系统100的若干实施例中,备机设备20进一步配置 为:响应于双机热备系统100启动,备机设备20主动向主机设备10发起时钟同步请求。由于在双机热备构架中,主备机设备的系统时间在很多情境和功能中发挥着重要的作用,所以保证主备之间的时钟同步是有必要的。在本发明的实施例中,关于时钟同步所采取的策略是双机热备系统100启动后,即主机设备10和备机设备20开启后,备机设备20主动向主机设备10发起时钟同步请求,确保两个设备系统时间的一致性。In several embodiments of the dual-machine hot backup system 100 of the present invention, the backup device 20 is further configured to: in response to the dual-machine hot backup system 100 being activated, the backup device 20 actively initiates a clock synchronization request to the host device 10. In the dual-system hot-standby architecture, the system time of the active and standby devices plays an important role in many situations and functions, so it is necessary to ensure the clock synchronization between the active and standby devices. In the embodiment of the present invention, the strategy adopted for clock synchronization is that after the dual-system hot backup system 100 is started, that is, after the host device 10 and the backup device 20 are turned on, the backup device 20 actively initiates a clock synchronization request to the host device 10. , To ensure the consistency of the system time of the two devices.
在本发明的双机热备系统100的进一步实施例中,主机设备10和备机设备20进一步配置为:响应于主机设备10和备机设备20任一方发起参数的映射的同步,发起方(10或20)生成包括原有数据、修改后数据和修改时间的同步打包数据并发送给对方(20或10)。在主机设备10和备机设备20中采用相同的数据存放格式。当数据发生变化而需要进行主备同步时,遵循的原则是数据发生变化的一方负责发起同步工作。也就是说,当主机设备10和备机设备20任一方因为数据发生变化而发起参数的映射的同步时,发起方(主机设备10或备机设备20)将“原有数据+修改后数据+修改时间”打包成同步打包数据,然后将该同步打包数据发送给对方(备机设备20或主机设备10)以便对方进行数据更新。In a further embodiment of the dual-system hot backup system 100 of the present invention, the host device 10 and the backup device 20 are further configured to: in response to either the host device 10 and the backup device 20 initiating the synchronization of the parameter mapping, the initiator ( 10 or 20) Generate synchronized packaged data including original data, modified data and modified time and send it to the other party (20 or 10). The same data storage format is used in the host device 10 and the backup device 20. When data changes and need to be synchronized between master and backup, the principle followed is that the party whose data has changed is responsible for initiating the synchronization. That is to say, when either of the host device 10 and the backup device 20 initiates the synchronization of parameter mapping due to data changes, the initiator (host device 10 or backup device 20) will "original data + modified data + The "modification time" is packaged into synchronized packaged data, and then the synchronized packaged data is sent to the other party (the standby device 20 or the host device 10) so that the other party can update the data.
在本发明的双机热备系统100的若干实施例中,主机设备10和备机设备20进一步配置为:In several embodiments of the dual-machine hot backup system 100 of the present invention, the host device 10 and the backup device 20 are further configured as:
响应于主机设备10和/或备机设备20接收到对方发送的同步打包数据,将其中的原有数据与本地数据进行比较;In response to the host device 10 and/or the backup device 20 receiving the synchronized packaged data sent by the other party, the original data therein is compared with the local data;
响应于原有数据与本地数据相同,根据修改后数据修改本地数据;In response to the original data being the same as the local data, modify the local data according to the modified data;
响应于原有数据与本地数据不同,将接收到的同步打包数据中的修改时间与本地的修改时间进行比较,以修改时间较新的修改后数据为准进行同步。In response to the difference between the original data and the local data, the modification time in the received synchronized packaged data is compared with the local modification time, and the modified data with the newer modification time is used as the standard for synchronization.
由于在本发明的双机热备系统100中,主机设备10和备机设备20都允许被外部直接访问,因此可能出现双机同时发生数据变化的情况,此时需要仲裁机制来确定最终有效的数据。因此,在上述若干实施例中,这种仲裁机制具体包括以下几个部分。首先当主机设备10和/或备机设备20接收到对方发送的同步打包数据时,解析该同步打包数据提取其中的原有数据、修 改后数据和修改时间,并将提取到的原有数据与其本地数据相比较。如果原有数据与本地数据相同,说明本地数据未发生变化,因此直接对本地数据根据提取到的修改后数据进行更新即可。如果原有数据与本地数据不同,说明本地数据也发生了修改,因此需要进一步确定应该以对方的修改还是以己方的修改作为最终有效的修改。此时,将从接收到的同步打包数据中提取的修改时间与本地的数据发生变化时产生的同步打包数据中的修改时间进行比较,以修改时间较新的修改后数据为最终有效的数据进行主备数据同步。具体地,如果从接收到的同步打包数据中提取的修改时间较新,即接收到的同步打包数据中提取的修改后数据即为最终有效的数据,则以接收到的同步打包数据中提取的修改后数据更新本地数据,以实现主备数据同步。如果本地的数据发生变化时产生的同步打包数据中的修改时间较新,即本地数据即为最终有效的数据,因此本地不做更新。此时,如果本地的数据发生变化时尚未将产生的同步打包数据发送给对方,则立即将该同步打包数据发送给对方。如果本地的数据发生变化时已经将产生的同步打包数据发送给对方,则不必再做进一步处理。Since in the dual-system hot backup system 100 of the present invention, both the host device 10 and the backup device 20 are allowed to be directly accessed from the outside, data changes may occur in the dual computers at the same time. In this case, an arbitration mechanism is required to determine the final effective data. Therefore, in the foregoing several embodiments, this arbitration mechanism specifically includes the following parts. First, when the host device 10 and/or the backup device 20 receives the synchronized packaged data sent by the other party, the synchronized packaged data is parsed to extract the original data, the modified data, and the modification time, and the extracted original data is compared with the original data. Compare with local data. If the original data is the same as the local data, it means that the local data has not changed, so the local data can be directly updated according to the extracted modified data. If the original data is different from the local data, it means that the local data has also been modified. Therefore, it is necessary to further determine whether the modification of the other party or the modification of the own party should be the final effective modification. At this point, the modification time extracted from the received synchronous packaged data is compared with the modification time in the synchronous packaged data generated when the local data changes, and the modified data with the newer modification time is the final valid data. Synchronize primary and secondary data. Specifically, if the modified time extracted from the received synchronized packaged data is relatively new, that is, the modified data extracted from the received synchronized packaged data is the final valid data, the data extracted from the received synchronized packaged data After the modification, the local data is updated to realize the synchronization of the main and standby data. If the modification time in the synchronized packaged data generated when the local data changes is newer, that is, the local data is the final valid data, so the local update is not performed. At this time, if the generated synchronous packaged data has not been sent to the other party when the local data changes, the synchronous packaged data will be sent to the other party immediately. If the generated synchronous packaged data has been sent to the other party when the local data changes, no further processing is necessary.
图2示出了根据本发明的双机热备系统100'的又一实施例的示意图,其中,主机设备10和备机设备20进一步配置为:通过双向物理IO心跳检测机制60检测对方的运行状态。在这些实施例中,根据本发明的双机热备系统100'的实施例与传统的双机热备系统相比,在主机设备10和备机设备20之间增加双向物理IO心跳检测机制,不仅备机设备检测主机设备是否故障,主机设备也实时检查备机设备的运行状态,从而避免备机设备先于主机设备出现故障时系统不知道这个情况,从而导致当主机设备发生故障需要备机设备开始执行管理职能时系统彻底失去管理的情况的发生。2 shows a schematic diagram of another embodiment of a dual-system hot backup system 100' according to the present invention, in which the host device 10 and the backup device 20 are further configured to detect the operation of each other through a two-way physical IO heartbeat detection mechanism 60 status. In these embodiments, the embodiment of the dual-machine hot backup system 100' according to the present invention adds a two-way physical IO heartbeat detection mechanism between the host device 10 and the backup device 20 compared with the traditional dual-machine hot backup system. Not only does the backup device detect whether the host device is faulty, the host device also checks the operating status of the backup device in real time, so as to prevent the system from not knowing this situation when the backup device fails before the host device, which leads to the need for backup when the host device fails The occurrence of a situation where the system completely loses management when the equipment starts to perform management functions.
如图2所示的根据本发明的双机热备系统100'的实施例,主机设备10和备机设备20进一步配置为:响应于主机设备10和备机设备20任一方检测到对方出现故障,非故障方(10或20)将故障方(20或10)的故障情况记入日志,并通过外部双复位机制70将故障方(20或10)重启复位,并对故障方(20或10)进行时钟及数据库同步。也就是说,一旦主机设备10和备机设备20任一方检测到对方出现故障,非故障方(主机设备10或备机设备 20)就会将故障方(备机设备20或主机设备10)的故障情况记入本地的日志中。并且非故障方(主机设备10或备机设备20)会通过外部双复位机制70将故障方(备机设备20或主机设备10)重启复位,确保双机都能够同时在线,防止冗余失效。如果重启复位成功,则非故障方(主机设备10或备机设备20)会对故障方(备机设备20或主机设备10)进行时钟及数据库同步。As shown in FIG. 2 according to the embodiment of the dual-machine hot backup system 100' of the present invention, the host device 10 and the backup device 20 are further configured to respond to the failure of either the host device 10 or the backup device 20 detecting that the other party has a failure , The non-faulty party (10 or 20) records the failure of the faulty party (20 or 10) in the log, and resets the faulty party (20 or 10) through the external double reset mechanism 70, and responds to the faulty party (20 or 10). ) Synchronize clock and database. That is to say, once either the host device 10 or the backup device 20 detects that the other party has a failure, the non-faulty party (host device 10 or backup device 20) will send the failure party (backup device 20 or host device 10) The failure situation is recorded in the local log. In addition, the non-faulty party (host device 10 or backup device 20) will restart and reset the failed party (backup device 20 or host device 10) through the external dual reset mechanism 70 to ensure that both devices can be online at the same time to prevent redundancy failure. If the reset is successful, the non-faulty party (host device 10 or backup device 20) will synchronize the clock and database of the failed party (backup device 20 or host device 10).
在本发明的双机热备系统100'的进一步实施例中,主机设备10和备机设备20进一步配置为:响应于非故障方(10或20)无法通过外部双复位机制70将故障方(20或10)重启复位和/或重启复位失败,非故障方(10或20)发出告警以通知运维人员处理。也就是说,在一些情境下会出现非故障方(主机设备10或备机设备20)无法通过外部双复位机制70将故障方(备机设备20或主机设备10)重启复位或者尝试重启复位但是未成功的情况,此时所采取的策略是非故障方(主机设备10或备机设备20)发出告警以通知运维人员对故障方(备机设备20或主机设备10)进行处理以排除故障,维护双机热备的有效性。In a further embodiment of the dual-system hot backup system 100' of the present invention, the host device 10 and the backup device 20 are further configured to respond to the failure of the non-faulty party (10 or 20) to pass the external dual reset mechanism 70 to the failed party ( 20 or 10) Restart reset and/or restart reset fails, the non-faulty party (10 or 20) issues an alarm to notify the operation and maintenance personnel to deal with. That is to say, in some situations, it may happen that the non-faulty party (host device 10 or backup device 20) cannot reset the failed party (backup device 20 or host device 10) through the external dual reset mechanism 70 or tries to restart and reset. In the case of unsuccessful situations, the strategy adopted at this time is that the non-faulty party (host device 10 or backup device 20) issues an alarm to notify the operation and maintenance personnel to deal with the failure party (backup device 20 or host device 10) to eliminate the fault. Maintain the effectiveness of dual-system hot backup.
图3示出了根据本发明的双机热备系统的主机设备和备机设备的实施例的示意性结构图。在这些实施例中,优选地以选定主机设备10和备机设备20均为CMC设备、从机设备40为BMC设备为例,并且主机设备10和备机设备20保持具有相同的结构。下面将进一步说明主备CMC设备中各个模块的具体构成和功能。Fig. 3 shows a schematic structural diagram of an embodiment of the host device and the standby device of the dual-system hot backup system according to the present invention. In these embodiments, it is preferable to select that the host device 10 and the backup device 20 are both CMC devices, and the slave device 40 is a BMC device as an example, and the host device 10 and the backup device 20 maintain the same structure. The following will further explain the specific composition and functions of each module in the main and standby CMC equipment.
- 系统模块:-System module:
CMC设备上的系统模块是整个系统状态机,是系统的核心调度模块,各个模块之间的调度、状态判断、数据流转都是系统模块来进行处理。The system module on the CMC device is the state machine of the entire system, and is the core scheduling module of the system. The scheduling, state judgment, and data flow among the various modules are all processed by the system module.
- 数据模块:-Data module:
CMC主机上电后检查各节点上的BMC设备,根据节点设备序列号、运行状态参数等参数,建立一个参数映射数据库。CMC将所有节点BMC的参数都映射一份到CMC上。这样用户只需要访问CMC设备就可以获取到所有节点BMC的数据,避免用户查询时CMC需要挨个节点读取BMC参数,加快响应用户的速度。After the CMC host is powered on, it checks the BMC equipment on each node, and establishes a parameter mapping database according to the node equipment serial number, operating state parameters and other parameters. CMC maps all BMC parameters of all nodes to CMC. In this way, the user only needs to access the CMC device to obtain the BMC data of all nodes, avoiding the need for the CMC to read the BMC parameters one by one when the user queries, and speed up the response to users.
- 同步模块:-Synchronization module:
本发明在两个CMC设备之间采用I3C总线连接,用于两个设备之间的数据的传输和同步工作,以实现两个设备之间的数据实时同步。I3C总线具有33Mbps的速率,和软中断机制,且有校验和容错机制。The present invention adopts I3C bus connection between two CMC devices, which is used for data transmission and synchronization work between the two devices, so as to realize real-time data synchronization between the two devices. The I3C bus has a rate of 33Mbps, a soft interrupt mechanism, and a checksum fault tolerance mechanism.
在本发明的构思中同步工作的原则包括以下内容:一是时钟的同步;二是谁的数据发生变化,谁负责发起同步工作;三是两个数据同时变化,以最后设置的为准。In the concept of the present invention, the principle of synchronization work includes the following content: one is clock synchronization; second, whose data changes, who is responsible for initiating the synchronization work; third, two data changes at the same time, subject to the last setting.
本发明中由于两个CMC都可以被用户访问,存在在同步期间,用户修改两个CMC的数据的问题。为防止CMC不知道修改先后顺序的情况。同步数据包中加入修改时间。In the present invention, since both CMCs can be accessed by the user, there is a problem that the user modifies the data of the two CMCs during synchronization. To prevent CMC from not knowing the order of modification. The modification time is added to the synchronization data package.
1、关于时钟同步,开机后备机主动向主机发起时间同步请求,确保两个设备时间的一致性,并且可以用于当主备同时修改数据时,确定最终有效的修改数据的问题。1. Regarding clock synchronization, the power-on backup machine actively initiates a time synchronization request to the host to ensure the time consistency of the two devices, and can be used to determine the final effective data modification problem when the master and the slave modify the data at the same time.
2、关于数据修改同步,两个系统中采用一样的数据存放格式,当一个CMC的存储数据发生变化时,数据同步模块就将“原有数据+修改后数据+修改时间”打包发送给另外一个CMC。另外一个CMC收到该同步数据后,首先判断原有数据与本地数据是否相同,如果相同说明本地数据未发生变化,直接更新修改后数据。如果原有数据与本地数据不同,说明本地也发生了修改,则对比数据修改时间,以时间较新的修改后数据为准进行同步,从而保持数据一致性。2. Regarding data modification synchronization, the same data storage format is used in the two systems. When the storage data of one CMC changes, the data synchronization module will package and send the "original data + modified data + modification time" to the other one. CMC. After receiving the synchronized data, another CMC first determines whether the original data is the same as the local data. If the same indicates that the local data has not changed, it directly updates the modified data. If the original data is different from the local data, it means that the local data has also been modified. Compare the data modification time and synchronize the modified data with the newer time to maintain data consistency.
3、关于设备故障恢复后数据同步,系统采用双向心跳包检测机制,任何一个CMC设备出现故障后,另一个CMC设备都会感知并记录LOG日志和强制复位该故障CMC设备使其恢复正常工作。以CMC0(主机设备10)发生故障为例:CMC0发生故障,则CMC1(备机设备20)会及时发现检查不到CMC0的心跳信号,此时CMC1判定CMC0出现故障,则记录LOG日志和通过复位信号IO线,强制复位CMC0。当CMC0故障设备重启恢复后,会第一时间询问CMC1的工作模式和是否需要强制同步数据。CMC1会答复自己工作模式和需要强制同步数据。且CMC1会将全部的同步数据+强制更新标志,发送给CMC0,CMC0将数据更新到完全一致。之后CMC1还会触发模 式切换的流程。3. Regarding data synchronization after device failure recovery, the system adopts a two-way heartbeat packet detection mechanism. After any CMC device fails, the other CMC device will sense and record the LOG log and force reset the failed CMC device to restore normal operation. Take CMC0 (host device 10) as an example: if CMC0 fails, CMC1 (standby device 20) will find that the heartbeat signal of CMC0 cannot be detected in time. At this time, CMC1 determines that CMC0 is faulty, and records the LOG log and resets it. Signal IO line, forcibly reset CMC0. When the CMC0 faulty device restarts and recovers, it will promptly ask about the working mode of CMC1 and whether it is necessary to forcibly synchronize data. CMC1 will reply to its own working mode and the need to forcibly synchronize data. And CMC1 will send all the synchronized data + mandatory update flag to CMC0, and CMC0 will update the data to be completely consistent. After that, CMC1 will also trigger the mode switching process.
4、如果两个CMC刚刚上电启动,则两个CMC都是处于空闲模式,备机设备20主动从主机设备10同步数据。4. If two CMCs are just powered on and started, both CMCs are in idle mode, and the backup device 20 actively synchronizes data from the host device 10.
- 检测模块:-Detection module:
本发明对故障检测采用双向心跳检测机制,确保双机任何一个出现问题,都能被另外一个发现,及时将其复位,恢复备份状态,并记录日志和告知运维人员。系统中检测模块,负责己方设备的心跳信号的生成和对方设备心跳信号的监测功能。The invention adopts a two-way heartbeat detection mechanism for fault detection to ensure that any problem of the dual machines can be found by the other, reset it in time, restore the backup state, record logs and inform operation and maintenance personnel. The detection module in the system is responsible for the generation of the heartbeat signal of one's own device and the monitoring function of the other's device's heartbeat signal.
1、检测模块负责收集本机工作状态和线程的异常挂起情况,当本机工作正常时,故障检测模块会在心跳信号IO上产生持续的脉冲信号。当检测到本机线程挂起等工作异常时,心跳信号就会停止输出。当出现死机、程序跑飞等情况时,检测模块也会死的自然无法输出信号。1. The detection module is responsible for collecting the working status of the machine and the abnormal suspension of the thread. When the machine is working normally, the fault detection module will generate a continuous pulse signal on the heartbeat signal IO. When an abnormality such as a hang of the native thread is detected, the heartbeat signal will stop outputting. When there is a crash, program runaway, etc., the detection module will naturally fail to output signals.
2、测对方心跳信号,检测模块会实时监控对方CMC设备的心跳信号IO上是否有心跳信号,当超过例如1秒检测不到对方脉冲信号,即可初步判断对方设备异常。2. Measure the other party's heartbeat signal, and the detection module will monitor in real time whether there is a heartbeat signal on the other party's CMC device's heartbeat signal IO. When the other party's pulse signal is not detected for more than one second, for example, the other party's device can be preliminarily determined to be abnormal.
3、确认对方心跳信号,检测模块会立刻通过总线向对方发起询问,如果对方无应答,则判定对方设备出现故障。如果对方有应答,则再次检测心跳信号,如果恢复则认为对方为假死,可以继续工作。如果检测不到,说明对方检测模块异常,也判定为对方设备故障,同步好数据后,将对方重启。3. Confirm the other party's heartbeat signal, and the detection module will immediately initiate an inquiry to the other party through the bus. If the other party does not respond, it is determined that the other party's device is malfunctioning. If the other party responds, the heartbeat signal will be checked again, and if it recovers, the other party will be considered as suspended animation and can continue to work. If it is not detected, it means that the other party's detection module is abnormal, and the other party's device is also judged to be faulty. After synchronizing the data, restart the other party.
- 复位模块:-Reset module:
复位模块的故障处理机制为当故障检测模块判定对方设备出现故障时,己方CMC设备立刻接管全部工作,另外通过复位信号线,将故障设备重启。具体的模式切换,故障恢复后的数据同步方式和工作模式再切换,前面均有介绍到,不再赘述。The fault handling mechanism of the reset module is that when the fault detection module determines that the other party's device is faulty, the own CMC device immediately takes over all the work, and in addition, restarts the faulty device through the reset signal line. The specific mode switching, the data synchronization mode and the working mode switching after the failure recovery are all introduced above, and will not be repeated.
本机故障恢复模块主要有看门狗和外部复位信号线组成,当系统运行过程中出现异常会导致不喂狗的情况,一段时间后看门狗饿死,会重启系统。当对方设备先检查到本机设备故障时,对方会通过复位信号线将本机拉复位。看门狗一般延时较长(本系统设计优选为4秒,可根据实际情况调 整),一般情况下,对方CMC复位会优先于看门狗发现系统故障。看门狗在收到外界干扰出现两个CMC同时故障时,提供复位。如果看门狗的实际时间足够短,或者心跳检测机制比较慢,也会出现看门狗先复位的情况。The fault recovery module of this machine is mainly composed of watchdog and external reset signal line. When the system is running abnormally, it will not feed the dog. After a period of time, the watchdog will starve to death and restart the system. When the other party's device first detects the failure of the local device, the other party will reset the device through the reset signal line. The watchdog generally has a long delay (4 seconds for this system design, which can be adjusted according to the actual situation). Under normal circumstances, the CMC reset of the other party will have priority over the watchdog to find system faults. The watchdog provides a reset when two CMCs fail at the same time due to external interference. If the actual time of the watchdog is short enough, or the heartbeat detection mechanism is slow, the watchdog resets first.
- 告警模块:-Alarm module:
CMC设备除了进行常规的运行过程中各种功能和服务的异常情况进行日志记录和告警外,也会对另外一个CMC设备的故障情况进行记录和告警,以便维护人员能够及时了解到两个CMC设备的运行状态。In addition to logging and alarming abnormal conditions of various functions and services during the normal operation of CMC equipment, it also records and alarms the failure of another CMC equipment so that maintenance personnel can learn about the two CMC equipment in time. The operating status of the.
当故障模块判定对方CMC设备出现故障时,本机会立刻将该情况记录在LOG日志中,并通过LED和上报远程服务器的方式进行一般告警。如果故障设备一天内出现两次故障及以上则上报严重级别的告警。如果故障设备无法通过复位的方式恢复工作,则CMC设备继续上报致命级别的故障告警。该告警会持续存在,即使另外一个CMC设备通过复位恢复了业务,也不会取消,除非维护人员手动消除。When the fault module determines that the other party's CMC equipment is faulty, the machine will immediately record the situation in the LOG log, and give general alarms through LEDs and reporting to the remote server. If the faulty device has two or more faults in a day, a severe alarm will be reported. If the faulty device cannot be restored by resetting, the CMC device continues to report a fatal fault alarm. The alarm will continue to exist, and even if another CMC device resumes business through reset, it will not be cancelled unless the maintenance personnel manually eliminate it.
- 管理模块:-Management module:
CMC设备负责整机柜/箱的各个节点上的信息采集和散热风扇调控,前面板按键和指示灯的显示控制管理等工作。管理模块具体包括至少一下内容:The CMC equipment is responsible for information collection and cooling fan control on each node of the entire cabinet/box, and the display control and management of the buttons and indicators on the front panel. The management module specifically includes at least the following content:
1、CMC设备采集各个节点上的数据,不在采用定期轮询的方式,而是采用各个节点BMC主动上报的形式。1. The CMC device collects data on each node, not in the form of periodic polling, but in the form of active reporting by the BMC of each node.
2、当节点BMC上参数发生变化的时候,BMC处理完本节点管理工作后,第一时间向CMC发起通信请求,将参数同步到CMC的映射区中,保证CMC的参数与BMC参数一致。由于BMC通过I3C总线发起软中断只能在I3C总线空闲时发起,因此当检测到总线忙时,BMC会延时一段时间在进行发起操作(本发明中优选为10ms,可根据实际情况适当调整)。2. When the parameters on the node BMC change, after the BMC processes the management of the node, it initiates a communication request to the CMC as soon as possible, and synchronizes the parameters to the mapping area of the CMC to ensure that the CMC parameters are consistent with the BMC parameters. Since the BMC initiates a soft interrupt through the I3C bus, it can only be initiated when the I3C bus is idle. Therefore, when the bus is detected to be busy, the BMC will delay the initiation operation for a period of time (in the present invention, it is preferably 10ms, which can be adjusted according to actual conditions) .
3、当用户通过CMC对某节点BMC设备进行配置的时候,CMC验证参数合法后,发起与该节点BMC的通信,将参数配置给BMC,配置成功后,CMC自己将该节点映射去参数进行修改,确保参数的一致性。3. When the user configures the BMC device of a node through the CMC, after the CMC verifies that the parameters are legal, it initiates communication with the node BMC and configures the parameters to the BMC. After the configuration is successful, the CMC maps the node to the parameters for modification. , To ensure the consistency of parameters.
- 网络模块:-Network module:
CMC设备对外提供web服务、命令行等人机交互的接口等服务80,用 于远程设备管理、固件更新或者故障上报远程控制中心等人机交互。The CMC device provides external services such as web services, command line and other human-computer interaction interfaces 80, which are used for human-computer interaction such as remote device management, firmware update, or fault reporting to the remote control center.
- 升级模块:-Upgrade module:
CMC设备中升级模块主要负责系统的升级工作,主要负责两部分的升级工作,一是CMC设备自身固件的升级,二是各个节点上的固件升级,CMC升级模块还需要负责判定升级包的一致性。本发明中用户可以通过CMC来管理各个节点的BMC、BIOS、CPLD等固件的升级。目前现有设计中由于I2C总线速率过低,CMC无法通过内部I2C总线给节点上的固件进行升级操作,必须依赖LAN进行,当某节点LAN出现问题时,则无法给该节点进行升级操作。本发明中由于I3C的高速率通信使得通过内部I3C总线进行固件升级成为现实,即便不依赖LAN,CMC依然可以对该节点的固件进行升级操作。The upgrade module in the CMC device is mainly responsible for the upgrade of the system. It is mainly responsible for two parts of the upgrade. One is the upgrade of the CMC device's own firmware, and the other is the firmware upgrade on each node. The CMC upgrade module is also responsible for determining the consistency of the upgrade package. . In the present invention, the user can manage the BMC, BIOS, CPLD and other firmware upgrades of each node through the CMC. At present, due to the low I2C bus rate in the current design, CMC cannot upgrade the firmware on the node through the internal I2C bus, and must rely on LAN. When a node has a problem with the LAN, the node cannot be upgraded. In the present invention, due to the high-speed I3C communication, firmware upgrade through the internal I3C bus becomes a reality. Even if it does not rely on the LAN, the CMC can still upgrade the firmware of the node.
1、由于CMC上有各个节点的BMC数据映射,用户只需登录CMC就可以对各个节点上的固件进行升级操作。用户首先选择某个节点,进行固件升级,CMC会根据该节点的型号,列出可升级的固件,用户上传固件升级包到CMC,CMC会根据该节点型号,对固件升级的兼容性进行判断,如果型号版本不对则提示用户后终止升级操作,如果符合升级条件才进行后续升级操作。1. Since there is BMC data mapping of each node on the CMC, the user only needs to log in to the CMC to upgrade the firmware on each node. The user first selects a node to upgrade the firmware. CMC will list the upgradeable firmware according to the model of the node. The user uploads the firmware upgrade package to the CMC. CMC will judge the compatibility of the firmware upgrade according to the model of the node. If the model version is incorrect, the user will be prompted to terminate the upgrade operation, and the subsequent upgrade operation will only be performed if the upgrade conditions are met.
2、CMC传输固件升级包数据到节点BMC途径有两种,一种是CMC通过LAN将固件升级包传输到节点BMC进行升级;一种是CMC通过内部I3C总线将固件升级包数据传输到节点BMC;本发明中I3C总线上一直在进行CMC和BMC直接的数据同步和交互,且固件升级包体积较大,为了减轻I3C总线的压力,本发明中优选通过LAN将估计升级包传输到节点BMC上,当LAN链路不通时,使用I3C总线进行传输。为防止几十M的固件升级包传输占用总线时间过长,影响CMC和BMC的数据及时同步,本发明中CMC会将估计升级包进行分片,分成若干小片,每个小片都带有编号和校验码,然后通过I3C总线进行间隔传输的方式进行。这样传输间隔中依然可以进行数据同步。BMC每收到一个小片数据都进行数据校验和解包存储。当校验失败时BMC通知CMC重新发送该小片数据。当分片数据全部传输到BMC后,BMC对分片数据进行组合,恢复完整的固件升级包。不论CMC通过LAN 传输数据还是内部I3C总线传输数据,BMC都会对升级包进行缓存后对其进行完整性校验。当校验通过则返回OK,校验不通过给CMC返回升级包校验失败。2. There are two ways for CMC to transmit firmware upgrade package data to node BMC. One is that CMC transmits firmware upgrade package to node BMC via LAN for upgrade; the other is CMC transmits firmware upgrade package data to node BMC via internal I3C bus. ; In the present invention, the direct data synchronization and interaction between CMC and BMC are always performed on the I3C bus, and the firmware upgrade package is relatively large. In order to reduce the pressure on the I3C bus, the present invention preferably transmits the estimated upgrade package to the node BMC via LAN , When the LAN link fails, the I3C bus is used for transmission. In order to prevent the transmission of tens of M firmware upgrade packages from occupying the bus for too long and affecting the timely synchronization of CMC and BMC data, in the present invention, CMC divides the estimated upgrade package into several small pieces, each with a number and The check code is then transmitted via the I3C bus at intervals. In this way, data synchronization can still be performed during the transmission interval. Each time BMC receives a small piece of data, it performs data verification and unpacking storage. When the verification fails, the BMC informs the CMC to resend the small piece of data. After all the fragmented data is transmitted to the BMC, the BMC combines the fragmented data to restore the complete firmware upgrade package. Regardless of whether the CMC transmits data via the LAN or the internal I3C bus, BMC will cache the upgrade package and verify its integrity. When the verification is passed, it returns OK, and the verification fails to return the upgrade package verification failure to the CMC.
3、CMC通过I3C总线通知BMC对该固件进行升级操作。BMC会再次判断该固件升级包是否符合本节点固件的升级要求,如果不符合通知CMC终止升级操作,如果符合则BMC开始对该固件进行升级操作,过程中BMC给CMC返回升级进度情况,供CMC反馈给用户,提高友好度。3. The CMC informs the BMC to upgrade the firmware through the I3C bus. BMC will again determine whether the firmware upgrade package meets the upgrade requirements of the firmware of the node. If it does not meet the requirements of the firmware upgrade of the node, the CMC will be notified to terminate the upgrade operation. If it does, the BMC will start to upgrade the firmware. Feedback to users and improve friendliness.
当CMC收到BMC估计升级成功的反馈后,提示用户升级完成。本次升级流程结束。When CMC receives the feedback from BMC that the upgrade is estimated to be successful, it prompts the user that the upgrade is complete. The upgrade process is over.
基于上述介绍的系统模块、数据模块、同步模块、检测模块、复位模块、告警模块、管理模块、网络模块和升级模块构成了根据本发明的双机热备系统100和/或100'中的主机设备10和备机设备20的功能结构,从而能够构建前述根据本发明的双机热备系统100和/或100'的各个实施例,完成相应的功能,实现相应的技术效果。The system module, data module, synchronization module, detection module, reset module, alarm module, management module, network module and upgrade module based on the above introduction constitute the host computer in the dual-system hot backup system 100 and/or 100' according to the present invention The functional structure of the device 10 and the backup device 20 can thus construct the aforementioned embodiments of the dual-machine hot backup system 100 and/or 100' according to the present invention, complete corresponding functions, and achieve corresponding technical effects.
采用上述技术方案,本发明至少具有如下有益效果:针对当前多节点服务器方案中采用的I2C总线、串口等通信速率较低,无法满足主备之间的交互和同步的要求,因而数据同步不得不严重依赖外部LAN和交换机的问题,提出了利用I3C总线建立多接点服务器双机热备系统的内部通信架构,采用两条I3C总线分别构建系统中主备和主从之间的通信架构。并且在系统启动时,由主机设备通过I3C总线采集从机设备的状态、建立映射并存入数据库,从而在例如用户需要调用某从机设备的运行参数时,无需主机设备响应于用户的指令再去向从机设备获取,而是可以直接将数据库中记录的相关信息直接反馈给用户。同时无论是系统启动时还是主机设备管理从机设备时,主机设备都会将相应的信息同步给备机设备,保证主备设备中数据的一致性。另外,备机设备也允许被外部例如用户直接访问下发指令,此时,备机设备在接收到管理指令后会通过I3C总线将该管理指令转发给主机设备以便主机设备对从机设备进行相应的管理。利用本发明的双机热备系统,不仅提高了系统内部通信的效率,同时避免了由于依赖外部LAN和外部交换机而产生的可靠性问题,在保证了主备设备数据一致性的情况下 一定程度上开发利用备机设备的资源,从而进一步提高了整个多节点服务器系统的安全性和运行效率。By adopting the above technical solution, the present invention has at least the following beneficial effects: for the low communication rate of I2C bus and serial port used in the current multi-node server solution, it cannot meet the requirements of interaction and synchronization between active and standby, so data synchronization has to be Relying heavily on the problem of external LAN and switches, the I3C bus is used to establish the internal communication architecture of the multi-point server dual-system hot standby system, and two I3C buses are used to construct the communication architecture between the master and the master and the slave in the system. And when the system is started, the host device collects the status of the slave device through the I3C bus, establishes a mapping and stores it in the database, so that for example, when the user needs to call the operating parameters of a certain slave device, there is no need for the host device to respond to the user's instructions. Go to the slave device to obtain it, but can directly feed back the relevant information recorded in the database to the user. At the same time, whether it is when the system is started or when the host device manages the slave device, the host device will synchronize the corresponding information to the standby device to ensure the consistency of data in the main and standby devices. In addition, the backup device is also allowed to be directly accessed by external users such as users to issue instructions. At this time, after receiving the management instruction, the backup device will forward the management instruction to the host device through the I3C bus so that the host device can respond to the slave device. Management. Using the dual-machine hot backup system of the present invention not only improves the efficiency of the internal communication of the system, but also avoids the reliability problems caused by relying on external LAN and external switches, and ensures the data consistency of the main and backup equipment to a certain extent. The above development and utilization of the resources of the backup equipment, thereby further improving the safety and operating efficiency of the entire multi-node server system.
应当理解的是,在技术上可行的情况下,以上针对不同实施例所列举的技术特征可以相互组合,从而形成本发明范围内的另外实施例。此外,本文所述的特定示例和实施例是非限制性的,并且可以对以上所阐述的结构、位置及顺序做出相应修改而不脱离本发明的保护范围。It should be understood that, where technically feasible, the technical features listed above for different embodiments can be combined with each other to form another embodiment within the scope of the present invention. In addition, the specific examples and embodiments described herein are non-limiting, and corresponding modifications may be made to the structure, position, and sequence set forth above without departing from the protection scope of the present invention.
在本申请中,反意连接词的使用旨在包括连接词。定或不定冠词的使用并不旨在指示基数。具体而言,对“该”对象或“一”和“一个”对象的引用旨在表示多个这样对象中可能的一个。然而,尽管本发明实施例公开的元素可以以个体形式描述或要求,但除非明确限制为单数,也可以理解为多个。此外,可以使用连接词“或”来传达同时存在的特征,而不是互斥方案。换句话说,连接词“或”应理解为包括“和/或”。术语“包括”是包容性的并且具有与“包含”相同的范围。In this application, the use of antagonistic conjunctions is intended to include conjunctions. The use of definite or indefinite articles is not intended to indicate a cardinal number. Specifically, references to "the" object or "a" and "an" objects are intended to indicate a possible one of a plurality of such objects. However, although the elements disclosed in the embodiments of the present invention may be described or required in an individual form, they may also be understood as plural unless explicitly limited to a singular number. In addition, the conjunction "or" can be used to convey co-existing features, rather than mutually exclusive solutions. In other words, the conjunction "or" should be understood to include "and/or". The term "including" is inclusive and has the same scope as "including".
上述实施例,特别是任何“优选”实施例是实施方式的可能示例,并且仅仅为了清楚理解本发明的原理而提出。在基本上不脱离本文描述的技术的精神和原理的情况下,可以对上述实施例做出许多变化和修改。所有修改旨在被包括在本公开的范围内。The above-mentioned embodiments, especially any "preferred" embodiments are possible examples of implementations, and are presented only for a clear understanding of the principles of the present invention. Many changes and modifications can be made to the above-mentioned embodiment without basically departing from the spirit and principle of the technology described herein. All modifications are intended to be included within the scope of this disclosure.

Claims (10)

  1. 一种双机热备系统,其特征在于,所述系统包括:A dual-machine hot backup system, characterized in that the system includes:
    主机设备;Host device
    备机设备,所述备机设备通过第一I3C总线与所述主机设备通信连接;A backup device, where the backup device communicates with the host device through a first I3C bus;
    至少一个从机设备,所述至少一个从机设备通过第二I3C总线与所述主机设备和所述备机设备通信连接;At least one slave device, the at least one slave device is in communication connection with the host device and the standby device through a second I3C bus;
    其中,所述主机设备配置为响应于所述双机热备系统启动,收集所述从机设备的参数,将所述参数的映射存入数据库并通过所述第一I3C总线同步给所述备机设备,并且配置为基于管理指令通过所述第二I3C总线管理所述从机设备,并且根据发生变化的参数生成映射,通过所述第一I3C总线将所述发生变化的参数的映射同步给所述备机设备;Wherein, the host device is configured to collect the parameters of the slave device in response to the startup of the dual-machine hot backup system, store the mapping of the parameters in a database, and synchronize the parameters to the backup device via the first I3C bus. The device is configured to manage the slave device through the second I3C bus based on the management instruction, and generate a mapping according to the changed parameter, and synchronize the mapping of the changed parameter to the first I3C bus. The standby equipment;
    所述备机设备配置为响应于从外部接收到管理指令,将所述接收到的管理指令通过所述第一I3C总线转发给所述主机设备。The backup device is configured to, in response to receiving a management instruction from the outside, forward the received management instruction to the host device through the first I3C bus.
  2. 根据权利要求1所述的系统,其特征在于,所述备机设备进一步配置为:The system according to claim 1, wherein the backup device is further configured to:
    响应于从外部接收到紧急的管理指令,强制临时占用所述第二I3C总线管理所述从机设备,并且根据发生变化的参数生成参数的映射,通过所述第一I3C总线将所述紧急的管理指令及所述发生变化的参数的映射同步给所述主机设备。In response to receiving an emergency management instruction from the outside, the second I3C bus is forcibly occupied to manage the slave device, and a parameter mapping is generated according to the changed parameters, and the emergency The mapping of the management instruction and the changed parameter is synchronized to the host device.
  3. 根据权利要求1所述的系统,其特征在于,所述主机设备进一步配置为:The system according to claim 1, wherein the host device is further configured to:
    响应于所述主机设备进入升级模式和/或资源占用超过阈值,通过所述第一I3C总线通知所述备机设备暂时接管对所述从机设备的管理;并且In response to the host device entering the upgrade mode and/or the resource occupation exceeds the threshold, notify the backup device through the first I3C bus to temporarily take over the management of the slave device; and
    响应于所述主机设备退出升级模式和/或资源占用不再超过阈值,通过所述第一I3C总线通知所述备机设备停止接管对所述从机设备的管理。In response to the host device exiting the upgrade mode and/or the resource occupation no longer exceeds the threshold, the standby device is notified through the first I3C bus to stop taking over the management of the slave device.
  4. 根据权利要求3所述的系统,其特征在于,所述备机设备进一步配置为:The system according to claim 3, wherein the backup device is further configured to:
    响应于接收到所述主机设备的暂时接管的通知,通过所述第二I3C总线管理所述从机设备,并且根据发生变化的相关参数生成参数的映射;并 且In response to receiving the notification of the temporary takeover of the host device, manage the slave device through the second I3C bus, and generate a parameter mapping according to the changed related parameters; and
    响应于接收到所述主机设备的停止接管的通知,停止管理所述从机设备并通过所述第一I3C总线将所述发生变化的参数的映射同步给所述主机设备。In response to receiving the notification of stopping the takeover of the host device, stop managing the slave device and synchronize the mapping of the changed parameter to the host device through the first I3C bus.
  5. 根据权利要求1所述的系统,其特征在于,所述备机设备进一步配置为:The system according to claim 1, wherein the backup device is further configured to:
    响应于所述双机热备系统启动,主动向所述主机设备发起时钟同步请求。In response to the startup of the dual-system hot backup system, actively initiate a clock synchronization request to the host device.
  6. 根据权利要求1所述的系统,其特征在于,所述主机设备和所述备机设备进一步配置为:The system according to claim 1, wherein the host device and the backup device are further configured as:
    响应于所述主机设备和所述备机设备任一方发起参数的映射的同步,发起方生成包括原有数据、修改后数据和修改时间的同步打包数据并发送给对方。In response to the synchronization of the mapping of parameters initiated by either of the host device and the backup device, the initiator generates and sends synchronized packaged data including original data, modified data, and modified time to the other party.
  7. 根据权利要求6所述的系统,其特征在于,所述主机设备和所述备机设备进一步配置为:The system according to claim 6, wherein the host device and the backup device are further configured as:
    响应于所述主机设备和/或所述备机设备接收到对方发送的同步打包数据,将其中的原有数据与本地数据进行比较;In response to the host device and/or the backup device receiving the synchronized packaged data sent by the other party, comparing the original data therein with the local data;
    响应于所述原有数据与所述本地数据相同,根据所述修改后数据修改所述本地数据;In response to the original data being the same as the local data, modifying the local data according to the modified data;
    响应于所述原有数据与所述本地数据不同,将所述接收到的同步打包数据中的修改时间与本地的修改时间进行比较,以修改时间较新的修改后数据为准进行同步。In response to the difference between the original data and the local data, the modification time in the received synchronized packaged data is compared with the local modification time, and the modified data with a newer modification time is used as the standard for synchronization.
  8. 根据权利要求1所述的系统,其特征在于,所述主机设备和所述备机设备进一步配置为:The system according to claim 1, wherein the host device and the backup device are further configured as:
    通过双向物理IO心跳检测机制检测对方的运行状态。Detect the other party's operating status through a two-way physical IO heartbeat detection mechanism.
  9. 根据权利要求1所述的系统,其特征在于,所述主机设备和所述备机设备进一步配置为:The system according to claim 1, wherein the host device and the backup device are further configured as:
    响应于所述主机设备和所述备机设备任一方检测到对方出现故障,非故障方将故障方的故障情况记入日志,并通过外部双复位机制将故障方重 启复位,并对故障方进行时钟及数据库同步。In response to either the host device or the backup device detecting that the other party has a failure, the non-faulty party records the failure of the failed party in the log, and resets the failed party through the external double reset mechanism, and performs a check on the failed party. Clock and database synchronization.
  10. 根据权利要求9所述的系统,其特征在于,所述主机设备和所述备机设备进一步配置为:The system according to claim 9, wherein the host device and the backup device are further configured to:
    响应于非故障方无法通过外部双复位机制将故障方重启复位和/或重启复位失败,所述非故障方发出告警以通知运维人员处理。In response to the non-faulty party being unable to restart and reset the faulty party through the external dual reset mechanism and/or the restarting and resetting failure, the non-faulty party issues an alarm to notify the operation and maintenance personnel to handle it.
PCT/CN2020/092835 2019-10-18 2020-05-28 Dual-computer hot standby system WO2021073105A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910995329.9A CN110750480B (en) 2019-10-18 2019-10-18 Dual-computer hot standby system
CN201910995329.9 2019-10-18

Publications (1)

Publication Number Publication Date
WO2021073105A1 true WO2021073105A1 (en) 2021-04-22

Family

ID=69278976

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/092835 WO2021073105A1 (en) 2019-10-18 2020-05-28 Dual-computer hot standby system

Country Status (2)

Country Link
CN (1) CN110750480B (en)
WO (1) WO2021073105A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113852529A (en) * 2021-08-11 2021-12-28 交控科技股份有限公司 Back board bus system for data communication of trackside equipment and data transmission method thereof

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109491311A (en) * 2018-11-13 2019-03-19 江苏常熟发电有限公司 A kind of CEMS data transmission failure judgment method
CN110750480B (en) * 2019-10-18 2021-06-29 苏州浪潮智能科技有限公司 Dual-computer hot standby system
CN111698117A (en) * 2020-04-01 2020-09-22 新华三信息安全技术有限公司 Equipment management method, network equipment, storage medium and router
CN111736880A (en) * 2020-05-28 2020-10-02 苏州浪潮智能科技有限公司 BMC refreshing method, system, equipment, product and storage medium
CN111813859A (en) * 2020-07-14 2020-10-23 积成电子股份有限公司 Time slice-based synchronization method for historical items of transformer substation between main machine and auxiliary machine
CN112398712B (en) * 2020-09-29 2022-01-28 卡斯柯信号有限公司 CAN and MLVDS dual-bus-based communication board active/standby control method
CN114690857A (en) * 2020-12-28 2022-07-01 技嘉科技股份有限公司 Cabinet management control device and cabinet management control system
CN113852549B (en) * 2021-09-27 2023-10-17 卡斯柯信号有限公司 Method for realizing independent data receiving and processing of main and standby systems
CN117032579A (en) * 2023-08-21 2023-11-10 上海合芯数字科技有限公司 Slave starting method, device and storage medium
CN117533251A (en) * 2024-01-08 2024-02-09 知迪汽车技术(北京)有限公司 Distributed file system for vehicle-mounted bus data recorder

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103199972A (en) * 2013-03-25 2013-07-10 成都瑞科电气有限公司 Double machine warm backup switching method and warm backup system achieved based on SOA and RS485 bus
US20130293251A1 (en) * 2012-05-07 2013-11-07 Tesla Motors, Inc. Wire break detection in redundant communications
CN109960679A (en) * 2017-12-14 2019-07-02 英特尔公司 For controlling the systems, devices and methods of the duty ratio of the clock signal of multi-point interconnection
CN110750480A (en) * 2019-10-18 2020-02-04 苏州浪潮智能科技有限公司 Dual-computer hot standby system

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6970961B1 (en) * 2001-01-02 2005-11-29 Juniper Networks, Inc. Reliable and redundant control signals in a multi-master system
CN104679907A (en) * 2015-03-24 2015-06-03 新余兴邦信息产业有限公司 Realization method and system for high-availability and high-performance database cluster
CN105389231A (en) * 2015-10-28 2016-03-09 浪潮(北京)电子信息产业有限公司 Database dual-computer backup method and system
CN107634855A (en) * 2017-09-12 2018-01-26 天津津航计算技术研究所 A kind of double hot standby method of embedded system
CN108090009A (en) * 2017-11-13 2018-05-29 北京全路通信信号研究设计院集团有限公司 A kind of multimachine method, apparatus of falling machine and system
CN109144913A (en) * 2018-09-29 2019-01-04 联想(北京)有限公司 A kind of data processing method, system and electronic equipment
CN109815186A (en) * 2018-12-18 2019-05-28 北京航天晨信科技有限责任公司 Dual redundant communication equipment and method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130293251A1 (en) * 2012-05-07 2013-11-07 Tesla Motors, Inc. Wire break detection in redundant communications
CN103199972A (en) * 2013-03-25 2013-07-10 成都瑞科电气有限公司 Double machine warm backup switching method and warm backup system achieved based on SOA and RS485 bus
CN109960679A (en) * 2017-12-14 2019-07-02 英特尔公司 For controlling the systems, devices and methods of the duty ratio of the clock signal of multi-point interconnection
CN110750480A (en) * 2019-10-18 2020-02-04 苏州浪潮智能科技有限公司 Dual-computer hot standby system

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113852529A (en) * 2021-08-11 2021-12-28 交控科技股份有限公司 Back board bus system for data communication of trackside equipment and data transmission method thereof
CN113852529B (en) * 2021-08-11 2023-03-24 交控科技股份有限公司 Back board bus system for data communication of trackside equipment and data transmission method thereof

Also Published As

Publication number Publication date
CN110750480B (en) 2021-06-29
CN110750480A (en) 2020-02-04

Similar Documents

Publication Publication Date Title
WO2021073105A1 (en) Dual-computer hot standby system
CN107733684B (en) Multi-controller computing redundancy cluster based on Loongson processor
US20140095925A1 (en) Client for controlling automatic failover from a primary to a standby server
CN103199972B (en) The two-node cluster hot backup changing method realized based on SOA, RS485 bus and hot backup system
US5875290A (en) Method and program product for synchronizing operator initiated commands with a failover process in a distributed processing system
CN103647781B (en) Mixed redundancy programmable control system based on equipment redundancy and network redundancy
CN105471622B (en) A kind of high availability method and system of the control node active-standby switch based on Galera
US7853767B2 (en) Dual writing device and its control method
US6012150A (en) Apparatus for synchronizing operator initiated commands with a failover process in a distributed processing system
CN107147540A (en) Fault handling method and troubleshooting cluster in highly available system
JP2004532442A (en) Failover processing in a storage system
BR112019027654A2 (en) train network node and canopen-based train network node monitoring method
US20150019671A1 (en) Information processing system, trouble detecting method, and information processing apparatus
JPH03164837A (en) Spare switching system for communication control processor
CN111737037A (en) Substrate management control method, master-slave heterogeneous BMC control system and storage medium
CN107071189B (en) Connection method of communication equipment physical interface
JP5625605B2 (en) OS operation state confirmation system, device to be confirmed, OS operation state confirmation device, OS operation state confirmation method, and program
CN110399254A (en) A kind of server CMC dual-locomotive heat activating method, system, terminal and storage medium
CN114124803B (en) Device management method and device, electronic device and storage medium
CN102638369B (en) Method, device and system for arbitrating main/standby switch
CN116069373A (en) BMC firmware upgrading method, device and medium thereof
JP7328907B2 (en) control system, control method
CN113794765A (en) Gate load balancing method and device based on file transmission
CN107423167A (en) A kind of ISCSI target redundancy control methods and system based on dual control storage
CN106656437A (en) Redundant hot standby platform

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20876515

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20876515

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 20876515

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 02.11.2022)

122 Ep: pct application non-entry in european phase

Ref document number: 20876515

Country of ref document: EP

Kind code of ref document: A1