WO2021073105A1

WO2021073105A1 - Dual-computer hot standby system

Info

Publication number: WO2021073105A1
Application number: PCT/CN2020/092835
Authority: WO
Inventors: 韩红瑞; 黄柏学
Original assignee: 苏州浪潮智能科技有限公司
Priority date: 2019-10-18
Filing date: 2020-05-28
Publication date: 2021-04-22
Also published as: CN110750480B; CN110750480A

Abstract

A dual-computer hot standby system, comprising: a host device, a standby device which is in communication connection with the host device by means of a first I3C bus, and at least one slave device which is in communication connection with the host device and the standby device by means of a second I3C bus; wherein the host device is configured to respond to the startup of the dual-computer hot standby system to collect parameters of the slave device, store the mapping of the parameters into a database, and synchronize the mapping of the parameters to the standby device by means of the first I3C bus, and the host device is configured to manage the slave device by means of the second I3C bus on the basis of a management instruction, generate a mapping according to the changed parameters, and synchronize the mapping of the changed parameters to the standby device by means of the first I3C bus; the standby device is configured to forward the received management instruction to the host device by means of the first I3C bus in response to the reception of the management instruction from the outside. By utilizing the system of the present invention, the problem of insufficient reliability caused by excessive dependence on an LAN is solved, and the safety and the operation efficiency of the whole server system are improved.

Description

A dual-machine hot backup system

This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on October 18, 2019, the application number is 201910995329.9, and the invention title is "a dual-system hot backup system", the entire content of which is incorporated into this application by reference .

Technical field

The invention relates to the field of server technology. The invention further relates to a dual-machine hot backup system.

Background technique

With the rapid development of AI technology and Internet technology, a single server unit in the past has been unable to meet the processing needs of massive data. Massively parallelized computer system architecture, especially with the characteristics of strong scalability, high computing power, and support for unified management, is increasingly catering to the needs of server products in the era of big data. This makes the physical volume of the current server system gradually larger, the module composition gradually becomes more complicated, and the integration degree gradually increases. With the gradual increase in server functions and the number of nodes, the challenges to monitoring and management are gradually increasing, and the requirements for system redundancy are getting higher and higher.

In the existing multi-node, large-scale, high-density server redundancy monitoring and management system, communication buses such as I2C or serial ports are usually used inside the system, and the speed is low, which cannot meet the requirements of interaction and synchronization, especially between active and standby. Therefore, data synchronization between active and standby has to rely heavily on external LANs and switches. The biggest disadvantage of the external network cable is that there is a reliability risk, and there may be a risk that the network cable is poorly connected or is artificially disconnected, or even the switch restarts. Once the LAN fails, the effective communication between the main and standby will be interrupted, and the management system may be confused and remotely uncontrollable. Secondly, the main and backup equipment are two independent operating entities, and data differences will inevitably occur when the two operate sequentially or at the same time. Even if the LAN returns to normal after a period of time, the two monitors at this time do not know whose data is the latest, and who should synchronize with whom, so a split-brain situation may occur.

For this reason, in some schemes, in the process of detecting the failure of the host device, a strategy of secondary verification of the serial port is added to avoid the failure of the standby machine to determine the true operation of the host after the LAN network is interrupted. However, when the host LAN is disconnected, the standby machine can still check that the host is alive through the serial port. At this time, the standby machine does not take over the work of the host and is always in a standby working state. At this time, data synchronization between the main and standby devices cannot be performed through the LAN, and the user cannot remotely control and access the host device through the LAN. At this time, even if the user logs in to the slave device through the LAN, the data queried is not the latest data.

Therefore, it is necessary to address the low communication rate of I2C bus and serial port used in the current multi-node server solution, which cannot meet the requirements of interaction and synchronization between active and standby. Therefore, data synchronization has to rely heavily on external LAN and switch issues to improve , Propose a mechanism for establishing safe and reliable internal communication between main and standby.

Summary of the invention

On the one hand, the present invention proposes a dual-machine hot backup system based on the above objective, wherein the system includes:

Host device

A standby device, which communicates with the host device through the first I3C bus;

At least one slave device, the at least one slave device is in communication connection with the host device and the standby device through the second I3C bus;

Among them, the host device is configured to respond to the startup of the dual-system hot backup system, collect the parameters of the slave device, store the parameter mapping in the database and synchronize it to the standby device via the first I3C bus, and is configured to pass the second device based on the management instruction The I3C bus manages the slave device, and generates a mapping according to the changed parameter, and synchronizes the mapping of the changed parameter to the standby device through the first I3C bus;

The standby device is configured to, in response to receiving a management instruction from the outside, forward the received management instruction to the host device through the first I3C bus.

According to the embodiment of the dual-system hot backup system of the present invention, the backup device is further configured to: in response to receiving an emergency management instruction from the outside, forcibly occupy the second I3C bus to manage the slave device temporarily, and according to the changed parameters Generate parameter mapping, and synchronize the mapping of emergency management commands and changed parameters to the host device through the first I3C bus.

According to the embodiment of the dual-system hot backup system of the present invention, the host device is further configured to: in response to the host device entering the upgrade mode and/or the resource occupation exceeds the threshold, notify the standby device to temporarily take over the slave device through the first I3C bus And in response to the host device exiting the upgrade mode and/or the resource occupation no longer exceeds the threshold, the standby device is notified through the first I3C bus to stop taking over the management of the slave device.

According to an embodiment of the dual-system hot backup system of the present invention, the backup device is further configured to: in response to receiving a notification of a temporary takeover of the host device, manage the slave device through the second I3C bus, and according to the changed related parameters Generate a parameter mapping; and in response to receiving a notification that the host device stops taking over, stop managing the slave device and synchronize the mapping of the changed parameter to the host device through the first I3C bus.

According to an embodiment of the dual-system hot backup system of the present invention, the backup device is further configured to: in response to the dual-system hot backup system being activated, actively initiate a clock synchronization request to the host device.

According to the embodiment of the dual-system hot backup system of the present invention, the host device and the backup device are further configured to: in response to either of the host device and the backup device initiates the synchronization of the parameter mapping, the initiator generates the original data, modification After the data and the modification time are synchronized, the data is packaged and sent to the other party.

According to the embodiment of the dual-system hot backup system of the present invention, the host device and the backup device are further configured as:

In response to the host device and/or the standby device receiving the synchronized packaged data sent by the other party, compare the original data therein with the local data;

In response to the original data being the same as the local data, modify the local data according to the modified data;

In response to the difference between the original data and the local data, the modification time in the received synchronized packaged data is compared with the local modification time, and the modified data with the newer modification time is used as the standard for synchronization.

According to an embodiment of the dual-system hot backup system of the present invention, the host device and the backup device are further configured to detect the operating state of each other through a two-way physical IO heartbeat detection mechanism.

According to the embodiment of the dual-system hot backup system of the present invention, the host device and the backup device are further configured to: in response to either the host device and the backup device detecting that the other party has a failure, the non-faulty party records the failure of the failed party Enter the log, and reset the faulty party through the external double reset mechanism, and synchronize the clock and database of the faulty party.

According to the embodiment of the dual-system hot backup system of the present invention, the host device and the backup device are further configured to: in response to the non-faulty party being unable to reset the failed party through an external dual reset mechanism and/or the failure of the resetting, the non-faulty party Send an alarm to notify the operation and maintenance personnel to deal with it.

By adopting the above technical solution, the present invention has at least the following beneficial effects: for the low communication rate of I2C bus and serial port used in the current multi-node server solution, it cannot meet the requirements of interaction and synchronization between active and standby, so data synchronization has to be Relying heavily on the problem of external LAN and switches, the I3C bus is used to establish the internal communication architecture of the multi-point server dual-system hot standby system, and two I3C buses are used to construct the communication architecture between the master and the master and the slave in the system. And when the system is started, the host device collects the status of the slave device through the I3C bus, establishes a mapping and stores it in the database, so that for example, when the user needs to call the operating parameters of a certain slave device, there is no need for the host device to respond to the user's instructions. Go to the slave device to obtain it, but can directly feed back the relevant information recorded in the database to the user. At the same time, whether it is when the system is started or when the host device manages the slave device, the host device will synchronize the corresponding information to the standby device to ensure the consistency of data in the main and standby devices. In addition, the backup device is also allowed to be directly accessed by external users such as users to issue instructions. At this time, after receiving the management instruction, the backup device will forward the management instruction to the host device through the I3C bus so that the host device can respond to the slave device. Management. Using the dual-machine hot backup system of the present invention not only improves the efficiency of the internal communication of the system, but also avoids the reliability problems caused by relying on external LAN and external switches, and ensures the data consistency of the main and backup equipment to a certain extent. The above development and utilization of the resources of the backup equipment, thereby further improving the safety and operating efficiency of the entire multi-node server system.

The present invention provides various aspects of the embodiments, and should not be used to limit the protection scope of the present invention. Other embodiments can be envisaged based on the technology described herein, which will be obvious to those of ordinary skill in the art after studying the following drawings and specific embodiments, and these embodiments are intended to be included in the scope of the present application .

The embodiments of the present invention are explained and described in more detail below with reference to the accompanying drawings, but they should not be construed as limiting the present invention.

Description of the drawings

In order to explain the technical solutions of the embodiments of the present invention more clearly, the following will briefly introduce the drawings that need to be used in the description of the prior art and the embodiments. The components in the drawings are not necessarily drawn to scale, and the related can be omitted. The elements, or in some cases, the scale may have been exaggerated in order to emphasize and clearly illustrate the novel features described herein. In addition, as is known in the art, the structural positions can be arranged differently.

Figure 1 shows a schematic diagram of an embodiment of a dual-machine hot backup system according to the present invention;

Figure 2 shows a schematic diagram of another embodiment of the dual-machine hot backup system according to the present invention;

Fig. 3 shows a schematic structural diagram of an embodiment of the host device and the standby device of the dual-system hot backup system according to the present invention.

Detailed ways

Although the present invention may be implemented in various forms, some exemplary and non-limiting embodiments are shown in the drawings and described below, but it should be understood that the present disclosure will be regarded as an example of the present invention and not It is intended to limit the invention to the specific embodiments described.

Fig. 1 shows a schematic diagram of an embodiment of a dual-machine hot backup system 100 according to the present invention. The dual-system hot backup system 100 according to the present invention is especially used for the control and management of a multi-node server system. In the embodiment shown in FIG. 1, the dual-machine hot backup system 100 at least includes:

Host device 10;

A backup device 20, which is in communication connection with the host device 10 through the first I3C bus 30;

At least one slave device 40, the at least one slave device 40 is in communication connection with the host device 10 and the backup device 20 through the second I3C bus 50;

Wherein, the host device 10 is configured to collect the parameters of the slave device 40 in response to the startup of the dual-machine hot backup system 100, store the parameter mapping in the database and synchronize to the backup device 20 through the first I3C bus 30, and is configured to be based on The management instruction manages the slave device 40 through the second I3C bus 50, and generates a mapping according to the changed parameters, and synchronizes the mapping of the changed parameters to the standby device 20 through the first I3C bus 30;

The backup device 20 is configured to, in response to receiving a management instruction from the outside, forward the received management instruction to the host device 10 through the first I3C bus 30.

Since the dual-machine hot standby system 100 according to the present invention is particularly used for the control and management of a multi-node server system, it is preferable that the host device 10 and the slave device 20 are CMC devices (Chassis Management Controller, chassis management controller). Similar to BMC, the whole machine is managed and controlled in multi-node server systems such as blades. CMC can send commands to each node for management. In addition, the slave device 40 is preferably a BMC (Baseboard Management Controller, baseboard management controller) in a multi-node server system. The BMC can perform some operations on the machine such as firmware upgrade, viewing machine equipment, and so on when the machine is not turned on.

In addition, the dual-machine hot backup system 100 according to the present invention uses an I3C bus for communication connection. I3C is a two-wire serial communication bus that integrates the key attributes of I2C and SPI buses. It is compatible with the I2C protocol. It has new features such as multiple masters, slave soft interrupts, dynamic allocation of slave addresses, and support for hot swapping. The speed can be as high as 33Mbps. Usually used to connect the sensor to the application processor. Further, in order to separate the master-backup communication and master-slave management to avoid mutual interference and reduce bus pressure, the first I3C bus 30 is used between the master and backup (10 and 20), and the master-slave (10 and 40) And a second I3C bus 50 is used between 20 and 40).

Since the I3C bus is compatible with the I2C bus protocol, the dual-computer hot backup system 100 according to the present invention can complete all the functions of the system originally constructed by the I2C bus. And on top of this, the dual-machine hot backup system 100 according to the present invention adds new functions. The host device 10 is configured to collect the parameters of the slave device 40 in response to the startup of the dual-machine hot backup system 100, store the mapping of the parameters in the database, and synchronize the parameters to the backup device 20 through the first I3C bus 30. That is to say, when the dual-system hot backup system 100 is initially powered on, the CMC host device 10 checks the BMC slave device 20 on each node after it is powered on, and establishes a parameter based on the node device serial number, various operating parameters and other parameters Map the database. The CMC host device 10 maps the parameters of all node BMC slave devices to the CMC host device 10. In this way, when the user requests to call the parameters of a certain slave device 40, he only needs to access the CMC host device 10 to obtain the corresponding parameters of all the slave devices 40, which prevents the CMC host device 10 from reporting to the corresponding slave device when the user queries. The device 40 reads the parameters, thereby speeding up the speed of responding to user instructions. In addition, the host device 10 synchronizes the database to the standby device 20 through the first I3C bus 30 to ensure data consistency between the main and standby devices.

Another newly added function is that the host device 10 is configured to manage the slave device 40 through the second I3C bus 50 based on management instructions, and generates a mapping according to the changed parameters, and synchronizes the mapping of the changed parameters through the first I3C bus 30给备机设备20。 To the standby equipment 20. That is to say, when, for example, a user accesses the host device 10 from the outside to issue a management instruction and/or when the host device 10 generates a management instruction according to a preset control strategy, the host device 10 completes its management and control of the slave device through the second I3C bus 50 40 functions. Not only that, but it also adds that the host device 10 maps the corresponding parameters to the database of the host device 10, and synchronizes the mapping of the parameters to the backup device 20 in real time through the first I3C bus 30, thereby ensuring that the management process is from Real-time data update and synchronization of main and standby data after changes in the parameters of the machine equipment 40.

In addition, the function of the dual-machine hot backup system 100 according to the present invention also includes the development and utilization of the resources of the backup device 20 to a certain extent, that is, the backup device 20 is also allowed to be directly accessed by external users, such as users. In the concept of the present invention, the host device 10 and the standby device 20 respectively have their own communication addresses. Therefore, when the user directly accesses the backup device 20 from the outside according to the communication address and issues a management instruction, the backup device 20 will forward the management instruction to the host device 10 through the first I3C bus 30 after receiving the corresponding management instruction. The host device 10 performs corresponding management on the slave device 40.

In one or more embodiments of the dual-system hot backup system 100 of the present invention, the backup device 20 is further configured to: in response to receiving an emergency management instruction from the outside, forcibly temporarily occupy the second I3C bus 50 to manage the slave device 40, and generate a parameter map according to the changed parameter, and synchronize the emergency management command and the map of the changed parameter to the host device 10 through the first I3C bus 30. That is, in order to further develop and utilize the resources of the backup device 20, in these embodiments, when the user directly accesses the backup device 20 from the outside according to the communication address and issues a management instruction, and the management instruction is a specific emergency management instruction At this time, the backup device 20 will forcibly occupy the second I3C bus 50 temporarily to manage the slave device 40 according to the emergency management instruction issued by the user, and synchronize the corresponding information to the host device 10. The specific emergency management instructions mentioned here usually refer to the management instructions that have very strong timeliness requirements and/or are closely related to system operation safety and must be processed immediately and/or that the user is forced to process immediately. The backup device 20 will forcibly occupy the second I3C bus 50 temporarily to manage the slave device 40, eliminating the step of the backup device 20 forwarding the management command to the host device 10 through the first I3C bus 30, which improves the response speed.

In one or more embodiments of the dual-system hot backup system of the present invention, the host device 10 is further configured to: in response to the host device 10 entering the upgrade mode and/or the resource occupation exceeds a threshold, the host device 10 passes through the first I3C bus 30 Notify the backup device 20 to temporarily take over the management of the slave device 40; and in response to the host device 10 exiting the upgrade mode and/or the resource occupation no longer exceeds the threshold, the host device 10 notifies the backup device 20 to stop taking over through the first I3C bus 30 Management of the slave device 40. In the traditional master-slave multi-node server control system, when the host device needs to upgrade the firmware and/or the system resource occupation exceeds the threshold, the host device cannot continue to support the management of the slave device due to the limited memory space. The method adopted is that the operation and maintenance personnel temporarily close some functions, and restart the previously temporarily closed functions when the firmware upgrade ends and/or the resource occupation is relieved. Such shortcomings are obvious. Under the above-mentioned specific circumstances, the management and control of the slave device by the host device is interrupted, and continuous management and control cannot be achieved. Therefore, the dual-system hot backup system 100 according to the present invention further develops and utilizes the resources of the backup device under the above-mentioned specific circumstances. When the host device 10 enters the upgrade mode and/or the resource occupation exceeds the threshold, the host device 10 actively The standby device 20 is notified through the first I3C bus 30 to temporarily take over the management of the slave device 40; and when the master device 10 is upgraded, it exits the upgrade mode and/or the resource occupation no longer exceeds the threshold so that the slave device 40 can continue to be managed When controlling, the host device 10 informs the standby device 20 through the first I3C bus 30 to stop taking over the management of the slave device 40.

In some embodiments of the dual-system hot backup system 100 of the present invention, the backup device 20 is further configured to: in response to receiving the notification of the temporary takeover of the host device 10, manage the slave device 40 through the second I3C bus 50, and The parameter mapping is generated according to the changed related parameters; and in response to receiving the notification of stopping the takeover of the host device 40, the slave device is stopped from managing and the mapping of the changed parameter is synchronized to the host device 10 through the first I3C bus 30. The backup device 20 serves as the backup of the host device 10 and is on standby for a long time. In a traditional dual-system hot backup system, once the host device 10 fails, the backup device 20 becomes the host to maintain the normal operation of the system. In the dual hot backup system of the present invention, in addition to the above, no matter what state the host device 10 is in, once the backup device 20 receives the temporary takeover notice sent to it by the host device 10, the backup device 20 will It will temporarily take over the management of the slave device 40, manage the slave device 40 through the second I3C bus 50, and generate a parameter mapping according to the changed related parameters. In addition, once the backup device 20 receives the notification to stop taking over from the host device 10, the backup device 20 will stop managing the slave device 40, return the management work to the host device 10, and pass the first I3C bus 30. The mapping of the parameters affected by the management actions during the management of the backup device 20 is synchronized to the host device 10 to ensure that the host device can accurately manage the slave device 40 and ensure that the master and backup data are consistent.

In several embodiments of the dual-machine hot backup system 100 of the present invention, the backup device 20 is further configured to: in response to the dual-machine hot backup system 100 being activated, the backup device 20 actively initiates a clock synchronization request to the host device 10. In the dual-system hot-standby architecture, the system time of the active and standby devices plays an important role in many situations and functions, so it is necessary to ensure the clock synchronization between the active and standby devices. In the embodiment of the present invention, the strategy adopted for clock synchronization is that after the dual-system hot backup system 100 is started, that is, after the host device 10 and the backup device 20 are turned on, the backup device 20 actively initiates a clock synchronization request to the host device 10. , To ensure the consistency of the system time of the two devices.

In a further embodiment of the dual-system hot backup system 100 of the present invention, the host device 10 and the backup device 20 are further configured to: in response to either the host device 10 and the backup device 20 initiating the synchronization of the parameter mapping, the initiator ( 10 or 20) Generate synchronized packaged data including original data, modified data and modified time and send it to the other party (20 or 10). The same data storage format is used in the host device 10 and the backup device 20. When data changes and need to be synchronized between master and backup, the principle followed is that the party whose data has changed is responsible for initiating the synchronization. That is to say, when either of the host device 10 and the backup device 20 initiates the synchronization of parameter mapping due to data changes, the initiator (host device 10 or backup device 20) will "original data + modified data + The "modification time" is packaged into synchronized packaged data, and then the synchronized packaged data is sent to the other party (the standby device 20 or the host device 10) so that the other party can update the data.

In several embodiments of the dual-machine hot backup system 100 of the present invention, the host device 10 and the backup device 20 are further configured as:

In response to the host device 10 and/or the backup device 20 receiving the synchronized packaged data sent by the other party, the original data therein is compared with the local data;

Since in the dual-system hot backup system 100 of the present invention, both the host device 10 and the backup device 20 are allowed to be directly accessed from the outside, data changes may occur in the dual computers at the same time. In this case, an arbitration mechanism is required to determine the final effective data. Therefore, in the foregoing several embodiments, this arbitration mechanism specifically includes the following parts. First, when the host device 10 and/or the backup device 20 receives the synchronized packaged data sent by the other party, the synchronized packaged data is parsed to extract the original data, the modified data, and the modification time, and the extracted original data is compared with the original data. Compare with local data. If the original data is the same as the local data, it means that the local data has not changed, so the local data can be directly updated according to the extracted modified data. If the original data is different from the local data, it means that the local data has also been modified. Therefore, it is necessary to further determine whether the modification of the other party or the modification of the own party should be the final effective modification. At this point, the modification time extracted from the received synchronous packaged data is compared with the modification time in the synchronous packaged data generated when the local data changes, and the modified data with the newer modification time is the final valid data. Synchronize primary and secondary data. Specifically, if the modified time extracted from the received synchronized packaged data is relatively new, that is, the modified data extracted from the received synchronized packaged data is the final valid data, the data extracted from the received synchronized packaged data After the modification, the local data is updated to realize the synchronization of the main and standby data. If the modification time in the synchronized packaged data generated when the local data changes is newer, that is, the local data is the final valid data, so the local update is not performed. At this time, if the generated synchronous packaged data has not been sent to the other party when the local data changes, the synchronous packaged data will be sent to the other party immediately. If the generated synchronous packaged data has been sent to the other party when the local data changes, no further processing is necessary.

2 shows a schematic diagram of another embodiment of a dual-system hot backup system 100' according to the present invention, in which the host device 10 and the backup device 20 are further configured to detect the operation of each other through a two-way physical IO heartbeat detection mechanism 60 status. In these embodiments, the embodiment of the dual-machine hot backup system 100' according to the present invention adds a two-way physical IO heartbeat detection mechanism between the host device 10 and the backup device 20 compared with the traditional dual-machine hot backup system. Not only does the backup device detect whether the host device is faulty, the host device also checks the operating status of the backup device in real time, so as to prevent the system from not knowing this situation when the backup device fails before the host device, which leads to the need for backup when the host device fails The occurrence of a situation where the system completely loses management when the equipment starts to perform management functions.

As shown in FIG. 2 according to the embodiment of the dual-machine hot backup system 100' of the present invention, the host device 10 and the backup device 20 are further configured to respond to the failure of either the host device 10 or the backup device 20 detecting that the other party has a failure , The non-faulty party (10 or 20) records the failure of the faulty party (20 or 10) in the log, and resets the faulty party (20 or 10) through the external double reset mechanism 70, and responds to the faulty party (20 or 10). ) Synchronize clock and database. That is to say, once either the host device 10 or the backup device 20 detects that the other party has a failure, the non-faulty party (host device 10 or backup device 20) will send the failure party (backup device 20 or host device 10) The failure situation is recorded in the local log. In addition, the non-faulty party (host device 10 or backup device 20) will restart and reset the failed party (backup device 20 or host device 10) through the external dual reset mechanism 70 to ensure that both devices can be online at the same time to prevent redundancy failure. If the reset is successful, the non-faulty party (host device 10 or backup device 20) will synchronize the clock and database of the failed party (backup device 20 or host device 10).

In a further embodiment of the dual-system hot backup system 100' of the present invention, the host device 10 and the backup device 20 are further configured to respond to the failure of the non-faulty party (10 or 20) to pass the external dual reset mechanism 70 to the failed party ( 20 or 10) Restart reset and/or restart reset fails, the non-faulty party (10 or 20) issues an alarm to notify the operation and maintenance personnel to deal with. That is to say, in some situations, it may happen that the non-faulty party (host device 10 or backup device 20) cannot reset the failed party (backup device 20 or host device 10) through the external dual reset mechanism 70 or tries to restart and reset. In the case of unsuccessful situations, the strategy adopted at this time is that the non-faulty party (host device 10 or backup device 20) issues an alarm to notify the operation and maintenance personnel to deal with the failure party (backup device 20 or host device 10) to eliminate the fault. Maintain the effectiveness of dual-system hot backup.

Fig. 3 shows a schematic structural diagram of an embodiment of the host device and the standby device of the dual-system hot backup system according to the present invention. In these embodiments, it is preferable to select that the host device 10 and the backup device 20 are both CMC devices, and the slave device 40 is a BMC device as an example, and the host device 10 and the backup device 20 maintain the same structure. The following will further explain the specific composition and functions of each module in the main and standby CMC equipment.

-System module:

The system module on the CMC device is the state machine of the entire system, and is the core scheduling module of the system. The scheduling, state judgment, and data flow among the various modules are all processed by the system module.

-Data module:

After the CMC host is powered on, it checks the BMC equipment on each node, and establishes a parameter mapping database according to the node equipment serial number, operating state parameters and other parameters. CMC maps all BMC parameters of all nodes to CMC. In this way, the user only needs to access the CMC device to obtain the BMC data of all nodes, avoiding the need for the CMC to read the BMC parameters one by one when the user queries, and speed up the response to users.

-Synchronization module:

The present invention adopts I3C bus connection between two CMC devices, which is used for data transmission and synchronization work between the two devices, so as to realize real-time data synchronization between the two devices. The I3C bus has a rate of 33Mbps, a soft interrupt mechanism, and a checksum fault tolerance mechanism.

In the concept of the present invention, the principle of synchronization work includes the following content: one is clock synchronization; second, whose data changes, who is responsible for initiating the synchronization work; third, two data changes at the same time, subject to the last setting.

In the present invention, since both CMCs can be accessed by the user, there is a problem that the user modifies the data of the two CMCs during synchronization. To prevent CMC from not knowing the order of modification. The modification time is added to the synchronization data package.

1. Regarding clock synchronization, the power-on backup machine actively initiates a time synchronization request to the host to ensure the time consistency of the two devices, and can be used to determine the final effective data modification problem when the master and the slave modify the data at the same time.

2. Regarding data modification synchronization, the same data storage format is used in the two systems. When the storage data of one CMC changes, the data synchronization module will package and send the "original data + modified data + modification time" to the other one. CMC. After receiving the synchronized data, another CMC first determines whether the original data is the same as the local data. If the same indicates that the local data has not changed, it directly updates the modified data. If the original data is different from the local data, it means that the local data has also been modified. Compare the data modification time and synchronize the modified data with the newer time to maintain data consistency.

3. Regarding data synchronization after device failure recovery, the system adopts a two-way heartbeat packet detection mechanism. After any CMC device fails, the other CMC device will sense and record the LOG log and force reset the failed CMC device to restore normal operation. Take CMC0 (host device 10) as an example: if CMC0 fails, CMC1 (standby device 20) will find that the heartbeat signal of CMC0 cannot be detected in time. At this time, CMC1 determines that CMC0 is faulty, and records the LOG log and resets it. Signal IO line, forcibly reset CMC0. When the CMC0 faulty device restarts and recovers, it will promptly ask about the working mode of CMC1 and whether it is necessary to forcibly synchronize data. CMC1 will reply to its own working mode and the need to forcibly synchronize data. And CMC1 will send all the synchronized data + mandatory update flag to CMC0, and CMC0 will update the data to be completely consistent. After that, CMC1 will also trigger the mode switching process.

4. If two CMCs are just powered on and started, both CMCs are in idle mode, and the backup device 20 actively synchronizes data from the host device 10.

-Detection module:

The invention adopts a two-way heartbeat detection mechanism for fault detection to ensure that any problem of the dual machines can be found by the other, reset it in time, restore the backup state, record logs and inform operation and maintenance personnel. The detection module in the system is responsible for the generation of the heartbeat signal of one's own device and the monitoring function of the other's device's heartbeat signal.

1. The detection module is responsible for collecting the working status of the machine and the abnormal suspension of the thread. When the machine is working normally, the fault detection module will generate a continuous pulse signal on the heartbeat signal IO. When an abnormality such as a hang of the native thread is detected, the heartbeat signal will stop outputting. When there is a crash, program runaway, etc., the detection module will naturally fail to output signals.

2. Measure the other party's heartbeat signal, and the detection module will monitor in real time whether there is a heartbeat signal on the other party's CMC device's heartbeat signal IO. When the other party's pulse signal is not detected for more than one second, for example, the other party's device can be preliminarily determined to be abnormal.

3. Confirm the other party's heartbeat signal, and the detection module will immediately initiate an inquiry to the other party through the bus. If the other party does not respond, it is determined that the other party's device is malfunctioning. If the other party responds, the heartbeat signal will be checked again, and if it recovers, the other party will be considered as suspended animation and can continue to work. If it is not detected, it means that the other party's detection module is abnormal, and the other party's device is also judged to be faulty. After synchronizing the data, restart the other party.

-Reset module:

The fault handling mechanism of the reset module is that when the fault detection module determines that the other party's device is faulty, the own CMC device immediately takes over all the work, and in addition, restarts the faulty device through the reset signal line. The specific mode switching, the data synchronization mode and the working mode switching after the failure recovery are all introduced above, and will not be repeated.

The fault recovery module of this machine is mainly composed of watchdog and external reset signal line. When the system is running abnormally, it will not feed the dog. After a period of time, the watchdog will starve to death and restart the system. When the other party's device first detects the failure of the local device, the other party will reset the device through the reset signal line. The watchdog generally has a long delay (4 seconds for this system design, which can be adjusted according to the actual situation). Under normal circumstances, the CMC reset of the other party will have priority over the watchdog to find system faults. The watchdog provides a reset when two CMCs fail at the same time due to external interference. If the actual time of the watchdog is short enough, or the heartbeat detection mechanism is slow, the watchdog resets first.

-Alarm module:

In addition to logging and alarming abnormal conditions of various functions and services during the normal operation of CMC equipment, it also records and alarms the failure of another CMC equipment so that maintenance personnel can learn about the two CMC equipment in time. The operating status of the.

When the fault module determines that the other party's CMC equipment is faulty, the machine will immediately record the situation in the LOG log, and give general alarms through LEDs and reporting to the remote server. If the faulty device has two or more faults in a day, a severe alarm will be reported. If the faulty device cannot be restored by resetting, the CMC device continues to report a fatal fault alarm. The alarm will continue to exist, and even if another CMC device resumes business through reset, it will not be cancelled unless the maintenance personnel manually eliminate it.

-Management module:

The CMC equipment is responsible for information collection and cooling fan control on each node of the entire cabinet/box, and the display control and management of the buttons and indicators on the front panel. The management module specifically includes at least the following content:

1. The CMC device collects data on each node, not in the form of periodic polling, but in the form of active reporting by the BMC of each node.

2. When the parameters on the node BMC change, after the BMC processes the management of the node, it initiates a communication request to the CMC as soon as possible, and synchronizes the parameters to the mapping area of the CMC to ensure that the CMC parameters are consistent with the BMC parameters. Since the BMC initiates a soft interrupt through the I3C bus, it can only be initiated when the I3C bus is idle. Therefore, when the bus is detected to be busy, the BMC will delay the initiation operation for a period of time (in the present invention, it is preferably 10ms, which can be adjusted according to actual conditions) .

3. When the user configures the BMC device of a node through the CMC, after the CMC verifies that the parameters are legal, it initiates communication with the node BMC and configures the parameters to the BMC. After the configuration is successful, the CMC maps the node to the parameters for modification. , To ensure the consistency of parameters.

-Network module:

The CMC device provides external services such as web services, command line and other human-computer interaction interfaces 80, which are used for human-computer interaction such as remote device management, firmware update, or fault reporting to the remote control center.

-Upgrade module:

The upgrade module in the CMC device is mainly responsible for the upgrade of the system. It is mainly responsible for two parts of the upgrade. One is the upgrade of the CMC device's own firmware, and the other is the firmware upgrade on each node. The CMC upgrade module is also responsible for determining the consistency of the upgrade package. . In the present invention, the user can manage the BMC, BIOS, CPLD and other firmware upgrades of each node through the CMC. At present, due to the low I2C bus rate in the current design, CMC cannot upgrade the firmware on the node through the internal I2C bus, and must rely on LAN. When a node has a problem with the LAN, the node cannot be upgraded. In the present invention, due to the high-speed I3C communication, firmware upgrade through the internal I3C bus becomes a reality. Even if it does not rely on the LAN, the CMC can still upgrade the firmware of the node.

1. Since there is BMC data mapping of each node on the CMC, the user only needs to log in to the CMC to upgrade the firmware on each node. The user first selects a node to upgrade the firmware. CMC will list the upgradeable firmware according to the model of the node. The user uploads the firmware upgrade package to the CMC. CMC will judge the compatibility of the firmware upgrade according to the model of the node. If the model version is incorrect, the user will be prompted to terminate the upgrade operation, and the subsequent upgrade operation will only be performed if the upgrade conditions are met.

2. There are two ways for CMC to transmit firmware upgrade package data to node BMC. One is that CMC transmits firmware upgrade package to node BMC via LAN for upgrade; the other is CMC transmits firmware upgrade package data to node BMC via internal I3C bus. ; In the present invention, the direct data synchronization and interaction between CMC and BMC are always performed on the I3C bus, and the firmware upgrade package is relatively large. In order to reduce the pressure on the I3C bus, the present invention preferably transmits the estimated upgrade package to the node BMC via LAN , When the LAN link fails, the I3C bus is used for transmission. In order to prevent the transmission of tens of M firmware upgrade packages from occupying the bus for too long and affecting the timely synchronization of CMC and BMC data, in the present invention, CMC divides the estimated upgrade package into several small pieces, each with a number and The check code is then transmitted via the I3C bus at intervals. In this way, data synchronization can still be performed during the transmission interval. Each time BMC receives a small piece of data, it performs data verification and unpacking storage. When the verification fails, the BMC informs the CMC to resend the small piece of data. After all the fragmented data is transmitted to the BMC, the BMC combines the fragmented data to restore the complete firmware upgrade package. Regardless of whether the CMC transmits data via the LAN or the internal I3C bus, BMC will cache the upgrade package and verify its integrity. When the verification is passed, it returns OK, and the verification fails to return the upgrade package verification failure to the CMC.

3. The CMC informs the BMC to upgrade the firmware through the I3C bus. BMC will again determine whether the firmware upgrade package meets the upgrade requirements of the firmware of the node. If it does not meet the requirements of the firmware upgrade of the node, the CMC will be notified to terminate the upgrade operation. If it does, the BMC will start to upgrade the firmware. Feedback to users and improve friendliness.

When CMC receives the feedback from BMC that the upgrade is estimated to be successful, it prompts the user that the upgrade is complete. The upgrade process is over.

The system module, data module, synchronization module, detection module, reset module, alarm module, management module, network module and upgrade module based on the above introduction constitute the host computer in the dual-system hot backup system 100 and/or 100' according to the present invention The functional structure of the device 10 and the backup device 20 can thus construct the aforementioned embodiments of the dual-machine hot backup system 100 and/or 100' according to the present invention, complete corresponding functions, and achieve corresponding technical effects.

It should be understood that, where technically feasible, the technical features listed above for different embodiments can be combined with each other to form another embodiment within the scope of the present invention. In addition, the specific examples and embodiments described herein are non-limiting, and corresponding modifications may be made to the structure, position, and sequence set forth above without departing from the protection scope of the present invention.

In this application, the use of antagonistic conjunctions is intended to include conjunctions. The use of definite or indefinite articles is not intended to indicate a cardinal number. Specifically, references to "the" object or "a" and "an" objects are intended to indicate a possible one of a plurality of such objects. However, although the elements disclosed in the embodiments of the present invention may be described or required in an individual form, they may also be understood as plural unless explicitly limited to a singular number. In addition, the conjunction "or" can be used to convey co-existing features, rather than mutually exclusive solutions. In other words, the conjunction "or" should be understood to include "and/or". The term "including" is inclusive and has the same scope as "including".

The above-mentioned embodiments, especially any "preferred" embodiments are possible examples of implementations, and are presented only for a clear understanding of the principles of the present invention. Many changes and modifications can be made to the above-mentioned embodiment without basically departing from the spirit and principle of the technology described herein. All modifications are intended to be included within the scope of this disclosure.

Claims

A dual-machine hot backup system, characterized in that the system includes:

Host device

A backup device, where the backup device communicates with the host device through a first I3C bus;

At least one slave device, the at least one slave device is in communication connection with the host device and the standby device through a second I3C bus;

Wherein, the host device is configured to collect the parameters of the slave device in response to the startup of the dual-machine hot backup system, store the mapping of the parameters in a database, and synchronize the parameters to the backup device via the first I3C bus. The device is configured to manage the slave device through the second I3C bus based on the management instruction, and generate a mapping according to the changed parameter, and synchronize the mapping of the changed parameter to the first I3C bus. The standby equipment;

The backup device is configured to, in response to receiving a management instruction from the outside, forward the received management instruction to the host device through the first I3C bus.
The system according to claim 1, wherein the backup device is further configured to:

In response to receiving an emergency management instruction from the outside, the second I3C bus is forcibly occupied to manage the slave device, and a parameter mapping is generated according to the changed parameters, and the emergency The mapping of the management instruction and the changed parameter is synchronized to the host device.
The system according to claim 1, wherein the host device is further configured to:

In response to the host device entering the upgrade mode and/or the resource occupation exceeds the threshold, notify the backup device through the first I3C bus to temporarily take over the management of the slave device; and

In response to the host device exiting the upgrade mode and/or the resource occupation no longer exceeds the threshold, the standby device is notified through the first I3C bus to stop taking over the management of the slave device.
The system according to claim 3, wherein the backup device is further configured to:

In response to receiving the notification of the temporary takeover of the host device, manage the slave device through the second I3C bus, and generate a parameter mapping according to the changed related parameters; and

In response to receiving the notification of stopping the takeover of the host device, stop managing the slave device and synchronize the mapping of the changed parameter to the host device through the first I3C bus.
The system according to claim 1, wherein the backup device is further configured to:

In response to the startup of the dual-system hot backup system, actively initiate a clock synchronization request to the host device.
The system according to claim 1, wherein the host device and the backup device are further configured as:

In response to the synchronization of the mapping of parameters initiated by either of the host device and the backup device, the initiator generates and sends synchronized packaged data including original data, modified data, and modified time to the other party.
The system according to claim 6, wherein the host device and the backup device are further configured as:

In response to the host device and/or the backup device receiving the synchronized packaged data sent by the other party, comparing the original data therein with the local data;

In response to the original data being the same as the local data, modifying the local data according to the modified data;

In response to the difference between the original data and the local data, the modification time in the received synchronized packaged data is compared with the local modification time, and the modified data with a newer modification time is used as the standard for synchronization.
The system according to claim 1, wherein the host device and the backup device are further configured as:

Detect the other party's operating status through a two-way physical IO heartbeat detection mechanism.
The system according to claim 1, wherein the host device and the backup device are further configured as:

In response to either the host device or the backup device detecting that the other party has a failure, the non-faulty party records the failure of the failed party in the log, and resets the failed party through the external double reset mechanism, and performs a check on the failed party. Clock and database synchronization.
The system according to claim 9, wherein the host device and the backup device are further configured to:

In response to the non-faulty party being unable to restart and reset the faulty party through the external dual reset mechanism and/or the restarting and resetting failure, the non-faulty party issues an alarm to notify the operation and maintenance personnel to handle it.