CN110750480B

CN110750480B - Dual-computer hot standby system

Info

Publication number: CN110750480B
Application number: CN201910995329.9A
Authority: CN
Inventors: 韩红瑞; 黄柏学
Original assignee: Suzhou Inspur Intelligent Technology Co Ltd
Current assignee: Suzhou Inspur Intelligent Technology Co Ltd
Priority date: 2019-10-18
Filing date: 2019-10-18
Publication date: 2021-06-29
Anticipated expiration: 2039-10-18
Also published as: CN110750480A; WO2021073105A1

Abstract

The invention relates to a dual-computer hot standby system, which comprises: a host device; the standby equipment is connected with the host equipment in a communication mode through a first I3C bus; at least one slave device which is connected with the master device and the standby device in a communication mode through a second I3C bus; the master device is configured to respond to the startup of the dual-computer hot standby system, collect parameters of the slave device, store mapping of the parameters into a database and synchronize the mapping to the standby device through a first I3C bus, manage the slave device through a second I3C bus based on a management instruction, generate mapping according to changed parameters, and synchronize the mapping of the changed parameters to the standby device through a first I3C bus; the standby device is configured to forward a received management command to the host device through the first I3C bus in response to receiving the management command from the outside. The system of the invention solves the problem of insufficient reliability caused by excessive dependence on LAN, and improves the safety and the operation efficiency of the whole server system.

Description

Dual-computer hot standby system

Technical Field

The invention relates to the technical field of servers. The invention further relates to a dual-computer hot standby system.

Background

With the rapid development of AI technology and internet technology, a single server unit in the past cannot meet the processing requirements of mass data. The large-scale parallelization computer system architecture has the characteristics of strong expansibility, high computing capability, support of unified management and the like, and meets the requirements of the big data era on server products more and more. This makes the physical volume of the current server system gradually large, the module composition gradually complex, and the integration level gradually increased. As the functions of the server and the number of nodes gradually increase, the challenge of monitoring management gradually increases, and the redundancy requirement of the system is higher and higher.

In the existing multi-node, large-scale and high-density server redundancy monitoring and management system, communication buses such as I2C or serial ports are usually adopted in the system, the speed is low, and the requirements of interaction and synchronization between main and standby servers cannot be met. Therefore, data synchronization between the main and standby has to rely heavily on the external LAN and the switch. The biggest disadvantage of the external network cable is the reliability risk, and the risk that the network cable is in poor contact or is manually disconnected and even the switch is restarted can occur. Once a LAN fails, the effective communication between the host and the standby is interrupted, and the management system may become confused and uncontrollable remotely. Secondly, the main and standby devices are two main bodies which operate independently, and data difference situations are inevitable after the main and standby devices operate sequentially or simultaneously. Even if the LAN returns to normal after a period of time, the primary and secondary monitors do not know who data is up-to-date and who should synchronize, and thus a split brain condition may occur.

Therefore, in some schemes, in the process of detecting the fault of the host equipment, a strategy of serial port secondary verification is added, and the situation that the standby machine cannot judge the real operation condition of the host after the LAN network is interrupted is avoided. However, when the host LAN is disconnected, the standby device can still check the survival of the host through the serial port, and at this time, the standby device does not take over the work of the host and is always in a standby working state. At this time, data synchronization between the main and standby devices cannot be performed through the LAN, and the user cannot remotely control and access the host device through the LAN. At this time, even if the user logs in the slave device through the LAN, the inquired data is not the latest data.

Therefore, it is necessary to improve the problem that data synchronization has to rely heavily on external LAN and switch because the communication rate of I2C bus, serial port, etc. adopted in the current multi-node server scheme is low and cannot meet the requirements of interaction and synchronization between the main and standby devices, and a mechanism for establishing safe and reliable internal communication between the main and standby devices is proposed.

Disclosure of Invention

In one aspect, the present invention provides a dual-computer hot standby system based on the above object, wherein the system includes:

a host device;

the standby equipment is connected with the host equipment in a communication mode through a first I3C bus;

at least one slave device, wherein the at least one slave device is connected with the master device and the standby device in a communication mode through a second I3C bus;

the master device is configured to respond to the startup of the dual-computer hot standby system, collect parameters of the slave device, store mapping of the parameters into a database and synchronize the mapping to the standby device through a first I3C bus, manage the slave device through a second I3C bus based on a management instruction, generate mapping according to changed parameters, and synchronize the mapping of the changed parameters to the standby device through a first I3C bus;

the standby device is configured to forward a received management command to the host device through the first I3C bus in response to receiving the management command from the outside.

In an embodiment of the dual-computer hot-standby system according to the present invention, the standby device is further configured to: and in response to receiving an emergency management instruction from the outside, forcibly and temporarily occupying the second I3C bus to manage the slave device, generating a mapping of parameters according to the changed parameters, and synchronizing the emergency management instruction and the mapping of the changed parameters to the master device through the first I3C bus.

In an embodiment of the dual-server hot-standby system according to the invention, the host device is further configured to: in response to the host device entering an upgrade mode and/or the resource occupancy exceeding a threshold, notifying the standby device via the first I3C bus to temporarily take over management of the slave device; and in response to the master device exiting the upgrade mode and/or the resource occupancy no longer exceeding the threshold, notify the standby device via the first I3C bus to stop taking over management of the slave device.

In an embodiment of the dual-computer hot-standby system according to the present invention, the standby device is further configured to: in response to receiving a notification of temporary takeover of the master device, managing the slave device through the second I3C bus and generating a mapping of parameters according to the changed related parameters; and in response to receiving a notification of the master device to stop takeover, stopping managing the slave device and synchronizing the mapping of the changed parameters to the master device over the first I3C bus.

In an embodiment of the dual-computer hot-standby system according to the present invention, the standby device is further configured to: and responding to the startup of the dual-computer hot standby system, and actively initiating a clock synchronization request to the host equipment.

In an embodiment of the dual-computer hot-standby system according to the present invention, the host device and the standby device are further configured to: and responding to the synchronization of the mapping of the initiating parameter of any one of the host equipment and the standby equipment, generating synchronous packed data comprising original data, modified data and modified time by the initiating party, and sending the synchronous packed data to the other party.

In an embodiment of the dual-computer hot-standby system according to the present invention, the host device and the standby device are further configured to:

in response to the host equipment and/or the standby equipment receiving the synchronous packed data sent by the other side, original data in the synchronous packed data are compared with local data;

in response to the original data being the same as the local data, modifying the local data according to the modified data;

and responding to the difference between the original data and the local data, comparing the modification time in the received synchronous packed data with the local modification time, and synchronizing the modified data with the newer modification time.

In an embodiment of the dual-device hot-standby system according to the present invention, the host device and the standby device are further configured to: and detecting the running state of the other side through a bidirectional physical IO heartbeat detection mechanism.

In an embodiment of the dual-computer hot-standby system according to the present invention, the host device and the standby device are further configured to: and in response to the detection of the fault of either one of the host equipment and the standby equipment, the non-fault party logs the fault condition of the fault party, resets the fault party by an external double-reset mechanism and synchronizes a clock and a database of the fault party.

In an embodiment of the dual-computer hot-standby system according to the present invention, the host device and the standby device are further configured to: and in response to the non-fault party failing to restart and reset the fault party through the external double-reset mechanism and/or failing to restart and reset, the non-fault party sends an alarm to inform operation and maintenance personnel to process.

By adopting the technical scheme, the invention at least has the following beneficial effects: aiming at the problem that the communication speed of I2C buses, serial ports and the like adopted in the current multi-node server scheme is low and cannot meet the requirements of interaction and synchronization between main servers and standby servers, so that data synchronization has to rely heavily on an external LAN and a switch, the internal communication architecture for establishing the multi-node server dual-server hot-standby system by using the I3C buses is provided, and the two I3C buses are adopted to respectively construct the communication architectures between the main servers and the standby servers in the system. And when the system is started, the state of the slave equipment is collected by the host equipment through the I3C bus, the mapping is established and the mapping is stored in the database, so that when a user needs to call the operating parameters of certain slave equipment, the host equipment does not need to respond to the instruction of the user and then go to the slave equipment to obtain the operating parameters, and the relevant information recorded in the database can be directly fed back to the user. Meanwhile, when the system is started or the host device manages the slave device, the host device synchronizes corresponding information to the standby device, so that the consistency of data in the standby device is ensured. In addition, the standby device is also allowed to be directly accessed by an external device, such as a user, to issue the command, and at this time, the standby device, after receiving the management command, forwards the management command to the host device through the I3C bus, so that the host device performs corresponding management on the slave device. The dual-computer hot standby system of the invention not only improves the efficiency of the internal communication of the system, but also avoids the reliability problem caused by depending on an external LAN and an external switch, and develops and utilizes the resources of the standby equipment to a certain extent under the condition of ensuring the data consistency of the main and standby equipment, thereby further improving the safety and the operation efficiency of the whole multi-node server system.

The present invention provides aspects of embodiments, which should not be used to limit the scope of the present invention. Other embodiments are contemplated in accordance with the techniques described herein, as will be apparent to one of ordinary skill in the art upon study of the following figures and detailed description, and are intended to be included within the scope of the present application.

Embodiments of the invention are explained and described in more detail below with reference to the drawings, but they should not be construed as limiting the invention.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are required to be used in the description of the prior art and the embodiments will be briefly described below, parts in the drawings are not necessarily drawn to scale, and related elements may be omitted, or in some cases the scale may have been exaggerated in order to emphasize and clearly show the novel features described herein. In addition, the structural positions may be arranged differently, as is known in the art.

Fig. 1 shows a schematic diagram of an embodiment of a dual-server hot-standby system according to the present invention;

FIG. 2 illustrates a schematic diagram of yet another embodiment of a dual hot standby system according to the present invention;

fig. 3 shows a schematic block diagram of an embodiment of a host device and a standby device of a dual-computer hot-standby system according to the present invention.

Detailed Description

While the present invention may be embodied in various forms, there is shown in the drawings and will hereinafter be described some exemplary and non-limiting embodiments, with the understanding that the present disclosure is to be considered an exemplification of the invention and is not intended to limit the invention to the specific embodiments illustrated.

Fig. 1 shows a schematic diagram of an embodiment of a dual-server hot-standby system 100 according to the present invention. The dual-server hot-standby system 100 according to the present invention is particularly used for control management of a multi-node server system. In the embodiment shown in fig. 1, the dual-computer hot-standby system 100 at least includes:

a host device 10;

a standby device 20, the standby device 20 being communicatively connected to the host device 10 via a first I3C bus 30;

at least one slave device 40, the at least one slave device 40 being communicatively connected to the master device 10 and the standby device 20 via a second I3C bus 50;

the master device 10 is configured to collect parameters of the slave device 40 in response to the startup of the dual-device hot standby system 100, store the mapping of the parameters in a database and synchronize the mapping to the standby device 20 through the first I3C bus 30, and is configured to manage the slave device 40 through the second I3C bus 50 based on a management instruction, generate a mapping according to the changed parameters, and synchronize the mapping of the changed parameters to the standby device 20 through the first I3C bus 30;

the standby device 20 is configured to forward a received management instruction to the host device 10 through the first I3C bus 30 in response to receiving the management instruction from the outside.

Since the dual-server hot-standby system 100 according to the present invention is particularly used for control Management of a multi-node server system, preferably, the master device 10 and the slave device 20 are CMC devices (Chassis Management controllers) whose functions are similar to BMCs, and the CMC may send commands to each node to manage the entire system in a multi-node server system such as a blade. In addition, the slave device 40 is preferably a BMC (Baseboard Management Controller) in the multi-node server system, and the BMC may perform operations such as firmware upgrade and viewing of the device when the device is not powered on.

In addition, the dual-computer hot-standby system 100 according to the present invention uses an I3C bus for communication connection. The I3C is a two-wire serial communication bus which integrates key attributes of an I2C bus and an SPI bus, is compatible with an I2C protocol, has new characteristics of multiple masters, slave soft interrupt, dynamic slave address allocation, hot plug support and the like, has the speed of up to 33Mbps, and is generally used for connecting a sensor to an application processor. Further, to isolate master-slave communications from master-slave management to avoid interference with each other and relieve bus stress, a first I3C bus 30 is employed between the master and slave (10 and 20) and a second I3C bus 50 is employed between the master and slave (10 and 40 and 20 and 40).

Since the I3C bus is compatible with the protocol of the I2C bus, the dual-computer hot-standby system 100 according to the present invention can complete the entire functions of the system originally constructed by the I2C bus. And in addition, the dual-computer hot-standby system 100 according to the present invention adds a new function. The host device 10 is configured to collect parameters of the slave device 40 in response to the dual-device hot-standby system 100 starting, store a mapping of the parameters in a database, and synchronize the parameters to the standby device 20 through the first I3C bus 30. That is, when the dual-computer hot-standby system 100 is initially powered on and started, the CMC host device 10 checks the BMC slave devices 20 on each node after being powered on, and establishes a parameter mapping database according to the node device serial number, various operating parameters, and other parameters. The CMC host device 10 maps the parameters of all node BMC slave devices to one copy on the CMC host device 10. Therefore, when a user requires to call a parameter of a certain slave device 40, the user only needs to access the CMC host device 10 to obtain the corresponding parameters of all the slave devices 40, so that the CMC host device 10 is prevented from reading the parameter from the corresponding slave device 40 when the user queries, and the speed of responding to the user instruction is increased. In addition, host device 10 synchronizes the database to standby device 20 via first I3C bus 30 to ensure data consistency between the host and standby.

Another added function is that the master device 10 is configured to manage the slave device 40 through the second I3C bus 50 based on the management instruction, and generate a mapping according to the changed parameter, synchronizing the mapping of the changed parameter to the standby device 20 through the first I3C bus 30. That is, when, for example, a user issues a management instruction from the external access host device 10 and/or the host device 10 generates a management instruction according to a preset control policy, the host device 10 completes its function of managing and controlling the slave device 40 through the second I3C bus 50. Moreover, the addition of the database that the host device 10 maps the corresponding parameters to the host device 10 and synchronizes the mapping of the parameters to the standby device 20 in real time through the first I3C bus 30 further ensures real-time data update and synchronization of the standby data after the parameters of the slave device 40 change during the management process.

In addition, the functions of the dual-computer hot-standby system 100 according to the present invention also include developing and utilizing the resources of the standby device 20 to a certain extent, that is, the standby device 20 is also allowed to be directly accessed by the outside, for example, by the user. In the concept of the present invention, the host device 10 and the standby device 20 have respective communication addresses. Therefore, when the user directly accesses the standby device 20 from the outside according to the communication address and issues the management instruction, the standby device 20, after receiving the corresponding management instruction, forwards the management instruction to the host device 10 through the first I3C bus 30 so that the host device 10 performs corresponding management on the slave device 40.

In one or more embodiments of the dual-computer hot standby system 100 of the present invention, the standby device 20 is further configured to: in response to receiving an urgent management instruction from the outside, the slave device 40 is managed by forcibly temporarily occupying the second I3C bus 50, and a mapping of parameters is generated according to the changed parameters, and the urgent management instruction and the mapping of the changed parameters are synchronized to the master device 10 through the first I3C bus 30. That is, in order to further develop and utilize the resources of the standby device 20, in these embodiments, when the user directly accesses the standby device 20 from the outside according to the communication address and issues a management instruction, and the management instruction is a specific emergency management instruction, the standby device 20 may force to temporarily occupy the second I3C bus 50 to manage the slave device 40 according to the emergency management instruction issued by the user, and synchronize the corresponding information to the master device 10. The specific emergency management instructions mentioned here generally refer to management instructions that are very time-critical and/or have to be processed immediately in a very tight relationship with the system operational safety and/or are mandatory by the user. The standby device 20 may force the second I3C bus 50 to be temporarily occupied to manage the slave device 40, so that the step of forwarding the management command to the host device 10 through the first I3C bus 30 by the standby device 20 is eliminated, and the response speed is improved.

In one or more embodiments of the dual-server hot-standby system of the present invention, the host device 10 is further configured to: in response to the host device 10 entering the upgrade mode and/or the resource occupation exceeding the threshold, the host device 10 informs the standby device 20 through the first I3C bus 30 to temporarily take over the management of the slave device 40; and in response to host device 10 exiting upgrade mode and/or the resource occupancy no longer exceeding the threshold, host device 10 notifies standby device 20 via first I3C bus 30 to stop taking over management of slave device 40. In a conventional master-slave multi-node server control system, when a host device needs to perform firmware upgrade and/or system resource occupation exceeds a threshold, the host device cannot continue to support management of a slave device due to limited memory space, and therefore, a mode generally adopted is that an operation and maintenance worker temporarily closes part of functions, and resumes the previously temporarily closed functions when the firmware upgrade is finished and/or the resource occupation situation is relieved. Such a drawback is obvious, and in the above specific case, the management control of the slave device by the master device is interrupted, and continuous management and control cannot be achieved. Therefore, the dual-computer hot-standby system 100 according to the present invention further develops and utilizes the resources of the standby device under the above-mentioned specific conditions, when the host device 10 enters the upgrade mode and/or the resource occupancy exceeds the threshold value too much, the host device 10 actively notifies the standby device 20 through the first I3C bus 30 to temporarily take over the management of the slave device 40; and when the master device 10 finishes the upgrade and exits the upgrade mode and/or the resource occupation no longer exceeds the threshold value so that management control of the slave device 40 can be continued, the master device 10 notifies the standby device 20 through the first I3C bus 30 to stop taking over the management of the slave device 40.

In some embodiments of the dual-computer hot standby system 100 of the present invention, the standby device 20 is further configured to: in response to receiving the notification of the temporary takeover of the host device 10, managing the slave device 40 through the second I3C bus 50, and generating a map of parameters according to the relevant parameters that have changed; and in response to receiving the notification of the stop takeover of the master device 40, stops managing the slave devices and synchronizes the mapping of the changed parameters to the master device 10 through the first I3C bus 30. The standby device 20 is in a standby state for a long time as a backup of the host device 10, and in the conventional dual-host hot standby system, once the host device 10 fails, the standby device 20 becomes a host to maintain normal operation of the system. In the dual-computer hot standby system of the present invention, in addition to the above situation, when the standby device 20 receives the notification of the temporary takeover from the host device 10 no matter what state the host device 10 is in, the standby device 20 temporarily takes over the management operation of the slave device 40, manages the slave device 40 through the second I3C bus 50, and generates the mapping of the parameters according to the changed related parameters. Further, when the standby device 20 receives the notification of the stop of the takeover from the master device 10, the standby device 20 stops managing the slave device 40, returns the management work to the master device 10, and synchronizes the mapping of the parameters affected by the management operation during the management of the standby device 20 to the master device 10 through the first I3C bus 30, so as to ensure that the master device can accurately manage the slave device 40 and ensure the master and slave data to be consistent.

In several embodiments of the dual-computer hot standby system 100 of the present invention, the standby device 20 is further configured to: in response to the dual-computer hot-standby system 100 starting up, the standby device 20 actively initiates a clock synchronization request to the host device 10. In the dual-host hot-standby framework, the system time of the main and standby devices plays an important role in many situations and functions, so that it is necessary to ensure the clock synchronization between the main and standby devices. In the embodiment of the present invention, the strategy adopted for clock synchronization is that after the dual-computer hot-standby system 100 is started, that is, after the host device 10 and the standby device 20 are turned on, the standby device 20 actively initiates a clock synchronization request to the host device 10, so as to ensure the consistency of the system time of the two devices.

In a further embodiment of the dual-computer hot-standby system 100 of the present invention, the host device 10 and the standby device 20 are further configured to: in response to either of the host device 10 and the standby device 20 initiating synchronization of the mapping of the parameters, the initiator (10 or 20) generates and transmits to the other (20 or 10) synchronization packed data including the original data, the modified data, and the modification time. The same data deposit format is used in the host device 10 and the standby device 20. When the data changes and the main-standby synchronization is needed, the principle is followed that the party with the changed data is responsible for initiating the synchronization work. That is, when either one of the host device 10 and the standby device 20 initiates synchronization of mapping of parameters due to a change in data, the initiator (the host device 10 or the standby device 20) packages "original data + modified data + modification time" into synchronization packaged data, and then transmits the synchronization packaged data to the other (the standby device 20 or the host device 10) for data update by the other.

In several embodiments of the dual-computer hot-standby system 100 of the present invention, the host device 10 and the standby device 20 are further configured to:

in response to the host device 10 and/or the standby device 20 receiving the synchronous packed data sent by the other party, comparing the original data with the local data;

Since both the host device 10 and the standby device 20 are allowed to be directly accessed from the outside in the dual-device hot-standby system 100 of the present invention, a situation may occur where the data changes in the dual devices at the same time, and an arbitration mechanism is required to determine the final valid data. Therefore, in the above embodiments, the arbitration mechanism specifically includes the following parts. First, when the host device 10 and/or the standby device 20 receive the synchronous packed data sent by the other party, the synchronous packed data is analyzed to extract original data, modified data and modification time therein, and the extracted original data is compared with local data thereof. If the original data is the same as the local data, the local data is not changed, so that the local data is directly updated according to the extracted modified data. If the original data is different from the local data, the local data is also modified, so that it is necessary to further determine whether the modification of the other party or the modification of the own party should be used as the final effective modification. At this time, the modification time extracted from the received synchronous packed data is compared with the modification time in the synchronous packed data generated when the local data changes, and the modified data with the newer modification time is used as the final effective data to perform the master-slave data synchronization. Specifically, if the modification time extracted from the received synchronous packed data is newer, that is, the modified data extracted from the received synchronous packed data is the final valid data, the local data is updated by the modified data extracted from the received synchronous packed data, so as to implement the synchronization of the main and standby data. If the modification time in the synchronous packed data generated when the local data is changed is newer, namely the local data is the final effective data, the local data is not updated. At this time, if the generated synchronous packet data is not transmitted to the counterpart when the local data is changed, the synchronous packet data is immediately transmitted to the counterpart. If the generated synchronous packed data is already sent to the other party when the local data changes, no further processing is needed.

Fig. 2 shows a schematic diagram of a further embodiment of the dual-computer hot-standby system 100' according to the present invention, wherein the host device 10 and the standby device 20 are further configured to: the operating state of the other party is detected by the bidirectional physical IO heartbeat detection mechanism 60. In these embodiments, compared to the conventional dual-standby system, the dual-standby system 100' according to the embodiment of the present invention adds a bidirectional physical IO heartbeat detection mechanism between the host device 10 and the standby device 20, so that not only the standby device detects whether the host device fails, but also the host device checks the running state of the standby device in real time, thereby avoiding the situation that the system does not know the failure before the host device fails, and thus the system completely loses management when the host device fails and the standby device starts to perform management functions.

In the embodiment of the dual-computer hot-standby system 100' shown in fig. 2, the host device 10 and the standby device 20 are further configured to: in response to either one of the host device 10 and the standby device 20 detecting that the other party is out of order, the non-failure party (10 or 20) logs the failure condition of the failure party (20 or 10), resets the failure party (20 or 10) through the external dual reset mechanism 70, and synchronizes the clock and the database of the failure party (20 or 10). That is, once either one of the host device 10 and the standby device 20 detects that the other one fails, the non-failed party (the host device 10 or the standby device 20) logs the failure condition of the failed party (the standby device 20 or the host device 10) in a local log. And the non-failure party (the host device 10 or the standby device 20) can restart and reset the failure party (the standby device 20 or the host device 10) through the external dual reset mechanism 70, so that the dual devices can be on line simultaneously, and the redundancy failure is prevented. If the reset is successful, the non-failure side (the host device 10 or the standby device 20) performs clock and database synchronization on the failure side (the standby device 20 or the host device 10).

In a further embodiment of the dual-standby system 100' of the present invention, the host device 10 and the standby device 20 are further configured to: in response to the non-failed party (10 or 20) failing to restart the failed party (20 or 10) through the external dual reset mechanism 70 and/or failing to restart the reset, the non-failed party (10 or 20) issues an alarm to notify the operation and maintenance personnel to handle. That is, in some situations, a situation may occur where the non-failure party (the host device 10 or the standby device 20) cannot reset the failure party (the standby device 20 or the host device 10) through the external dual reset mechanism 70 or the restart reset is attempted but is not successful, and at this time, the adopted strategy is that the non-failure party (the host device 10 or the standby device 20) issues an alarm to notify the operation and maintenance personnel to process the failure party (the standby device 20 or the host device 10) to remove the failure, so as to maintain the effectiveness of the dual hot standby.

Fig. 3 shows a schematic block diagram of an embodiment of a host device and a standby device of a dual-computer hot-standby system according to the present invention. In these embodiments, it is preferable to take as an example that the selected host device 10 and the standby device 20 are both CMC devices and the slave device 40 is a BMC device, and the host device 10 and the standby device 20 remain to have the same structure. The specific structure and function of each module in the main and standby CMC devices will be further described below.

-a system module:

the system module on the CMC equipment is the whole system state machine and is the core scheduling module of the system, and the scheduling, state judgment and data circulation among all the modules are processed by the system module.

-a data module:

after the CMC host is electrified, the BMC equipment on each node is checked, and a parameter mapping database is established according to parameters such as the serial number of the node equipment, the running state parameters and the like. The CMC maps all node BMC parameters to one copy to the CMC. Therefore, the user can obtain the data of the BMC of all the nodes only by accessing the CMC equipment, the condition that the CMC needs to read the BMC parameters from node to node when the user inquires is avoided, and the speed of responding to the user is increased.

-a synchronization module:

the invention adopts I3C bus connection between two CMC devices, which is used for data transmission and synchronization between the two devices, so as to realize real-time data synchronization between the two devices. The I3C bus has a rate of 33Mbps, and a soft interrupt mechanism, and has a checksum fault tolerance mechanism.

The principle of synchronous operation in the concept of the present invention includes the following: firstly, clock synchronization; secondly, who has data change and who is responsible for initiating the synchronization work; and thirdly, two data are changed simultaneously, based on the final setting.

In the invention, because both CMCs can be accessed by the user, the problem that the user modifies the data of the two CMCs during the synchronization exists. To prevent the CMC from not knowing the order of modification.

The modification time is added to the synchronization packet.

1. Regarding clock synchronization, the boot backup actively initiates a time synchronization request to the host, so as to ensure the time consistency of the two devices, and can be used for determining the problem of finally and effectively modifying data when the host and the backup modify data simultaneously.

2. Regarding data modification synchronization, the two systems adopt the same data storage format, and when the storage data of one CMC changes, the data synchronization module packs the original data, the modified data and the modification time and sends the packed data to the other CMC. After receiving the synchronous data, the other CMC firstly judges whether the original data is the same as the local data, and if the original data is the same as the local data, the CMC directly updates the modified data. If the original data is different from the local data, which indicates that the local data is also modified, the data modification time is compared, and the modified data with the newer time is taken as the reference for synchronization, so that the data consistency is maintained.

3. Regarding data synchronization after equipment failure recovery, the system adopts a bidirectional heartbeat packet detection mechanism, and after any one CMC equipment fails, another CMC equipment can sense and record LOG LOGs and forcibly reset the failed CMC equipment to enable the failed CMC equipment to recover normal work. Taking a CMC0 (host device 10) failure as an example: when the CMC0 fails, the CMC1 (the standby device 20) may find in time that the heartbeat signal of the CMC0 cannot be checked, and at this time, the CMC1 determines that the CMC0 fails, and records the LOG and forcibly resets the CMC0 through the reset signal IO line. When the CMC0 failed device is restarted and recovered, the CMC1 may be queried for the first time as to the mode of operation and whether forced synchronization of data is required. The CMC1 may reply to its own mode of operation and need to force the synchronization data. And the CMC1 will send all the synchronized data + the forced update flag to the CMC0, and the CMC0 updates the data to be completely consistent. The CMC1 may then also trigger the flow of mode switching.

4. If both CMCs have just been powered up, both CMCs are in idle mode and the standby device 20 actively synchronizes data from the host device 10.

-a detection module:

the invention adopts a bidirectional heartbeat detection mechanism for fault detection, ensures that any one of the two machines can be found to have a problem, and can be reset in time, recover the backup state, record logs and inform operation and maintenance personnel. And the detection module in the system is responsible for generating the heartbeat signal of the own equipment and monitoring the heartbeat signal of the opposite equipment.

1. The detection module is responsible for collecting the abnormal suspension condition of the working state and the thread of the local machine, and when the local machine works normally, the fault detection module can generate continuous pulse signals on the heartbeat signals IO. When detecting abnormal work such as local thread suspension, the heartbeat signal stops outputting. When the conditions of crash, program run-off and the like occur, the detection module can also be dead and cannot output signals naturally.

2. And when the heartbeat signal of the opposite side is detected, the detection module can monitor whether the heartbeat signal IO of the opposite side CMC equipment exists or not in real time, and when the pulse signal of the opposite side cannot be detected for more than 1 second, the abnormity of the opposite side equipment can be preliminarily judged.

3. And when the heartbeat signal of the opposite side is confirmed, the detection module immediately initiates inquiry to the opposite side through the bus, and if the opposite side does not respond, the opposite side equipment is judged to be in fault. If the counterpart responds, the heartbeat signal is detected again, if the counterpart recovers, the counterpart is considered to be false death, and the operation can be continued. If the data is not detected, the other side detection module is abnormal, the other side equipment is also judged to be in fault, and the other side is restarted after the data is synchronized.

-a reset module:

the failure processing mechanism of the reset module is that when the failure detection module judges that the opposite equipment fails, the own CMC equipment takes over all work at once, and in addition, the failed equipment is restarted through a reset signal line. The specific mode switching, the data synchronization mode after the failure recovery, and the working mode switching are described above and will not be described again.

The fault recovery module of the computer mainly comprises a watchdog and an external reset signal line, when the abnormal condition occurs in the running process of the system, the condition that the watchdog is not fed can be caused, and after a period of time, the watchdog is starved, and the system can be restarted. When the opposite side device detects the failure of the local device, the opposite side can pull the local device to reset through the reset signal line. The watchdog generally has a long delay (the design of the system is preferably 4 seconds, and can be adjusted according to actual conditions), and in general, the reset of the CMC of the opposite side is prior to the watchdog to find out the system fault. The watchdog provides reset when two CMCs fail simultaneously due to external interference. If the actual time of the watchdog is short enough or the heartbeat detection mechanism is slow, the watchdog will reset first.

-an alert module:

the CMC equipment records and alarms the fault condition of the other CMC equipment besides performing log recording and alarming on abnormal conditions of various functions and services in the conventional operation process, so that maintenance personnel can know the operation states of the two CMC equipments in time.

When the failure module judges that the opposite CMC equipment fails, the local machine can immediately record the condition in the LOG LOG and carry out general alarm in a mode of LED and report to a remote server. And reporting a warning of the severity level if the fault equipment has faults twice or more in one day. If the failure equipment can not recover to work in a reset mode, the CMC equipment continues to report the failure alarm of the fatal level. The alarm will persist and will not be cancelled unless manually removed by a maintenance person, even if another CMC device resumes service by resetting.

-a management module:

the CMC equipment is responsible for information acquisition and control of a cooling fan on each node of the whole cabinet/box, display control management of front panel keys and indicator lamps and the like. The management module specifically includes at least the following contents:

1. the CMC equipment collects data on each node, and adopts a mode of actively reporting by each node BMC instead of a mode of periodically polling.

2. When the parameters on the node BMC are changed, after the BMC processes the management work of the node, the BMC initiates a communication request to the CMC at the first time, synchronizes the parameters to the mapping area of the CMC, and ensures that the parameters of the CMC are consistent with the parameters of the BMC. Since the BMC can initiate the soft interrupt through the I3C bus only when the I3C bus is idle, when the bus is detected to be busy, the BMC delays to perform the initiation operation for a period of time (in the present invention, 10ms is preferred, and it can be adjusted appropriately according to actual situations).

3. When a user configures BMC equipment of a certain node through the CMC, the CMC initiates communication with the BMC of the node after verifying that parameters are legal, configures the parameters to the BMC, and after successful configuration, the CMC maps the node to remove the parameters for modification, thereby ensuring the consistency of the parameters.

-a network module:

the CMC device provides services 80 such as web services, interfaces for human-computer interaction such as command line, etc. for human-computer interaction such as remote device management, firmware update, or fault reporting remote control center, etc.

-an upgrade module:

the upgrading module in the CMC equipment is mainly responsible for upgrading work of a system and mainly responsible for upgrading work of two parts, namely upgrading of firmware of the CMC equipment and upgrading of firmware on each node, and the CMC upgrading module is also responsible for judging consistency of an upgrading package. In the invention, a user can manage the upgrading of firmware such as BMC, BIOS, CPLD and the like of each node through the CMC. In the existing design, because the I2C bus rate is too low, the CMC cannot upgrade the firmware on the node through the internal I2C bus, and must rely on the LAN to perform the upgrade operation, and when a problem occurs in the LAN of a certain node, the CMC cannot perform the upgrade operation on the node. In the invention, because the high-speed communication of the I3C realizes the firmware upgrade through the internal I3C bus, the CMC can still carry out the upgrade operation on the firmware of the node even if the CMC does not depend on the LAN.

1. Because the CMC has BMC data mapping of each node, a user can update the firmware on each node only by logging in the CMC. The user selects a certain node to upgrade the firmware, the CMC lists the upgradable firmware according to the model of the node, the user uploads the firmware upgrade package to the CMC, the CMC judges the compatibility of the firmware upgrade according to the model of the node, if the model version is not correct, the user is prompted to terminate the upgrade operation, and if the upgrade condition is met, the subsequent upgrade operation is carried out.

2. The CMC transmits firmware upgrading packet data to the node BMC in two ways, one is that the CMC transmits the firmware upgrading packet to the node BMC through the LAN for upgrading; one is that the CMC transmits the firmware upgrade packet data to the node BMC through the internal I3C bus; in the invention, direct data synchronization and interaction between CMC and BMC are carried out on the I3C bus all the time, the volume of a firmware upgrade packet is large, in order to relieve the pressure of the I3C bus, the estimated upgrade packet is preferably transmitted to the node BMC through a LAN, and when a LAN link is not communicated, the I3C bus is used for transmission. In order to prevent the fact that the transmission of a dozen of M firmware upgrade packages occupies too long time of a bus and influences timely synchronization of data of the CMC and the BMC, the CMC divides the estimated upgrade packages into a plurality of small pieces, each small piece is provided with a number and a check code, and then the small pieces are transmitted at intervals through an I3C bus. So that data synchronization can still be performed in the transmission interval. And the BMC checks and unpacks the data every time the BMC receives one small piece of data. When the check fails, the BMC notifies the CMC to resend the small piece of data. And when all the fragment data are transmitted to the BMC, the BMC combines the fragment data and recovers a complete firmware upgrade package. Whether the CMC transmits data via the LAN or the internal I3C bus, the BMC may buffer the upgrade package and perform an integrity check on the upgrade package. And if the verification is passed, the OK is returned, and the verification fails when the verification fails, namely the upgrade package is returned to the CMC.

3. The CMC notifies the BMC via the I3C bus to perform an upgrade operation on the firmware. The BMC can judge whether the firmware upgrade package meets the upgrade requirement of the firmware of the node again, if not, the CMC is informed to terminate the upgrade operation, if so, the BMC starts to perform the upgrade operation on the firmware, and in the process, the BMC returns the upgrade progress condition to the CMC for the CMC to feed back to a user, so that the friendliness is improved.

And when the CMC receives the feedback that the upgrading of the BMC is estimated to be successful, prompting the user to finish the upgrading. And ending the upgrading process.

Based on the system module, the data module, the synchronization module, the detection module, the reset module, the alarm module, the management module, the network module and the upgrade module, the functional structures of the host device 10 and the standby device 20 in the dual-computer hot-standby system 100 and/or 100 'according to the present invention are formed, so that the foregoing embodiments of the dual-computer hot-standby system 100 and/or 100' according to the present invention can be constructed, and corresponding functions and corresponding technical effects can be achieved.

It is to be understood that the features listed above for the different embodiments may be combined with each other to form further embodiments within the scope of the invention, where technically feasible. Furthermore, the particular examples and embodiments described herein are non-limiting, and various modifications may be made in the structure, location and order set forth above without departing from the scope of the invention.

In this application, the use of the conjunction of the contrary intention is intended to include the conjunction. The use of definite or indefinite articles is not intended to indicate cardinality. In particular, references to "the" object or "an" and "an" object are intended to mean one of many such objects possible. However, although elements of the disclosed embodiments of the invention may be described or claimed in the singular, the plural is contemplated unless limitation to the singular is explicitly stated. Furthermore, the conjunction "or" may be used to convey simultaneous features, rather than mutually exclusive schemes. In other words, the conjunction "or" should be understood to include "and/or". The term "comprising" is inclusive and has the same scope as "comprising".

The above-described embodiments, particularly any "preferred" embodiments, are possible examples of implementations, and are presented merely for a clear understanding of the principles of the invention. Many variations and modifications may be made to the above-described embodiments without departing substantially from the spirit and principles of the technology described herein. All such modifications are intended to be included within the scope of this disclosure.

Claims

1. A dual-server hot-standby system, the system comprising:

a host device;

the standby equipment is in communication connection with the host equipment through a first I3C bus;

at least one slave device communicatively connected with the master device and the standby device through a second I3C bus;

wherein the master device is configured to collect parameters of the slave device in response to the dual-computer hot-standby system starting, store the mapping of the parameters in a database and synchronize the mapping of the parameters to the standby device through the first I3C bus, and is configured to manage the slave device through the second I3C bus based on a management instruction, and generate a mapping according to a changed parameter, and synchronize the mapping of the changed parameter to the standby device through the first I3C bus;

the standby device is configured to forward a management instruction received from the outside to the host device through the first I3C bus in response to receiving the management instruction.

2. The system of claim 1, wherein the standby device is further configured to:

in response to receiving an emergency management instruction from the outside, the slave device is managed by forcibly temporarily occupying the second I3C bus, a mapping of parameters is generated according to the changed parameters, and the emergency management instruction and the mapping of the changed parameters are synchronized to the host device through the first I3C bus.

3. The system of claim 1, wherein the host device is further configured to:

in response to the host device entering an upgrade mode and/or the resource occupancy exceeding a threshold, notifying the standby device through the first I3C bus to temporarily take over management of the slave device; and is

And in response to the host device exiting the upgrade mode and/or the resource occupation no longer exceeding a threshold, notifying the standby device through the first I3C bus to stop taking over management of the slave device.

4. The system of claim 3, wherein the standby device is further configured to:

in response to receiving notification of temporary takeover of the master device, managing the slave device over the second I3C bus and generating a mapping of parameters according to the relevant parameters that change; and is

In response to receiving a notification of the master device to stop taking over, stopping managing the slave device and synchronizing the mapping of the changed related parameter to the master device over the first I3C bus.

5. The system of claim 1, wherein the standby device is further configured to:

and responding to the startup of the dual-computer hot standby system, and actively initiating a clock synchronization request to the host equipment.

6. The system of claim 1, wherein the host device and the standby device are further configured to:

and responding to the synchronization of the mapping of the initiating parameter of any one of the host equipment and the standby equipment, and generating synchronous packed data comprising original data, modified data and modified time by the initiating party and sending the synchronous packed data to the other party by the initiating party.

7. The system of claim 6, wherein the host device and the standby device are further configured to:

responding to the host equipment and/or the standby equipment receiving synchronous packed data sent by the other side, and comparing original data with local data;

8. The system of claim 1, wherein the host device and the standby device are further configured to:

and detecting the running state of the other side through a bidirectional physical IO heartbeat detection mechanism.

9. The system of claim 1, wherein the host device and the standby device are further configured to:

and in response to the detection of the fault of either one of the host equipment and the standby equipment, logging the fault condition of the fault party by the non-fault party, restarting and resetting the fault party through an external double-reset mechanism, and synchronizing a clock and a database of the fault party.

10. The system of claim 9, wherein the host device and the standby device are further configured to:

and in response to the non-fault party failing to restart and reset the fault party through the external double-reset mechanism and/or failing to restart and reset, the non-fault party sends an alarm to inform operation and maintenance personnel to process.