CN112068991B - High-reliability dual-management system based on master-slave synchronization - Google Patents

High-reliability dual-management system based on master-slave synchronization Download PDF

Info

Publication number
CN112068991B
CN112068991B CN202010778099.3A CN202010778099A CN112068991B CN 112068991 B CN112068991 B CN 112068991B CN 202010778099 A CN202010778099 A CN 202010778099A CN 112068991 B CN112068991 B CN 112068991B
Authority
CN
China
Prior art keywords
control module
management control
master
slave
programmable logic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010778099.3A
Other languages
Chinese (zh)
Other versions
CN112068991A (en
Inventor
李倩倩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Inspur Intelligent Technology Co Ltd
Original Assignee
Suzhou Inspur Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Inspur Intelligent Technology Co Ltd filed Critical Suzhou Inspur Intelligent Technology Co Ltd
Priority to CN202010778099.3A priority Critical patent/CN112068991B/en
Publication of CN112068991A publication Critical patent/CN112068991A/en
Application granted granted Critical
Publication of CN112068991B publication Critical patent/CN112068991B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1479Generic software techniques for error detection or fault masking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Hardware Redundancy (AREA)
  • Safety Devices In Control Systems (AREA)

Abstract

The invention discloses a high-reliability dual-management system based on master-slave synchronization, which comprises a master management control module and a slave management control module; the master management control module and the slave management control module provide a dual-computer hot standby mode and a dual-control mode; in the double control mode, the master management control module and the slave management control module are subjected to master-slave synchronous inspection, and the machine can be started only when the information is consistent; the programmable logic device simultaneously monitors the master management control module and the slave management control module in the double control mode, when one management control module fails, the failed management control module is taken off line, the programmable logic device restarts the failed management control module, and if the failed management control module recovers to work normally after restarting, the system is switched back to the double control mode. Through the mode, the system can still work normally when a single management control module fails, and the stability and reliability of the system are ensured; and the synchronism of the master management control module and the slave management control module can be ensured.

Description

High-reliability dual-management system based on master-slave synchronization
Technical Field
The invention relates to the field of server systems, in particular to a high-reliability dual-management system based on master-slave synchronization.
Background
Existing server systems typically employ a single management board design, and when it fails, the entire system loses management control. In order to solve the problem, a double-management-board design can be adopted, but the master-slave synchronism in the existing double-management control module design cannot be ensured; for the heat dissipation strategy, because the rotating speed of the fan is controlled by adopting a single BMC in the past, when the BMC fails, the CPLD controls the fan to work at a certain fixed rotating speed, and at the moment, the heat dissipation strategy is not available, and the stability and the reliability of the system are poor due to the reasons.
Disclosure of Invention
The invention mainly solves the technical problem of providing a high-reliability dual-management system based on master-slave synchronization, which can ensure that the system can still work normally when a single management control module fails without switching a new management control module when power is cut off; the invention can ensure the synchronism of the master management control module and the slave management control module; when a single management control module fails, the fan rotating speed is regulated and controlled by using the other management control module, so that the stability and the reliability of the system are ensured.
In order to solve the technical problems, the invention adopts a technical scheme that: the high-reliability dual management system based on master-slave synchronization comprises a signal interaction module, a system fan, a plurality of computing nodes, a master management control module and a slave management control module, wherein a programmable logic device is arranged on the signal interaction module and used for acquiring system information, and the master management control module and the slave management control module read the system information from the programmable logic device; the master management control module and the slave management control module provide two working mode selections of a dual-computer hot standby mode and a dual-control mode; in the double control mode, the information in the master management control module and the slave management control module is firstly subjected to master-slave synchronous inspection, and the machine can be started only after the master-slave synchronous inspection is passed; the programmable logic device simultaneously monitors the master management control module and the slave management control module in the double control mode, when one management control module fails, the failed management control module is taken off line, the double control mode exits, the programmable logic device restarts the failed management control module, and if the failed management control module recovers normal work after restarting, the programmable logic device switches the system into the double control mode again; the rotating speed of the system fan is regulated and controlled by the master management control module when the system normally operates, and when the master management control module fails, the programmable logic device switches the regulation and control of the rotating speed of the system fan to the control of the slave management control module.
Further, the master-slave synchronization verification comprises: before power-on, the programmable logic device respectively reads the information in the master management control module and the slave management control module for comparison, and a system with consistent information can be normally started; if the information monitored by the master management control module is not consistent with the information monitored by the slave management control module, the programmable logic device informs the master management control module and the slave management control module to reload simultaneously, and the slave programmable logic device acquires the state information of the whole system again until the monitored information of the master management control module and the slave management control module is completely consistent with each other, so that the system can not be started.
Further, the master-slave synchronization inspection also comprises that when the system normally runs, the programmable logic device monitors the information in the master management control module and the slave management control module at regular time and compares the information, and when the information is inconsistent, the master management control module and the slave management control module are reloaded at the same time.
Further, the master management control module works online in the dual-machine hot standby mode, the slave management control module is in a standby state, and when the master management control module fails, the slave management control module replaces the master management control module to work online.
Further, in the dual-computer hot standby mode, the programmable logic device receives a watchdog signal of the master management control module in real time, when the programmable logic device cannot receive the watchdog signal of the master management control module, the fault of the master management control module is judged, meanwhile, the programmable logic device sends effective information to the slave management control module, the slave management control module acquires system information, and the slave management control module is on-line connected with the master management control module.
Furthermore, in the double control mode, the programmable logic device receives the watchdog signals of the master management control module and the slave management control module at the same time, when one watchdog feeding signal cannot be received, the watchdog feeding signal is judged to be in fault, the management control module in fault is off-line, the system enters a single management control module mode, meanwhile, the programmable logic device sends a reset signal to the management control module in fault, and if the system works normally after reset, the programmable logic device enables the system to be switched to the double control mode again.
Further, the programmable logic device acquires system information through an I2C two-wire system synchronous serial bus, wherein the system information comprises configuration of each computing node, temperature information of each computing node, and information of the rotating speed and the power supply of a system fan.
Further, the signal interaction module is a back panel, the programmable logic device is a CPLD, and the master management control module and the slave management control module are a master CMC and a slave CMC, respectively.
The invention has the beneficial effects that: in the invention, under the condition that one management control module has a fault, the other management board can take over the management in a seamless way, thereby ensuring the normal work of the whole system; the client can also select an active-standby mode or an active-active mode according to the self requirement; the invention provides master-slave CMC information verification to ensure master-slave information synchronization; when a single management board fails, the fan rotating speed is regulated and controlled by using the other management board, so that the stability and the reliability of the system are ensured.
Drawings
Fig. 1 is an architecture diagram of a preferred embodiment of a high-reliability dual management system based on master-slave synchronization according to the present invention.
Detailed Description
The following detailed description of the preferred embodiments of the present invention, taken in conjunction with the accompanying drawings, will make the advantages and features of the invention easier to understand by those skilled in the art, and thus will clearly and clearly define the scope of the invention.
Referring to fig. 1, an embodiment of the present invention includes:
a highly reliable dual management system based on master-slave synchronization, comprising: the system comprises a back plate, a system fan, a plurality of computing nodes, CMC0 and CMC1, wherein the back plate plays a role of a signal path, a CPLD on the back plate acquires system information through I2C, the system information comprises the configuration of each computing node, the corresponding temperature information of each node, the rotating speed of the system fan, the information of a PSU and the like, and the CMC0 and the CMC1 read the system information from the CPLD; the CMC is named CHASIS MANAGEMENT CONTROLLER in English and named as a chassis fan management CONTROLLER in Chinese; CPLD is a complex programmable logic device.
The CMC0 and the CMC1 provide two working modes, one is an Active-Active double-Active mode, and the other is an Active-standby mode, namely, the single CMC works.
In an Active-Active dual-Active mode, in order to prevent the situation that the information of the CMC0 and the CMC1 is inconsistent in the data reading process and solve the problem that the master-slave synchronization cannot be guaranteed, the master-slave synchronization check needs to be performed on the information in the CMC0 and the CMC1, and the computer can be started after the master-slave synchronization check passes. The master-slave synchronization inspection method comprises the following steps: before the Main Power is powered on, the CPLD on the back board respectively reads the information in the CMC0 and the CMC1 for comparison, if the information is consistent, the system can be normally started, and if the information is not consistent, the CPLD informs the CMC0 and the CMC1 to simultaneously reload, and the state information of the whole system is obtained from the CPLD again until the monitored information of the two is completely consistent, and the system can not be started. During the normal operation of the system, the CPLD also monitors the information of the CMC0 and the CMC1 at regular time, and compares the information, and reloading is required if the inconsistency is not consistent.
In the Active-Active mode, the CPLD simultaneously receives watchdog signals of the CMC0 and the CMC1, when one dog feeding signal cannot be received, the CPLD judges that the dog feeding signal fails, the failed CMC is off-line, simultaneously the CPLD sends a reset signal to the failed CMC, and if the reset signal indicates that the failed CMC works normally, the CPLD commands the system to be switched into the Active-Active mode again.
Under an Active-standby mode, the CMC0 works online to manage the whole system, the CMC1 is in a standby state, when the CMC0 breaks down, the CPLD cannot receive a watchdog signal sent by the CMC0, the CMC0 breaks down is judged, meanwhile, an effective signal is sent to the CMC1, the CMC1 obtains system information from the CPLD again, and the CMC0 works in a online replacing mode. Therefore, under the condition that one management control module fails, the other management control module can still work normally, and the whole system can also work normally.
No matter the user selects the Active-Active mode or the Active-standby mode, when the system works normally, the rotation speed of the fan is regulated and controlled by the CMC0, and when the CMC0 breaks down, the CPLD is switched to be regulated and controlled by the CMC1, so that a heat dissipation strategy is still provided, and the reliability of the system is greatly improved.
The above description is only an embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes performed by the present specification and drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims (9)

1. A high-reliability dual-management system based on master-slave synchronization comprises a signal interaction module, a system fan and a plurality of computing nodes, and is characterized by further comprising a master management control module and a slave management control module, wherein a programmable logic device is arranged on the signal interaction module and used for acquiring system information, and the master management control module and the slave management control module read the system information from the programmable logic device; the master management control module and the slave management control module provide two working mode selections of a dual-computer hot standby mode and a dual-control mode; in the double control mode, the master management control module and the slave management control module are subjected to master-slave synchronous inspection, and the machine can be started only after the master-slave synchronous inspection is passed; and in the double control mode, the programmable logic device simultaneously monitors the master management control module and the slave management control module, when one management control module fails, the failed management control module is taken off line, the double control mode exits, the programmable logic device restarts the failed management control module, and if the failed management control module recovers to work normally after restarting, the programmable logic device switches the system into the double control mode again.
2. A highly reliable dual management system based on master-slave synchronization according to claim 1, characterized in that: the master-slave synchronization verification comprises: before power-on, the programmable logic device respectively reads the information in the master management control module and the slave management control module for comparison, and a system with consistent information can be normally started; if the information monitored by the master management control module is not consistent with the information monitored by the slave management control module, the programmable logic device informs the master management control module and the slave management control module to reload simultaneously, and the slave programmable logic device acquires the state information of the whole system again until the monitored information of the master management control module and the slave management control module is completely consistent with each other, so that the system can not be started.
3. A highly reliable dual management system based on master-slave synchronization according to claim 2, characterized in that: and the master-slave synchronization inspection also comprises that when the system normally runs, the programmable logic device regularly monitors the information in the master management control module and the slave management control module to compare, and when the information is inconsistent, the master management control module and the slave management control module are reloaded at the same time.
4. A high-reliability dual management system based on master-slave synchronization according to claim 3, characterized in that: in the double control mode, the programmable logic device receives watchdog signals of the master management control module and the slave management control module at the same time, when one watchdog feeding signal cannot be received, the watchdog feeding signal is judged to be in fault, the faulty management control module is offline, the system enters a single management control module mode, meanwhile, the programmable logic device sends a reset signal to the faulty management control module, and if the watchdog feeding signal is in normal operation after the reset, the programmable logic device enables the system to be switched into the double control mode again.
5. A highly reliable dual management system based on master-slave synchronization according to claim 1, characterized in that: the master management control module works on line in the dual-machine hot standby mode, the slave management control module is in a standby state, and when the master management control module fails, the slave management control module takes over the work of the master management control module on line.
6. A high-reliability dual management system based on master-slave synchronization according to claim 5, characterized in that: under the dual-computer hot standby mode, the programmable logic device receives a watchdog signal of the master management control module in real time, when the programmable logic device cannot receive the watchdog signal of the master management control module, the fault of the master management control module is judged, meanwhile, the programmable logic device sends effective information to the slave management control module, the slave management control module acquires system information, and the slave management control module is connected with the master management control module on line.
7. A high-reliability dual management system based on master-slave synchronization according to claim 4 or 6, characterized in that: the rotating speed of the system fan is regulated and controlled by the master management control module when the system normally operates, and when the master management control module fails, the programmable logic device switches the regulation and control of the rotating speed of the system fan to the control of the slave management control module.
8. A highly reliable dual management system based on master-slave synchronization according to claim 7, wherein: the programmable logic device obtains system information through an I2C two-wire system synchronous serial bus, wherein the system information comprises configuration of each computing node, temperature information of each computing node, and information of the rotating speed and the power supply of a system fan.
9. A highly reliable dual management system based on master-slave synchronization according to claim 7, wherein: the signal interaction module is a back panel, the programmable logic device is a CPLD, the master management control module and the slave management control module are respectively a master CMC and a slave CMC, and the CMC is a chassis fan management controller.
CN202010778099.3A 2020-08-05 2020-08-05 High-reliability dual-management system based on master-slave synchronization Active CN112068991B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010778099.3A CN112068991B (en) 2020-08-05 2020-08-05 High-reliability dual-management system based on master-slave synchronization

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010778099.3A CN112068991B (en) 2020-08-05 2020-08-05 High-reliability dual-management system based on master-slave synchronization

Publications (2)

Publication Number Publication Date
CN112068991A CN112068991A (en) 2020-12-11
CN112068991B true CN112068991B (en) 2022-09-20

Family

ID=73657003

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010778099.3A Active CN112068991B (en) 2020-08-05 2020-08-05 High-reliability dual-management system based on master-slave synchronization

Country Status (1)

Country Link
CN (1) CN112068991B (en)

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN203786723U (en) * 2014-04-18 2014-08-20 北京盛博协同科技有限责任公司 Dual redundant system based on X86 PC/104 embedded CPU modules
CN106648997A (en) * 2016-12-23 2017-05-10 北京航天测控技术有限公司 Master-salve switching method based on non-real-time operating system
CN109857614A (en) * 2018-12-28 2019-06-07 曙光信息产业(北京)有限公司 A kind of disaster tolerance device and method of rack server

Also Published As

Publication number Publication date
CN112068991A (en) 2020-12-11

Similar Documents

Publication Publication Date Title
EP1980943B1 (en) System monitor device control method, program, and computer system
CN102103532B (en) Safety redundancy computer system of train control vehicle-mounted equipment
US9195553B2 (en) Redundant system control method
CN102724083A (en) Degradable triple-modular redundancy computer system based on software synchronization
CN109698775A (en) A kind of dual-machine redundancy backup system based on real-time status detection
JP5794137B2 (en) Control system and relay device
US6032265A (en) Fault-tolerant computer system
TW454128B (en) Shared disk type multiple system
CN114116280A (en) Interactive BMC self-recovery method, system, terminal and storage medium
CN112653734A (en) Server cluster real-time master-slave control and data synchronization system and method
CN116319618A (en) Switch operation control method, device, system, equipment and storage medium
CN114594672A (en) Control system, control method thereof, and computer-readable storage medium
CN112068991B (en) High-reliability dual-management system based on master-slave synchronization
CN111984471B (en) Cabinet power BMC redundancy management system and method
JP2015230720A (en) Computer system
US9158666B2 (en) Computer system and computer system information storage method
US11829266B2 (en) Computing device, redundant system, program, and method for constructing redundant configuration
KR100333484B1 (en) Fault tolerance control system with duplicated data channel by a method of concurrent writes
KR19990050460A (en) Disaster Recovery Method and Device of High Availability System
JPH04268643A (en) Information processing system
JP3448197B2 (en) Information processing device
CN117827544B (en) Hot backup system, method, electronic device and storage medium
JP7056057B2 (en) Information processing equipment, information processing methods, information processing systems, and computer programs
KR20010010293A (en) Fault management system and Method for recovering fault of the Configuration Management System in the fault tolerant switching control system
CN116483631A (en) Comprehensive electrical system based on cold and hot dual-backup mechanism and operation method thereof

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant