CN104035831A

CN104035831A - High-end fault-tolerant computer management system and method

Info

Publication number: CN104035831A
Application number: CN201410309564.3A
Authority: CN
Inventors: 贡维; 吴孝磊
Original assignee: Inspur Beijing Electronic Information Industry Co Ltd
Current assignee: Inspur Beijing Electronic Information Industry Co Ltd
Priority date: 2014-07-01
Filing date: 2014-07-01
Publication date: 2014-09-10

Abstract

The invention discloses a high-end fault-tolerant computer management system and method and relates to the field of computers. The high-end fault-tolerant computer management system comprises a system power source, a fan, a switch, a plurality of computational nodes and an SMC, wherein the SMC receives collected information through the switch, and the collected information is reported by the computational nodes; when the reported collected information meets a preset computational node management strategy, the SMC sends a corresponding management operation instruction to the computational nodes through the switch; when the reported collected information meets a preset system power supply and temperature management strategy, corresponding management operations are carried out on the system power source and/or the fan. The invention further discloses the high-end fault-tolerant computer management method. According to the technical scheme, the role of each level in management is fully played, and centralized power supply, centralized heat dissipation and centralized management are achieved under the management structure.

Description

A kind of high-end fault-tolerant computer management system and method

Technical field

The present invention relates to computer realm, specifically a kind of high-end server system management scheme.

Background technology

At present, high-end fault-tolerant computer is widely used in the key area such as high-performance calculation, bank with powerful instant computing ability and the RAS characteristic such as highly reliable.High-end server system complex, generally comprises the multiple node forms such as computing node, interconnecting nodes, IO expanding node, memory node.How whole system effectively being managed, and how to improve the efficiency of power supply and the heat radiation of system, is the large technical barrier that high-end server faces.Traditional server generally adopts BMC (Baseboard Management Controller, baseboard management controller) Managed Solution, and all management functions concentrate on BMC, as the monitoring of system power supply, fan, temperature etc.In the time that BMC breaks down, whole management function just means inefficacy; The power supply of traditional server simultaneously and heat radiation are all the parts of being responsible for separately separately, and the mechanism of neither one United Dispatching can not make full use of power supply and radiating resource.

Summary of the invention

Technical matters to be solved by this invention is to provide a kind of high-end fault-tolerant computer management system and method, the problem of solution high-end server system management complexity.

In order to solve the problems of the technologies described above, the invention discloses a kind of high-end fault-tolerant computer management system, comprise system power supply and fan, also comprise switch, multiple computing node and the System Management Controller (SMC) being all connected with fan with all system power supplies, wherein:

Described SMC, receive by described switch the Information Monitoring that each computing node reports, the Information Monitoring reporting when computing node meets predefined computing node operating strategy, send corresponding management operation instruction by described switch to this computing node, and the Information Monitoring reporting when computing node meets predefined system power supply, temperature treatment strategy, described system power supply and/or fan are carried out to corresponding bookkeeping;

Described computing node, reports the Information Monitoring of this computing node to described SMC, and in the time receiving the management operation instruction that switch sends, according to this instruction, this computing node is carried out to corresponding bookkeeping by switch.

Alternatively, in said system, described computing node comprises baseboard management controller (BMC) and CPLD (CPLD), wherein:

Described BMC, the Information Monitoring of obtaining this computing node, and connect and offer described SMC by described switch, and by described switch receiving management operational order, this management operation instruction is handed down to described CPLD;

Described CPLD, the management operation instruction issuing according to described BMC is carried out corresponding bookkeeping to this computing node.

Alternatively, in said system, state Information Monitoring that computing node reports at least comprise following one or more:

Voltage in temperature, computing node in computing node, crucial pinning memory information.

Alternatively, in said system, the bookkeeping of described computing node comprises the start of computing node, shutdown, reset operation.

Alternatively, in said system, the bookkeeping of described system power supply comprises: increase system power supply number, reduce system power supply number, the power consumption of reading system power supply, output voltage, electric current, temperature.

Alternatively, in said system, the bookkeeping of described system fan comprises: improve rotation speed of the fan, reduce rotation speed of the fan.

Alternatively, in said system, described SMC comprises two SMC chips, and described two SMC chips are all connected with described switch, and described two SMC chips are all connected with fan with all system power supplies, and the interconnect bus by redundancy between two SMC chips connects;

Among described two SMC chips, a SMC chip is main SMC, and in normal mode of operation, another SMC chip is from SMC, in standby, wherein:

The described state that detects in real time main SMC from SMC by heartbeat, if detect, described main SMC breaks down, described activation from SMC is that normal mode of operation is to replace main SMC.

The invention also discloses a kind of high-end fault-tolerant computer management method, comprising:

System Management Controller (SMC) in high-end fault-tolerant computer management system as described above, receive by described switch the Information Monitoring that each computing node reports, when the Information Monitoring reporting when computing node meets predefined computing node operating strategy, described SMC sends corresponding management operation instruction by described switch to this computing node, and this computing node carries out corresponding bookkeeping according to the management operation instruction of receiving to this computing node;

When the Information Monitoring reporting when computing node meets predefined system power supply, temperature treatment strategy, described SMC carries out corresponding bookkeeping to described system power supply and/or fan.

Alternatively, in said method, the Information Monitoring that described computing node reports at least comprise following one or more:

Alternatively, in said method, described computing node carries out corresponding bookkeeping according to the management operation instruction of receiving to this computing node and refers to:

Described computing node according to management operation instruction to this computing node start shooting, shutdown or reset operation.

Alternatively, in said method, described SMC carries out corresponding bookkeeping to described system power supply and refers to:

Described SMC increases system power supply number, reduces system power supply number, the power consumption of reading system power supply, output voltage, electric current, temperature.

Alternatively, in said method, described SMC carries out corresponding bookkeeping to described fan and refers to:

Described SMC improves or reduces rotation speed of the fan.

Alternatively, in said method, adopt two SMC chips in described high-end fault-tolerant computer management system, one of them SMC chip is main SMC, and in normal mode of operation, another SMC chip is from SMC, in standby:

The described state that detects in real time main SMC from SMC by heartbeat, when described while detecting that from SMC described main SMC breaks down, describedly activates as normal mode of operation is to replace main SMC from SMC.

Present techniques scheme provides a kind of management system of high-end fault-tolerant computer of hierarchy type, can give full play to the effect of each level in management, and preferably adopt multiple redundant measure to ensure the reliability of management, the method that realizes centrally connected power supply under this management framework, concentrate heat radiation, manage concentratedly is proposed simultaneously, can reach system resource utilization and maximize, be to be a very large improvement and lifting to prior art.

Brief description of the drawings

Fig. 1 is hierarchy type management system topological diagram of the present invention.

Embodiment

For making the object, technical solutions and advantages of the present invention clearer, below in connection with accompanying drawing, technical solution of the present invention is described in further detail.It should be noted that, in the situation that not conflicting, the feature in the application's embodiment and embodiment can combine arbitrarily mutually.

Embodiment 1

The present embodiment is introduced a kind of high-end fault-tolerant computer management system, it adopts top-down hierarchy type management framework, at least comprise: the SMC (System Management Control, System Management Controller), switch and the multiple computing node (also can be described as server node) that are all connected with fan with all system power supplies;

SMC, receive by switch the Information Monitoring that each computing node reports, the Information Monitoring reporting when computing node meets predefined computing node operating strategy, send corresponding management operation instruction by switch to this computing node, and the Information Monitoring reporting when computing node meets predefined system power supply, temperature treatment strategy, system power supply and/or fan are carried out to corresponding bookkeeping;

Particularly, SMC can pass through control bus monitoring management system power supply and blower.

Computing node, reports the Information Monitoring of this computing node to described SMC, and in the time receiving the management operation instruction that switch sends, according to this instruction, this computing node is carried out to corresponding bookkeeping by switch.

It should be noted that, computing node in the present embodiment comprises the BMC and CPLD (the Complex Programmable Logic Device that are deployed on computing node, CPLD), between CPLD and BMC, pass through SMBUS (System Management Bus, System Management Bus), GPIO (General Purpose Input Output, universal input/output) etc. signal connect, wherein:

BMC, by switch and SMC communication, the Information Monitoring of obtaining this computing node, and connect and offer SMC by switch, and by switch receiving management operational order, this management operation instruction is handed down to CPLD;

Particularly, the Information Monitoring that computing node reports at least comprise following one or more:

And the Information Monitoring that BMC obtains can be the directly collection of sensor in MBC, for example, inner integrated ADC (analog to digital conversion) module of BMC, can be to mainboard voltage Real-time Collection.Also can obtain by other sensors in system, for example, temperature sensor can be connected to BMC, read the temperature in computing node by BMC.At this, the mode that the present embodiment obtains Information Monitoring to MBC does not impose any restrictions.

And crucial pinning memory information generally can be thought the information of the key chip (as CPU, PCH etc.) that BMC is connected to by SMBUS, this crucial pinning memory information can need determine by practical application scene or artificially, not do any restriction at this.

CPLD, the management operation instruction issuing according to BMC is carried out corresponding bookkeeping to this computing node.

And the bookkeeping that CPLD carries out comprises start, shutdown, the reset operation of computing node.

In addition, SMC comprises the bookkeeping of system power supply: increase system power supply number, reduce system power supply number, the power consumption of reading system power supply, output voltage, electric current, temperature.

SMC comprises the bookkeeping of system fan: improve rotation speed of the fan, reduce rotation speed of the fan.

With SMC to the management control of fan for instance, in the inner predefined system power supply of SMC, temperature treatment strategy, having one raises and improves the strategy of rotation speed of the fan according to temperature, like this, when temperature information in BMC reports computing node feeds back to SMC, SMC can improve or reduce rotation speed of the fan according to temperature results.

Also be noted that client host is connected on SMC and is conducted interviews by supervising the network.Be that user can go manually to check system power supply information by logining the administration interface (the Web page) of SMC, control rotation speed of the fan etc.

Optimally, above-mentioned management system can also adopt multiple redundancy mechanism:

First adopt redundancy scheme to be, adopt two SMC, two SMC are with master and slave design of operating modes, and now, whole system framework as shown in Figure 1, only has main SMC (being SMC0) work at ordinary times, from SMC (being SMC1) as armed state.Between two SMC, carry out status monitoring by redundancy heartbeat bus more than two and data synchronous.Particularly, heartbeat bus can adopt SMMBUS, RS232 or other bus forms, and whether heartbeat mechanism is monitored main SMC0 every a fixed cycle normal, and the data from SMC1 are synchronizeed with main SMC0, if find that main SMC0 breaks down, heartbeat mechanism is switched to management function from SMC1.Two SMC are also responsible for voltage, electric current, power consumption and the temperature conditions of the power supply of monitoring server system by PMBUS, and monitoring and adjustment System rotation speed of the fan etc.

Secondly, adopt the supervising the network of redundancy, two of BMC network interfaces are connected respectively to two switches, two cover networks redundancy backup each other.

Now, switch is responsible for connecting the communication between two SMC and all BMC, and the name of all IPMI of meeting is all transmitted by switch.The downstream interface of switch 0 and switch 1 is connected respectively to two management network interface cards of BMC, and upstream Interface is connected respectively to SMC0 and SMC1.

Embodiment 2

The present embodiment provides a kind of high-end fault-tolerant computer management method, and the high-end fault-tolerant computer management system that its responsible above-described embodiment 1 provides realizes.The method comprises following operation:

System Management Controller in high-end fault-tolerant computer management system (SMC), receive by described switch the Information Monitoring that each computing node reports, when the Information Monitoring reporting when computing node meets predefined computing node operating strategy, described SMC sends corresponding management operation instruction by described switch to this computing node, and this computing node carries out corresponding bookkeeping according to the management operation instruction of receiving to this computing node;

Wherein, the Information Monitoring that computing node reports at least comprise following one or more:

And computing node according to the management operation instruction of receiving to this computing node carry out corresponding bookkeeping generally comprise to this computing node start shooting, shutdown and reset operation.

SMC carries out corresponding bookkeeping to system power supply to be comprised increase system power supply number, reduces system power supply number, the power consumption of reading system power supply, output voltage, electric current, temperature.

SMC carries out corresponding bookkeeping to fan and comprises raising and reduce rotation speed of the fan.

Optimally, above-mentioned management method can also adopt multiple redundancy mechanism:

Wherein, a set of redundancy scheme is exactly, and adopts two SMC chips in high-end fault-tolerant computer management system, and one of them SMC chip is main SMC, and in normal mode of operation, another SMC chip is from SMC, in standby:

The state that detects in real time main SMC from SMC by heartbeat, in the time detecting that from SMC main SMC breaks down, activates as normal mode of operation is to replace main SMC from SMC.

Another set of redundancy scheme is that two network interfaces of BMC are connected respectively to two switches, two cover networks redundancy backup each other.

Particularly, in the present embodiment, from SMC (being SMC1) by heartbeat detection the duty from SMC (being SMC0), the cycle is at Microsecond grade or Millisecond, if SMC0 feedback information is normal, synchronous its status information of SMC1 keep standby; If SMC0 does not react, SMC1 sends sense command again by heartbeat bus, if SMC0 does not still react, SMC1 enters enable mode, takes over the work of SMC0.Redundancy heartbeat bus in the present embodiment can avoid holocentric to jump the problem of bus catastrophic failure effectively.

SMC0 is connected to switch 0 by Ethernet, and in all computing nodes, first network interface of BMC is also connected to switch 0 by Ethernet, sets up the communication mechanism of SMC0 and BMC.SMC1 is connected to switch 1 by Ethernet, and in all computing nodes, second of BMC network interface is also connected to switch 1 by Ethernet, sets up the communication mechanism of SMC0 and BMC.Two network interfaces of BMC are connected respectively to two switches, form the supervising the network of a set of redundancy.

SMC0 and SMC1 are also connected to system power supply and system fan by control bus, unified to its centralized management and control by SMC, as read power module electric current, power consumption, control rotation speed of the fan etc.Centrally connected power supply, concentrated heat radiation and central controlled advantage are maximum using resources and reduce costs, as centrally connected power supply can effectively reduce power module quantity, reduce power supply pressure drop; Concentrated heat radiation can be optimized Duct design, reduces windage.

Can find out from above-described embodiment, present techniques scheme notable feature is to adopt top-down hierarchy type management framework and have multiple redundancy function, modularization division of labor management function, proposes a kind of system centrally connected power supply simultaneously, concentrates the scheme of heat radiation and centralized management under this management framework.In summary, present techniques scheme has advantages of that redundancy is reliable, execution efficiency is high, can reduce system energy consumption.

Centrally connected power supply and heat radiation: whole server system will adopt centrally connected power supply and concentrated heat dissipation technology, the i.e. power supply of all computing nodes (or module) is all provided by bus-bar or dorsulum, can effectively reduce PSU (power supply module) quantity and reduce power supply pressure drop.All system power supplies are dispatched by SMC unified management.Concentrate heat dissipation technology, the mode that in abandoning tradition server, each module is dispelled the heat alone, adopts fan wall to concentrate heat dissipation technology, effectively optimizes air channel, reduces resistance.All system fan are dispatched by SMC unified management.

One of ordinary skill in the art will appreciate that all or part of step in said method can carry out instruction related hardware by program and complete, described program can be stored in computer-readable recording medium, as ROM (read-only memory), disk or CD etc.Alternatively, all or part of step of above-described embodiment also can realize with one or more integrated circuit.Correspondingly, the each module/unit in above-described embodiment can adopt the form of hardware to realize, and also can adopt the form of software function module to realize.The application is not restricted to the combination of the hardware and software of any particular form.

The above, be only preferred embodiments of the present invention, is not intended to limit protection scope of the present invention.Within the spirit and principles in the present invention all, any amendment of making, be equal to replacement, improvement etc., within all should being included in protection scope of the present invention.

Claims

1. a high-end fault-tolerant computer management system, comprises system power supply and fan, it is characterized in that, also comprises switch, multiple computing node and the System Management Controller (SMC) being all connected with fan with all system power supplies, wherein:

2. the system as claimed in claim 1, is characterized in that, described computing node comprises baseboard management controller (BMC) and CPLD (CPLD), wherein:

3. system as claimed in claim 2, is characterized in that, the Information Monitoring that described computing node reports at least comprise following one or more:

4. system as claimed in claim 2, is characterized in that,

The bookkeeping of described computing node comprises start, shutdown, the reset operation of computing node.

5. the system as claimed in claim 1, is characterized in that,

The bookkeeping of described system power supply comprises: increase system power supply number, reduce system power supply number, the power consumption of reading system power supply, output voltage, electric current, temperature.

6. system as claimed in claim 6, is characterized in that,

The bookkeeping of described system fan comprises: improve rotation speed of the fan, reduce rotation speed of the fan.

7. the system as described in claim 1 to 6 any one, is characterized in that,

Described SMC comprises two SMC chips, and described two SMC chips are all connected with described switch, and described two SMC chips are all connected with fan with all system power supplies, and the interconnect bus by redundancy between two SMC chips connects;

8. a high-end fault-tolerant computer management method, is characterized in that, comprising:

System Management Controller (SMC) in high-end fault-tolerant computer management system as described in the claims 1 to 7, receive by described switch the Information Monitoring that each computing node reports, when the Information Monitoring reporting when computing node meets predefined computing node operating strategy, described SMC sends corresponding management operation instruction by described switch to this computing node, and this computing node carries out corresponding bookkeeping according to the management operation instruction of receiving to this computing node;

9. method as claimed in claim 8, is characterized in that, the Information Monitoring that described computing node reports at least comprise following one or more:

10. method as claimed in claim 8, is characterized in that, described computing node carries out corresponding bookkeeping according to the management operation instruction of receiving to this computing node and refers to:

11. methods as claimed in claim 8, is characterized in that, described SMC carries out corresponding bookkeeping to described system power supply and refers to:

12. methods as claimed in claim 8, is characterized in that, described SMC carries out corresponding bookkeeping to described fan and refers to:

Described SMC improves or reduces rotation speed of the fan.

13. methods as described in claim 8 to 12 any one, is characterized in that,

In described high-end fault-tolerant computer management system, adopt two SMC chips, one of them SMC chip is main SMC, and in normal mode of operation, another SMC chip is from SMC, in standby: