WO2013051145A1

WO2013051145A1 - Computer system, management device, management method, and program

Info

Publication number: WO2013051145A1
Application number: PCT/JP2011/073148
Authority: WO
Inventors: 喜田剛啓
Original assignee: 富士通株式会社
Priority date: 2011-10-06
Filing date: 2011-10-06
Publication date: 2013-04-11

Abstract

One system to which the present invention is applied comprises a plurality of computers each of which is equipped with a power source device. The system comprises the following: a storage unit for storing status information which indicates the free status of a supplementary power source device equipped on one or more computers from among the plurality of computers; and a specification unit that, in a case where failure occurs in the power source device of any one of the plurality of computers, refers to the status information stored in the storage unit, and specifies a computer that is different from the computer in which failure occurred and that is equipped with a power source device which can substitute for the power source device that failed. A worker who responds to the occurred failure is informed of the different computer specified by the specification unit.

Description

Computer system, management apparatus, management method, and program

The present invention relates to a technique for dealing with a failure of a power supply device.

Electrical products are required to have high reliability. A server that is a computer connected to a network is required to be able to be used by many people in a timely manner through the network. For this reason, the server is required to have particularly high reliability.

The server is equipped with one or more power supply units. The power supply device is a maintenance component that may cause a failure (failure). When a failure occurs in the installed power supply device, the power supply from the power supply device in which the failure has occurred stops, causing the server to run out of power. Therefore, the occurrence of a failure in the power supply apparatus is very likely to stop the server.

In a server (blade server) equipped with a plurality of server blades each capable of functioning as a server, the server blade can be operated individually. As a result, when a failure occurs in the power supply device, the blade server can cope with power shortage by stopping a part of the operating server blade in order to reduce power consumption.

Both stoppage of the server due to power shortage and decrease in the number of operating server blades impede the user's comfortable use. Therefore, it is necessary to quickly resolve the power shortage state. For this reason, many servers are now equipped with spare power supply units (maintenance replacement parts) in addition to enabling hot replacement in which the power supply unit is replaced with the server main unit turned on. ing. The spare power supply device is hereinafter referred to as a “redundant power supply device” in order to be operated or to be distinguished from the operated power supply device.

When a server with a redundant power supply is installed, if the power supply fails, operate the redundant power supply and replace the redundant power supply with the failed power supply to stop the failed power supply. The accompanying power shortage can be avoided. Therefore, when a redundant power supply device is installed, the reliability of the server can be improved and a higher operation rate can be realized. This operation rate is a value obtained by multiplying the division value obtained by dividing the operation time during which the server is actually operating within the target time by the target time (= operation time · 100 / target time).

In a server that operates a redundant power supply, the power supply that has failed is repaired or replaced. However, a failure of another power supply device may occur before the repair or replacement of the power supply device is completed. At this time, if there is no redundant power supply that can be replaced, the entire system is stopped or a part of the operating server blade is stopped for the server that has become insufficient in power. From this, it can be said that even in a server on which a redundant power supply can be mounted, a situation where there is no redundant power supply that can replace the power supply in which a failure has occurred should be considered.

As a conventional server (computer) system including a plurality of servers, a redundant power supply device can be mounted, and power is supplied to other servers by connecting to other servers via a power cable, and others. There is a system that can supply power from other servers. In this conventional server system, a server in which a failure has occurred in a power supply apparatus in a situation where there is no redundant power supply apparatus can avoid power shortage due to power supply from other servers. Thereby, higher reliability of each server is realized.

However, if the servers are connected by a power cable to enable mutual power supply, each server must be provided with the facilities necessary for connecting with the power cable. Each server must be equipped with a function for responding to requests from other servers connected by a power cable. For this reason, in this conventional server system, both the manufacturing cost and installation cost of the server itself are greatly increased. When modifying an existing server system, the cost of the modification is high.

Also, even in a conventional server system, there may be a situation where there is no redundant power supply that can be substituted for all of a plurality of servers connected by a power cable. This means that there is a possibility that the server will run out of power due to a failure occurring in the power supply device. Therefore, it should be considered that there is no redundant power supply that can be replaced even in the conventional server system.

Employees must respond to failures that occur in the power supply when there is no redundant power supply that can be replaced. An operator must quickly install a power supply that can replace the server in which the power supply has failed to eliminate the power shortage condition. However, an alternative power supply device is not always prepared. In a computer system, if a plurality of power supply devices including redundant power supply devices continuously fail or if the procurement of the power supply device is delayed for some reason, it is likely that there is no alternative power supply device. . For this reason, when dealing with a failure that occurs in a power supply, it is necessary to consider the situation in which no substitute power supply is prepared, that is, the situation where all operable power supplies are installed in any server. This is desirable.

JP 2009-169874 A JP 2008-83841 A

In one aspect, an object of the present invention is to easily find a power supply device that can be used as an alternative when a failure occurs in the power supply device of the device.

In one system to which the present invention is applied, a plurality of computers each equipped with a power supply device are provided, and a state representing an empty state of a spare power supply device installed in one or more computers of the plurality of computers When a failure occurs in a storage unit that stores information and a power supply device of one of a plurality of computers, a power supply that can replace the failed power supply device by referring to the state information stored in the storage unit And a specifying unit that specifies another computer different from the computer in which the failure has occurred.

In one system to which the present invention is applied, when a failure occurs in the power supply device of the device, it is possible to easily find an alternative power supply device.

It is a figure explaining the structural example of the computer system by this embodiment. It is a figure explaining the more detailed structure of a blade server. It is a figure explaining the functional structure of the maintenance alternative component management apparatus by this embodiment. It is a figure explaining the structural example of a management server table. It is a figure explaining the structural example of a components management table. It is a flowchart of a redundant power supply extraction process. It is a flowchart of an alternative component determination process. It is a flowchart showing the recovery procedure by the worker | operator when determining the redundant power supply device mounted in another blade server as a maintenance alternative part.

Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.
FIG. 1 is a diagram illustrating a configuration example of a computer system according to the present embodiment. As shown in FIG. 1, the computer system has a configuration in which a plurality of blade servers 1 are connected to a network 2. A terminal device (for example, a console) 3 used by an operator or a worker is connected to the network 2.

FIG. 1 shows three blade servers 1-1 to 1-3, but the number of blade servers 1 connected to the network 2 is not limited to three. The “1” of “Blade Server 1” and the number “1” following the hyphen of reference numeral 1-1 in FIG. 1 represent the numbers assigned to the blade server 1 as identification information (ID: IDentifier). Yes. Similarly, “2” of “Blade Server 2” and the number “2” following the hyphen of reference numeral 1-2 represent the numbers assigned to the blade server 1 as IDs.

The three blade servers 1-1 to 1-3 shown in FIG. 1 are computers according to this embodiment. As shown in FIG. 1, each blade server 1 includes a plurality of server blades 11 (11-1 to 11-10), a plurality of power supply devices 12 (12-1 to 12-3), and a management board 13. Yes.

The three power supply devices 12 (12-1 to 12-3) are mounted to realize, for example, a 2 + 1 redundant configuration. Each blade server 1 can obtain necessary power by operating two power supply devices 12. Therefore, one power supply device 12 is a spare power supply device (redundant power supply device), that is, an alternative maintenance part that is replaced by a failure that has occurred in one of the other two power supply devices 12. The number of server blades 11 mounted on the blade server 1 is not limited to ten. Further, the number of power supply devices 12 that can be mounted and the redundant configuration are not limited to 3, 2 + 1, respectively.

FIG. 2 is a diagram illustrating a more detailed configuration of the blade server.
Each power supply device 12 includes a control device 12a that operates / stops the own power supply device 12, and each server blade 11 also includes a control device 11a that operates / stops the own server blade. The control device 12 a of each power supply device 12 is connected to the management blade 13 by a bus 21, and the control device 11 a of each server blade 11 is connected to the management blade 13 by a bus 22. As long as the power supply from all the power supply devices 12 is not stopped, the power is supplied from any one of the power supply devices 12 to the control device 12a of each power supply device 12 and the control device 11a of each server blade 11. ing. Thereby, each power supply device 12 and each server blade 11 perform operation / stop switching in accordance with an instruction from the management blade 13.

The management blade 13 manages the operation of the entire blade server 1. As shown in FIG. 2, an arithmetic device (for example, CPU (Central Processing Unit)) 13a, a storage device 13b, and an interface (denoted as “I / F” in the figure) 13c are provided.

The storage device 13b is, for example, a holding unit that holds programs executed by the arithmetic device 13a and various data. The arithmetic device 13a performs control for managing the entire blade server 1 by reading out and executing the program stored in the storage device 13b, for example, in a memory mounted therein. The interface 13c provides the computing device 13a with an environment where communication with each power supply device 12 via the bus 21 and communication with each server blade 11 via the bus 22 can be performed.

The control device 12a of each power supply device 12 detects a failure that has occurred in the power supply device 12 and notifies the management blade 13 of the detected failure. In response to the notification, the management blade 13 instructs the control device 12a of the redundant power supply device 12 to operate when the replaceable redundant power supply device 12 exists. The controller 12a of the power supply device 12 that has detected the failure is instructed to stop. In this way, the management blade 13 substitutes the redundant power supply device 12 for the power supply device 12 in which the failure is detected.

On the other hand, if there is no replaceable redundant power supply device 12, the management blade 13 instructs the control device 12a of the power supply device 12 that detected the failure to stop. Since the power failure occurs in the power supply device 12, the management blade 13 determines the server blade 11 to be stopped and instructs the control device 11 a of the determined server blade 11 to stop. In this way, only the server blade 11 that can be operated by the operating power supply device 12 is operated.

The failure occurring in the power supply device 12 is notified to the terminal device 3 by the management blade 13 including whether or not the redundant power supply device 12 can be substituted for the power supply device 12. From this, the worker performs necessary actions by monitoring each blade server 1 using the terminal device 3.

In FIG. 1, two power supply devices 12 of the blade server 1-2 and five server blades 11-6 to 11-10 are marked with x. A cross mark attached to the power supply device 12 represents the occurrence of a failure (failure), and a cross mark attached to the server blade 11 represents a state where the power supply device 12 has been stopped due to a failure occurring. As shown in FIG. 1, when only one power supply device 12 is operable, the management blade 13 operates only the five server blades 11-1 to 11-5, and the other five server blades 11 -6 to 11-10 are stopped. When the operable power supply device 12 no longer exists, the blade server 1-2 enters a system stop state in which all the server blades 11 and the management blade 13 are stopped.

Thus, the management blade 13 responds to a power shortage state in which only one power supply device 12 operates by stopping a part of the server blade 11. However, by stopping a part of the server blade 11, the processing capacity of the blade server 1 is reduced. The decrease in the processing capability increases the possibility that the user cannot comfortably use the blade server 1. Thereby, the power shortage state needs to be quickly resolved. In the present embodiment, quick and more reliable cancellation of the power shortage state is realized as follows.

Each of the three blade servers 1 connected to the network 2 can be equipped with one or more spare power supply devices (redundant power supply devices) 12. Therefore, the computer system can have redundant power supply devices 12 for at least the number of blade servers 1. In the present embodiment, paying attention to this, the redundant power supply device 12 mounted on each blade server 1 of the computer system is handled as a shared maintenance substitute part. Accordingly, in the present embodiment, when a failure occurs in the power supply device 12 in the blade server 1 in which the replaceable redundant power supply device 12 does not exist, the redundant power supply device 12 that can be mounted on the blade server 1 is replaced with another blade server 1. Extract (specify) from For example, the extracted redundant power supply 12 notifies the worker of the blade server 1 on which the redundant power supply 12 is mounted. The situation where there is no redundant power supply 12 that can replace the blade server 1 means that the redundant power supply 12 is not installed (all installed power supplies 12 are operating) or the redundant power supply 12 has already failed. Is the situation.

When the redundant power supply 12 mounted on each blade server 1 of the computer system is used as a shared maintenance replacement part, even if there is no spare power supply 12 that is not used in the computer system, the power supply 12 It becomes possible to respond to the failure that occurred. Even in this situation, if there is a power supply 12 that can be replaced by any blade server 1 that constitutes the computer system, the power shortage state of the blade server 1 that does not have the redundant power supply 12 that can be replaced can be resolved. it can. For this reason, the power shortage state can be resolved more reliably. The presence / absence of the replaceable power supply device 12 and the presence of the power supply device 12 are notified to the blade server 1 on which the power supply device 12 is mounted. Can be done.

The extraction of the replaceable power supply device 12 in the entire computer system and the notification of the extraction result are performed by the maintenance substitute part management device according to the present embodiment. In the present embodiment, the maintenance substitute part management apparatus is mounted on one of the blade servers 1. Here, it is assumed that the maintenance substitute part management apparatus according to the present embodiment is mounted on the blade server 1-1. The maintenance substitute part management device can be mounted on any computer (data processing device) capable of communicating with each blade server 1.

FIG. 3 is a diagram illustrating a functional configuration of the maintenance substitute part management apparatus according to the present embodiment.
As shown in FIG. 3, the maintenance substitute part management apparatus according to the present embodiment is mounted on the management blade 14 and includes a failure trap receiving unit 31, a substitute part extracting unit 32, a data holding unit 33, and a data output unit 34. Yes.

The data holding unit 33 corresponds to the storage device 13b shown in FIG. In the failure trap receiving unit 31, the substitute component extracting unit 32, and the data output unit 34, the arithmetic device 13a has a management substitute component management program stored in the storage device 13b (hereinafter referred to as a “component management program”). Is executed to control the interface 13c.

When the management blade 13 mounted on the blade server 1 other than the blade server 1-1 is in a power shortage state due to a failure that has occurred in the power supply device 12, the management blade 13 generates a message to that effect, and the blade server 1 Sent to -1. The failure trap receiver 31 receives and processes the message. Thereby, the blade server 1 that has become insufficiency of power is notified to the alternative component extraction unit 32.

SNMP (Simple Network Management Protocol) can be used to send and receive such messages. SNMP is a protocol for monitoring and controlling communication devices connected to a network such as a computer, a router, and a terminal device via the network. The notification of the power shortage state can be performed using “SNMP trap”. The SNMP trap is a function that, when a preset abnormal value is detected, notifies that fact from the SNMP agent to the SNMP manager. An object ID (OID) is used to transmit the notification content including the detected abnormal value type and the request content. The SNMP manager corresponds to the management blade 13 of the blade server 1-1, and the SNMP agent corresponds to the management blade 13 of each blade server 1 other than the blade server 1-1. Here, the SNMP trap is also used to indicate a message transmitted by the SNMP.

When the replacement part extraction unit 32 is notified of the blade server 1 in a power shortage state from the failure trap reception unit 31, the number of redundant power supply devices 12, its state, and the server blade 11 are notified to the other blade servers 1. Queries the average power consumption of. As a result, the alternative component extraction unit 32 identifies the redundant power supply device 12 that can be replaced by the entire computer system, and checks the state of the power supply device 12 in the blade server 1 on which the specified redundant power supply device 12 is mounted. The replacement component extraction unit 32 performs such confirmation, and when there are a plurality of replaceable redundant power supply devices 12, extracts the redundant power supply device 12 considered to be optimal from the replaceable redundant power supply devices 12. The data output unit 34 transmits the data representing the redundant power supply device 12 extracted by the alternative component extraction unit 32 and the blade server 1 on which the redundant power supply device 12 is mounted, so that the alternative component extraction unit is connected via the terminal device 3. The extraction result of 32 is output.

The substitute part extraction unit 32 refers to the management server table 33a stored in the data holding unit 33 and extracts the redundant power supply device 12. The data holding unit 33 stores a component management table 33b, and the component management table 33b is referred to as necessary. Hereinafter, the extraction method will be specifically described.

FIG. 4 is a diagram illustrating a configuration example of the management server table.
This management server table 33a is a table that represents the priority order among the blade servers 1 in extracting the redundant power supply device 12 that can be replaced, and the blade server 1 from which the redundant power supply device 12 is removed for the other blade server 1 Used for specific purposes. As shown in FIG. 4, each management server table 33 a stores data of ID, IP (Internet Protocol) address, and priority for each blade server 1. FIG. 4 shows an example of the contents of the management server table when there are four blade servers 1 in the computer system.

The numbers 1 to 3 representing the priority order indicate that the higher the number, the lower the priority order. As a result, the blade server 1 from which the redundant power supply device 12 is extracted is given the highest priority to the blade server 1 with the priority order of one.

This priority is set according to the operation rate guaranteed for the blade server 1. For example, when the three types of operation rate are guaranteed, that is, operation rate ≦ 99%, 99% <operation rate <99.99%, and 99.99% ≦ operation rate, the lowest guaranteed operation rate The blade server 1 is assigned a priority of 1. The blade server 1 with the next highest operation rate is assigned a priority of 2, and the blade server 12 with the highest operation rate is assigned a priority of 3.

In this embodiment, the redundant power supply device 12 is extracted with priority given to the blade server 1 having a higher priority. Accordingly, the redundant power supply device 12 is prevented from being removed as the blade server 1 has a higher guaranteed operation rate. For this reason, the influence when the failure of the power supply device 12 occurs in the blade server 1 from which the redundant power supply device 12 is removed can be further suppressed.

The managing blade 13 manages each component using the component management table 33b shown in FIG. Here, the component management table 33b will be specifically described with reference to FIG.

As shown in FIG. 5, the parts management table 33b stores each data of the part ID, type, state, operating time, and power consumption for each maintenance part.

The parts management table 33b is a table stored in the storage device 13b. The extracted redundant power supply 12 may be the redundant power supply 12 mounted on the blade server 1-1. Therefore, the component management table 33b is data necessary for the maintenance substitute component management device to extract the redundant power supply device 12.

In FIG. 5, “server blade”, “power supply”, and “redundant power supply” are all represented as data representing the types of maintenance parts. Both “drive” and “standby” are represented as data representing the state of the maintenance part. Other states representing maintenance parts include “stop” and “failure”. "Standby" and "Stop" are maintenance parts that can be operated together, but "Standby" is a stop in a situation where it is not necessary to operate, whereas "Stop" needs to be operated It is a stop at. The replaceable redundant power supply device 12 is a power supply device 12 of a type “redundant power supply” and a state “standby”. The activated redundant power supply 12 is updated from “standby” to “active” and the type is updated from “redundant power” to “power”.

The operation time (h) is the total time that the maintenance parts are actually operated, and is used to specify the timing for performing operation check, adjustment, replacement, or the like. The operation time is measured by the arithmetic device 13a using, for example, a built-in hard timer. The power consumption (W) is notified from the maintenance parts or is an average value thereof. The power consumption value of the server blade 11 is notified from the control device 11a. The average power consumption value returned in response to an inquiry from the maintenance / substitution component management apparatus is, for example, the average power consumption value of the entire blade server 11.

The management blade 13 of each blade server 1 can determine whether there is a redundant power supply 12 that can be replaced by the component management table 33b. As a result, when there is no redundant power supply 12 that replaces the failed power supply 12, the management blades 13 of the blade servers 1 other than the blade server 1-1 notify the management of the blade server 1-1. Notify the blade 13. When there is a redundant power supply 12 that can replace the failed power supply 12, the management blade 13 of each blade server 1 operates the redundant power supply 12 and stops the failed power supply 12.

The maintenance substitute component management device, that is, the management blade 13 of the blade server 1-1, in response to the inquiry, the number of replaceable redundant power supply devices 12 existing in each blade server 1 other than the blade server 1-1, the average of the blade servers 11 Check the power consumption value. A reply to the inquiry can be made by the management blade 13 using the component management table 33b.

Next, the maintenance substitute part management device sorts the blade servers 1 in which the redundant power supply units 12 that can be substituted exist in priority order, and identifies the blade server 1 having the highest priority order. When there is only one blade server 1 with the highest priority, the maintenance replacement component management device extracts the redundant power supply device 12 mounted on the blade server 1 as a replaceable redundant power supply device.

On the other hand, when there are a plurality of blade servers 1 with the highest priority, the maintenance substitute part management device sorts the plurality of blade servers 1 by the average power consumption value, and the average power consumption value is The smallest blade server 1 is specified. When there are a plurality of blade servers 1 having the smallest average power consumption value, the maintenance / substitution component management apparatus selects one of them. As a selection method thereof, a method of referring to the drive time of the redundant power supply device 12 and selecting the blade server 1 on which the redundant power supply device 12 with a shorter drive time is mounted can be considered. The maintenance substitute part management device extracts the redundant power supply device 12 mounted on the selected blade server 1 as a replaceable redundant power supply device.

The redundant power supply device 12 removed from the blade server 1 is temporarily used in the newly installed blade server 1 or used until a failure occurs in the newly installed blade server 1. The endurance time (life) of the power supply device 12 tends to be shorter as the supplied power is larger. The redundant power supply device 12 is not necessarily used (operated) at all. For this reason, when it is assumed that the number of blade servers 11 mounted on each server blade 1 is the same, the time when the redundant power supply 12 is expected to fail is the average power consumption value of the server blades 11. It can be considered that the larger is, the shorter. Therefore, in the present embodiment, the redundant power supply device 12 that is considered to be less likely to cause a failure is used as a substitute so that the time in which the specific blade server 1 is in a power shortage state does not become long in the specific blade server 1. ing.

The fault trap receiver 31, the alternative component extractor 32, and the data output unit 34 shown in FIG. 2 are realized when the arithmetic device 13a executes the redundant power supply extraction process shown in FIG. This redundant power supply extraction process is a process for dealing with a failure that has occurred in the power supply device 12 of the blade server 1 for which there is no replaceable redundant power supply device 12, and the arithmetic device 13a is a component stored in the storage device 13b. This is realized by executing the management program. Next, the redundant power supply extraction process will be described in detail with reference to FIG.

First, the arithmetic unit 13a monitors reception of an SNMP trap by the interface 13c (S1). When the interface 13c receives the SNMP trap, the arithmetic device 13a next determines whether or not the OID stored in the SNMP trap is a target OID, that is, an OID representing a power shortage state. When the interface 13c receives the SNMP trap storing the target OID, the determination in S2 is Yes and the process proceeds to S3. When the interface 13c receives a message in which the target OID is not stored, the determination in S2 is No and the process returns to S1. As a result, the interface 13c waits for reception of the SNMP trap storing the corresponding OID. The fault trap receiver 31 shown in FIG. 3 is realized by the arithmetic device 13a executing each process of S1 and S2.

This redundant power supply extraction process is realized by the arithmetic device 13a executing the component management program stored in the storage device 13b.

In S3, the arithmetic device 13a selects one blade server 1 excluding its own blade server 1-1 and the blade server 1 that transmitted the SNMP trap, and inquires the selected blade server 1 about the number of redundant power supply devices 12. . One blade server 1 is selected with reference to the management server table 33a. The inquiry is made by sending an SNMP message storing the corresponding OID.

Next, the arithmetic device 13a determines whether or not the number of redundant power supply devices 12 notified in response to the inquiry is zero. If there is no redundant power supply device 12 in the inquired blade server 1, the determination in S4 is Yes and the process proceeds to S8. When the redundant power supply device 12 is mounted on the blade server 1, the determination in S4 is No and the process proceeds to S5.

In S5, the arithmetic device 13a further makes an inquiry to the inquired blade server 1 to confirm the state of the redundant power supply device 12. The inquiry is also made by transmitting an SNMP message storing the corresponding OID. The arithmetic device 13a after the inquiry waits to receive a reply, and determines whether or not the state notified by the reply represents a usable state (denoted as “ok” in FIG. 8). If the notified state is “standby”, the determination in S6 is Yes and the process proceeds to S7. When the notified state is “failure” or “operation”, the determination in S6 is No and the process proceeds to S8. The determination of No in S6 means that there is no replaceable redundant power supply device 12 in the blade server 1 that made the inquiry.

In S7, the arithmetic device 13a further inquires the selected blade server 1 about the average power consumption value (denoted as “average power” in FIG. 8) of the server blade 11. This inquiry is made by transmitting an SNMP message storing the corresponding OID, as with other inquiries. After making the inquiry, it waits to receive a reply, saves the average power consumption value of the server blade 11 notified by the reply in the storage device 13b, and then proceeds to S8.

In S8, the arithmetic device 13a determines whether or not the selected blade server 1 is the last blade server 1. If there is no other blade server 1 to be inquired, the determination in S8 is Yes and the process proceeds to S9. If there is another blade server 1 to be inquired, the determination in S8 is No and the process returns to S3. In S3, an inquiry is made by selecting another blade server 1 anew.

In S9, the arithmetic device 13a executes an alternative component determination process for determining the redundant power supply device 12 to be a maintenance alternative component using the result of the inquiry. Next, the arithmetic device 13a performs screen output to the terminal device 3 for notifying the worker of the determined redundant power supply device 12 (S10). Thereafter, the redundant power supply extraction process is terminated. Screen output to the terminal device 3 is performed by transmitting a message storing data of a screen (image) to be output.

As described above, in this embodiment, for each blade server 1 (management blade 13), confirmation of the redundant power supply device 12 that can be used for the other blade server 1 and necessary information (average consumption of the server blade 11) Power value) is collected. The substitute part determination process is performed with reference to such confirmation results and collected information.

FIG. 7 is a flowchart of the substitute part determination process. Next, the substitute part determination process will be described in detail with reference to FIG.

First, the arithmetic device 13a determines whether or not there is a redundant power supply device 12 that can be used for another blade server 1 (S20). If any of the inquired blade servers 1 is equipped with the redundant power supply device 12 in the standby state, the determination in S20 is Yes and the process proceeds to S21. If all of the inquired blade servers 1 are either not equipped with the redundant power supply device 12, the redundant power supply device 12 has failed, or the redundant power supply device 12 is in operation, the S21 The determination is no and the process moves to S23.

At this time, in S23, the arithmetic device 13a determines that there is no redundant power supply device 12 that can be a maintenance substitute component. After making such a determination, the substitute part determination process ends. Thereby, when the above-described S10 is executed, the terminal device 3 outputs a screen indicating that fact.

In S21, the arithmetic device 13a extracts the blade servers 1 in which the usable redundant power supply devices 12 are confirmed, and sorts them according to the priority order with reference to the management server table 33a. Next, the arithmetic device 13a determines whether or not there is one blade server 1 with the highest priority among the sorted blade servers 1. If there is only one blade server 1 with the highest priority among the sorted blade servers 1, the determination in S21 is Yes and the process proceeds to S23, and the redundant power supply mounted on this one blade server 1 is obtained. The apparatus 12 is determined as a maintenance substitute part. When there are a plurality of blade servers 1 with the highest priority, the determination in S21 is No and the process proceeds to S25. In the example of the management server table shown in FIG. 4, the case where there is only one blade server 1 with the highest priority is the case where there is only one blade server with the priority 1, and the blade server 1 with the priority 1. In the case where there is only one blade server 1 with a priority of 2, and there is only one blade server 1 with a priority of 3, and there is only one blade server 1. is there.

In S25, the arithmetic device 13a sorts the blade servers 1 having the highest priority according to the average power value. Next, the arithmetic device 13a selects the blade server 1 having the smallest average power value among the sorted blade servers 1. Thereafter, the process proceeds to S23, and the redundant power supply device 12 mounted on the selected blade server 1 is determined as a maintenance substitute part.

In this way, in this embodiment, the blade servers 1 equipped with the redundant power supply devices 12 to be maintenance substitute parts are narrowed down in the order of the priority order of the blade servers 1 and the average power value of the server blades 11. Accordingly, in the present embodiment, the redundant power supply device 12 that is considered to be less likely to fail is prioritized while suppressing the possibility of selecting the redundant power supply device 12 mounted on the blade server 1 having a high guaranteed operation rate. To choose.

There is a possibility that the average power consumption value of the server blades 11 is the same for a plurality of blade servers 1. In that case, one blade server 1 may be arbitrarily selected from among the plurality of blade servers 1. If a redundant power supply 12 that is considered to be less likely to fail is selected, the operation time of the redundant power supply 12 is referred to and the blade server 1 equipped with the redundant power supply 12 with the shortest operation time is selected. May be. In order to suppress a decrease in the operation rate, the number of times that the power supply device 12 has failed in a situation where there is no substitute redundant power supply device 12 may be counted, and the blade server 1 having the smallest number of times counted may be selected.

FIG. 8 is a flowchart showing a recovery procedure performed by a worker when a redundant power supply installed in another blade server is determined as a substitute part for maintenance. Here, with reference to FIG. 8, the recovery procedure by an operator is demonstrated in detail.

“System stop due to hardware failure” shown in FIG. 8 means that the entire system of the blade server 1 is stopped due to a failure in the power supply device 12 in a situation where there is no redundant power supply that can be replaced. The server blade 11 is stopped. The “maintenance part” and the “failed part” correspond to the power supply apparatus 12, and the “maintenance substitute part” corresponds to the redundant power supply apparatus 12.

When the worker recognizes an abnormality of any blade server 1 using the terminal device 3 or the like, the worker identifies a maintenance component that has failed in the blade server 1 that has recognized the abnormality (S100). The abnormality assumed here is an abnormality caused by a power shortage state due to the fact that the failed maintenance component is the power supply device 12 and the replaceable redundant power supply device 12 is not mounted on the blade server 1. When such an abnormality occurs, the maintenance substitute part management device determines a redundant power supply device 12 to be a maintenance substitute part, and displays the determination result on the terminal device 3. From this, the worker removes the redundant power supply device 12 from the blade server 1 presented by the maintenance / substitution component management device and mounts it on the blade server 1 in which an abnormality has occurred (S200). Thereby, the recovery of the blade server 1 in which an abnormality has occurred is completed. At this time, if there is no place for mounting the redundant power supply device 12 on the blade server 1, the worker will replace the failed power supply device 12 with the redundant power supply device 12.

The occurrence of the above abnormality means that the power supply device 12 necessary for the computer system is insufficient. This is because all the blade servers 1 are not equipped with an alternative redundant power supply device 12. Therefore, the worker or the person in charge orders the power supply device 12 from the supplier (S300). The worker or the person in charge receives the power supply device 12 delivered by the supplier according to the order (S310). The worker uses the delivered power supply device 12 as the redundant power supply device 12 to be newly mounted on the blade server 1 from which the redundant power supply device 12 has been removed or the redundant power supply device 12 mounted on the blade server 1 in which an abnormality has occurred. Replace (S320). When the redundant power supply device 12 is replaced, the redundant power supply device 12 removed from the blade server 1 in which an abnormality has occurred may be mounted on the original blade server 1 again.

A certain amount of time is required from when the power supply device 12 is ordered until it is actually delivered. Therefore, mounting the power supply device 12 delivered by ordering on the blade server 1 in which an abnormality (power shortage state) has occurred is an action that should be avoided to maintain a high operating rate. However, when the redundant power supply device 12 is handled as a shared maintenance substitute part in the entire computer system, recovery using the power supply device 12 delivered by ordering can be minimized. For this reason, more rapid recovery is possible, and the time during which the power is insufficient is minimized. As a result, a reduction in operating rate due to a power shortage state can be minimized.

When the redundant power supply device 12 mounted on the blade server 1 is mounted on another blade server 1, it is necessary to update the component management tables 33b of the two blade servers 1 respectively. The management blade 13 of each blade server 1 recognizes the removal of the component connected to the

bus

21 or 22 and the connection of a new component to the

bus

21 or 22, and updates the component management table 33b. Thereby, even if the redundant power supply device 12 is moved between the blade servers 1, the component management table 33b is updated in accordance with the movement in each blade server 1.

In the blade server 1 in which the redundant power supply device 12 is newly installed, the operation time of the redundant power supply device 12 is set to 0 by the update. It is desirable to accurately manage the operation time in order to prepare for the occurrence of a failure. For this reason, it is desirable to notify the blade server 1 that is the destination of the redundant power supply device 12 of the operation time up to that point. Thereby, the maintenance substitute part management apparatus may be provided with a function of notifying the operation time.

In the present embodiment, only the redundant power supply 12 is selected as the power supply 12 mounted on the other blade server 1, but when there is no replaceable redundant power supply 12, the operating power supply 12 is in operation. May be targeted. This is because the operation rate to be guaranteed may differ depending on the blade server 1. Accordingly, the power supply device 12 to be mounted on the blade server 1 in a power shortage state may be selected from the blade server 1 having a lower operation rate than that of the blade server 1. This means that the power supply device 12 operating in another blade server 1 is regarded as a spare power supply device 12 depending on the situation.

In this embodiment, the maintenance substitute component management device is realized by being mounted on the management blade 13 of one blade server 1-1. However, the maintenance substitute component management device is mounted on the server blade 11. You can also. Since a program (parts management program) for realizing the maintenance alternative management device may cause the management blade 13 to fail, it is desirable that the management blade 13 of any blade server 1 be executable.

Claims

In a computer system comprising a plurality of computers each equipped with a power supply unit,
A storage unit for storing state information indicating a free state of a spare power supply device mounted on one or more of the plurality of computers;
When a failure occurs in the power supply device of any one of the plurality of computers, the power supply device that can replace the power supply device in which the failure has occurred is installed with reference to the state information stored in the storage unit. A specifying unit for specifying another computer different from the computer in which the failure has occurred;
A computer system comprising:
A computer system according to claim 1,
A notification unit for notifying the other computer specified by the specifying unit;
The computer system according to claim 1 or 2,
The specifying unit further refers to at least one of information indicating priority between computers when the spare power supply device is used instead, and information indicating power consumption of each computer, and the failure is Identify another computer that has a power supply that can replace the generated power supply.
A computer system according to claim 3,
When both the information indicating the priority order and the information indicating the power consumption are referred to, the specifying unit extracts other computers as candidates based on the information indicating the priority order, and two or more other computers are One other computer is selected using the information indicating the power consumption when extracted as a candidate.
A maintenance substitute parts management device capable of communicating with a plurality of computers each equipped with a power supply,
A storage unit for storing state information indicating a free state of a spare power supply device mounted on one or more of the plurality of computers;
When a failure occurs in the power supply device of any one of the plurality of computers, the power supply device that can replace the power supply device in which the failure has occurred is installed with reference to the state information stored in the storage unit. A specifying unit for specifying another computer different from the computer in which the failure has occurred;
A maintenance substitute parts management apparatus comprising:
A maintenance substitute component management method executed by a data processing device capable of communicating with a plurality of computers each equipped with a power supply device,
Among the plurality of computers, when a failure occurs in a computer power supply that does not have a spare power supply, check the spare power supply present in another computer,
Selecting a spare power supply to be used as a maintenance replacement part of the failed power supply from the spare power supply identified by the confirmation;
A maintenance substitute parts management method characterized by that.
The maintenance substitute part management method according to claim 6,
The data processing apparatus notifies the selected spare power supply apparatus.
In data processing devices that can communicate with multiple computers each equipped with a power supply,
Among the plurality of computers, when a failure occurs in a computer power supply that does not have a spare power supply, check the status of the spare power supply in another computer,
A program for executing a process of selecting a spare power supply device to be used in place of a failed power supply device based on the state confirmation result of the spare power supply device.