WO2013051145A1 - Computer system, management device, management method, and program - Google Patents

Computer system, management device, management method, and program Download PDF

Info

Publication number
WO2013051145A1
WO2013051145A1 PCT/JP2011/073148 JP2011073148W WO2013051145A1 WO 2013051145 A1 WO2013051145 A1 WO 2013051145A1 JP 2011073148 W JP2011073148 W JP 2011073148W WO 2013051145 A1 WO2013051145 A1 WO 2013051145A1
Authority
WO
WIPO (PCT)
Prior art keywords
power supply
supply device
blade
server
computers
Prior art date
Application number
PCT/JP2011/073148
Other languages
French (fr)
Japanese (ja)
Inventor
喜田剛啓
Original Assignee
富士通株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 富士通株式会社 filed Critical 富士通株式会社
Priority to PCT/JP2011/073148 priority Critical patent/WO2013051145A1/en
Publication of WO2013051145A1 publication Critical patent/WO2013051145A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F1/00Details not covered by groups G06F3/00 - G06F13/00 and G06F21/00
    • G06F1/26Power supply means, e.g. regulation thereof
    • G06F1/263Arrangements for using multiple switchable power supplies, e.g. battery and AC
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F1/00Details not covered by groups G06F3/00 - G06F13/00 and G06F21/00
    • G06F1/26Power supply means, e.g. regulation thereof
    • G06F1/28Supervision thereof, e.g. detecting power-supply failure by out of limits supervision
    • HELECTRICITY
    • H02GENERATION; CONVERSION OR DISTRIBUTION OF ELECTRIC POWER
    • H02JCIRCUIT ARRANGEMENTS OR SYSTEMS FOR SUPPLYING OR DISTRIBUTING ELECTRIC POWER; SYSTEMS FOR STORING ELECTRIC ENERGY
    • H02J1/00Circuit arrangements for dc mains or dc distribution networks
    • H02J1/001Hot plugging or unplugging of load or power modules to or from power distribution networks
    • HELECTRICITY
    • H02GENERATION; CONVERSION OR DISTRIBUTION OF ELECTRIC POWER
    • H02JCIRCUIT ARRANGEMENTS OR SYSTEMS FOR SUPPLYING OR DISTRIBUTING ELECTRIC POWER; SYSTEMS FOR STORING ELECTRIC ENERGY
    • H02J9/00Circuit arrangements for emergency or stand-by power supply, e.g. for emergency lighting
    • H02J9/04Circuit arrangements for emergency or stand-by power supply, e.g. for emergency lighting in which the distribution system is disconnected from the normal source and connected to a standby source
    • H02J9/06Circuit arrangements for emergency or stand-by power supply, e.g. for emergency lighting in which the distribution system is disconnected from the normal source and connected to a standby source with automatic change-over, e.g. UPS systems

Definitions

  • the present invention relates to a technique for dealing with a failure of a power supply device.
  • a server that is a computer connected to a network is required to be able to be used by many people in a timely manner through the network. For this reason, the server is required to have particularly high reliability.
  • the server is equipped with one or more power supply units.
  • the power supply device is a maintenance component that may cause a failure (failure).
  • a failure occurs in the installed power supply device, the power supply from the power supply device in which the failure has occurred stops, causing the server to run out of power. Therefore, the occurrence of a failure in the power supply apparatus is very likely to stop the server.
  • the server blade In a server (blade server) equipped with a plurality of server blades each capable of functioning as a server, the server blade can be operated individually. As a result, when a failure occurs in the power supply device, the blade server can cope with power shortage by stopping a part of the operating server blade in order to reduce power consumption.
  • the spare power supply device is hereinafter referred to as a “redundant power supply device” in order to be operated or to be distinguished from the operated power supply device.
  • a failure of another power supply device may occur before the repair or replacement of the power supply device is completed.
  • the entire system is stopped or a part of the operating server blade is stopped for the server that has become insufficient in power. From this, it can be said that even in a server on which a redundant power supply can be mounted, a situation where there is no redundant power supply that can replace the power supply in which a failure has occurred should be considered.
  • a redundant power supply device can be mounted, and power is supplied to other servers by connecting to other servers via a power cable, and others.
  • a server in which a failure has occurred in a power supply apparatus in a situation where there is no redundant power supply apparatus can avoid power shortage due to power supply from other servers. Thereby, higher reliability of each server is realized.
  • each server must be provided with the facilities necessary for connecting with the power cable.
  • Each server must be equipped with a function for responding to requests from other servers connected by a power cable. For this reason, in this conventional server system, both the manufacturing cost and installation cost of the server itself are greatly increased. When modifying an existing server system, the cost of the modification is high.
  • an object of the present invention is to easily find a power supply device that can be used as an alternative when a failure occurs in the power supply device of the device.
  • a plurality of computers each equipped with a power supply device are provided, and a state representing an empty state of a spare power supply device installed in one or more computers of the plurality of computers
  • a failure occurs in a storage unit that stores information and a power supply device of one of a plurality of computers, a power supply that can replace the failed power supply device by referring to the state information stored in the storage unit And a specifying unit that specifies another computer different from the computer in which the failure has occurred.
  • FIG. 1 is a diagram illustrating a configuration example of a computer system according to the present embodiment.
  • the computer system has a configuration in which a plurality of blade servers 1 are connected to a network 2.
  • a terminal device (for example, a console) 3 used by an operator or a worker is connected to the network 2.
  • FIG. 1 shows three blade servers 1-1 to 1-3, but the number of blade servers 1 connected to the network 2 is not limited to three.
  • the “1” of “Blade Server 1” and the number “1” following the hyphen of reference numeral 1-1 in FIG. 1 represent the numbers assigned to the blade server 1 as identification information (ID: IDentifier). Yes.
  • “2” of “Blade Server 2” and the number “2” following the hyphen of reference numeral 1-2 represent the numbers assigned to the blade server 1 as IDs.
  • each blade server 1 includes a plurality of server blades 11 (11-1 to 11-10), a plurality of power supply devices 12 (12-1 to 12-3), and a management board 13. Yes.
  • the three power supply devices 12 (12-1 to 12-3) are mounted to realize, for example, a 2 + 1 redundant configuration.
  • Each blade server 1 can obtain necessary power by operating two power supply devices 12. Therefore, one power supply device 12 is a spare power supply device (redundant power supply device), that is, an alternative maintenance part that is replaced by a failure that has occurred in one of the other two power supply devices 12.
  • the number of server blades 11 mounted on the blade server 1 is not limited to ten. Further, the number of power supply devices 12 that can be mounted and the redundant configuration are not limited to 3, 2 + 1, respectively.
  • FIG. 2 is a diagram illustrating a more detailed configuration of the blade server.
  • Each power supply device 12 includes a control device 12a that operates / stops the own power supply device 12, and each server blade 11 also includes a control device 11a that operates / stops the own server blade.
  • the control device 12 a of each power supply device 12 is connected to the management blade 13 by a bus 21, and the control device 11 a of each server blade 11 is connected to the management blade 13 by a bus 22.
  • the power is supplied from any one of the power supply devices 12 to the control device 12a of each power supply device 12 and the control device 11a of each server blade 11. ing.
  • each power supply device 12 and each server blade 11 perform operation / stop switching in accordance with an instruction from the management blade 13.
  • the management blade 13 manages the operation of the entire blade server 1. As shown in FIG. 2, an arithmetic device (for example, CPU (Central Processing Unit)) 13a, a storage device 13b, and an interface (denoted as “I / F” in the figure) 13c are provided.
  • arithmetic device for example, CPU (Central Processing Unit)
  • storage device 13b for example, a hard disk drive
  • interface denoted as “I / F” in the figure
  • the storage device 13b is, for example, a holding unit that holds programs executed by the arithmetic device 13a and various data.
  • the arithmetic device 13a performs control for managing the entire blade server 1 by reading out and executing the program stored in the storage device 13b, for example, in a memory mounted therein.
  • the interface 13c provides the computing device 13a with an environment where communication with each power supply device 12 via the bus 21 and communication with each server blade 11 via the bus 22 can be performed.
  • the control device 12a of each power supply device 12 detects a failure that has occurred in the power supply device 12 and notifies the management blade 13 of the detected failure. In response to the notification, the management blade 13 instructs the control device 12a of the redundant power supply device 12 to operate when the replaceable redundant power supply device 12 exists. The controller 12a of the power supply device 12 that has detected the failure is instructed to stop. In this way, the management blade 13 substitutes the redundant power supply device 12 for the power supply device 12 in which the failure is detected.
  • the management blade 13 instructs the control device 12a of the power supply device 12 that detected the failure to stop. Since the power failure occurs in the power supply device 12, the management blade 13 determines the server blade 11 to be stopped and instructs the control device 11 a of the determined server blade 11 to stop. In this way, only the server blade 11 that can be operated by the operating power supply device 12 is operated.
  • the failure occurring in the power supply device 12 is notified to the terminal device 3 by the management blade 13 including whether or not the redundant power supply device 12 can be substituted for the power supply device 12. From this, the worker performs necessary actions by monitoring each blade server 1 using the terminal device 3.
  • FIG. 1 two power supply devices 12 of the blade server 1-2 and five server blades 11-6 to 11-10 are marked with x.
  • a cross mark attached to the power supply device 12 represents the occurrence of a failure (failure), and a cross mark attached to the server blade 11 represents a state where the power supply device 12 has been stopped due to a failure occurring.
  • the management blade 13 operates only the five server blades 11-1 to 11-5, and the other five server blades 11 -6 to 11-10 are stopped.
  • the blade server 1-2 enters a system stop state in which all the server blades 11 and the management blade 13 are stopped.
  • the management blade 13 responds to a power shortage state in which only one power supply device 12 operates by stopping a part of the server blade 11.
  • the processing capacity of the blade server 1 is reduced.
  • the decrease in the processing capability increases the possibility that the user cannot comfortably use the blade server 1.
  • the power shortage state needs to be quickly resolved.
  • quick and more reliable cancellation of the power shortage state is realized as follows.
  • Each of the three blade servers 1 connected to the network 2 can be equipped with one or more spare power supply devices (redundant power supply devices) 12. Therefore, the computer system can have redundant power supply devices 12 for at least the number of blade servers 1.
  • the redundant power supply device 12 mounted on each blade server 1 of the computer system is handled as a shared maintenance substitute part. Accordingly, in the present embodiment, when a failure occurs in the power supply device 12 in the blade server 1 in which the replaceable redundant power supply device 12 does not exist, the redundant power supply device 12 that can be mounted on the blade server 1 is replaced with another blade server 1. Extract (specify) from For example, the extracted redundant power supply 12 notifies the worker of the blade server 1 on which the redundant power supply 12 is mounted.
  • the situation where there is no redundant power supply 12 that can replace the blade server 1 means that the redundant power supply 12 is not installed (all installed power supplies 12 are operating) or the redundant power supply 12 has already failed. Is the situation.
  • the redundant power supply 12 mounted on each blade server 1 of the computer system is used as a shared maintenance replacement part, even if there is no spare power supply 12 that is not used in the computer system, the power supply 12 It becomes possible to respond to the failure that occurred. Even in this situation, if there is a power supply 12 that can be replaced by any blade server 1 that constitutes the computer system, the power shortage state of the blade server 1 that does not have the redundant power supply 12 that can be replaced can be resolved. it can. For this reason, the power shortage state can be resolved more reliably. The presence / absence of the replaceable power supply device 12 and the presence of the power supply device 12 are notified to the blade server 1 on which the power supply device 12 is mounted. Can be done.
  • the extraction of the replaceable power supply device 12 in the entire computer system and the notification of the extraction result are performed by the maintenance substitute part management device according to the present embodiment.
  • the maintenance substitute part management apparatus is mounted on one of the blade servers 1.
  • the maintenance substitute part management apparatus according to the present embodiment is mounted on the blade server 1-1.
  • the maintenance substitute part management device can be mounted on any computer (data processing device) capable of communicating with each blade server 1.
  • FIG. 3 is a diagram illustrating a functional configuration of the maintenance substitute part management apparatus according to the present embodiment.
  • the maintenance substitute part management apparatus according to the present embodiment is mounted on the management blade 14 and includes a failure trap receiving unit 31, a substitute part extracting unit 32, a data holding unit 33, and a data output unit 34. Yes.
  • the data holding unit 33 corresponds to the storage device 13b shown in FIG.
  • the arithmetic device 13a has a management substitute component management program stored in the storage device 13b (hereinafter referred to as a “component management program”). Is executed to control the interface 13c.
  • the management blade 13 mounted on the blade server 1 other than the blade server 1-1 When the management blade 13 mounted on the blade server 1 other than the blade server 1-1 is in a power shortage state due to a failure that has occurred in the power supply device 12, the management blade 13 generates a message to that effect, and the blade server 1 Sent to -1.
  • the failure trap receiver 31 receives and processes the message. Thereby, the blade server 1 that has become insufficiency of power is notified to the alternative component extraction unit 32.
  • SNMP Simple Network Management Protocol
  • SNMP is a protocol for monitoring and controlling communication devices connected to a network such as a computer, a router, and a terminal device via the network.
  • the notification of the power shortage state can be performed using “SNMP trap”.
  • the SNMP trap is a function that, when a preset abnormal value is detected, notifies that fact from the SNMP agent to the SNMP manager.
  • An object ID (OID) is used to transmit the notification content including the detected abnormal value type and the request content.
  • the SNMP manager corresponds to the management blade 13 of the blade server 1-1
  • the SNMP agent corresponds to the management blade 13 of each blade server 1 other than the blade server 1-1.
  • the SNMP trap is also used to indicate a message transmitted by the SNMP.
  • the replacement part extraction unit 32 When the replacement part extraction unit 32 is notified of the blade server 1 in a power shortage state from the failure trap reception unit 31, the number of redundant power supply devices 12, its state, and the server blade 11 are notified to the other blade servers 1. Queries the average power consumption of. As a result, the alternative component extraction unit 32 identifies the redundant power supply device 12 that can be replaced by the entire computer system, and checks the state of the power supply device 12 in the blade server 1 on which the specified redundant power supply device 12 is mounted. The replacement component extraction unit 32 performs such confirmation, and when there are a plurality of replaceable redundant power supply devices 12, extracts the redundant power supply device 12 considered to be optimal from the replaceable redundant power supply devices 12.
  • the data output unit 34 transmits the data representing the redundant power supply device 12 extracted by the alternative component extraction unit 32 and the blade server 1 on which the redundant power supply device 12 is mounted, so that the alternative component extraction unit is connected via the terminal device 3. The extraction result of 32 is output.
  • the substitute part extraction unit 32 refers to the management server table 33a stored in the data holding unit 33 and extracts the redundant power supply device 12.
  • the data holding unit 33 stores a component management table 33b, and the component management table 33b is referred to as necessary.
  • the extraction method will be specifically described.
  • FIG. 4 is a diagram illustrating a configuration example of the management server table.
  • This management server table 33a is a table that represents the priority order among the blade servers 1 in extracting the redundant power supply device 12 that can be replaced, and the blade server 1 from which the redundant power supply device 12 is removed for the other blade server 1 Used for specific purposes.
  • each management server table 33 a stores data of ID, IP (Internet Protocol) address, and priority for each blade server 1.
  • FIG. 4 shows an example of the contents of the management server table when there are four blade servers 1 in the computer system.
  • the numbers 1 to 3 representing the priority order indicate that the higher the number, the lower the priority order.
  • the blade server 1 from which the redundant power supply device 12 is extracted is given the highest priority to the blade server 1 with the priority order of one.
  • This priority is set according to the operation rate guaranteed for the blade server 1. For example, when the three types of operation rate are guaranteed, that is, operation rate ⁇ 99%, 99% ⁇ operation rate ⁇ 99.99%, and 99.99% ⁇ operation rate, the lowest guaranteed operation rate
  • the blade server 1 is assigned a priority of 1.
  • the blade server 1 with the next highest operation rate is assigned a priority of 2
  • the blade server 12 with the highest operation rate is assigned a priority of 3.
  • the redundant power supply device 12 is extracted with priority given to the blade server 1 having a higher priority. Accordingly, the redundant power supply device 12 is prevented from being removed as the blade server 1 has a higher guaranteed operation rate. For this reason, the influence when the failure of the power supply device 12 occurs in the blade server 1 from which the redundant power supply device 12 is removed can be further suppressed.
  • the managing blade 13 manages each component using the component management table 33b shown in FIG.
  • the component management table 33b will be specifically described with reference to FIG.
  • the parts management table 33b stores each data of the part ID, type, state, operating time, and power consumption for each maintenance part.
  • the parts management table 33b is a table stored in the storage device 13b.
  • the extracted redundant power supply 12 may be the redundant power supply 12 mounted on the blade server 1-1. Therefore, the component management table 33b is data necessary for the maintenance substitute component management device to extract the redundant power supply device 12.
  • “server blade”, “power supply”, and “redundant power supply” are all represented as data representing the types of maintenance parts. Both “drive” and “standby” are represented as data representing the state of the maintenance part. Other states representing maintenance parts include “stop” and “failure”. “Standby” and “Stop” are maintenance parts that can be operated together, but “Standby” is a stop in a situation where it is not necessary to operate, whereas “Stop” needs to be operated It is a stop at.
  • the replaceable redundant power supply device 12 is a power supply device 12 of a type “redundant power supply” and a state “standby”. The activated redundant power supply 12 is updated from “standby” to “active” and the type is updated from “redundant power” to “power”.
  • the operation time (h) is the total time that the maintenance parts are actually operated, and is used to specify the timing for performing operation check, adjustment, replacement, or the like.
  • the operation time is measured by the arithmetic device 13a using, for example, a built-in hard timer.
  • the power consumption (W) is notified from the maintenance parts or is an average value thereof.
  • the power consumption value of the server blade 11 is notified from the control device 11a.
  • the average power consumption value returned in response to an inquiry from the maintenance / substitution component management apparatus is, for example, the average power consumption value of the entire blade server 11.
  • the management blade 13 of each blade server 1 can determine whether there is a redundant power supply 12 that can be replaced by the component management table 33b. As a result, when there is no redundant power supply 12 that replaces the failed power supply 12, the management blades 13 of the blade servers 1 other than the blade server 1-1 notify the management of the blade server 1-1. Notify the blade 13. When there is a redundant power supply 12 that can replace the failed power supply 12, the management blade 13 of each blade server 1 operates the redundant power supply 12 and stops the failed power supply 12.
  • the maintenance substitute component management device that is, the management blade 13 of the blade server 1-1, in response to the inquiry, the number of replaceable redundant power supply devices 12 existing in each blade server 1 other than the blade server 1-1, the average of the blade servers 11 Check the power consumption value.
  • a reply to the inquiry can be made by the management blade 13 using the component management table 33b.
  • the maintenance substitute part management device sorts the blade servers 1 in which the redundant power supply units 12 that can be substituted exist in priority order, and identifies the blade server 1 having the highest priority order.
  • the maintenance replacement component management device extracts the redundant power supply device 12 mounted on the blade server 1 as a replaceable redundant power supply device.
  • the maintenance substitute part management device sorts the plurality of blade servers 1 by the average power consumption value, and the average power consumption value is The smallest blade server 1 is specified.
  • the maintenance / substitution component management apparatus selects one of them. As a selection method thereof, a method of referring to the drive time of the redundant power supply device 12 and selecting the blade server 1 on which the redundant power supply device 12 with a shorter drive time is mounted can be considered.
  • the maintenance substitute part management device extracts the redundant power supply device 12 mounted on the selected blade server 1 as a replaceable redundant power supply device.
  • the redundant power supply device 12 removed from the blade server 1 is temporarily used in the newly installed blade server 1 or used until a failure occurs in the newly installed blade server 1.
  • the endurance time (life) of the power supply device 12 tends to be shorter as the supplied power is larger.
  • the redundant power supply device 12 is not necessarily used (operated) at all. For this reason, when it is assumed that the number of blade servers 11 mounted on each server blade 1 is the same, the time when the redundant power supply 12 is expected to fail is the average power consumption value of the server blades 11. It can be considered that the larger is, the shorter. Therefore, in the present embodiment, the redundant power supply device 12 that is considered to be less likely to cause a failure is used as a substitute so that the time in which the specific blade server 1 is in a power shortage state does not become long in the specific blade server 1. ing.
  • the fault trap receiver 31, the alternative component extractor 32, and the data output unit 34 shown in FIG. 2 are realized when the arithmetic device 13a executes the redundant power supply extraction process shown in FIG.
  • This redundant power supply extraction process is a process for dealing with a failure that has occurred in the power supply device 12 of the blade server 1 for which there is no replaceable redundant power supply device 12, and the arithmetic device 13a is a component stored in the storage device 13b. This is realized by executing the management program.
  • the redundant power supply extraction process will be described in detail with reference to FIG.
  • the arithmetic unit 13a monitors reception of an SNMP trap by the interface 13c (S1).
  • the arithmetic device 13a next determines whether or not the OID stored in the SNMP trap is a target OID, that is, an OID representing a power shortage state.
  • the determination in S2 is Yes and the process proceeds to S3.
  • the interface 13c receives a message in which the target OID is not stored, the determination in S2 is No and the process returns to S1.
  • the interface 13c waits for reception of the SNMP trap storing the corresponding OID.
  • the fault trap receiver 31 shown in FIG. 3 is realized by the arithmetic device 13a executing each process of S1 and S2.
  • This redundant power supply extraction process is realized by the arithmetic device 13a executing the component management program stored in the storage device 13b.
  • the arithmetic device 13a selects one blade server 1 excluding its own blade server 1-1 and the blade server 1 that transmitted the SNMP trap, and inquires the selected blade server 1 about the number of redundant power supply devices 12. .
  • One blade server 1 is selected with reference to the management server table 33a. The inquiry is made by sending an SNMP message storing the corresponding OID.
  • the arithmetic device 13a determines whether or not the number of redundant power supply devices 12 notified in response to the inquiry is zero. If there is no redundant power supply device 12 in the inquired blade server 1, the determination in S4 is Yes and the process proceeds to S8. When the redundant power supply device 12 is mounted on the blade server 1, the determination in S4 is No and the process proceeds to S5.
  • the arithmetic device 13a further makes an inquiry to the inquired blade server 1 to confirm the state of the redundant power supply device 12.
  • the inquiry is also made by transmitting an SNMP message storing the corresponding OID.
  • the arithmetic device 13a after the inquiry waits to receive a reply, and determines whether or not the state notified by the reply represents a usable state (denoted as “ok” in FIG. 8). If the notified state is “standby”, the determination in S6 is Yes and the process proceeds to S7. When the notified state is “failure” or “operation”, the determination in S6 is No and the process proceeds to S8.
  • the determination of No in S6 means that there is no replaceable redundant power supply device 12 in the blade server 1 that made the inquiry.
  • the arithmetic device 13a further inquires the selected blade server 1 about the average power consumption value (denoted as “average power” in FIG. 8) of the server blade 11. This inquiry is made by transmitting an SNMP message storing the corresponding OID, as with other inquiries. After making the inquiry, it waits to receive a reply, saves the average power consumption value of the server blade 11 notified by the reply in the storage device 13b, and then proceeds to S8.
  • the arithmetic device 13a determines whether or not the selected blade server 1 is the last blade server 1. If there is no other blade server 1 to be inquired, the determination in S8 is Yes and the process proceeds to S9. If there is another blade server 1 to be inquired, the determination in S8 is No and the process returns to S3. In S3, an inquiry is made by selecting another blade server 1 anew.
  • the arithmetic device 13a executes an alternative component determination process for determining the redundant power supply device 12 to be a maintenance alternative component using the result of the inquiry.
  • the arithmetic device 13a performs screen output to the terminal device 3 for notifying the worker of the determined redundant power supply device 12 (S10). Thereafter, the redundant power supply extraction process is terminated.
  • Screen output to the terminal device 3 is performed by transmitting a message storing data of a screen (image) to be output.
  • FIG. 7 is a flowchart of the substitute part determination process. Next, the substitute part determination process will be described in detail with reference to FIG.
  • the arithmetic device 13a determines whether or not there is a redundant power supply device 12 that can be used for another blade server 1 (S20). If any of the inquired blade servers 1 is equipped with the redundant power supply device 12 in the standby state, the determination in S20 is Yes and the process proceeds to S21. If all of the inquired blade servers 1 are either not equipped with the redundant power supply device 12, the redundant power supply device 12 has failed, or the redundant power supply device 12 is in operation, the S21 The determination is no and the process moves to S23.
  • the arithmetic device 13a determines that there is no redundant power supply device 12 that can be a maintenance substitute component. After making such a determination, the substitute part determination process ends. Thereby, when the above-described S10 is executed, the terminal device 3 outputs a screen indicating that fact.
  • the arithmetic device 13a extracts the blade servers 1 in which the usable redundant power supply devices 12 are confirmed, and sorts them according to the priority order with reference to the management server table 33a. Next, the arithmetic device 13a determines whether or not there is one blade server 1 with the highest priority among the sorted blade servers 1. If there is only one blade server 1 with the highest priority among the sorted blade servers 1, the determination in S21 is Yes and the process proceeds to S23, and the redundant power supply mounted on this one blade server 1 is obtained. The apparatus 12 is determined as a maintenance substitute part. When there are a plurality of blade servers 1 with the highest priority, the determination in S21 is No and the process proceeds to S25. In the example of the management server table shown in FIG.
  • the case where there is only one blade server 1 with the highest priority is the case where there is only one blade server with the priority 1, and the blade server 1 with the priority 1.
  • the case where there is only one blade server 1 with a priority of 2 and there is only one blade server 1 with a priority of 3, and there is only one blade server 1. is there.
  • the arithmetic device 13a sorts the blade servers 1 having the highest priority according to the average power value. Next, the arithmetic device 13a selects the blade server 1 having the smallest average power value among the sorted blade servers 1. Thereafter, the process proceeds to S23, and the redundant power supply device 12 mounted on the selected blade server 1 is determined as a maintenance substitute part.
  • the blade servers 1 equipped with the redundant power supply devices 12 to be maintenance substitute parts are narrowed down in the order of the priority order of the blade servers 1 and the average power value of the server blades 11. Accordingly, in the present embodiment, the redundant power supply device 12 that is considered to be less likely to fail is prioritized while suppressing the possibility of selecting the redundant power supply device 12 mounted on the blade server 1 having a high guaranteed operation rate. To choose.
  • the average power consumption value of the server blades 11 is the same for a plurality of blade servers 1.
  • one blade server 1 may be arbitrarily selected from among the plurality of blade servers 1. If a redundant power supply 12 that is considered to be less likely to fail is selected, the operation time of the redundant power supply 12 is referred to and the blade server 1 equipped with the redundant power supply 12 with the shortest operation time is selected. May be. In order to suppress a decrease in the operation rate, the number of times that the power supply device 12 has failed in a situation where there is no substitute redundant power supply device 12 may be counted, and the blade server 1 having the smallest number of times counted may be selected.
  • FIG. 8 is a flowchart showing a recovery procedure performed by a worker when a redundant power supply installed in another blade server is determined as a substitute part for maintenance.
  • the recovery procedure by an operator is demonstrated in detail.
  • System stop due to hardware failure shown in FIG. 8 means that the entire system of the blade server 1 is stopped due to a failure in the power supply device 12 in a situation where there is no redundant power supply that can be replaced.
  • the server blade 11 is stopped.
  • the “maintenance part” and the “failed part” correspond to the power supply apparatus 12, and the “maintenance substitute part” corresponds to the redundant power supply apparatus 12.
  • the worker When the worker recognizes an abnormality of any blade server 1 using the terminal device 3 or the like, the worker identifies a maintenance component that has failed in the blade server 1 that has recognized the abnormality (S100).
  • the abnormality assumed here is an abnormality caused by a power shortage state due to the fact that the failed maintenance component is the power supply device 12 and the replaceable redundant power supply device 12 is not mounted on the blade server 1.
  • the maintenance substitute part management device determines a redundant power supply device 12 to be a maintenance substitute part, and displays the determination result on the terminal device 3. From this, the worker removes the redundant power supply device 12 from the blade server 1 presented by the maintenance / substitution component management device and mounts it on the blade server 1 in which an abnormality has occurred (S200). Thereby, the recovery of the blade server 1 in which an abnormality has occurred is completed. At this time, if there is no place for mounting the redundant power supply device 12 on the blade server 1, the worker will replace the failed power supply device 12 with the redundant power supply device 12.
  • the occurrence of the above abnormality means that the power supply device 12 necessary for the computer system is insufficient. This is because all the blade servers 1 are not equipped with an alternative redundant power supply device 12. Therefore, the worker or the person in charge orders the power supply device 12 from the supplier (S300). The worker or the person in charge receives the power supply device 12 delivered by the supplier according to the order (S310). The worker uses the delivered power supply device 12 as the redundant power supply device 12 to be newly mounted on the blade server 1 from which the redundant power supply device 12 has been removed or the redundant power supply device 12 mounted on the blade server 1 in which an abnormality has occurred. Replace (S320). When the redundant power supply device 12 is replaced, the redundant power supply device 12 removed from the blade server 1 in which an abnormality has occurred may be mounted on the original blade server 1 again.
  • the redundant power supply device 12 is handled as a shared maintenance substitute part in the entire computer system, recovery using the power supply device 12 delivered by ordering can be minimized. For this reason, more rapid recovery is possible, and the time during which the power is insufficient is minimized. As a result, a reduction in operating rate due to a power shortage state can be minimized.
  • each blade server 1 When the redundant power supply device 12 mounted on the blade server 1 is mounted on another blade server 1, it is necessary to update the component management tables 33b of the two blade servers 1 respectively.
  • the management blade 13 of each blade server 1 recognizes the removal of the component connected to the bus 21 or 22 and the connection of a new component to the bus 21 or 22, and updates the component management table 33b. Thereby, even if the redundant power supply device 12 is moved between the blade servers 1, the component management table 33b is updated in accordance with the movement in each blade server 1.
  • the operation time of the redundant power supply device 12 is set to 0 by the update. It is desirable to accurately manage the operation time in order to prepare for the occurrence of a failure. For this reason, it is desirable to notify the blade server 1 that is the destination of the redundant power supply device 12 of the operation time up to that point. Thereby, the maintenance substitute part management apparatus may be provided with a function of notifying the operation time.
  • the redundant power supply 12 is selected as the power supply 12 mounted on the other blade server 1, but when there is no replaceable redundant power supply 12, the operating power supply 12 is in operation. May be targeted. This is because the operation rate to be guaranteed may differ depending on the blade server 1. Accordingly, the power supply device 12 to be mounted on the blade server 1 in a power shortage state may be selected from the blade server 1 having a lower operation rate than that of the blade server 1. This means that the power supply device 12 operating in another blade server 1 is regarded as a spare power supply device 12 depending on the situation.
  • the maintenance substitute component management device is realized by being mounted on the management blade 13 of one blade server 1-1.
  • the maintenance substitute component management device is mounted on the server blade 11. You can also. Since a program (parts management program) for realizing the maintenance alternative management device may cause the management blade 13 to fail, it is desirable that the management blade 13 of any blade server 1 be executable.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Power Engineering (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • Emergency Management (AREA)
  • Power Sources (AREA)

Abstract

One system to which the present invention is applied comprises a plurality of computers each of which is equipped with a power source device. The system comprises the following: a storage unit for storing status information which indicates the free status of a supplementary power source device equipped on one or more computers from among the plurality of computers; and a specification unit that, in a case where failure occurs in the power source device of any one of the plurality of computers, refers to the status information stored in the storage unit, and specifies a computer that is different from the computer in which failure occurred and that is equipped with a power source device which can substitute for the power source device that failed. A worker who responds to the occurred failure is informed of the different computer specified by the specification unit.

Description

コンピュータシステム、管理装置、管理方法、及びプログラムComputer system, management apparatus, management method, and program
 本発明は、電源装置の障害に対応するための技術に関する。 The present invention relates to a technique for dealing with a failure of a power supply device.
 電気製品には、高い信頼性が求められる。ネットワークに接続されるコンピュータであるサーバでは、そのネットワークを介して多くの人がタイムリに利用できることが要求される。このため、サーバには、特に高い信頼性が求められる。 Electrical products are required to have high reliability. A server that is a computer connected to a network is required to be able to be used by many people in a timely manner through the network. For this reason, the server is required to have particularly high reliability.
 サーバには、電源装置が1つ以上、搭載されている。電源装置は、障害(故障)が発生する可能性のある保守部品である。搭載した電源装置に障害が発生した場合、障害が発生した電源装置からの電力供給が停止することでサーバは電力不足になる。そのため、電源装置における障害の発生は、サーバを停止させる可能性が非常に高い。 The server is equipped with one or more power supply units. The power supply device is a maintenance component that may cause a failure (failure). When a failure occurs in the installed power supply device, the power supply from the power supply device in which the failure has occurred stops, causing the server to run out of power. Therefore, the occurrence of a failure in the power supply apparatus is very likely to stop the server.
 それぞれがサーバとして機能可能なサーバブレードを複数、搭載したサーバ(ブレードサーバ)では、サーバブレードを個別に稼動させることができる。それにより、電源装置に障害が発生した場合、ブレードサーバでは、消費電力を抑えるために、運用中のサーバブレードの一部を停止させることによる電力不足への対応が可能である。 In a server (blade server) equipped with a plurality of server blades each capable of functioning as a server, the server blade can be operated individually. As a result, when a failure occurs in the power supply device, the blade server can cope with power shortage by stopping a part of the operating server blade in order to reduce power consumption.
 電力不足によるサーバの停止、及び運用するサーバブレードの数の減少は共に、利用者の快適な利用を阻害する。そのため、電力不足状態は、迅速に解消する必要がある。このことから、現在では、多くのサーバは、サーバ本体の電源を入れたまま電源装置を交換する活性交換を可能とする他に、予備の電源装置(保守代替部品)が搭載されるようになっている。予備の電源装置は、稼動させる、或いは稼動させている電源装置と区別するために、以降「冗長電源装置」と表記する。 Both stoppage of the server due to power shortage and decrease in the number of operating server blades impede the user's comfortable use. Therefore, it is necessary to quickly resolve the power shortage state. For this reason, many servers are now equipped with spare power supply units (maintenance replacement parts) in addition to enabling hot replacement in which the power supply unit is replaced with the server main unit turned on. ing. The spare power supply device is hereinafter referred to as a “redundant power supply device” in order to be operated or to be distinguished from the operated power supply device.
 冗長電源装置を搭載したサーバは、電源装置に障害が発生した場合、冗長電源装置を稼動させ、冗長電源装置を障害の発生した電源装置と代替させることにより、障害が発生した電源装置の停止に伴う電力不足を回避することができる。このことから、冗長電源装置を搭載させた場合、サーバの信頼性を向上させることができ、より高い稼働率を実現できる。この稼働率は、対象時間内にサーバが実際に稼動している稼動時間をその対象時間で割った除算値に100を掛けて得られる値(=稼動時間・100/対象時間)である。 When a server with a redundant power supply is installed, if the power supply fails, operate the redundant power supply and replace the redundant power supply with the failed power supply to stop the failed power supply. The accompanying power shortage can be avoided. Therefore, when a redundant power supply device is installed, the reliability of the server can be improved and a higher operation rate can be realized. This operation rate is a value obtained by multiplying the division value obtained by dividing the operation time during which the server is actually operating within the target time by the target time (= operation time · 100 / target time).
 冗長電源装置を稼動させたサーバでは、障害が発生した電源装置の修理、或いは交換が行われる。しかし、電源装置の修理、或いは交換が終了する前に、別の電源装置に障害が発生する場合がある。このとき、代替可能な冗長電源装置が存在していなければ、電力不足となったサーバはシステム全体が停止するか、或いは運用中のサーバブレードの一部が停止することとなる。このことから、冗長電源装置を搭載可能なサーバでも、障害が発生した電源装置と代替可能な冗長電源装置が存在しない状況を考慮すべきと云える。 In a server that operates a redundant power supply, the power supply that has failed is repaired or replaced. However, a failure of another power supply device may occur before the repair or replacement of the power supply device is completed. At this time, if there is no redundant power supply that can be replaced, the entire system is stopped or a part of the operating server blade is stopped for the server that has become insufficient in power. From this, it can be said that even in a server on which a redundant power supply can be mounted, a situation where there is no redundant power supply that can replace the power supply in which a failure has occurred should be considered.
 複数のサーバを備えた従来のサーバ(コンピュータ)システムとしては、冗長電源装置を搭載可能にすると共に、他のサーバと電力ケーブルを介して接続することにより、他のサーバへの電力供給、及び他のサーバからの電力供給を可能にしたシステムがある。この従来のサーバシステムでは、冗長電源装置が存在しない状況で電源装置に障害が発生したサーバは、他のサーバからの電力供給により、電力不足となるのを回避することが可能となる。それにより、各サーバのより高い信頼性が実現される。 As a conventional server (computer) system including a plurality of servers, a redundant power supply device can be mounted, and power is supplied to other servers by connecting to other servers via a power cable, and others. There is a system that can supply power from other servers. In this conventional server system, a server in which a failure has occurred in a power supply apparatus in a situation where there is no redundant power supply apparatus can avoid power shortage due to power supply from other servers. Thereby, higher reliability of each server is realized.
 しかし、サーバ間を電力ケーブルにより接続し、相互の電力供給を可能にする場合、電力ケーブルで接続するために必要な設備を各サーバに設けなくてはならない。また、各サーバには、電力ケーブルで接続された他のサーバからの要求に対応するための機能を搭載しなければならない。このようなことから、この従来のサーバシステムでは、サーバ自体の製造コスト、設置コストが共に大きく増大する。既存のサーバシステムを改造する場合、その改造コストは大きいものとなる。 However, if the servers are connected by a power cable to enable mutual power supply, each server must be provided with the facilities necessary for connecting with the power cable. Each server must be equipped with a function for responding to requests from other servers connected by a power cable. For this reason, in this conventional server system, both the manufacturing cost and installation cost of the server itself are greatly increased. When modifying an existing server system, the cost of the modification is high.
 また、従来のサーバシステムでも、電力ケーブルで接続した複数のサーバ全てに代替可能な冗長電源装置が存在しない状況がありうる。これは、電源装置に発生する障害によってサーバが電力不足になる可能性が存在することを意味する。このことから、従来のサーバシステムでも、代替可能な冗長電源装置が存在しない状況を考慮すべきである。 Also, even in a conventional server system, there may be a situation where there is no redundant power supply that can be substituted for all of a plurality of servers connected by a power cable. This means that there is a possibility that the server will run out of power due to a failure occurring in the power supply device. Therefore, it should be considered that there is no redundant power supply that can be replaced even in the conventional server system.
 代替可能な冗長電源装置が存在しない状況で電源装置に発生する障害には、作業員が対応しなければならない。作業員は、迅速に、電源装置に障害が発生したサーバに代替可能な電源装置を搭載し、電力不足状態を解消しなければならない。しかし、代替可能な電源装置が常に用意されているとは限らない。コンピュータシステムにおいて、冗長電源装置を含む複数の電源装置に障害が連続して発生するか、或いは電源装置の調達が何らかの理由によって遅れているような場合、代替可能な電源装置が存在しない状況となりやすい。このことから、電源装置に発生する障害への対応では、代替可能な電源装置が用意されていない状況、つまり稼動可能な電源装置は全て何れかのサーバに搭載されている状況を考慮することが望ましいと云える。 Employees must respond to failures that occur in the power supply when there is no redundant power supply that can be replaced. An operator must quickly install a power supply that can replace the server in which the power supply has failed to eliminate the power shortage condition. However, an alternative power supply device is not always prepared. In a computer system, if a plurality of power supply devices including redundant power supply devices continuously fail or if the procurement of the power supply device is delayed for some reason, it is likely that there is no alternative power supply device. . For this reason, when dealing with a failure that occurs in a power supply, it is necessary to consider the situation in which no substitute power supply is prepared, that is, the situation where all operable power supplies are installed in any server. This is desirable.
特開2009-169874号公報JP 2009-169874 A 特開2008-83841号公報JP 2008-83841 A
 1つの側面では、本発明は、装置の電源装置に障害が発生した場合に代替利用可能な電源装置を容易に探し出すことを目的とする。 In one aspect, an object of the present invention is to easily find a power supply device that can be used as an alternative when a failure occurs in the power supply device of the device.
 本発明を適用した1システムでは、それぞれ電源装置が搭載された複数のコンピュータを備えられており、複数のコンピュータのうちの一つ以上のコンピュータに搭載された予備の電源装置の空き状態を表す状態情報を記憶する記憶部と、複数のコンピュータいずれかのコンピュータの電源装置に障害が発生した場合に、記憶部に記憶された状態情報を参照して、障害が発生した電源装置と代替可能な電源装置を搭載した、障害が発生した該コンピュータとは異なる他のコンピュータを特定する特定部と、を具備する。 In one system to which the present invention is applied, a plurality of computers each equipped with a power supply device are provided, and a state representing an empty state of a spare power supply device installed in one or more computers of the plurality of computers When a failure occurs in a storage unit that stores information and a power supply device of one of a plurality of computers, a power supply that can replace the failed power supply device by referring to the state information stored in the storage unit And a specifying unit that specifies another computer different from the computer in which the failure has occurred.
 本発明を適用した1システムでは、装置の電源装置に障害が発生した場合に、代替利用可能な電源装置を容易に探し出すことができる。 In one system to which the present invention is applied, when a failure occurs in the power supply device of the device, it is possible to easily find an alternative power supply device.
本実施形態によるコンピュータシステムの構成例を説明する図である。It is a figure explaining the structural example of the computer system by this embodiment. ブレードサーバのより詳細な構成を説明する図である。It is a figure explaining the more detailed structure of a blade server. 本実施形態による保守代替部品管理装置の機能構成を説明する図である。It is a figure explaining the functional structure of the maintenance alternative component management apparatus by this embodiment. 管理サーバテーブルの構成例を説明する図である。It is a figure explaining the structural example of a management server table. 部品管理テーブルの構成例を説明する図である。It is a figure explaining the structural example of a components management table. 冗長電源抽出処理のフローチャートである。It is a flowchart of a redundant power supply extraction process. 代替部品決定処理のフローチャートである。It is a flowchart of an alternative component determination process. 他のブレードサーバに搭載の冗長電源装置を保守代替部品として決定した場合の作業員による復旧手順を表すフローチャートである。It is a flowchart showing the recovery procedure by the worker | operator when determining the redundant power supply device mounted in another blade server as a maintenance alternative part.
 以下、本発明の実施の形態について、図面を参照しながら詳細に説明する。
  図1は、本実施形態によるコンピュータシステムの構成例を説明する図である。図1に表すように、コンピュータシステムは、複数のブレードサーバ1をネットワーク2に接続した構成となっている。ネットワーク2には、オペレータ、或いは作業員が使用する端末装置(例えばコンソール)3が接続されている。
Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.
FIG. 1 is a diagram illustrating a configuration example of a computer system according to the present embodiment. As shown in FIG. 1, the computer system has a configuration in which a plurality of blade servers 1 are connected to a network 2. A terminal device (for example, a console) 3 used by an operator or a worker is connected to the network 2.
 図1には、3台のブレードサーバ1-1~1-3を表しているが、ネットワーク2に接続させるブレードサーバ1の台数は3台に限定されない。図1に表記の「ブレードサーバ1」の「1」、符号1-1のハイフンに続く数字の「1」は共に、ブレードサーバ1に識別情報(ID:IDentifier)として割り当てられた番号を表している。同様に、「ブレードサーバ2」の「2」、符号1-2のハイフンに続く数字の「2」は共に、ブレードサーバ1にIDとして割り当てられた番号を表している。 FIG. 1 shows three blade servers 1-1 to 1-3, but the number of blade servers 1 connected to the network 2 is not limited to three. The “1” of “Blade Server 1” and the number “1” following the hyphen of reference numeral 1-1 in FIG. 1 represent the numbers assigned to the blade server 1 as identification information (ID: IDentifier). Yes. Similarly, “2” of “Blade Server 2” and the number “2” following the hyphen of reference numeral 1-2 represent the numbers assigned to the blade server 1 as IDs.
 図1に表す3台のブレードサーバ1-1~1-3は、本実施形態によるコンピュータである。各ブレードサーバ1は、図1に表すように、複数のサーバブレード11(11-1~11-10)、複数の電源装置12(12-1~12-3)、及びマネージメントボード13を備えている。 The three blade servers 1-1 to 1-3 shown in FIG. 1 are computers according to this embodiment. As shown in FIG. 1, each blade server 1 includes a plurality of server blades 11 (11-1 to 11-10), a plurality of power supply devices 12 (12-1 to 12-3), and a management board 13. Yes.
 3台の電源装置12(12-1~12-3)は、例えば2+1の冗長構成の実現のために搭載されている。各ブレードサーバ1は、2台の電源装置12の稼動により必要な電力を得ることができる。このことから、1台の電源装置12は予備の電源装置(冗長電源装置)、つまり他の2台の電源装置12のうちの一方に発生した障害によって代替させる代替保守部品である。ブレードサーバ1に搭載されるサーバブレード11の数は10に限定されない。また、搭載可能な電源装置12の数、及び冗長構成もそれぞれ3、2+1に限定されない。 The three power supply devices 12 (12-1 to 12-3) are mounted to realize, for example, a 2 + 1 redundant configuration. Each blade server 1 can obtain necessary power by operating two power supply devices 12. Therefore, one power supply device 12 is a spare power supply device (redundant power supply device), that is, an alternative maintenance part that is replaced by a failure that has occurred in one of the other two power supply devices 12. The number of server blades 11 mounted on the blade server 1 is not limited to ten. Further, the number of power supply devices 12 that can be mounted and the redundant configuration are not limited to 3, 2 + 1, respectively.
 図2は、ブレードサーバのより詳細な構成を説明する図である。
 各電源装置12は、自電源装置12の稼動/停止を行う制御装置12aを備え、各サーバブレード11も、自サーバブレードの稼動/停止を行う制御装置11aを備えている。各電源装置12の制御装置12aは、バス21によってマネージメントブレード13と接続され、各サーバブレード11の制御装置11aは、バス22によってマネージメントブレード13と接続されている。各電源装置12の制御装置12a、及び各サーバブレード11の制御装置11aには、全ての電源装置12からの電力供給が停止されない限り、何れかの電源装置12から電力が供給されるようになっている。それにより、各電源装置12、及び各サーバブレード11は、マネージメントブレード13の指示に従って、稼動/停止の切り替えを行う。
FIG. 2 is a diagram illustrating a more detailed configuration of the blade server.
Each power supply device 12 includes a control device 12a that operates / stops the own power supply device 12, and each server blade 11 also includes a control device 11a that operates / stops the own server blade. The control device 12 a of each power supply device 12 is connected to the management blade 13 by a bus 21, and the control device 11 a of each server blade 11 is connected to the management blade 13 by a bus 22. As long as the power supply from all the power supply devices 12 is not stopped, the power is supplied from any one of the power supply devices 12 to the control device 12a of each power supply device 12 and the control device 11a of each server blade 11. ing. Thereby, each power supply device 12 and each server blade 11 perform operation / stop switching in accordance with an instruction from the management blade 13.
 マネージメントブレード13は、ブレードサーバ1全体の動作を管理する。図2に表すように、演算装置(例えばCPU(Central Processing Unit))13a、記憶装置13b、及びインターフェース(図中「I/F」と表記)13cを備えている。 The management blade 13 manages the operation of the entire blade server 1. As shown in FIG. 2, an arithmetic device (for example, CPU (Central Processing Unit)) 13a, a storage device 13b, and an interface (denoted as “I / F” in the figure) 13c are provided.
 記憶装置13bは、例えば演算装置13aが実行するプログラムや各種データを保持する保持部である。演算装置13aは、記憶装置13bに格納されたプログラムを例えば自身に搭載されたメモリに読み出して実行することにより、ブレードサーバ1全体を管理するための制御を行う。インターフェース13cは、演算装置13aに対し、バス21を介した各電源装置12との通信、及びバス22を介した各サーバブレード11との通信を行える環境を提供する。 The storage device 13b is, for example, a holding unit that holds programs executed by the arithmetic device 13a and various data. The arithmetic device 13a performs control for managing the entire blade server 1 by reading out and executing the program stored in the storage device 13b, for example, in a memory mounted therein. The interface 13c provides the computing device 13a with an environment where communication with each power supply device 12 via the bus 21 and communication with each server blade 11 via the bus 22 can be performed.
 各電源装置12の制御装置12aは、自電源装置12に発生した障害を検出し、検出した障害をマネージメントブレード13に通知する。マネージメントブレード13は、その通知により、代替可能な冗長電源装置12が存在している場合、その冗長電源装置12の制御装置12aに稼動を指示する。障害を検出した電源装置12の制御装置12aには停止を指示する。そのようにして、マネージメントブレード13は、障害が検出された電源装置12を冗長電源装置12に代替させる。 The control device 12a of each power supply device 12 detects a failure that has occurred in the power supply device 12 and notifies the management blade 13 of the detected failure. In response to the notification, the management blade 13 instructs the control device 12a of the redundant power supply device 12 to operate when the replaceable redundant power supply device 12 exists. The controller 12a of the power supply device 12 that has detected the failure is instructed to stop. In this way, the management blade 13 substitutes the redundant power supply device 12 for the power supply device 12 in which the failure is detected.
 一方、代替可能な冗長電源装置12が存在しない場合、マネージメントブレード13は、障害を検出した電源装置12の制御装置12aに停止を指示する。電源装置12に発生した障害によって電力不足状態となることから、マネージメントブレード13は、停止させるべきサーバブレード11を決定し、決定したサーバブレード11の制御装置11aに停止を指示する。そのようにして、稼動中の電源装置12によって動作させることができる分のサーバブレード11のみ動作させる。 On the other hand, if there is no replaceable redundant power supply device 12, the management blade 13 instructs the control device 12a of the power supply device 12 that detected the failure to stop. Since the power failure occurs in the power supply device 12, the management blade 13 determines the server blade 11 to be stopped and instructs the control device 11 a of the determined server blade 11 to stop. In this way, only the server blade 11 that can be operated by the operating power supply device 12 is operated.
 電源装置12に発生した障害は、その電源装置12に冗長電源装置12を代替させることができたか否かを含め、マネージメントブレード13によって端末装置3に通知される。このことから、作業員は、端末装置3を用いた各ブレードサーバ1の監視により、必要な対応を行う。 The failure occurring in the power supply device 12 is notified to the terminal device 3 by the management blade 13 including whether or not the redundant power supply device 12 can be substituted for the power supply device 12. From this, the worker performs necessary actions by monitoring each blade server 1 using the terminal device 3.
 図1では、ブレードサーバ1-2の2つの電源装置12、及び5つのサーバブレード11-6~11-10に×印を付している。電源装置12に付した×印は、障害(故障)の発生を表し、サーバブレード11に付した×印は、2つの電源装置12に発生した障害により停止させた状態を表している。図1に表すように、稼動可能な電源装置12が1つのみとなった場合、マネージメントブレード13は、5つのサーバブレード11-1~11-5のみを動作させ、他の5つのサーバブレード11-6~11-10は停止させる。稼動可能な電源装置12が存在しなくなった場合、ブレードサーバ1-2は、全てのサーバブレード11、及びマネージメントブレード13が停止するシステム停止状態となる。 In FIG. 1, two power supply devices 12 of the blade server 1-2 and five server blades 11-6 to 11-10 are marked with x. A cross mark attached to the power supply device 12 represents the occurrence of a failure (failure), and a cross mark attached to the server blade 11 represents a state where the power supply device 12 has been stopped due to a failure occurring. As shown in FIG. 1, when only one power supply device 12 is operable, the management blade 13 operates only the five server blades 11-1 to 11-5, and the other five server blades 11 -6 to 11-10 are stopped. When the operable power supply device 12 no longer exists, the blade server 1-2 enters a system stop state in which all the server blades 11 and the management blade 13 are stopped.
 このように、マネージメントブレード13は、1つの電源装置12のみが稼動する電力不足状態に、サーバブレード11の一部を停止させることで対応する。しかし、サーバブレード11の一部を停止させることにより、ブレードサーバ1の処理能力は低下する。その処理能力の低下によって、利用者がブレードサーバ1を快適に利用できなくなる可能性が高くなる。それにより、電力不足状態は迅速に解消する必要がある。本実施形態では、電力不足状態の迅速、且つより確実な解消を以下のようにして実現させる。 Thus, the management blade 13 responds to a power shortage state in which only one power supply device 12 operates by stopping a part of the server blade 11. However, by stopping a part of the server blade 11, the processing capacity of the blade server 1 is reduced. The decrease in the processing capability increases the possibility that the user cannot comfortably use the blade server 1. Thereby, the power shortage state needs to be quickly resolved. In the present embodiment, quick and more reliable cancellation of the power shortage state is realized as follows.
 ネットワーク2に接続された3台のブレードサーバ1はそれぞれ予備の電源装置(冗長電源装置)12を1つ以上、搭載可能となっている。そのため、コンピュータシステムは、少なくともブレードサーバ1の数分、冗長電源装置12を有することができる。本実施形態では、このことに着目し、コンピュータシステムの各ブレードサーバ1に搭載された冗長電源装置12を、共有の保守代替部品として扱う。それにより、本実施形態では、代替可能な冗長電源装置12が存在しないブレードサーバ1で電源装置12に障害が発生した場合、そのブレードサーバ1に搭載可能な冗長電源装置12を他のブレードサーバ1から抽出(特定)する。抽出した冗長電源装置12は、例えばその冗長電源装置12を搭載したブレードサーバ1を作業員に提示することで通知する。ブレードサーバ1に代替可能な冗長電源装置12が存在しない状況とは、冗長電源装置12が搭載されていない(搭載された電源装置12を全て稼動させている)、或いは冗長電源装置12が既に故障している状況である。 Each of the three blade servers 1 connected to the network 2 can be equipped with one or more spare power supply devices (redundant power supply devices) 12. Therefore, the computer system can have redundant power supply devices 12 for at least the number of blade servers 1. In the present embodiment, paying attention to this, the redundant power supply device 12 mounted on each blade server 1 of the computer system is handled as a shared maintenance substitute part. Accordingly, in the present embodiment, when a failure occurs in the power supply device 12 in the blade server 1 in which the replaceable redundant power supply device 12 does not exist, the redundant power supply device 12 that can be mounted on the blade server 1 is replaced with another blade server 1. Extract (specify) from For example, the extracted redundant power supply 12 notifies the worker of the blade server 1 on which the redundant power supply 12 is mounted. The situation where there is no redundant power supply 12 that can replace the blade server 1 means that the redundant power supply 12 is not installed (all installed power supplies 12 are operating) or the redundant power supply 12 has already failed. Is the situation.
 コンピュータシステムの各ブレードサーバ1に搭載された冗長電源装置12を、共有の保守代替部品とする場合、コンピュータシステムに使用されていない予備の電源装置12が存在していない状況でも、電源装置12に発生した障害に対応できるようになる。その状況でも、コンピュータシステムを構成する何れのブレードサーバ1に代替可能な電源装置12が存在していれば、代替可能な冗長電源装置12が存在しないブレードサーバ1の電力不足状態を解消することができる。このことから、電力不足状態の解消はより確実に行えるようになる。代替可能な電源装置12の存在の有無、その電源装置12が存在している場合はその電源装置12が搭載されたブレードサーバ1を通知するため、作業員は、電力不足状態の解消を常に迅速に行うことができる。 When the redundant power supply 12 mounted on each blade server 1 of the computer system is used as a shared maintenance replacement part, even if there is no spare power supply 12 that is not used in the computer system, the power supply 12 It becomes possible to respond to the failure that occurred. Even in this situation, if there is a power supply 12 that can be replaced by any blade server 1 that constitutes the computer system, the power shortage state of the blade server 1 that does not have the redundant power supply 12 that can be replaced can be resolved. it can. For this reason, the power shortage state can be resolved more reliably. The presence / absence of the replaceable power supply device 12 and the presence of the power supply device 12 are notified to the blade server 1 on which the power supply device 12 is mounted. Can be done.
 コンピュータシステム全体における代替可能な電源装置12の抽出、及びその抽出結果の通知は、本実施形態による保守代替部品管理装置によって行われる。本実施形態では、保守代替部品管理装置は、ブレードサーバ1のうちの一つに搭載されている。ここでは、ブレードサーバ1-1に本実施形態による保守代替部品管理装置が搭載されていると想定する。保守代替部品管理装置は、各ブレードサーバ1との通信が可能な任意のコンピュータ(データ処理装置)に搭載させることができる。 The extraction of the replaceable power supply device 12 in the entire computer system and the notification of the extraction result are performed by the maintenance substitute part management device according to the present embodiment. In the present embodiment, the maintenance substitute part management apparatus is mounted on one of the blade servers 1. Here, it is assumed that the maintenance substitute part management apparatus according to the present embodiment is mounted on the blade server 1-1. The maintenance substitute part management device can be mounted on any computer (data processing device) capable of communicating with each blade server 1.
 図3は、本実施形態による保守代替部品管理装置の機能構成を説明する図である。
 図3に表すように、本実施形態による保守代替部品管理装置は、マネージメントブレード14に搭載され、障害トラップ受信部31、代替部品抽出部32、データ保持部33、及びデータ出力部34を備えている。
FIG. 3 is a diagram illustrating a functional configuration of the maintenance substitute part management apparatus according to the present embodiment.
As shown in FIG. 3, the maintenance substitute part management apparatus according to the present embodiment is mounted on the management blade 14 and includes a failure trap receiving unit 31, a substitute part extracting unit 32, a data holding unit 33, and a data output unit 34. Yes.
 データ保持部33は、図2に表す記憶装置13bが対応する。障害トラップ受信部31、代替部品抽出部32、及びデータ出力部34は共に、演算装置13aが、記憶装置13bに格納された保守代替部品の管理用のプログラム(以降「部品管理プログラム」と呼ぶ)を実行して、インターフェース13cを制御することにより実現される。 The data holding unit 33 corresponds to the storage device 13b shown in FIG. In the failure trap receiving unit 31, the substitute component extracting unit 32, and the data output unit 34, the arithmetic device 13a has a management substitute component management program stored in the storage device 13b (hereinafter referred to as a “component management program”). Is executed to control the interface 13c.
 ブレードサーバ1-1以外のブレードサーバ1に搭載されたマネージメントブレード13は、電源装置12に発生した障害により電力不足状態となった場合に、その旨を通知するメッセージを生成して、ブレードサーバ1-1宛てに送信する。障害トラップ受信部31は、そのメッセージを受信して処理する。それにより、電力不足状態となったブレードサーバ1を代替部品抽出部32に通知する。 When the management blade 13 mounted on the blade server 1 other than the blade server 1-1 is in a power shortage state due to a failure that has occurred in the power supply device 12, the management blade 13 generates a message to that effect, and the blade server 1 Sent to -1. The failure trap receiver 31 receives and processes the message. Thereby, the blade server 1 that has become insufficiency of power is notified to the alternative component extraction unit 32.
 そのようなメッセージの送受信には、SNMP(Simple Network Management Protocol)を用いることができる。SNMPは、コンピュータ、ルータ、及び端末装置等のネットワークに接続された通信機器をネットワーク経由で監視・制御するためのプロトコルである。電力不足状態の通知は、「SNMPトラップ」を用いて行わせることができる。SNMPトラップは、予め設定した異常値を検出した場合に、その旨をSNMPエージェントからSNMPマネージャに伝える機能である。検出された異常値の種類を含む通知内容、及び要求内容の伝達には、オブジェクトID(OID)が用いられる。SNMPマネージャは、ブレードサーバ1-1のマネージメントブレード13が相当し、SNMPエージェントは、ブレードサーバ1-1以外の各ブレードサーバ1のマネージメントブレード13が相当する。ここでは、SNMPトラップは、そのSNMPにより送信されるメッセージを指す意味でも用いる。 SNMP (Simple Network Management Protocol) can be used to send and receive such messages. SNMP is a protocol for monitoring and controlling communication devices connected to a network such as a computer, a router, and a terminal device via the network. The notification of the power shortage state can be performed using “SNMP trap”. The SNMP trap is a function that, when a preset abnormal value is detected, notifies that fact from the SNMP agent to the SNMP manager. An object ID (OID) is used to transmit the notification content including the detected abnormal value type and the request content. The SNMP manager corresponds to the management blade 13 of the blade server 1-1, and the SNMP agent corresponds to the management blade 13 of each blade server 1 other than the blade server 1-1. Here, the SNMP trap is also used to indicate a message transmitted by the SNMP.
 代替部品抽出部32は、障害トラップ受信部31から電力不足状態となったブレードサーバ1が通知されると、他のブレードサーバ1に対し、冗長電源装置12の数、その状態、及びサーバブレード11の平均消費電力値を問い合わせる。それにより、代替部品抽出部32は、コンピュータシステム全体で代替可能な冗長電源装置12の特定、及び特定された冗長電源装置12を搭載するブレードサーバ1での電源装置12の状態の確認を行う。代替部品抽出部32は、そのような確認を行い、代替可能な冗長電源装置12が複数、存在する場合、代替可能な冗長電源装置12のなかから最適と考えられる冗長電源装置12を抽出する。データ出力部34は、代替部品抽出部32が抽出した冗長電源装置12、及びその冗長電源装置12を搭載したブレードサーバ1を表すデータを送信することにより、端末装置3を介して代替部品抽出部32による抽出結果を出力する。 When the replacement part extraction unit 32 is notified of the blade server 1 in a power shortage state from the failure trap reception unit 31, the number of redundant power supply devices 12, its state, and the server blade 11 are notified to the other blade servers 1. Queries the average power consumption of. As a result, the alternative component extraction unit 32 identifies the redundant power supply device 12 that can be replaced by the entire computer system, and checks the state of the power supply device 12 in the blade server 1 on which the specified redundant power supply device 12 is mounted. The replacement component extraction unit 32 performs such confirmation, and when there are a plurality of replaceable redundant power supply devices 12, extracts the redundant power supply device 12 considered to be optimal from the replaceable redundant power supply devices 12. The data output unit 34 transmits the data representing the redundant power supply device 12 extracted by the alternative component extraction unit 32 and the blade server 1 on which the redundant power supply device 12 is mounted, so that the alternative component extraction unit is connected via the terminal device 3. The extraction result of 32 is output.
 代替部品抽出部32は、データ保持部33に格納された管理サーバテーブル33aを参照して、冗長電源装置12の抽出を行う。データ保持部33には、部品管理テーブル33bが格納されており、その部品管理テーブル33bは必要に応じて参照される。以下、その抽出方法について具体的に説明する。 The substitute part extraction unit 32 refers to the management server table 33a stored in the data holding unit 33 and extracts the redundant power supply device 12. The data holding unit 33 stores a component management table 33b, and the component management table 33b is referred to as necessary. Hereinafter, the extraction method will be specifically described.
 図4は、管理サーバテーブルの構成例を説明する図である。
 この管理サーバテーブル33aは、代替可能な冗長電源装置12を抽出するうえでのブレードサーバ1間の優先順位を表すテーブルであり、他のブレードサーバ1用に冗長電源装置12を取り外すブレードサーバ1の特定に用いられる。図4に表すように、管理サーバテーブル33aには、ブレードサーバ1毎に、そのID、IP(Internet Protocol)アドレス、及び優先順位の各データが格納される。この図4は、コンピュータシステムに4台のブレードサーバ1が存在する場合の管理サーバテーブルの内容例を表している。
FIG. 4 is a diagram illustrating a configuration example of the management server table.
This management server table 33a is a table that represents the priority order among the blade servers 1 in extracting the redundant power supply device 12 that can be replaced, and the blade server 1 from which the redundant power supply device 12 is removed for the other blade server 1 Used for specific purposes. As shown in FIG. 4, each management server table 33 a stores data of ID, IP (Internet Protocol) address, and priority for each blade server 1. FIG. 4 shows an example of the contents of the management server table when there are four blade servers 1 in the computer system.
 優先順位を表す1~3の数字は、その数字が大きくなるほど優先順位が低いことを表している。それにより、冗長電源装置12を抽出するブレードサーバ1は、優先順位が1のブレードサーバ1が最優先とされる。 The numbers 1 to 3 representing the priority order indicate that the higher the number, the lower the priority order. As a result, the blade server 1 from which the redundant power supply device 12 is extracted is given the highest priority to the blade server 1 with the priority order of one.
 この優先順位は、ブレードサーバ1に保証している稼働率に応じて設定されている。例えば稼働率として、稼働率≦99%、99%<稼働率<99.99%、及び99.99%≦稼働率、の3種類が保証されている場合、保証されている稼働率の最も低いブレードサーバ1には1の優先順位が割り当てられる。次に稼働率の高いブレードサーバ1には2の優先順位、最も稼働率の高いブレードサーバ12には3の優先順位がそれぞれ割り当てられる。 This priority is set according to the operation rate guaranteed for the blade server 1. For example, when the three types of operation rate are guaranteed, that is, operation rate ≦ 99%, 99% <operation rate <99.99%, and 99.99% ≦ operation rate, the lowest guaranteed operation rate The blade server 1 is assigned a priority of 1. The blade server 1 with the next highest operation rate is assigned a priority of 2, and the blade server 12 with the highest operation rate is assigned a priority of 3.
 本実施形態では、優先順位の高いブレードサーバ1を優先して、冗長電源装置12を抽出する。それにより、保証する稼働率が高いブレードサーバ1程、冗長電源装置12の取り外しが行われないようにする。このため、冗長電源装置12を取り外したブレードサーバ1で電源装置12の障害が発生した場合の影響はより抑えることができる。 In this embodiment, the redundant power supply device 12 is extracted with priority given to the blade server 1 having a higher priority. Accordingly, the redundant power supply device 12 is prevented from being removed as the blade server 1 has a higher guaranteed operation rate. For this reason, the influence when the failure of the power supply device 12 occurs in the blade server 1 from which the redundant power supply device 12 is removed can be further suppressed.
 マネージングブレード13は、図5に表す部品管理テーブル33bを用いて、各部品の管理を行う。ここで、図5を参照して、部品管理テーブル33bについて具体的に説明する。 The managing blade 13 manages each component using the component management table 33b shown in FIG. Here, the component management table 33b will be specifically described with reference to FIG.
 図5に表すように、部品管理テーブル33bには、保守部品毎に、その部品ID、種類、状態、稼動時間、消費電力、の各データが格納される。 As shown in FIG. 5, the parts management table 33b stores each data of the part ID, type, state, operating time, and power consumption for each maintenance part.
 部品管理テーブル33bは、記憶装置13bに保存されるテーブルである。抽出される冗長電源装置12は、ブレードサーバ1-1に搭載された冗長電源装置12である可能性がある。このため、部品管理テーブル33bは、保守代替部品管理装置が冗長電源装置12を抽出するうえで必要なデータである。 The parts management table 33b is a table stored in the storage device 13b. The extracted redundant power supply 12 may be the redundant power supply 12 mounted on the blade server 1-1. Therefore, the component management table 33b is data necessary for the maintenance substitute component management device to extract the redundant power supply device 12.
 図5において、「サーバブレード」「電源」及び「冗長電源」は何れも、保守部品の種類を表すデータとして表記している。「駆動」「待機」は何れも、保守部品の状態を表すデータとして表記している。保守部品を表す状態としては、他に、「停止」「故障」がある。「待機」と「停止」は、共に稼動可能な保守部品の状態であるが、「待機」は稼動させる必要のない状況での停止であるのに対し、「停止」は稼動させる必要がある状況での停止である。代替可能な冗長電源装置12は、種類が「冗長電源」であり、且つ状態が「待機」となっている電源装置12である。稼動させた冗長電源装置12は、状態が「待機」から「稼動」に更新されると共に、種類が「冗長電源」から「電源」に更新される。 In FIG. 5, “server blade”, “power supply”, and “redundant power supply” are all represented as data representing the types of maintenance parts. Both “drive” and “standby” are represented as data representing the state of the maintenance part. Other states representing maintenance parts include “stop” and “failure”. "Standby" and "Stop" are maintenance parts that can be operated together, but "Standby" is a stop in a situation where it is not necessary to operate, whereas "Stop" needs to be operated It is a stop at. The replaceable redundant power supply device 12 is a power supply device 12 of a type “redundant power supply” and a state “standby”. The activated redundant power supply 12 is updated from “standby” to “active” and the type is updated from “redundant power” to “power”.
 稼動時間(h)は、保守部品を実際に稼動させた総時間であり、動作確認、調整、或いは交換等を行うタイミングの特定に用いられる。稼動時間の計時は、演算装置13aが、例えば搭載されたハードタイマを用いて行う。消費電力(W)は、保守部品から通知されたものか、或いはその平均値である。サーバブレード11の消費電力値は、制御装置11aから通知される。保守代替部品管理装置の問い合わせによって返信する平均消費電力値は、例えばブレードサーバ11全体の消費電力値の平均値である。 The operation time (h) is the total time that the maintenance parts are actually operated, and is used to specify the timing for performing operation check, adjustment, replacement, or the like. The operation time is measured by the arithmetic device 13a using, for example, a built-in hard timer. The power consumption (W) is notified from the maintenance parts or is an average value thereof. The power consumption value of the server blade 11 is notified from the control device 11a. The average power consumption value returned in response to an inquiry from the maintenance / substitution component management apparatus is, for example, the average power consumption value of the entire blade server 11.
 各ブレードサーバ1のマネージメントブレード13は、部品管理テーブル33bにより、代替可能な冗長電源装置12が存在するか否か判定することができる。それにより、障害が発生した電源装置12の代わりとなる冗長電源装置12が存在しない場合、ブレードサーバ1-1以外の各ブレードサーバ1のマネージメントブレード13は、その旨をブレードサーバ1-1のマネージメントブレード13に通知する。障害が発生した電源装置12の代わりとなる冗長電源装置12が存在する場合、各ブレードサーバ1のマネージメントブレード13は、冗長電源装置12を稼動させ、障害が発生した電源装置12を停止させる。 The management blade 13 of each blade server 1 can determine whether there is a redundant power supply 12 that can be replaced by the component management table 33b. As a result, when there is no redundant power supply 12 that replaces the failed power supply 12, the management blades 13 of the blade servers 1 other than the blade server 1-1 notify the management of the blade server 1-1. Notify the blade 13. When there is a redundant power supply 12 that can replace the failed power supply 12, the management blade 13 of each blade server 1 operates the redundant power supply 12 and stops the failed power supply 12.
 保守代替部品管理装置、つまりブレードサーバ1-1のマネージメントブレード13は、問い合わせにより、ブレードサーバ1-1以外の各ブレードサーバ1に存在する代替可能な冗長電源装置12の個数、ブレードサーバ11の平均消費電力値を確認する。その問い合わせに対する返信は、マネージメントブレード13が部品管理テーブル33bを用いて行うことができる。 The maintenance substitute component management device, that is, the management blade 13 of the blade server 1-1, in response to the inquiry, the number of replaceable redundant power supply devices 12 existing in each blade server 1 other than the blade server 1-1, the average of the blade servers 11 Check the power consumption value. A reply to the inquiry can be made by the management blade 13 using the component management table 33b.
 次に、保守代替部品管理装置は、代替可能な冗長電源装置12が存在するブレードサーバ1を優先順位でソートし、優先順位の最も高いブレードサーバ1を特定する。優先順位の最も高いブレードサーバ1が1台のみであった場合、保守代替部品管理装置は、このブレードサーバ1に搭載されている冗長電源装置12を代替可能な冗長電源装置として抽出する。 Next, the maintenance substitute part management device sorts the blade servers 1 in which the redundant power supply units 12 that can be substituted exist in priority order, and identifies the blade server 1 having the highest priority order. When there is only one blade server 1 with the highest priority, the maintenance replacement component management device extracts the redundant power supply device 12 mounted on the blade server 1 as a replaceable redundant power supply device.
 一方、優先順位の最も高いブレードサーバ1が複数台、存在する場合、保守代替部品管理装置は、この複数台のブレードサーバ1を対象に、平均消費電力値でソートを行い、平均消費電力値が最小のブレードサーバ1を特定する。平均消費電力値が最小のブレードサーバ1が複数台、存在する場合、保守代替部品管理装置は、そのなかから1台を選択する。その選択方法としては、冗長電源装置12の駆動時間を参照し、駆動時間がより短い方の冗長電源装置12が搭載されているブレードサーバ1を選択するといった方法が考えられる。保守代替部品管理装置は、選択したブレードサーバ1に搭載されている冗長電源装置12を代替可能な冗長電源装置として抽出する。 On the other hand, when there are a plurality of blade servers 1 with the highest priority, the maintenance substitute part management device sorts the plurality of blade servers 1 by the average power consumption value, and the average power consumption value is The smallest blade server 1 is specified. When there are a plurality of blade servers 1 having the smallest average power consumption value, the maintenance / substitution component management apparatus selects one of them. As a selection method thereof, a method of referring to the drive time of the redundant power supply device 12 and selecting the blade server 1 on which the redundant power supply device 12 with a shorter drive time is mounted can be considered. The maintenance substitute part management device extracts the redundant power supply device 12 mounted on the selected blade server 1 as a replaceable redundant power supply device.
 ブレードサーバ1から取り外された冗長電源装置12は、新たに搭載されるブレードサーバ1で一時的に使用されるか、或いは新たに搭載されたブレードサーバ1で障害が発生するまで使用される。電源装置12の耐久時間(寿命)は、供給電力が大きいほど、短くなる傾向がある。冗長電源装置12が全く使用(稼動)されていないとは限らない。このようなことから、各サーバブレード1に搭載されるブレードサーバ11の数が全て同じと想定する場合、冗長電源装置12に障害が発生すると予想される時間は、サーバブレード11の平均消費電力値が大きいほど短くなると見なすことができる。そのため、本実施形態では、特定のブレードサーバ1で電力不足状態となっている時間が特定のブレードサーバ1で長くならないように、より障害が発生し難いと考えられる冗長電源装置12を代替用にしている。 The redundant power supply device 12 removed from the blade server 1 is temporarily used in the newly installed blade server 1 or used until a failure occurs in the newly installed blade server 1. The endurance time (life) of the power supply device 12 tends to be shorter as the supplied power is larger. The redundant power supply device 12 is not necessarily used (operated) at all. For this reason, when it is assumed that the number of blade servers 11 mounted on each server blade 1 is the same, the time when the redundant power supply 12 is expected to fail is the average power consumption value of the server blades 11. It can be considered that the larger is, the shorter. Therefore, in the present embodiment, the redundant power supply device 12 that is considered to be less likely to cause a failure is used as a substitute so that the time in which the specific blade server 1 is in a power shortage state does not become long in the specific blade server 1. ing.
 図2に表す障害トラップ受信部31、代替部品抽出部32、及びデータ出力部34は、演算装置13aが、図6に表す冗長電源抽出処理を実行することで実現される。この冗長電源抽出処理は、代替可能な冗長電源装置12が存在しないブレードサーバ1の電源装置12に発生した故障に対応するための処理であり、演算装置13aが、記憶装置13bに格納された部品管理プログラムを実行することで実現される。次に図6を参照して、冗長電源抽出処理について詳細に説明する。 The fault trap receiver 31, the alternative component extractor 32, and the data output unit 34 shown in FIG. 2 are realized when the arithmetic device 13a executes the redundant power supply extraction process shown in FIG. This redundant power supply extraction process is a process for dealing with a failure that has occurred in the power supply device 12 of the blade server 1 for which there is no replaceable redundant power supply device 12, and the arithmetic device 13a is a component stored in the storage device 13b. This is realized by executing the management program. Next, the redundant power supply extraction process will be described in detail with reference to FIG.
 先ず、演算装置13aは、インターフェース13cによるSNMPトラップの受信を監視する(S1)。インターフェース13cがSNMPトラップを受信すると、次に演算装置13aは、SNMPトラップに格納されているOIDが対象OID、つまり電力不足状態を表すOIDか否か判定する。対象OIDを格納したSNMPトラップをインターフェース13cが受信した場合、S2の判定はYesとなってS3に移行する。対象OIDが格納されていないメッセージをインターフェース13cが受信した場合、S2の判定はNoとなってS1に戻る。それにより、対応OIDを格納したSNMPトラップをインターフェース13cが受信するのを待つ。図3に表す障害トラップ受信部31は、演算装置13aがS1及びS2の各処理を実行することで実現される。 First, the arithmetic unit 13a monitors reception of an SNMP trap by the interface 13c (S1). When the interface 13c receives the SNMP trap, the arithmetic device 13a next determines whether or not the OID stored in the SNMP trap is a target OID, that is, an OID representing a power shortage state. When the interface 13c receives the SNMP trap storing the target OID, the determination in S2 is Yes and the process proceeds to S3. When the interface 13c receives a message in which the target OID is not stored, the determination in S2 is No and the process returns to S1. As a result, the interface 13c waits for reception of the SNMP trap storing the corresponding OID. The fault trap receiver 31 shown in FIG. 3 is realized by the arithmetic device 13a executing each process of S1 and S2.
 この冗長電源抽出処理は、演算装置13aが、記憶装置13bに格納された部品管理プログラムを実行することで実現される。 This redundant power supply extraction process is realized by the arithmetic device 13a executing the component management program stored in the storage device 13b.
 S3では、演算装置13aは、自ブレードサーバ1-1、及びSNMPトラップを送信したブレードサーバ1を除く1つのブレードサーバ1を選択し、選択したブレードサーバ1に、冗長電源装置12の数を問い合わせる。1つのブレードサーバ1の選択は、管理サーバテーブル33aを参照して行う。問い合わせは、それに対応したOIDを格納したSNMPメッセージの送信により行われる。 In S3, the arithmetic device 13a selects one blade server 1 excluding its own blade server 1-1 and the blade server 1 that transmitted the SNMP trap, and inquires the selected blade server 1 about the number of redundant power supply devices 12. . One blade server 1 is selected with reference to the management server table 33a. The inquiry is made by sending an SNMP message storing the corresponding OID.
 次に演算装置13aは、問い合わせに対する返信(レスポンス)で通知された冗長電源装置12の数が0か否か判定する。問い合わせしたブレードサーバ1に冗長電源装置12が存在しない場合、S4の判定はYesとなってS8に移行する。そのブレードサーバ1に冗長電源装置12が搭載されている場合、S4の判定はNoとなってS5に移行する。 Next, the arithmetic device 13a determines whether or not the number of redundant power supply devices 12 notified in response to the inquiry is zero. If there is no redundant power supply device 12 in the inquired blade server 1, the determination in S4 is Yes and the process proceeds to S8. When the redundant power supply device 12 is mounted on the blade server 1, the determination in S4 is No and the process proceeds to S5.
 S5では、演算装置13aは、問い合わせしたブレードサーバ1に対し、冗長電源装置12の状態を確認するための問い合わせを更に行う。その問い合わせも、対応するOIDを格納したSNMPメッセージを送信することで行われる。その問い合わせ後の演算装置13aは、返信を受信するのを待って、その返信で通知された状態は使用可能な状態(図8では「ok」と表記)を表しているか否か判定する。通知された状態が「待機」であった場合、S6の判定はYesとなってS7に移行する。通知された状態が「故障」或いは「稼動」であった場合、S6の判定はNoとなってS8に移行する。このS6でのNoの判定は、問い合わせを行ったブレードサーバ1には代替可能な冗長電源装置12が存在しないことを意味する。 In S5, the arithmetic device 13a further makes an inquiry to the inquired blade server 1 to confirm the state of the redundant power supply device 12. The inquiry is also made by transmitting an SNMP message storing the corresponding OID. The arithmetic device 13a after the inquiry waits to receive a reply, and determines whether or not the state notified by the reply represents a usable state (denoted as “ok” in FIG. 8). If the notified state is “standby”, the determination in S6 is Yes and the process proceeds to S7. When the notified state is “failure” or “operation”, the determination in S6 is No and the process proceeds to S8. The determination of No in S6 means that there is no replaceable redundant power supply device 12 in the blade server 1 that made the inquiry.
 S7では、演算装置13aは、選択したブレードサーバ1に、サーバブレード11の平均消費電力値(図8では「平均電力」と表記)を更に問い合わせる。この問い合わせも他の問い合わせと同様に、対応するOIDを格納したSNMPメッセージの送信により行われる。その問い合わせを行った後は、返信を受信するのを待って、その返信により通知されるサーバブレード11の平均消費電力値を記憶装置13bに保存してからS8に移行する。 In S7, the arithmetic device 13a further inquires the selected blade server 1 about the average power consumption value (denoted as “average power” in FIG. 8) of the server blade 11. This inquiry is made by transmitting an SNMP message storing the corresponding OID, as with other inquiries. After making the inquiry, it waits to receive a reply, saves the average power consumption value of the server blade 11 notified by the reply in the storage device 13b, and then proceeds to S8.
 S8では、演算装置13aは、選択したブレードサーバ1が最後のブレードサーバ1か否か判定する。他に問い合わせを行うべきブレードサーバ1が残っていない場合、S8の判定はYesとなってS9に移行する。他に問い合わせを行うべきブレードサーバ1が残っている場合、S8の判定はNoとなり、上記S3に戻る。そのS3では、新たに別のブレードサーバ1を選択しての問い合わせが行われる。 In S8, the arithmetic device 13a determines whether or not the selected blade server 1 is the last blade server 1. If there is no other blade server 1 to be inquired, the determination in S8 is Yes and the process proceeds to S9. If there is another blade server 1 to be inquired, the determination in S8 is No and the process returns to S3. In S3, an inquiry is made by selecting another blade server 1 anew.
 S9では、演算装置13aは、問い合わせの結果を用いて、保守代替部品とすべき冗長電源装置12を決定する代替部品決定処理を実行する。次に演算装置13aは、決定した冗長電源装置12を作業員に通知するための端末装置3への画面出力を行う(S10)。その後、この冗長電源抽出処理を終了する。端末装置3への画面出力は、出力すべき画面(画像)のデータを格納したメッセージの送信により行われる。 In S9, the arithmetic device 13a executes an alternative component determination process for determining the redundant power supply device 12 to be a maintenance alternative component using the result of the inquiry. Next, the arithmetic device 13a performs screen output to the terminal device 3 for notifying the worker of the determined redundant power supply device 12 (S10). Thereafter, the redundant power supply extraction process is terminated. Screen output to the terminal device 3 is performed by transmitting a message storing data of a screen (image) to be output.
 このようにして、本実施形態では、ブレードサーバ1(のマネージメントブレード13)毎に、他のブレードサーバ1用に使用可能な冗長電源装置12の確認、及び必要な情報(サーバブレード11の平均消費電力値)の収集が行われる。代替部品決定処理は、そのような確認結果、及び収集された情報を参照して行われる。 As described above, in this embodiment, for each blade server 1 (management blade 13), confirmation of the redundant power supply device 12 that can be used for the other blade server 1 and necessary information (average consumption of the server blade 11) Power value) is collected. The substitute part determination process is performed with reference to such confirmation results and collected information.
 図7は、代替部品決定処理のフローチャートである。次に図7を参照して、代替部品決定処理について詳細に説明する。 FIG. 7 is a flowchart of the substitute part determination process. Next, the substitute part determination process will be described in detail with reference to FIG.
 先ず、演算装置13aは、他のブレードサーバ1用に使用可能な冗長電源装置12があるか否か判定する(S20)。問い合わせしたブレードサーバ1の何れかに、待機状態となっている冗長電源装置12が搭載されていた場合、S20の判定はYesとなってS21に移行する。問い合わせしたブレードサーバ1の全てが、冗長電源装置12を搭載していない、冗長電源装置12が故障している、及び冗長電源装置12が稼動されている、の何れかであった場合、S21の判定はNoとなってS23に移行する。 First, the arithmetic device 13a determines whether or not there is a redundant power supply device 12 that can be used for another blade server 1 (S20). If any of the inquired blade servers 1 is equipped with the redundant power supply device 12 in the standby state, the determination in S20 is Yes and the process proceeds to S21. If all of the inquired blade servers 1 are either not equipped with the redundant power supply device 12, the redundant power supply device 12 has failed, or the redundant power supply device 12 is in operation, the S21 The determination is no and the process moves to S23.
 このとき、S23では、演算装置13aは、保守代替部品となりうる冗長電源装置12は存在しないことを決定する。そのような決定を行った後、代替部品決定処理が終了する。それにより、上記S10を実行した場合、端末装置3はその旨を表す画面を出力することになる。 At this time, in S23, the arithmetic device 13a determines that there is no redundant power supply device 12 that can be a maintenance substitute component. After making such a determination, the substitute part determination process ends. Thereby, when the above-described S10 is executed, the terminal device 3 outputs a screen indicating that fact.
 S21では、演算装置13aは、使用可能な冗長電源装置12が確認されたブレードサーバ1を抽出し、管理サーバテーブル33aを参照して、優先順位によりソートする。次に演算装置13aは、ソートしたブレードサーバ1のなかで優先順位が最も高いブレードサーバ1は1つか否か判定する。ソートしたブレードサーバ1のなかで優先順位が最も高いブレードサーバ1が1つのみであった場合、S21の判定はYesとなってS23に移行し、この1つのブレードサーバ1に搭載された冗長電源装置12を保守代替部品として決定する。優先順位が最も高いブレードサーバ1が複数、存在する場合、S21の判定はNoとなってS25に移行する。図4に表す管理サーバテーブルの例では、優先順位が最も高いブレードサーバ1が1つのみとなるケースは、優先順位が1のブレードサーバが1つのみのケース、優先順位が1のブレードサーバ1が存在せず、且つ優先順位が2のブレードサーバ1が1つのみのケース、優先順位が3のブレードサーバ1のみが存在し、且つそのブレードサーバ1が一つのみのケース、の何れかである。 In S21, the arithmetic device 13a extracts the blade servers 1 in which the usable redundant power supply devices 12 are confirmed, and sorts them according to the priority order with reference to the management server table 33a. Next, the arithmetic device 13a determines whether or not there is one blade server 1 with the highest priority among the sorted blade servers 1. If there is only one blade server 1 with the highest priority among the sorted blade servers 1, the determination in S21 is Yes and the process proceeds to S23, and the redundant power supply mounted on this one blade server 1 is obtained. The apparatus 12 is determined as a maintenance substitute part. When there are a plurality of blade servers 1 with the highest priority, the determination in S21 is No and the process proceeds to S25. In the example of the management server table shown in FIG. 4, the case where there is only one blade server 1 with the highest priority is the case where there is only one blade server with the priority 1, and the blade server 1 with the priority 1. In the case where there is only one blade server 1 with a priority of 2, and there is only one blade server 1 with a priority of 3, and there is only one blade server 1. is there.
 S25では、演算装置13aは、優先順位が最も高いブレードサーバ1を平均電力値によりソートする。次に演算装置13aは、ソートしたブレードサーバ1のなかで平均電力値が最も小さいブレードサーバ1を選択する。その後に上記S23に移行し、選択したブレードサーバ1に搭載の冗長電源装置12を保守代替部品として決定する。 In S25, the arithmetic device 13a sorts the blade servers 1 having the highest priority according to the average power value. Next, the arithmetic device 13a selects the blade server 1 having the smallest average power value among the sorted blade servers 1. Thereafter, the process proceeds to S23, and the redundant power supply device 12 mounted on the selected blade server 1 is determined as a maintenance substitute part.
 このようにして、本実施形態では、ブレードサーバ1の優先順位、サーバブレード11の平均電力値の順序で保守代替部品とすべき冗長電源装置12を搭載したブレードサーバ1の絞り込みを行う。それにより、本実施形態では、保証する稼働率が高いブレードサーバ1に搭載された冗長電源装置12が選択される可能性を低く抑えつつ、より故障し難いと考えられる冗長電源装置12を優先的に選択するようにしている。 In this way, in this embodiment, the blade servers 1 equipped with the redundant power supply devices 12 to be maintenance substitute parts are narrowed down in the order of the priority order of the blade servers 1 and the average power value of the server blades 11. Accordingly, in the present embodiment, the redundant power supply device 12 that is considered to be less likely to fail is prioritized while suppressing the possibility of selecting the redundant power supply device 12 mounted on the blade server 1 having a high guaranteed operation rate. To choose.
 サーバブレード11の平均消費電力値は、複数のブレードサーバ1で一致する可能性がある。その場合、複数のブレードサーバ1のなかから1つのブレードサーバ1を任意に選択すれば良い。より故障し難いと考えられる冗長電源装置12を選択するのであれば、冗長電源装置12の稼動時間を参照して、稼動時間が最も短い冗長電源装置12を搭載したブレードサーバ1を選択するようにしても良い。稼働率の低下を抑えるために、代替可能な冗長電源装置12が存在しない状況で電源装置12が故障した回数を計数し、計数した回数が最も少ないブレードサーバ1を選択するようにしても良い。 There is a possibility that the average power consumption value of the server blades 11 is the same for a plurality of blade servers 1. In that case, one blade server 1 may be arbitrarily selected from among the plurality of blade servers 1. If a redundant power supply 12 that is considered to be less likely to fail is selected, the operation time of the redundant power supply 12 is referred to and the blade server 1 equipped with the redundant power supply 12 with the shortest operation time is selected. May be. In order to suppress a decrease in the operation rate, the number of times that the power supply device 12 has failed in a situation where there is no substitute redundant power supply device 12 may be counted, and the blade server 1 having the smallest number of times counted may be selected.
 図8は、他のブレードサーバに搭載の冗長電源装置を保守代替部品として決定した場合の作業員による復旧手順を表すフローチャートである。ここで、図8を参照して、作業員による復旧手順について詳細に説明する。 FIG. 8 is a flowchart showing a recovery procedure performed by a worker when a redundant power supply installed in another blade server is determined as a substitute part for maintenance. Here, with reference to FIG. 8, the recovery procedure by an operator is demonstrated in detail.
 図8に表記の「ハード故障によるシステム停止」とは、代替可能な冗長電源装置が存在しない状況で電源装置12に故障が発生したことにより、ブレードサーバ1全体のシステムが停止、或いは一部のサーバブレード11が停止したことを表している。「保守部品」及び「故障部品」は電源装置12が相当し、「保守代替部品」は冗長電源装置12が相当する。 “System stop due to hardware failure” shown in FIG. 8 means that the entire system of the blade server 1 is stopped due to a failure in the power supply device 12 in a situation where there is no redundant power supply that can be replaced. The server blade 11 is stopped. The “maintenance part” and the “failed part” correspond to the power supply apparatus 12, and the “maintenance substitute part” corresponds to the redundant power supply apparatus 12.
 作業員は、端末装置3等を用いて、何れかのブレードサーバ1の異常を認識した場合、異常を認識したブレードサーバ1で故障した保守部品の特定を行う(S100)。ここで想定する異常とは、故障した保守部品が電源装置12であり、且つ代替可能な冗長電源装置12がブレードサーバ1に搭載されていないことによる電力不足状態によって生じる異常である。このような異常が発生した場合、保守代替部品管理装置は、保守代替部品とすべき冗長電源装置12を決定し、その決定結果を端末装置3上に表示させる。このことから、作業員は、保守代替部品管理装置により提示されるブレードサーバ1から冗長電源装置12を取り外し、異常が発生したブレードサーバ1に搭載する(S200)。それにより、異常が発生したブレードサーバ1の復旧が終了する。このとき、ブレードサーバ1に冗長電源装置12を搭載する場所がなければ、作業員は、故障した電源装置12と冗長電源装置12を交換することになる。 When the worker recognizes an abnormality of any blade server 1 using the terminal device 3 or the like, the worker identifies a maintenance component that has failed in the blade server 1 that has recognized the abnormality (S100). The abnormality assumed here is an abnormality caused by a power shortage state due to the fact that the failed maintenance component is the power supply device 12 and the replaceable redundant power supply device 12 is not mounted on the blade server 1. When such an abnormality occurs, the maintenance substitute part management device determines a redundant power supply device 12 to be a maintenance substitute part, and displays the determination result on the terminal device 3. From this, the worker removes the redundant power supply device 12 from the blade server 1 presented by the maintenance / substitution component management device and mounts it on the blade server 1 in which an abnormality has occurred (S200). Thereby, the recovery of the blade server 1 in which an abnormality has occurred is completed. At this time, if there is no place for mounting the redundant power supply device 12 on the blade server 1, the worker will replace the failed power supply device 12 with the redundant power supply device 12.
 上記異常の発生は、コンピュータシステムに必要な電源装置12が足りないことを意味する。なぜなら、全てのブレードサーバ1に代替可能な冗長電源装置12が搭載されていないからである。そのため、作業員、或いは担当者は、電源装置12を業者に発注する(S300)。作業員、或いは担当者は、その発注により、業者が納入する電源装置12を受け取る(S310)。作業員は、納入された電源装置12を冗長電源装置12として、冗長電源装置12を取り外したブレードサーバ1に新たに搭載するか、或いは異常が発生したブレードサーバ1に搭載の冗長電源装置12と交換する(S320)。冗長電源装置12の交換を行う場合、異常が発生したブレードサーバ1から取り外した冗長電源装置12は、元のブレードサーバ1に再び搭載させれば良い。 The occurrence of the above abnormality means that the power supply device 12 necessary for the computer system is insufficient. This is because all the blade servers 1 are not equipped with an alternative redundant power supply device 12. Therefore, the worker or the person in charge orders the power supply device 12 from the supplier (S300). The worker or the person in charge receives the power supply device 12 delivered by the supplier according to the order (S310). The worker uses the delivered power supply device 12 as the redundant power supply device 12 to be newly mounted on the blade server 1 from which the redundant power supply device 12 has been removed or the redundant power supply device 12 mounted on the blade server 1 in which an abnormality has occurred. Replace (S320). When the redundant power supply device 12 is replaced, the redundant power supply device 12 removed from the blade server 1 in which an abnormality has occurred may be mounted on the original blade server 1 again.
 電源装置12の発注を行ってから実際に納入されるまでには、或る程度の時間が必要である。そのため、発注により納入された電源装置12を異常(電力不足状態)の発生したブレードサーバ1に搭載することは、稼働率を高く維持するうえで回避すべき行為となる。しかし、コンピュータシステム全体で冗長電源装置12を共有の保守代替部品として扱う場合、発注により納入された電源装置12を用いて復旧するようなことを最小限に抑えることができる。このため、より迅速な復旧が可能になり、電力不足状態となっている時間は最小限に抑えられるようになる。その結果、電力不足状態による稼働率の低下も最小限に抑えられることとなる。 A certain amount of time is required from when the power supply device 12 is ordered until it is actually delivered. Therefore, mounting the power supply device 12 delivered by ordering on the blade server 1 in which an abnormality (power shortage state) has occurred is an action that should be avoided to maintain a high operating rate. However, when the redundant power supply device 12 is handled as a shared maintenance substitute part in the entire computer system, recovery using the power supply device 12 delivered by ordering can be minimized. For this reason, more rapid recovery is possible, and the time during which the power is insufficient is minimized. As a result, a reduction in operating rate due to a power shortage state can be minimized.
 ブレードサーバ1に搭載された冗長電源装置12を別のブレードサーバ1に搭載させる場合、この2つのブレードサーバ1の部品管理テーブル33bをそれぞれ更新する必要がある。各ブレードサーバ1のマネージメントブレード13は、バス21或いは22に接続された部品の取り外し、及びバス21或いは22への新たな部品の接続を認識し、部品管理テーブル33bを更新する。それにより、ブレードサーバ1間で冗長電源装置12を移動させても、各ブレードサーバ1でその移動に応じた部品管理テーブル33bの更新が行われる。 When the redundant power supply device 12 mounted on the blade server 1 is mounted on another blade server 1, it is necessary to update the component management tables 33b of the two blade servers 1 respectively. The management blade 13 of each blade server 1 recognizes the removal of the component connected to the bus 21 or 22 and the connection of a new component to the bus 21 or 22, and updates the component management table 33b. Thereby, even if the redundant power supply device 12 is moved between the blade servers 1, the component management table 33b is updated in accordance with the movement in each blade server 1.
 冗長電源装置12が新たに搭載されるブレードサーバ1では、その更新により、冗長電源装置12の稼動時間は0とされる。稼動時間は、故障の発生に備えるためにも正確に管理することが望ましい。このことから、冗長電源装置12の移動先となるブレードサーバ1には、それまでの稼動時間を通知することが望ましい。それにより、保守代替部品管理装置には、その稼動時間を通知する機能を搭載させても良い。 In the blade server 1 in which the redundant power supply device 12 is newly installed, the operation time of the redundant power supply device 12 is set to 0 by the update. It is desirable to accurately manage the operation time in order to prepare for the occurrence of a failure. For this reason, it is desirable to notify the blade server 1 that is the destination of the redundant power supply device 12 of the operation time up to that point. Thereby, the maintenance substitute part management apparatus may be provided with a function of notifying the operation time.
 なお、本実施形態では、他のブレードサーバ1に搭載された電源装置12は冗長電源装置12のみ選択の対象としているが、代替可能な冗長電源装置12が存在しない場合、稼動中の電源装置12を対象にしても良い。これは、ブレードサーバ1によって保証すべき稼働率が異なる場合があるからである。それにより、電力不足状態となったブレードサーバ1に搭載させる電源装置12を、そのブレードサーバ1よりも保証すべき稼働率が低いブレードサーバ1から選択するようにしても良い。これは、状況に応じて、他のブレードサーバ1で稼動中の電源装置12を予備の電源装置12と見なすことを意味する。 In the present embodiment, only the redundant power supply 12 is selected as the power supply 12 mounted on the other blade server 1, but when there is no replaceable redundant power supply 12, the operating power supply 12 is in operation. May be targeted. This is because the operation rate to be guaranteed may differ depending on the blade server 1. Accordingly, the power supply device 12 to be mounted on the blade server 1 in a power shortage state may be selected from the blade server 1 having a lower operation rate than that of the blade server 1. This means that the power supply device 12 operating in another blade server 1 is regarded as a spare power supply device 12 depending on the situation.
 また、本実施形態では、保守代替部品管理装置は1台のブレードサーバ1-1のマネージメントブレード13に搭載される形で実現させているが、保守代替部品管理装置はサーバブレード11に搭載させることもできる。保守代替管理装置を実現させるプログラム(部品管理プログラム)は、マネージメントブレード13が故障する可能性があることから、任意のブレードサーバ1のマネージメントブレード13が実行可能にすることが望ましい。 In this embodiment, the maintenance substitute component management device is realized by being mounted on the management blade 13 of one blade server 1-1. However, the maintenance substitute component management device is mounted on the server blade 11. You can also. Since a program (parts management program) for realizing the maintenance alternative management device may cause the management blade 13 to fail, it is desirable that the management blade 13 of any blade server 1 be executable.

Claims (8)

  1.  それぞれ電源装置が搭載された複数のコンピュータを備えたコンピュータシステムにおいて、
     前記複数のコンピュータのうちの一つ以上のコンピュータに搭載された予備の電源装置の空き状態を表す状態情報を記憶する記憶部と、
     前記複数のコンピュータいずれかのコンピュータの電源装置に障害が発生した場合に、前記記憶部に記憶された状態情報を参照して、前記障害が発生した電源装置と代替可能な電源装置を搭載した、障害が発生した該コンピュータとは異なる他のコンピュータを特定する特定部と、
     を具備することを特徴とするコンピュータシステム。
    In a computer system comprising a plurality of computers each equipped with a power supply unit,
    A storage unit for storing state information indicating a free state of a spare power supply device mounted on one or more of the plurality of computers;
    When a failure occurs in the power supply device of any one of the plurality of computers, the power supply device that can replace the power supply device in which the failure has occurred is installed with reference to the state information stored in the storage unit. A specifying unit for specifying another computer different from the computer in which the failure has occurred;
    A computer system comprising:
  2.  請求項1記載のコンピュータシステムであって、
     前記特定部が特定した前記他のコンピュータを通知する通知部、を更に具備する。
    A computer system according to claim 1,
    A notification unit for notifying the other computer specified by the specifying unit;
  3.  請求項1、または2記載のコンピュータシステムであって、
     前記特定部は、前記予備の電源装置を代替に用いるうえでのコンピュータ間の優先順位を表す情報、及び各コンピュータの消費電力を表す情報のうちの少なくとも一つを更に参照して、前記障害が発生した電源装置と代替可能な電源装置が搭載されている他のコンピュータを特定する。
    The computer system according to claim 1 or 2,
    The specifying unit further refers to at least one of information indicating priority between computers when the spare power supply device is used instead, and information indicating power consumption of each computer, and the failure is Identify another computer that has a power supply that can replace the generated power supply.
  4.  請求項3記載のコンピュータシステムであって、
     前記特定部は、前記優先順位を表す情報、及び前記消費電力を表す情報を共に参照する場合、前記優先順位を表す情報により候補となる他のコンピュータを抽出し、2台以上の他のコンピュータが候補として抽出されたときに前記消費電力を表す情報を用いて1台の他のコンピュータを選択する。
    A computer system according to claim 3,
    When both the information indicating the priority order and the information indicating the power consumption are referred to, the specifying unit extracts other computers as candidates based on the information indicating the priority order, and two or more other computers are One other computer is selected using the information indicating the power consumption when extracted as a candidate.
  5.  それぞれ電源装置が搭載された複数のコンピュータと通信可能な保守代替部品管理装置であって、
     前記複数のコンピュータのうちの一つ以上のコンピュータに搭載された予備の電源装置の空き状態を表す状態情報を記憶する記憶部と、
     前記複数のコンピュータいずれかのコンピュータの電源装置に障害が発生した場合に、前記記憶部に記憶された状態情報を参照して、前記障害が発生した電源装置と代替可能な電源装置を搭載した、障害が発生した該コンピュータとは異なる他のコンピュータを特定する特定部と、
     を具備することを特徴とする保守代替部品管理装置。
    A maintenance substitute parts management device capable of communicating with a plurality of computers each equipped with a power supply,
    A storage unit for storing state information indicating a free state of a spare power supply device mounted on one or more of the plurality of computers;
    When a failure occurs in the power supply device of any one of the plurality of computers, the power supply device that can replace the power supply device in which the failure has occurred is installed with reference to the state information stored in the storage unit. A specifying unit for specifying another computer different from the computer in which the failure has occurred;
    A maintenance substitute parts management apparatus comprising:
  6.  それぞれ電源装置が搭載された複数のコンピュータと通信可能なデータ処理装置が実行する保守代替部品管理方法であって、
     前記複数のコンピュータのなかで、予備の電源装置が存在しないコンピュータの電源装置に障害が発生した場合に、他のコンピュータに存在する予備の電源装置を確認し、
     該確認により特定した予備の電源装置のなかから、前記障害が発生した電源装置の保守代替部品とすべき予備の電源装置を選択する、
     ことを特徴とする保守代替部品管理方法。
    A maintenance substitute component management method executed by a data processing device capable of communicating with a plurality of computers each equipped with a power supply device,
    Among the plurality of computers, when a failure occurs in a computer power supply that does not have a spare power supply, check the spare power supply present in another computer,
    Selecting a spare power supply to be used as a maintenance replacement part of the failed power supply from the spare power supply identified by the confirmation;
    A maintenance substitute parts management method characterized by that.
  7.  請求項6記載の保守代替部品管理方法であって、
     前記データ処理装置は、前記選択した予備の電源装置を通知する。
    The maintenance substitute part management method according to claim 6,
    The data processing apparatus notifies the selected spare power supply apparatus.
  8.  それぞれ電源装置が搭載された複数のコンピュータと通信可能なデータ処理装置に、
     前記複数のコンピュータのなかで、予備の電源装置が存在しないコンピュータの電源装置に障害が発生した場合に、他のコンピュータに存在する予備の電源装置の状態を確認し、
     前記予備の電源装置の状態確認結果に基づいて、障害が発生した電源装置の代わりとすべき予備の電源装置を選択する処理を実行させるプログラム。
    In data processing devices that can communicate with multiple computers each equipped with a power supply,
    Among the plurality of computers, when a failure occurs in a computer power supply that does not have a spare power supply, check the status of the spare power supply in another computer,
    A program for executing a process of selecting a spare power supply device to be used in place of a failed power supply device based on the state confirmation result of the spare power supply device.
PCT/JP2011/073148 2011-10-06 2011-10-06 Computer system, management device, management method, and program WO2013051145A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/JP2011/073148 WO2013051145A1 (en) 2011-10-06 2011-10-06 Computer system, management device, management method, and program

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2011/073148 WO2013051145A1 (en) 2011-10-06 2011-10-06 Computer system, management device, management method, and program

Publications (1)

Publication Number Publication Date
WO2013051145A1 true WO2013051145A1 (en) 2013-04-11

Family

ID=48043336

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2011/073148 WO2013051145A1 (en) 2011-10-06 2011-10-06 Computer system, management device, management method, and program

Country Status (1)

Country Link
WO (1) WO2013051145A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015104841A1 (en) * 2014-01-10 2015-07-16 株式会社 日立製作所 Redundant system and method for managing redundant system
JP5991442B2 (en) * 2013-10-09 2016-09-14 富士通株式会社 Information processing apparatus, management apparatus, and component management method

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2001142579A (en) * 1999-11-16 2001-05-25 Fujitsu Ltd Power source controller, controller, information processor provided with it and recording medium
JP2005025745A (en) * 2003-07-02 2005-01-27 Hewlett-Packard Development Co Lp Apparatus and method for real-time power distribution management
JP2008108101A (en) * 2006-10-26 2008-05-08 Nec Computertechno Ltd Power supply control system and method, electronic apparatus, and program
JP2009169874A (en) * 2008-01-21 2009-07-30 Hitachi Ltd Blade server system provided with power supply path between systems

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2001142579A (en) * 1999-11-16 2001-05-25 Fujitsu Ltd Power source controller, controller, information processor provided with it and recording medium
JP2005025745A (en) * 2003-07-02 2005-01-27 Hewlett-Packard Development Co Lp Apparatus and method for real-time power distribution management
JP2008108101A (en) * 2006-10-26 2008-05-08 Nec Computertechno Ltd Power supply control system and method, electronic apparatus, and program
JP2009169874A (en) * 2008-01-21 2009-07-30 Hitachi Ltd Blade server system provided with power supply path between systems

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5991442B2 (en) * 2013-10-09 2016-09-14 富士通株式会社 Information processing apparatus, management apparatus, and component management method
WO2015104841A1 (en) * 2014-01-10 2015-07-16 株式会社 日立製作所 Redundant system and method for managing redundant system
JPWO2015104841A1 (en) * 2014-01-10 2017-03-23 株式会社日立製作所 MULTISYSTEM SYSTEM AND MULTISYSTEM SYSTEM MANAGEMENT METHOD
AU2014376751B2 (en) * 2014-01-10 2017-07-27 Hitachi, Ltd. Redundant system and method for managing redundant system
US10055004B2 (en) 2014-01-10 2018-08-21 Hitachi, Ltd. Redundant system and redundant system management method

Similar Documents

Publication Publication Date Title
US20090243846A1 (en) Electronic apparatus system having a plurality of rack-mounted electronic apparatuses, and method for identifying electronic apparatus in electronic apparatus system
CN103607297A (en) Fault processing method of computer cluster system
JP4695705B2 (en) Cluster system and node switching method
JP5858144B2 (en) Information processing system, failure detection method, and information processing apparatus
CN107153595B (en) Fault detection method and system for distributed database system
JP6007522B2 (en) Cluster system
CN105227385A (en) A kind of method and system of troubleshooting
US9231779B2 (en) Redundant automation system
CN110417600A (en) Node switching method, device and the computer storage medium of distributed system
CN103441987A (en) Method and device for managing dual-computer firewall system
US8510402B2 (en) Management of redundant addresses in standby systems
CN105812161A (en) Controller fault backup method and system
US20130205162A1 (en) Redundant computer control method and device
JP6253956B2 (en) Network management server and recovery method
US8150958B2 (en) Methods, systems and computer program products for disseminating status information to users of computer resources
JP5056504B2 (en) Control apparatus, information processing system, control method for information processing system, and control program for information processing system
WO2013051145A1 (en) Computer system, management device, management method, and program
CN104243304A (en) Data processing method, device and system of locally-connected topological structure
JP2010244463A (en) Event detection control method and system
JP4806382B2 (en) Redundant system
JP2012059193A (en) Monitoring control system, monitoring control method used therefor, and monitoring control method
JP2013025765A (en) Master/slave system, control device, master/slave switching method and master/slave switching program
JP2009026182A (en) Program execution system and execution device
JP2016009413A (en) Network monitoring system and network monitoring method
KR20140140719A (en) Apparatus and system for synchronizing virtual machine and method for handling fault using the same

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 11873663

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 11873663

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: JP