WO2015037103A1 - サーバシステム、計算機システム、サーバシステムの管理方法、及びコンピュータ読み取り可能な記憶媒体 - Google Patents
サーバシステム、計算機システム、サーバシステムの管理方法、及びコンピュータ読み取り可能な記憶媒体 Download PDFInfo
- Publication number
- WO2015037103A1 WO2015037103A1 PCT/JP2013/074725 JP2013074725W WO2015037103A1 WO 2015037103 A1 WO2015037103 A1 WO 2015037103A1 JP 2013074725 W JP2013074725 W JP 2013074725W WO 2015037103 A1 WO2015037103 A1 WO 2015037103A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- server
- spare
- active
- information
- active server
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/16—Error detection or correction of the data by redundancy in hardware
- G06F11/20—Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
- G06F11/202—Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant
- G06F11/2023—Failover techniques
- G06F11/2033—Failover techniques switching over of hardware resources
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/16—Error detection or correction of the data by redundancy in hardware
- G06F11/20—Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
- G06F11/202—Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant
- G06F11/2023—Failover techniques
- G06F11/2025—Failover techniques using centralised failover control functionality
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/0706—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
- G06F11/0709—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a distributed system consisting of a plurality of standalone computer nodes, e.g. clusters, client-server systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/0751—Error or fault detection not based on redundancy
- G06F11/0754—Error or fault detection not based on redundancy by exceeding limits
- G06F11/076—Error or fault detection not based on redundancy by exceeding limits by exceeding a count or rate limit, e.g. word- or bit count limit
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/3003—Monitoring arrangements specially adapted to the computing system or computing system component being monitored
- G06F11/3006—Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system is distributed, e.g. networked systems, clusters, multiprocessor systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/3051—Monitoring arrangements for monitoring the configuration of the computing system or of the computing system component, e.g. monitoring the presence of processing resources, peripherals, I/O links, software programs
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/16—Error detection or correction of the data by redundancy in hardware
- G06F11/20—Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
- G06F11/202—Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant
- G06F11/2035—Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant without idle spare hardware
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/16—Error detection or correction of the data by redundancy in hardware
- G06F11/20—Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
- G06F11/202—Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant
- G06F11/2048—Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant where the redundant components share neither address space nor persistent storage
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2201/00—Indexing scheme relating to error detection, to error correction, and to monitoring
- G06F2201/805—Real-time
Definitions
- the present invention relates to a server system, a computer system, a server system management method, and a computer-readable storage medium. For example, when a failure occurs in a server of a computer system, the server is replaced and the server is recovered from the failure. Is related to the technology.
- the boot server (logical unit) used by the failed active server is started by the spare server that is not operating, and the work of the active server is transferred to the spare server.
- H / W Hardware
- Patent Document 1 when the method disclosed in Patent Document 1 is used, it is not possible to transfer from a physical computer to a virtual computer. For this reason, even if the method according to Patent Document 1 is used when the active server is operating the system on a physical computer, the server cannot be taken over.
- the server takes over when the H / W configuration is different, the following problems occur. That is, first, when the I / O configuration is different, the I / O boot order before and after the takeover changes, so that the logical unit cannot be read correctly and the OS cannot be started. Further, even when the OS is activated, I / O recognition work is required on the OS according to the number of I / Os. Furthermore, when the number of CPU core sockets and the number of cores increase, there is a possibility of being restricted by a software license.
- H / W concealment processing for matching the configurations is performed. It cannot be performed mechanically. Further, as described above, if the physical configuration is matched and taken over, and the configuration is simply matched and taken over, the H / W resource cannot be utilized to the maximum, and inefficient takeover may be performed. Concealing a server that does not have the same configuration instead of the server that originally has the same configuration to conceal it for configuration matching results in wasted H / W resources compared to the case of taking over to a server that has the same configuration. May occur. In addition, when active servers that can be taken over by one spare server are concentrated, if the configuration is matched and taken over by the server, another active server cannot take over.
- the present invention has been made in view of such a situation, and provides a technique for realizing takeover that efficiently uses H / W resources in consideration of the state of other active servers.
- the physical configuration recognized on the software is made the same for the portion where the H / W configuration needs to be the same between the active server and the spare server, so Take over.
- the server system is provided to process a business and to take over the business of at least one active server in operation and the failed active server when the active server fails.
- at least one spare server, a current server, and a local management computer that monitors the spare server and controls server switching.
- This local management computer stores at least hardware configuration matching policy information indicating a hardware configuration condition that allows a server to take over, and a processor that executes a spare server assignment process that is a takeover destination of the current server's business. And a memory. Then, the processor of the local management computer obtains the hardware configuration for each combination of the active server and the spare server based on the processing for obtaining the hardware configuration information from the active server and the spare server and the obtained hardware configuration information.
- the embodiment of the present invention may be implemented by software running on a general-purpose computer, or may be implemented by dedicated hardware or a combination of software and hardware.
- each information of the present invention will be described in a “table” format.
- the information does not necessarily have to be expressed in a data structure by a table, such as a data structure such as a list, a DB, a queue, or the like. It may be expressed as Therefore, “table”, “list”, “DB”, “queue”, etc. may be simply referred to as “information” to indicate that they do not depend on the data structure.
- each process in the embodiment of the present invention will be described with various control units as the subject (operation subject), but the operations of the various control units can also be described as programs, and these programs are executed by the processor
- the processing determined in this way is performed using a memory and a communication port (communication control device). For this reason, the description may be made with the processor as the subject.
- a part or all of the program may be realized by dedicated hardware, or may be modularized.
- Various programs may be installed by a program distribution server or a storage medium.
- FIG. 1 is a block diagram showing the overall configuration of a computer system according to an embodiment of the present invention.
- the computer system 1 includes an active server 100 currently operating as a server, a spare server 101 that takes over the operation of the active server when the active server fails, and an SVP 102 (Service Processor) that monitors the active server 100 and the spare server 101.
- a management computer global management computer having a management program 103 for monitoring the active server 100, the spare server 101, and the SVP 102 in the computer system 1.
- the active server 100, the spare server 101, and the SVP 102 constitute, for example, one blade server stored in one chassis.
- the management program 103 of the management computer monitors the operations of the active server 100, the spare server 101, and the SVP 102 over a plurality of blade servers.
- the active server 100 and the spare server 101 include a BMC 110 (Baseboard Management Controller), CPU sockets 120 and 121, CPU cores 130 and 131, a DIMM (Dual Inline Memory Module) 140, and an I / O slot 150.
- the BMC 110 includes a CPU concealment control unit 111, a DIMM concealment control unit 112, and an I / O concealment control unit 113. Each control unit may be configured by a program as described above.
- the SVP 102 matches the N + M control unit 160, which is a control unit for taking over work, the H / W configuration table 165 (see FIG. 4) having the H / W configuration information, and the H / W (Hardware) configuration
- An H / W configuration matching policy storage unit (storage area) 166 (see FIG. 5) that defines a policy
- a configuration matching information table 167 that is setting information of a spare server for matching the configurations of the active server and the spare server (see FIG. 7)
- an assignment change policy storage unit (storage area) 168 for storing the assignment change policy.
- N in the N + M control unit 160 indicates the number of active servers
- M indicates the number of spare servers. It may be simply referred to as a control unit.
- the N + M control unit 160 includes a Conf acquisition control unit 161 that acquires setting information (BMC information of each server) necessary for takeover, a takeover control unit 162 that controls server switching, and a H / W that can conceal a server.
- Each control unit may be configured by a program as described above.
- the active server 100 is a server on which an OS is started and a business is operating.
- the spare server 101 is a standby server for taking over work in response to a failure of the active server.
- the spare server 101 may have a different H / W configuration from the active server, and is not necessarily in operation.
- the modules of the active server and the spare server in this embodiment are the CPU 120-121, the DIMM 140, and the I / O slot 150. However, if the BMC 110 has concealment control and the module can be concealed, the type of the module does not matter. .
- the H / W configuration acquisition control unit 163 of the N + M control unit 160 of the SVP 102 acquires H / W configuration information from the active server 100, the spare server 101, and the management program 103, and creates an H / W configuration table 165.
- the acquired H / W configuration information is H / W information necessary for determining and executing H / W degeneration (concealment), PCI slot blocking, and the like.
- a server system system in which the active server 100, the spare server 101, and the SVP 102 are mounted has been described. However, it is also possible to execute takeover processing using the active server 100 and the spare server 101 in another system unit (blade server in another chassis) through the management program 103.
- the H / W configuration match control unit 163 reads the H / W configuration table 165 and the H / W configuration match policy 165 indicating the configuration match criteria of the active server 100 and the spare server 101. Then, the H / W configuration matching control unit 163 creates a configuration matching information table 167 that is setting information for matching the configurations of the active server 100 and the spare server 101 based on the read information.
- the H / W configuration match control unit 163 transmits the configuration match information in the configuration match information table 167 to the BMC 110 of the spare server 101.
- the BMC 110 performs concealment control based on the received configuration match information, and matches the H / W configurations of the active server 100 and the spare server 101. Then, the business takeover processing is executed in a state where the H / W configurations of the active server 100 and the spare server 101 match. However, it may be possible to take over even if the H / W configurations do not match. For example, the number of CPU cores can be handled by the OS even if the number changes.
- the part that does not need to match the configuration is defined by the H / W configuration matching policy 165.
- FIG. 2 is a diagram for explaining the procedure of the takeover destination server determination process executed when the takeover function is valid.
- the takeover destination server determination process (spare server allocation process) is executed at a stage before a failure occurs in the active server. Since it is not known what the active server will do after the failure of the active server, it is not possible to correctly determine the spare server assignment, and the situation where the assignment could not be taken over after the failure This is in order not to become.
- the SVP 102 executes the H / W information acquisition process 200 of the active server 100 and the spare server 101 using the H / W configuration acquisition control unit of the SVP 102.
- the active server 100 and the spare server 101 do not necessarily have to be stored in the same casing (the management entity may not be the same SVP), and the active server in a different casing through the management program 103.
- the H / W information of 100 and the spare server 101 may be acquired. For this reason, either the active server 100 or the spare server 101 can take over in another case.
- the SVP 10 executes the H / W configuration table creation process 201 using the H / W configuration acquisition control unit based on the acquired H / W configuration information.
- the SVP 102 reads the H / W configuration table 165 and the H / W configuration match policy 166 using the H / W configuration match control unit 164, and matches the H / W configurations of the active server 100 and the spare server 101.
- Information necessary for concealing the module of the spare server (configuration matching information) (configuration matching process 202: see FIG. 6 for details).
- the module is a H / W part of a computer such as a CPU, DIMM, and I / O, and is not limited to a CPU, DIMM, or I / O, but is required to be concealable. . More specifically, referring to the H / W configuration matching policy 166 (FIG. 5), paying attention to the H / W site where the configurations must match, comparing the configurations of the spare server and the active server in FIG. It is determined whether or not they match.
- the SVP 102 reads the server status information of the active server 100 and the spare server 101 and the configuration match information 167, and creates information that serves as a criterion for determining which spare server 101 is allocated to the active server 100 (allocation determination information table).
- Update process 203: FIG. 8) is executed for details.
- the server status information acquired from each server is additional information necessary for determining allocation, and is information such as a DIMM ECC error and a CPU operating rate, for example. Note that when information on another case is necessary, the information may be acquired from the management program 103.
- the SVP 102 executes an allocation table initialization process 204 (corresponding to the process of S1206 in FIG. 12).
- the allocation table (see FIG. 13 for the table at the time of initialization) is information indicating to which spare server the active server is to be taken over in advance. In the case of the first allocation, this allocation table initialization process 204 is executed.
- the SVP 102 reads the allocation determination information table (see FIG. 9) and the allocation change policy (see FIG. 10 or 11), and determines which spare server the active server is allocated (allocation determination processing 205: FIG. 12). (Corresponding to the processing of S1207).
- the SVP 102 enters a failure notification reception standby state.
- the SVP 102 executes the allocation determination information table update process 203 as needed (periodically), and further executes the allocation determination process 205 after updating the allocation information table, thereby determining the allocation in accordance with the situation of the server. (See FIG. 14 or 15 for the updated table).
- the frequency of the update depends on the weight of the process, but is performed at a rate of once every predetermined time (for example, 1 hour). Updating the allocation determination information table periodically in this way efficiently avoids the risk of failure if a server with increased ECC errors (memory errors) is preferentially handed over to a spare server. Because it can. In other words, as the computer system 1 is operated, an ECC error or the like may have changed compared to the case where it was assigned at the initial stage (at the time of the initialization process of the allocation table). It is because it is necessary to execute.
- FIG. 3 is a diagram for explaining a procedure for switching from the active server to the spare server when a failure occurs.
- the active server 100 When the H / W failure 300 occurs in the active server 100, the active server 100 notifies the SVP 102 of the failure log. Upon receiving this notification, the SVP 102 notifies the management program 103 of a failure.
- the management program 103 has allocation information (allocation table information) between the active server 100 and the spare server 101, and sends a server switch request (N + M switch request) to the SVP 102 that is monitoring the target active server 100 and spare server 101. ).
- allocation table any of FIGS. 13 to 15
- the management program 103 can grasp the target active server 100 and spare server 101.
- the SVP 102 manages the allocation table, but the allocation table itself may be held by either the SVP 102 or the management program 103. If allocation information is transmitted from the management program 103 to the SVP 102, the management program 103 can also manage the allocation table.
- the SVP 102 receives the N + M switching request from the management program 103, refers to the allocation table, and transmits 301 the configuration match information to the spare server BMC as the takeover destination.
- the configuration match information is BMC setting information, and indicates whether or not a specific module is hidden.
- the configuration match information transmitted is only information relating to the target active server and spare server. For example, when the active server in which the failure has occurred is the server 3 and the spare server assigned as the takeover destination is the server 1, only information (see FIG. 7) regarding the configuration match between the server 1 and the server 3 is transmitted. Become.
- BMC 110 of spare server 101 executes H / W configuration concealment processing 302 based on the received configuration match information. By executing the concealment process 302, the H / W configuration of the spare server 101 can be matched or taken over with the H / W configuration of the spare server 101.
- the spare server 101 notifies the SVP 102 that the concealment process has ended after the H / W concealment process based on the configuration match information.
- the SVP 102 that has received the notification executes an N + M switching process 303 that is a conventional switching process.
- the configuration matching information information on whether or not to conceal the CPU socket, CPU core, DIMM, and I / O slot is described.
- the CPU concealment control unit of the BMC of the target spare server performs concealment of the CPU socket 1.
- FIG. 4 is a diagram illustrating a configuration example of the H / W configuration table 165 indicating the H / W configuration of each server.
- the H / W configuration table 165 includes, as configuration items, a server name 400, a server application 401, a module name 402, and mounting information 403 indicating whether each module is mounted or not mounted.
- the server name 400 indicates an identifier unique to the server and may be any unique identifier.
- the usage 401 indicates whether the current server is a working server or a spare server.
- a module name 402 indicates an H / W part constituting the computer system, which can be concealed, and includes additional information of the module.
- the mounting information 403 is information indicating whether each module is mounted. Further, it includes additional information of a module that is a material for determining compositional coincidence. For example, the CPU frequency and the memory capacity.
- FIG. 5 is a diagram illustrating a configuration example of the H / W configuration matching policy table 166 indicating policies for matching H / W configurations.
- the H / W configuration matching policy table 166 includes the active server name 500, the module name 501, and the policy 502 as configuration items.
- the active server name 500 is an identifier unique to the server and is the same as the server name 400 in FIG.
- the module name 501 is information indicating the H / W part that is a policy setting target.
- the policy 502 is information that defines how to match the H / W configuration at the time of takeover.
- “no concealment” is specified when the module configurations are not matched
- “configuration match” is specified when the modules are matched.
- policy definition if the I / O slot is not installed properly and cannot be taken over normally, set “configuration match”, and takeover is normal even if the configuration changes before and after taking over the business like a CPU If possible, set “No concealment”.
- FIG. 6 is a flowchart for explaining processing (H / W configuration matching processing) for determining a concealment module for matching H / W configurations.
- the H / W configuration match control unit (which may be referred to as an H / W configuration match control program) 164 refers to the H / W configuration table of FIG. 4 and selects one module from the spare server (S601).
- the H / W configuration coincidence control unit 164 determines whether the module selected in S601 is installed in the active server (S602). If the selected module is mounted, the process proceeds to S605. If not, the process proceeds to S603.
- the H / W configuration match control unit 164 determines whether the H / W match policy in FIG. 5 is a configuration match (S603). When the policy is “configuration match”, the H / W configuration match control unit 164 writes the concealment setting of the selected module in the configuration match information (S604). If the H / W configuration match policy is not configuration match, the H / W configuration match control unit 164 writes the “no concealment” setting (S605).
- the H / W configuration match control unit 164 writes the “no concealment” setting because the configurations match (S606).
- the H / W configuration match control unit 164 determines whether there is another module in the spare server of the H / W configuration table 165 (S607).
- the processing of S601 to S606 is repeated until the processing of all modules mounted on the target spare server is completed.
- the processing moves to S608.
- the H / W configuration match control unit 164 selects a module that is not set in the concealment information of the configuration match information table 167. That is, a module for which neither “no concealment” nor “concealment” is set is selected. Since the processing of S602 to S607 is performed when there is a module installed only in the active server, there is a possibility that a module without setting exists.
- the H / W configuration match control unit 164 determines whether the H / W configuration match policy 166 of the module selected in S608 is “no concealment” (S609).
- the H / W configuration match control unit 164 writes “no concealment” as the concealment information of the target module (S610).
- the H / W configuration match control unit 164 writes the “cannot take over” setting as concealment information. (S611). This is because the configuration cannot be matched because there is no module in the spare server.
- the H / W configuration coincidence control unit 164 repeats the processing of S608 to 612 until the setting is completed for the module for which the concealment information is not set.
- FIG. 7 is a diagram illustrating a configuration example of the configuration matching information table 167 indicating concealment settings for matching H / W configurations.
- the configuration match information table 167 is information created as a result of the H / W configuration match process shown in FIG.
- the configuration match information table 167 manages information to be transmitted to the spare server BMC.
- the BMC that has received the information executes a process of concealing the module based on the concealment information.
- the configuration match information table 167 includes active / reserve 700, module names 702, and concealment information 702 to 705 as configuration items.
- the working / standby 700 is information indicating a combination of identifiers of the working server and the spare server, and the same identifier as the server name 400 (see FIG. 4) and the working server 500 (see FIG. 5) is used.
- the concealment information 702 to 705 is information indicating the setting of each module. “No concealment” in the concealment information means a setting for not concealing the spare server module. “Hidden” means a setting for hiding the module of the spare server. Further, “impossible to take over” means that the configuration of the active server and the spare server does not match, and therefore cannot be taken over.
- FIG. 8 is a flowchart for explaining a process (assignment determination information table update process) for obtaining information for determining which spare server the active server is assigned to.
- the SVP 102 first selects one combination of the active server and the spare server from which the allocation determination information is acquired (S800), and reads the H / W configuration match information table 167 (see FIG. 7) (S801). Note that the order of the process of S800 and the process of S801 may be reversed.
- the SVP 102 determines whether there is “cannot be taken over” in the concealment information of the H / W configuration match information table 167 for the selected combination of the active server and the spare server (S802). If there is no “cannot be taken over” (No in S802), the process proceeds to S803. If there is “cannot be taken over” (Yes in S802), the process proceeds to S805.
- the SVP 102 writes the takeover “permitted” in the takeover permission / inhibition 901 column of the assignment determination information table (FIG. 9) (S803), and then sums up the number of modules to be concealed, and the value is concealed in the assignment determination information table.
- the data is written in the module number field 904 (S804).
- the number of concealed cores / the total number of cores is used. This is because it is not appropriate to conceal only one core and conceal the entire CPU. However, the total number of modules to be concealed does not matter.
- the SVP 102 writes “No” meaning “no takeover” in the column of takeover availability 901 of the allocation determination table (S805).
- the SVP 102 calculates an average CPU frequency increase rate 905, a CPU core number increase rate 906, and a memory capacity increase rate 907 when the active server takes over from the active server (S806).
- the SVP 102 calculates a configuration match rate in the combination of the current server and the spare server (S807).
- Modules to be considered when calculating the configuration match rate are a CPU socket, a DIMM, and an I / O slot. Taking FIG. 4 as an example, there are nine CPU sockets 0, 1, 2, and 3, DIMMs 0 and 1, and I / O slots 0, 1, and 2. The CPU core is not counted in the number of modules. This is because it is considered to be included in the CPU socket.
- the configuration matching condition is that the installation / non-installation is the same in the active server and the spare server, and the frequency and the capacity are the same.
- the SVP 102 determines whether there is another combination of active server and spare server in the H / W configuration table 167 (S808). If there is a combination of the active server and the spare server (Yes in S808), the processes of S802 to S807 are repeated until there is no combination. If there is no combination of the active server and the spare server (No in S808), the process proceeds to S809.
- the SVP 102 acquires information on the number of ECC errors of the active server (S809). Further, the SVP 102 acquires information on the CPU operating rate of the active server (S810). These pieces of information are acquired when the SVP 102 requests each active server to transmit information on the current number of ECC errors and availability.
- FIG. 9 is a diagram illustrating a configuration example of an assignment determination information table for managing information for determining which spare server the active server is assigned to. The allocation of the spare server to be taken over for the active server is determined using this allocation determination information table and an allocation change policy described later.
- the allocation determination information table includes a success / failure 901, a possible total 902 indicating a total of succession “permitted”, a succession destination candidate number 903, a concealment module number 904, an average CPU frequency increase rate 905, and a CPU core number increase rate.
- a memory capacity increase rate 907, an H / W configuration coincidence rate 908, an ECC error count 909, and an operation rate 910 are included as configuration items.
- the takeover permission / inhibition 901 is information indicating whether it is possible to take over in the H / W configuration.
- the possible total 902 is information indicating the number of active servers that can be taken over based on the spare server.
- the takeover destination candidate number 903 is information indicating how many spare servers can be taken over based on the active server.
- the concealment module number 904 is information indicating the number of modules to be concealed when taking over.
- the average CPU frequency increase rate 905 is information indicating how much the CPU frequency increases after takeover.
- the CPU core number increase rate 906 is information indicating how much the CPU core number increases after the takeover.
- the memory capacity increase rate 907 is information indicating how much the memory capacity increases after the takeover.
- the H / W configuration match rate 908 is information indicating how much the configurations of the active server and the spare server match when not concealed.
- the number of ECC errors 909 is information indicating the number of DIMM ECC errors, and is used as an index for determining the possibility of DIMM failure. If the number of ECC errors is large, there is a high possibility that business handover will be carried out.
- the CPU operating rate 910 is information indicating the usage rate of the CPU. When the CPU usage rate is high, the CPU operating rate 910 is used as an index for taking over to a standby server with higher performance.
- ⁇ Assignment change policy> 10 and 11 are diagrams showing examples of the assignment change policy table 168 showing the policy of the active server assignment method.
- the policy in FIG. 10 and the policy in FIG. 11 show different policy examples.
- the allocation change policy 168 has a priority 1000, a policy 1001, and policy contents 1002 to 1005 or 1100 and 1101 as configuration items.
- the priority 1000 is information indicating the execution priority of the policy 1001.
- the policy 1001 is information indicating allocation criteria for the active server and the spare server.
- the policy content 1002 indicates the policy content of priority 1 in the policy table of FIG.
- the policy content 1003 indicates the policy content of priority 2 in the policy table of FIG.
- Policy content 1004 indicates the policy content of priority 3 in the policy table of FIG.
- Policy content 1005 indicates the policy content of priority 4 in the policy table of FIG.
- the policy content 1100 indicates the policy content of priority 1 in the policy table of FIG.
- the policy content 1101 indicates the policy content of priority 2 in the policy table of FIG.
- FIG. 12 is a flowchart for explaining an allocation determination process for determining to which spare server the active server is allocated.
- the SVP 102 reads the assignment determination information table (FIG. 9) (S1201), and determines whether there is an assignment table (for example, FIGS. 13 to 15) that has already been created (S1202). If there is no allocation table (No in S1202), the process proceeds to S1205. If there is an allocation table (Yes in S1202), the process proceeds to S1203.
- S1201 the assignment determination information table
- S1202 determines whether there is an assignment table (for example, FIGS. 13 to 15) that has already been created (S1202). If there is no allocation table (No in S1202), the process proceeds to S1205. If there is an allocation table (Yes in S1202), the process proceeds to S1203.
- the SVP 102 creates an allocation table (corresponding to the allocation table initialization process 204 (see FIG. 2)).
- the SVP 102 assigns the active server to the spare server having the highest configuration matching rate (S1206).
- the spare server identifier is assigned to the smallest spare server that can be assigned. However, it does not necessarily have to be the smallest, and it is sufficient if there is a criterion for assigning servers in the case of the same rate.
- S1206 is a process executed only during the first allocation process.
- the SVP 102 refers to the allocation determination information table (FIG. 9) and allocates the server with the number of candidates “1” to servers that can be taken over (S1203). Since a server with one candidate has one takeover destination, an allocation destination is inevitably determined.
- the SVP 102 sets the number of candidates X to 2 in order to allocate a server with the number of candidates 2 (S1204).
- the SVP 102 determines whether X is larger than the number of spare servers (S1207). When X is a value larger than that of the spare server (Yes in S1207), the process ends. On the other hand, if X is equal to or less than the number of spare servers (No in S1207), the process proceeds to S1208.
- the SVP 102 determines whether there is a spare server that takes the value of the number of candidates X (S1208). If there is such a spare server (Yes in S1208), the process proceeds to S1209. If there is no such spare server (No in S1208), the process proceeds to S1215.
- the SVP 102 cannot determine the conditions to take over according to the allocation change policy allocation method, so the identifier (spare server number (spare server name)) is It allocates to the spare server which can be allocated the smallest (S1214). Then, the process proceeds to S1215, and 1 is added to the number of candidates X, and the processes after S1207 are repeated. In the processing of S1214, the identifier does not necessarily have to be the minimum, but it is necessary to satisfy the condition that the active server can always be allocated to the spare server.
- the SVP 102 determines whether all of the active servers whose candidate number is X have been allocated to the spare server (can be taken over) (S1212). . If it is allocated (Yes in S1212), the process proceeds to S1215, 1 is added to the number of candidates X, and the processes after S1207 are repeated. If not allocated (No in S1212), the process proceeds to S213.
- the SVP 102 selects the next priority (priority (y + 1) where y is the previous priority). Then, the processing after S1210 is repeated for the priority (y + 1). In this way, by the processing from S1210 to S1213, the allocation change policy is read in the descending order of priority y, and spare servers are allocated to the current number of active servers (all active servers). If the number of candidates X is greater than the number of spare servers (Yes in S1207), the process ends.
- FIGS. 13 to 15 are diagrams showing configuration examples of allocation tables for managing information as to which spare server the active server is allocated to.
- the information in the allocation table varies depending on the timing of allocation, the allocation determination information table in FIG. 9, and the status of the allocation change policy 168 in FIGS.
- the allocation table includes, as configuration information, an allocation 1300 indicating information on an allocation-destination spare server, information on a spare server allocated to the active server, an ECC error total 1301, and an allocated server number 1302.
- the ECC error total 1301 is information indicating the total number of ECC errors of the active server assigned to the spare server.
- the number of assigned servers 1302 is information indicating the number of active servers assigned to spare servers.
- FIG. 13 shows the result of the initial allocation process (result of S1206).
- 14 shows the result of the assignment update process according to the assignment change policy of FIG. 10 (results of S1202 to S1215)
- FIG. 15 shows the result of the assignment update process according to the assignment change policy of FIG. 11 (results of S1202 to S1215). ).
- FIG. 13 (result of the initial allocation process)
- the active server is allocated to the server having the highest configuration match rate 908 in the allocation determination information table in FIG. Regarding the active server 3, the configuration matching rate of the spare server 1 is 66 and the configuration matching rate of the spare server 2 is 44. For this reason, the takeover destination of the active server 3 is the spare server 1. Similar processing is executed for the active servers 4, 5, 6, and 7 to allocate spare servers.
- FIG. 14 (result of assignment update processing based on assignment change policy in FIG. 10) FIG. 14 shows the result when the assignment process is performed using the assignment change policy of FIG.
- the active server 6 is assigned to the spare server 2 because there is only one takeover destination. Thereafter, for the active servers 3, 4, 5, and 7, the policy 1002 having the priority 1 of the allocation change policy in FIG. 10 is executed. Servers with an ECC error number exceeding 50 correspond to the active servers 5 and 7 from the ECC error number 909 in the allocation determination table of FIG. Further, since the number of ECC errors is larger in the active server 7, the allocation process is executed before the active server 5. Regarding (1) of the policy 1002, since the total number of ECC errors 1301 is 0 at the stage of assigning the active server 7 that is the takeover source, (2) is executed. The “possible total” 902 of the takeover destination candidates in FIG. 9 is smaller for the spare server 1, so the active server 7 is assigned to the spare server 1.
- (1) is executed for the active server 5.
- the active server 7 has already been allocated, the total ECC error number 1301 of the spare server 1 is 80, and the total ECC error 1301 of the spare server 2 is zero. Therefore, the active server 5 is assigned to the spare server 2 having a small ECC error total 1301. Since there is no server whose ECC error number exceeds 50, the allocation process based on the policy 1002 ends.
- the active server 3 corresponds to the server in which the operation rate 910 in FIG. 9 exceeds 90%.
- the active server 3 is assigned to a spare server whose CPU core number increase rate is 2.0 or more. This condition applies when the spare server 2 is assigned, and the active server 3 is assigned to the spare server 2.
- the policy 1004 is executed.
- the target of the allocation process is the active server 4.
- the CPU core number increase rate 906 in FIG. 9 for the active server 4 is the same for the spare servers 1 and 2. For this reason, even if the policy 1004 is used, the assignment cannot be determined. Therefore, the execution of the policy 1004 ends.
- policy 1005 allocation is determined by the number of concealment modules. However, the number of concealment modules in FIG. 9 for the active server 4 is the same for the spare servers 1 and 2. For this reason, allocation cannot be determined. Therefore, the policy 1005 ends.
- FIG. 15 shows the result when the assignment process is performed using the assignment change policy of FIG.
- the active server 6 is assigned to the spare server 1 because it has one takeover destination. Thereafter, the policy 1100 with priority 1 of the allocation change policy in FIG. 11 is executed. Since this process is the same as the case of FIG. 14, it is omitted.
- the processing targets are the active servers 3 and 4, and (1) is executed, and allocation is made to the spare server having the smallest number of assigned servers 1302. Since the number of assigned servers of the spare server 1 is 1 and the spare server 2 is 2, the active server 3 is allocated to the spare server 1.
- the hardware configuration is compared for each combination of the active server and the spare server, the hardware configuration matching policy information is referred to, and the hardware for each combination of the active server and the spare server is referred to. Decide whether to hide the configuration and whether to take over.
- a configuration matching rate indicating a hardware configuration matching rate is calculated for each combination of the active server and the spare server. Then, for each combination of the active server and the spare server, the spare server that is the takeover destination of the active server is allocated based on the hardware configuration concealment information, the takeover availability information, and the configuration match rate information. .
- the spare server executes a hardware concealment process based on the hardware concealment information and transmits a concealment process completion notification to the SVP (local management computer). Then, the SVP executes a process of switching the active server in which the failure has occurred to the spare server assigned as the takeover destination.
- the assignment of the spare server as the takeover destination once determined is dynamically changed (updated). For example, the allocation of spare servers already executed is changed based on the number of ECC errors of the active server. Alternatively, in addition to the number of ECC errors, information on the CPU utilization rate of the active server and information on the CPU core number increase rate in each combination of the active server and the spare server are used to dynamically allocate the spare server. change. In this case, it is also possible to prescribe a policy for the assignment change process.
- a spare server allocation condition based on the number of ECC errors is defined, or a spare server allocation condition based on the ECC error count, a spare server allocation condition based on the CPU operation rate, and Conditions for allocation of spare servers based on the CPU core number increase rate may be defined.
- a priority indicating the order of consideration for each condition may be set.
- the SVP, active server, and spare server included in one chassis are regarded as one server system (blade server).
- An actual system computer system
- a global management computer having a management program is provided.
- This global management computer manages communication between SVPs (local management computers) in a plurality of server systems.
- SVPs local management computers
- a spare server in the same chassis is considered as a candidate, but a spare server stored in a different chassis (server system) may be a candidate.
- each SVP acquires the hardware configuration information of the active server and the spare server arranged in different server systems via the global management computer.
- Each SVP allocates a spare server in a server system different from its own server system as a takeover destination of the active server in its own server system.
- a spare server in another server system, so that hardware resources are utilized more efficiently. It becomes possible.
- the present invention can also be realized by software program codes that implement the functions of the embodiments.
- a storage medium in which the program code is recorded is provided to the system or apparatus, and the computer (or CPU or MPU) of the system or apparatus reads the program code stored in the storage medium.
- the program code itself read from the storage medium realizes the functions of the above-described embodiments, and the program code itself and the storage medium storing the program code constitute the present invention.
- a storage medium for supplying such program code for example, a flexible disk, CD-ROM, DVD-ROM, hard disk, optical disk, magneto-optical disk, CD-R, magnetic tape, nonvolatile memory card, ROM Etc. are used.
- an OS operating system
- the computer CPU or the like performs part or all of the actual processing based on the instruction of the program code.
- the program code is stored in a storage means such as a hard disk or a memory of a system or apparatus, or a storage medium such as a CD-RW or CD-R
- the computer (or CPU or MPU) of the system or apparatus may read and execute the program code stored in the storage means or the storage medium when used.
- control lines and information lines indicate what is considered necessary for explanation, and not all control lines and information lines on the product are necessarily shown. All the components may be connected to each other.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Quality & Reliability (AREA)
- General Physics & Mathematics (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Computer Hardware Design (AREA)
- Hardware Redundancy (AREA)
Abstract
Description
を実行する。
図1は、本発明の実施形態による計算機システムの全体構成を示すブロック図である。計算機システム1は、現在サーバとして動作している現用サーバ100と、現用サーバが故障したときに当該現用サーバの動作を引き継ぐ予備サーバ101と、現用サーバ100や予備サーバ101を監視するSVP102(Service Processor:ローカル管理計算機ということもできる)と、計算機システム1における現用サーバ100、予備サーバ101、及びSVP102を監視するための管理プログラム103を有する管理計算機(グローバル管理計算機)と、を有している。現用サーバ100、予備サーバ101、及びSVP102は、例えば1つのシャーシに格納された1つのブレードサーバを構成している。管理計算機(グローバル管理計算機)の管理プログラム103は、複数のブレードサーバに亘って現用サーバ100、予備サーバ101、及びSVP102の動作を監視している。
図2は、引き継ぎ機能有効時に実行される引き継ぎ先サーバ決定処理の手順を説明するための図である。引き継ぎ先サーバ決定処理(予備サーバの割り当て処理)は、現用サーバに故障が発生する前の段階で実行される。現用サーバに故障が発生してからでは現用サーバがどのような動作をするか分からないため、予備サーバの割り当ての判断が正しく行えないからであり、故障してから割り当ててみて引き継げなかったという事態とならないようにするためである。
図3は、障害発生時に現用サーバから予備サーバに切り替える処理の手順を説明するための図である。
図4は、各サーバのH/W構成を示すH/W構成テーブル165の構成例を示す図である。このH/W構成テーブルを用いることにより、隠蔽可能なH/W構成情報を知ることが可能となる。
図5は、H/W構成を一致させるポリシーを示すH/W構成一致ポリシーテーブル166の構成例を示す図である。
図6は、H/W構成を一致させるための隠蔽モジュールを決定する処理(H/W構成一致処理)を説明するためのフローチャートである。
図7は、H/W構成を一致させるための隠蔽設定を示す構成一致情報テーブル167の構成例を示す図である。当該構成一致情報テーブル167は、図6に示されるH/W構成一致処理の結果作成される情報である。また、構成一致情報テーブル167は、予備サーバのBMCへ送信する情報を管理している。当該情報を受信したBMCは、隠蔽情報を基にモジュールを隠蔽する処理を実行する。
図8は、現用サーバをどの予備サーバに割り当てるか判定するための情報を取得する処理(割り当て決定情報テーブル更新処理)を説明するためのフローチャートである。
図9は、現用サーバをどの予備サーバに割り当てるか判定するための情報を管理するための割り当て決定情報テーブルの構成例を示す図である。この割り当て決定情報テーブルと後述の割り当て変更ポリシーとを用いて現用サーバに対して引き継ぐべき予備サーバの割り当てが決定されることになる。
図10及び図11は、現用サーバの割り当て方法のポリシーを示す割り当て変更ポリシーテーブル168の例を示す図である。図10のポリシーと図11のポリシーは別のポリシー例を示している。
図12は、現用サーバをどの予備サーバに割り当てるか決定する割り当て決定処理を説明するためのフローチャートである。
図13乃至15は、現用サーバをどの予備サーバに割り当てるのかの情報を管理する割り当てテーブルの構成例を示す図である。割り当てテーブルの情報は、割り当てのタイミングや図9の割り当て決定情報テーブル、図10及び11の割り当て変更ポリシー168の状態によってとりうる値が変化する。
図13の初回割り当てでは、現用サーバは、図9の割り当て決定情報テーブルの構成一致率908が最も高いサーバに割り当てられる。現用サーバ3に関し、予備サーバ1の構成一致率は66で、予備サーバ2の構成一致率は44である。このため、現用サーバ3の引き継ぎ先は予備サーバ1となる。現用サーバ4、5、6、及び7についても同様の処理が実行されて予備サーバが割り当てられる。
図14は、図10の割り当て変更ポリシーを用いて割り当て処理を行った場合の結果を示している。
図15は、図11の割り当て変更ポリシーを用いて割り当て処理を行った場合の結果を示している。
(i)本発明の実施形態では、現用サーバと予備サーバの組み合せのそれぞれについてハードウェア構成を比較し、ハードウェア構成一致ポリシー情報を参照して、現用サーバと予備サーバの組み合わせのそれぞれについてハードウェア構成の隠蔽の有無、及び引き継ぎの可否について決定する。また、現用サーバと予備サーバの組み合せのそれぞれについてハードウェア構成の一致割合を示す構成一致率を算出する。そして、現用サーバと予備サーバの組み合わせのそれぞれについての、ハードウェア構成の隠蔽の有無の情報及び引き継ぎ可否の情報と、構成一致率の情報に基づいて、現用サーバの引き継ぎ先である予備サーバを割り当てる。このようにすることにより、現用サーバと予備サーバのH/W構成が異なる場合でも、OSに影響なく業務を引き継ぐことが可能な引き継ぎ先の予備サーバを決定することができる。また、切り替え後にOS上でI/Oの認識作業を必要とせず、CPUソケット/コアに連動するライセンスの制約を受けないという効果も期待できる。さらに、OS上で動作するプログラムがライセンスの制約を受けない。このように、H/Wリソースを効率よく利用することで、引き継ぎ可能なサーバ数を保つことで可用性を向上させることができ、故障しそうなサーバを優先してサーバの割り当てを決定することで可用性を向上させることができる。
101・・・予備サーバ
102・・・SVP(Service Processor)
103・・・管理プログラム
110・・・BMC(Baseboard Management Controller)
111・・・CPU隠蔽制御部
112・・・DIMM隠蔽制御部
113・・・I/O隠蔽制御部
120-121・・・CPUソケット
130-131・・・CPUコア
140・・・DIMM
150・・・I/Oスロット
160・・・N+M制御部
161・・・Conf取得制御部
162・・・引き継ぎ制御部
163・・・H/W構成取得制御部
164・・・H/W構成一致制御部
165・・・H/W構成テーブル
166・・・H/W構成一致ポリシー
167・・・構成一致情報テーブル
168・・・割り当て変更ポリシー
Claims (14)
- 業務を処理し、稼働中の少なくとも1つの現用サーバと、
前記現用サーバが故障した際に当該故障した現用サーバの業務を引き継がせるために用意された少なくとも1つの予備サーバと、
前記現用サーバ及び前記予備サーバを監視し、サーバの切り替えを制御するローカル管理計算機と、を有し、
前記ローカル管理計算機は、
前記現用サーバの業務の引き継ぎ先である予備サーバの割り当て処理を実行するプロセッサと、
サーバの引き継ぎを可能とするハードウェア構成の条件を示すハードウェア構成一致ポリシー情報を少なくとも格納するメモリと、を有し、
前記プロセッサは、
前記現用サーバ及び前記予備サーバからそれぞれのハードウェア構成情報を取得する処理と、
前記取得したハードウェア構成情報に基づいて、前記現用サーバと前記予備サーバの組み合せのそれぞれについてハードウェア構成を比較し、前記メモリから読み出した前記ハードウェア構成一致ポリシー情報を参照して、前記現用サーバと前記予備サーバの組み合わせのそれぞれについてハードウェア構成の隠蔽の有無、及び引き継ぎの可否について決定する処理と、
前記現用サーバと前記予備サーバの組み合せのそれぞれについてハードウェア構成の一致割合を示す構成一致率を算出する処理と、
前記現用サーバと前記予備サーバの組み合わせのそれぞれについての、前記ハードウェア構成の隠蔽の有無の情報及び前記引き継ぎ可否の情報と、前記構成一致率の情報に基づいて、前記現用サーバの引き継ぎ先である予備サーバを割り当てる処理と、
を実行することを特徴とするサーバシステム。 - 請求項1において、
前記プロセッサは、さらに、
前記現用サーバのECCエラー数の情報を取得する処理と、
前記取得したECCエラー数に基づいて、既に実行された予備サーバの割り当てを変更し、動的に予備サーバを現用サーバに割り当てる処理と、
を実行することを特徴とするサーバシステム。 - 請求項2において、
前記プロセッサは、さらに、
前記現用サーバのCPU稼働率の情報と、前記現用サーバと前記予備サーバの組み合わせのそれぞれにおけるCPUコア数増加率の情報と、を取得する処理と、
前記ECCエラー数に加えて、前記CPU稼働率及び前記CPUコア数増加率の情報を用いて、動的に予備サーバを現用サーバに割り当てる処理と、
を実行することを特徴とするサーバシステム。 - 請求項2において、
前記メモリは、さらに、少なくとも前記ECCエラー数に基づく予備サーバの割り当ての条件を規定する割り当て変更ポリシー情報を格納し、
前記プロセッサは、前記メモリから前記割り当て変更ポリシー情報を読み出し、前記動的に予備サーバを現用サーバに割り当てる処理を実行することを特徴とするサーバシステム。 - 請求項3において、
前記メモリは、さらに、前記ECCエラー数に基づく予備サーバの割り当ての条件、前記CPUの稼働率に基づく予備サーバの割り当ての条件、及び前記CPUコア数増加率に基づく予備サーバの割り当ての条件を規定する割り当て変更ポリシー情報を格納し、
前記割り当て変更ポリシー情報における各割り当ての条件には検討の優先度が設定されており、
前記プロセッサは、前記メモリから前記割り当て変更ポリシー情報を読み出し、前記検討の優先度に従って前記割り当て変更ポリシー情報を検討し、前記動的に予備サーバを現用サーバに割り当てる処理を実行することを特徴とするサーバシステム。 - 請求項1において、
前記プロセッサは、さらに、
前記現用サーバの何れかについての障害通知に応答して、障害が発生した前記現用サーバの引き継ぎ先として割り当てられた前記予備サーバに前記ハードウェアの隠蔽の有無の情報を送信する処理と、
前記予備サーバから、前記ハードウェアの隠蔽の有無の情報に基づいて実行されたハードウェアの隠蔽処理の完了通知を受信する処理と、
前記障害が発生した現用サーバを前記引き継ぎ先として割り当てられた前記予備サーバに切り替える処理と、
を実行することを特徴とするサーバシステム。 - 複数の、請求項1に記載のサーバシステムと、
前記複数のサーバシステムを管理するグローバル管理計算機と、を有する計算機システムであって。
前記グローバル管理計算機は、前記複数のサーバシステムにおける前記ローカル管理計算機間の通信を管理することにより、前記ローカル管理計算機が異なるサーバシステムに配置された前記現用サーバ及び前記予備サーバのハードウェア構成情報を取得することを可能にし、
前記ローカル管理計算機は、自身のサーバシステムにおける現用サーバの引き継ぎ先として、自身のサーバシステムとは異なるサーバシステムにおける予備サーバを割り当てることを特徴とする計算機システム。 - 業務を処理し、稼働中の少なくとも1つの現用サーバと、前記現用サーバが故障した際に当該故障した現用サーバの業務を引き継がせるために用意された少なくとも1つの予備サーバと、前記現用サーバ及び前記予備サーバを監視し、サーバの切り替えを制御するローカル管理計算機と、を有するサーバシステムの管理方法であって、
前記ローカル管理計算機は、前記現用サーバの業務の引き継ぎ先である予備サーバの割り当て処理を実行するプロセッサと、サーバの引き継ぎを可能とするハードウェア構成の条件を示すハードウェア構成一致ポリシー情報を少なくとも格納するメモリと、を有し、
前記管理方法は、
前記プロセッサが、前記現用サーバ及び前記予備サーバからそれぞれのハードウェア構成情報を取得するステップと、
前記プロセッサが、前記取得したハードウェア構成情報に基づいて、前記現用サーバと前記予備サーバの組み合せのそれぞれについてハードウェア構成を比較し、前記メモリから読み出した前記ハードウェア構成一致ポリシー情報を参照して、前記現用サーバと前記予備サーバの組み合わせのそれぞれについてハードウェア構成の隠蔽の有無、及び引き継ぎの可否について決定するステップと、
前記プロセッサが、前記現用サーバと前記予備サーバの組み合せのそれぞれについてハードウェア構成の一致割合を示す構成一致率を算出するステップと、
前記プロセッサが、前記現用サーバと前記予備サーバの組み合わせのそれぞれについての、前記ハードウェア構成の隠蔽の有無の情報及び前記引き継ぎ可否の情報と、前記構成一致率の情報に基づいて、前記現用サーバの引き継ぎ先である予備サーバを割り当てるステップと、
を含むことを特徴とするサーバシステムの管理方法。 - 請求項8において、さらに、
前記プロセッサが、前記現用サーバのECCエラー数の情報を取得するステップと、
前記プロセッサが、前記取得したECCエラー数に基づいて、既に実行された予備サーバの割り当てを変更し、動的に予備サーバを現用サーバに割り当てるステップと、
を含むことを特徴とするサーバシステムの管理方法。 - 請求項9において、さらに、
前記プロセッサが、前記現用サーバのCPU稼働率の情報と、前記現用サーバと前記予備サーバの組み合わせのそれぞれにおけるCPUコア数増加率の情報と、を取得するステップと、
前記プロセッサが、前記ECCエラー数に加えて、前記CPU稼働率及び前記CPUコア数増加率の情報を用いて、動的に予備サーバを現用サーバに割り当てるステップと、
を含むことを特徴とするサーバシステムの管理方法。 - 請求項9において、
前記メモリは、さらに、少なくとも前記ECCエラー数に基づく予備サーバの割り当ての条件を規定する割り当て変更ポリシー情報を格納し、
前記動的に予備サーバを現用サーバに割り当てるステップにおいて、前記プロセッサは、前記メモリから前記割り当て変更ポリシー情報を読み出し、前記動的に予備サーバを現用サーバに割り当てることを特徴とするサーバシステムの管理方法。 - 請求項10において、
前記メモリは、さらに、前記ECCエラー数に基づく予備サーバの割り当ての条件、前記CPUの稼働率に基づく予備サーバの割り当ての条件、及び前記CPUコア数増加率に基づく予備サーバの割り当ての条件を規定する割り当て変更ポリシー情報を格納し、
前記割り当て変更ポリシー情報における各割り当ての条件には検討の優先度が設定されており、
前記動的に予備サーバを現用サーバに割り当てるステップにおいて、前記プロセッサは、前記メモリから前記割り当て変更ポリシー情報を読み出し、前記検討の優先度に従って前記割り当て変更ポリシー情報を検討し、前記動的に予備サーバを現用サーバに割り当てることを特徴とするサーバシステムの管理方法。 - 請求項8において、さらに、
前記プロセッサが、前記現用サーバの何れかについての障害通知に応答して、障害が発生した前記現用サーバの引き継ぎ先として割り当てられた前記予備サーバに前記ハードウェアの隠蔽の有無の情報を送信するステップと、
前記引き継ぎ先として割り当てられた予備サーバが、前記ハードウェアの隠蔽の有無の情報に基づいてハードウェアの隠蔽処理を実行するステップと、
前記プロセッサが、前記予備サーバから前記隠蔽処理の完了通知を受信するステップと、
前記プロセッサが、前記障害が発生した現用サーバを前記引き継ぎ先として割り当てられた前記予備サーバに切り替えるステップと、
を含むことを特徴とするサーバシステムの管理方法。 - 業務を処理し、稼働中の少なくとも1つの現用サーバと、前記現用サーバが故障した際に当該故障した現用サーバの業務を引き継がせるために用意された少なくとも1つの予備サーバと、前記現用サーバ及び前記予備サーバを監視し、サーバの切り替えを制御するローカル管理計算機と、を有するサーバシステムにおける前記ローカル管理計算機のプロセッサに、前記現用サーバの業務の引き継ぎ先である予備サーバの割り当て処理を実行させるためのプログラムを記憶するコンピュータ読み取り可能な記憶媒体であって、
前記プログラムは、前記プロセッサに、
前記現用サーバ及び前記予備サーバからそれぞれのハードウェア構成情報を取得する処理と、
前記取得したハードウェア構成情報に基づいて、前記現用サーバと前記予備サーバの組み合せのそれぞれについてハードウェア構成を比較し、サーバの引き継ぎを可能とするハードウェア構成の条件を示すハードウェア構成一致ポリシー情報を少なくとも格納するメモリから読み出した前記ハードウェア構成一致ポリシー情報を参照して、前記現用サーバと前記予備サーバの組み合わせのそれぞれについてハードウェア構成の隠蔽の有無、及び引き継ぎの可否について決定する処理と、
前記現用サーバと前記予備サーバの組み合せのそれぞれについてハードウェア構成の一致割合を示す構成一致率を算出する処理と、
前記現用サーバと前記予備サーバの組み合わせのそれぞれについての、前記ハードウェア構成の隠蔽の有無の情報及び前記引き継ぎ可否の情報と、前記構成一致率の情報に基づいて、前記現用サーバの引き継ぎ先である予備サーバを割り当てる処理と、
を実行するためのプログラムコードを含む特徴とするコンピュータ読み取り可能な記憶媒体。
Priority Applications (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US14/777,883 US9792189B2 (en) | 2013-09-12 | 2013-09-12 | Server system, computer system, method for managing server system, and computer-readable storage medium |
JP2015536369A JP6063576B2 (ja) | 2013-09-12 | 2013-09-12 | サーバシステム、計算機システム、サーバシステムの管理方法、及びコンピュータ読み取り可能な記憶媒体 |
PCT/JP2013/074725 WO2015037103A1 (ja) | 2013-09-12 | 2013-09-12 | サーバシステム、計算機システム、サーバシステムの管理方法、及びコンピュータ読み取り可能な記憶媒体 |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/JP2013/074725 WO2015037103A1 (ja) | 2013-09-12 | 2013-09-12 | サーバシステム、計算機システム、サーバシステムの管理方法、及びコンピュータ読み取り可能な記憶媒体 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2015037103A1 true WO2015037103A1 (ja) | 2015-03-19 |
Family
ID=52665245
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2013/074725 WO2015037103A1 (ja) | 2013-09-12 | 2013-09-12 | サーバシステム、計算機システム、サーバシステムの管理方法、及びコンピュータ読み取り可能な記憶媒体 |
Country Status (3)
Country | Link |
---|---|
US (1) | US9792189B2 (ja) |
JP (1) | JP6063576B2 (ja) |
WO (1) | WO2015037103A1 (ja) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP6579995B2 (ja) * | 2016-04-26 | 2019-09-25 | 三菱電機株式会社 | 静観候補特定装置、静観候補特定方法及び静観候補特定プログラム |
CN113032229B (zh) * | 2021-02-24 | 2022-09-20 | 山东英信计算机技术有限公司 | 一种java性能测试方法、系统及介质 |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2006163963A (ja) * | 2004-12-09 | 2006-06-22 | Hitachi Ltd | ディスク引き継ぎによるフェイルオーバ方法 |
JP2008097276A (ja) * | 2006-10-11 | 2008-04-24 | Hitachi Ltd | 障害回復方法、計算機システム及び管理サーバ |
JP2009140194A (ja) * | 2007-12-06 | 2009-06-25 | Hitachi Ltd | 障害回復環境の設定方法 |
JP5208324B1 (ja) * | 2012-02-20 | 2013-06-12 | 三菱電機株式会社 | 情報システム管理装置及び情報システム管理方法及びプログラム |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP4842210B2 (ja) | 2007-05-24 | 2011-12-21 | 株式会社日立製作所 | フェイルオーバ方法、計算機システム、管理サーバ及び予備サーバの設定方法 |
US8121966B2 (en) * | 2008-06-05 | 2012-02-21 | International Business Machines Corporation | Method and system for automated integrated server-network-storage disaster recovery planning |
JP4648447B2 (ja) * | 2008-11-26 | 2011-03-09 | 株式会社日立製作所 | 障害復旧方法、プログラムおよび管理サーバ |
JP4727714B2 (ja) * | 2008-12-05 | 2011-07-20 | 株式会社日立製作所 | サーバのフェイルオーバの制御方法及び装置、並びに計算機システム群 |
US8112657B2 (en) * | 2010-06-14 | 2012-02-07 | At&T Intellectual Property I, L.P. | Method, computer, and computer program product for hardware mapping |
-
2013
- 2013-09-12 JP JP2015536369A patent/JP6063576B2/ja not_active Expired - Fee Related
- 2013-09-12 US US14/777,883 patent/US9792189B2/en active Active
- 2013-09-12 WO PCT/JP2013/074725 patent/WO2015037103A1/ja active Application Filing
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2006163963A (ja) * | 2004-12-09 | 2006-06-22 | Hitachi Ltd | ディスク引き継ぎによるフェイルオーバ方法 |
JP2008097276A (ja) * | 2006-10-11 | 2008-04-24 | Hitachi Ltd | 障害回復方法、計算機システム及び管理サーバ |
JP2009140194A (ja) * | 2007-12-06 | 2009-06-25 | Hitachi Ltd | 障害回復環境の設定方法 |
JP5208324B1 (ja) * | 2012-02-20 | 2013-06-12 | 三菱電機株式会社 | 情報システム管理装置及び情報システム管理方法及びプログラム |
Also Published As
Publication number | Publication date |
---|---|
US20160266987A1 (en) | 2016-09-15 |
JP6063576B2 (ja) | 2017-01-18 |
JPWO2015037103A1 (ja) | 2017-03-02 |
US9792189B2 (en) | 2017-10-17 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11438411B2 (en) | Data storage system with redundant internal networks | |
US11467732B2 (en) | Data storage system with multiple durability levels | |
US11237772B2 (en) | Data storage system with multi-tier control plane | |
US8473692B2 (en) | Operating system image management | |
US10635551B2 (en) | System, and control method and program for input/output requests for storage systems | |
US11375014B1 (en) | Provisioning of clustered containerized applications | |
US9098466B2 (en) | Switching between mirrored volumes | |
US11182096B1 (en) | Data storage system with configurable durability | |
JP5069732B2 (ja) | 計算機装置、計算機システム、アダプタ承継方法 | |
US11888933B2 (en) | Cloud service processing method and device, cloud server, cloud service system and storage medium | |
CN107710160B (zh) | 计算机和存储区域管理方法 | |
JP4920248B2 (ja) | サーバの障害回復方法及びデータベースシステム | |
US9772785B2 (en) | Controlling partner partitions in a clustered storage system | |
US11405455B2 (en) | Elastic scaling in a storage network environment | |
US20180205612A1 (en) | Clustered containerized applications | |
JP6063576B2 (ja) | サーバシステム、計算機システム、サーバシステムの管理方法、及びコンピュータ読み取り可能な記憶媒体 | |
EP3167372B1 (en) | Methods for facilitating high availability storage services and corresponding devices | |
US20230367503A1 (en) | Computer system and storage area allocation control method | |
US20220215001A1 (en) | Replacing dedicated witness node in a stretched cluster with distributed management controllers | |
US11431552B1 (en) | Zero traffic loss in VLT fabric | |
JP2019174875A (ja) | 記憶システム及び記憶制御方法 | |
JP7087719B2 (ja) | コンピュータシステム | |
CN115562562A (zh) | 基于客户端/服务器架构管理计算系统的方法、设备和程序产品 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 13893295 Country of ref document: EP Kind code of ref document: A1 |
|
ENP | Entry into the national phase |
Ref document number: 2015536369 Country of ref document: JP Kind code of ref document: A |
|
WWE | Wipo information: entry into national phase |
Ref document number: 14777883 Country of ref document: US |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 13893295 Country of ref document: EP Kind code of ref document: A1 |