CN105808391A

CN105808391A - Method and device for hot replacing CPU nodes

Info

Publication number: CN105808391A
Application number: CN201610204324.6A
Authority: CN
Inventors: 周玉龙; 童元满; 李仁刚
Original assignee: Inspur Electronic Information Industry Co Ltd
Current assignee: Inspur Electronic Information Industry Co Ltd
Priority date: 2016-04-05
Filing date: 2016-04-05
Publication date: 2016-07-27

Abstract

The invention provides a method and device for hot replacing CPU nodes.The method includes the steps that target CPU nodes with faulted CPUs on a server are determined; application programs, managed by the target CPU nodes, of a memory are moved to a memory managed by normal CPU nodes so that running can be carried out through the normal CPU nodes; running of the target CPU nodes is stopped, and the breakdown CPUs in the target CPU nodes are replaced with new CPUs; the target CPU nodes with the replaced CPU are added into a system of the server.The device comprises a determining unit, a moving unit, a replacing unit and an adding unit.By means of the method and device, interruption of services running on the server can be avoided.

Description

A kind of heat replaces method and the device of cpu node

Technical field

The present invention relates to field of computer technology, replace method and the device of cpu node particularly to a kind of heat.

Background technology

Server is widely used in every field as a kind of high-performance computer, processes miscellaneous service.Increase along with the growth of portfolio and business complexity, user is also more and more higher to the performance requirement of server, user's requirement to server calculating speed cannot have been met by improving the performance of single processor cpu node, therefore, the performance of server is improved, to meet user's requirement to server process speed typically via increasing the quantity of cpu node in server.

In the server including multiple cpu node, each cpu node can run business simultaneously, improves the server speed to Business Processing.Owing to would be likely to occur the business of intersection between each cpu node, when in server, one of them cpu node breaks down, this cpu node is not normally functioning business, it is also possible to cause that other cpu nodes can not properly functioning business.

At present; when a cpu node in server breaks down, it is necessary to terminate the operation of each cpu node on this server, after replacing, with new CPU, the CPU broken down after server outage; restart each cpu node on this server, continue to run with business.

It is directed to the prior art solution to fault cpu node, when there being cpu node to break down, need to terminate the operation of each cpu node on this server, fault cpu node is replaced, this will cause that the business run on server is interrupted, bring to user constant greatly, even can cause serious consequence at some special dimensions.

Summary of the invention

Embodiments provide a kind of heat and replace method and the device of cpu node, it is possible to avoid the business run on server to interrupt.

Embodiments provide a kind of method that heat replaces cpu node, be applied to include the server of at least two cpu node, including:

Determine the target cpu node that on described server, CPU breaks down；

Application program in described target cpu node institute managing internal memory is moved to normal cpu node in the internal memory managed, to be run by described normal cpu node；

Stop the operation of described target cpu node, and utilize new CPU to replace the fault CPU in described target cpu node；

Described target cpu node after replacement CPU is added in the system of described server.

Preferably, described application program in described target cpu node institute managing internal memory is moved in the internal memory managed to normal cpu node before farther include:

The internal memory of each cpu node management on described server is carried out buffer consistency write back.

Preferably, described the internal memory of each cpu node management on described server is carried out buffer consistency write back and include:

Being sent to each cpu node in described server by the form interrupted and write back instruction, the internal memory of each Self management, after writing back instruction described in receiving, is carried out buffer consistency and writes back by each cpu node described.

Preferably, farther include after the operation of the described target cpu node of described stopping:

Maintenance instruction is sent to described normal cpu node by basic input-output system BIOS, described normal cpu node is after receiving described maintenance instruction, continue to send NULL sky bag to each Node Controller corresponding to described normal cpu node, to safeguard in described normal cpu node the normal of link between CPU and each corresponding Node Controller.

Preferably, described include replacing the system that the described target cpu node after CPU adds described server to:

Described target cpu node after replacing CPU is turned back on, initialize the link parameter between each CPU and each corresponding Node Controller in described target cpu node, and send interpolation instruction by the form interrupted to the host CPU node in described server, described host CPU node is after receiving described interpolation instruction, described target cpu node after replacing CPU is carried out initialization process, the described target cpu node after initialization process is added in the system of described server.

The embodiment of the present invention additionally provides a kind of heat and replaces the device of cpu node, comprises determining that unit, moves unit, replacement unit and adding device；

Described determine unit, for determining the target cpu node that on described server, CPU breaks down；

Described move unit, for being moved by the described application program determined in the target cpu node institute managing internal memory that unit is determined to normal cpu node in the internal memory managed, to be run by described normal cpu node；

Described replacement unit, for completing after application program moves, to stop the operation of described target cpu node, to complete to utilize the new CPU fault CPU replaced in described target cpu node at described unit of moving；

Described adding device, adds to for the described target cpu node after described replacement unit is replaced CPU in the system of described server.

Preferably, this device farther includes: writeback unit；

Described writeback unit, writes back for the internal memory of each cpu node management on described server is carried out buffer consistency.

Preferably,

Described writeback unit, writes back instruction for the form by interrupting to each cpu node transmission of described server, and the internal memory of each Self management, after writing back instruction described in receiving, is carried out concordance and writes back by each cpu node described.

Preferably, this device farther includes: link maintenance unit；

Described link maintenance unit, for sending maintenance instruction by basic input-output system BIOS to described normal cpu node, described normal cpu node is after receiving described maintenance instruction, continue to send NULL sky bag to each Node Controller corresponding to described normal cpu node, to safeguard in described normal cpu node the normal of link between CPU and each corresponding Node Controller.

Preferably,

Described adding device, for the described target cpu node after replacement CPU is turned back on, initialize the link parameter between each CPU and each corresponding Node Controller in described target cpu node, and send interpolation instruction by the form interrupted to the host CPU node in described server, described host CPU node is after receiving described interpolation instruction, described target cpu node after replacing CPU is carried out initialization process, the described target cpu node after initialization process is added in the system of described server.

Embodiments provide a kind of heat and replace method and the device of cpu node, after CPU in target cpu node breaks down, stop running so that the CPU broken down to be replaced of target cpu node, other cpu nodes are properly functioning, and stopping being moved by the application program run on target cpu node before target cpu node runs operation on other normal cpu nodes, complete to be re-added in server system by target cpu node after CPU replaces.So, fault cpu node replacement process not terminating each application program run on server, interrupting thus avoiding the business run on server.

Accompanying drawing explanation

In order to be illustrated more clearly that the embodiment of the present invention or technical scheme of the prior art, the accompanying drawing used required in embodiment or description of the prior art will be briefly described below, apparently, accompanying drawing in the following describes is some embodiments of the present invention, for those of ordinary skill in the art, under the premise not paying creative work, it is also possible to obtain other accompanying drawing according to these accompanying drawings.

Fig. 1 is the method flow diagram that a kind of heat that one embodiment of the invention provides replaces cpu node；

Fig. 2 is the method flow diagram that a kind of heat that another embodiment of the present invention provides replaces cpu node；

Fig. 3 is a kind of server architecture schematic diagram that one embodiment of the invention provides；

Fig. 4 is the device schematic diagram that a kind of heat that one embodiment of the invention provides replaces cpu node.

Detailed description of the invention

For making the purpose of the embodiment of the present invention, technical scheme and advantage clearly; below in conjunction with the accompanying drawing in the embodiment of the present invention; technical scheme in the embodiment of the present invention is clearly and completely described; obviously; described embodiment is a part of embodiment of the present invention, rather than whole embodiments, based on the embodiment in the present invention; the every other embodiment that those of ordinary skill in the art obtain under the premise not making creative work, broadly falls into the scope of protection of the invention.

As it is shown in figure 1, embodiments provide a kind of method that heat replaces cpu node, being applied to include the server of at least two cpu node, the method may comprise steps of:

Step 101: determine the target cpu node that on described server, CPU breaks down；

Step 102: the application program in described target cpu node institute managing internal memory is moved to normal cpu node in the internal memory managed, to be run by described normal cpu node；

Step 103: stop the operation of described target cpu node, and utilize new CPU to replace the fault CPU in described target cpu node；

Step 104: the described target cpu node after replacement CPU is added in the system of described server.

Embodiments provide a kind of method that heat replaces cpu node, after CPU in target cpu node breaks down, stop running so that the CPU broken down to be replaced of target cpu node, other cpu nodes are properly functioning, and stopping being moved by the application program run on target cpu node before target cpu node runs operation on other normal cpu nodes, complete to be re-added in server system by target cpu node after CPU replaces.So, fault cpu node replacement process not terminating each application program run on server, interrupting thus avoiding the business run on server.

In an embodiment of the invention, before application program in the internal memory managed by target cpu node is moved to normal cpu node in the internal memory managed, first the internal memory of each cpu node management on server is carried out buffer consistency to write back, data in the internal memory manage each cpu node carry out preservation process, so, data in server memory are all up-to-date, after the operation stopping target cpu node, ensure the concordance of server system buffer memory, enable the application program moving normal cpu node properly functioning.

In an embodiment of the invention, the internal memory that each cpu node is managed carry out buffer consistency write back time, being sent to each cpu node by the form interrupted and write back instruction, the internal memory of each Self management, after receiving and writing back instruction, is carried out buffer consistency and writes back by each cpu node.By the form interrupted, ensure that the buffer consistency completing each cpu node managing internal memory before target cpu node is removed writes back, it is up-to-date for making the data in server memory, it is ensured that the application program in target cpu node internal memory is moved can be properly functioning to other normal cpu nodes.

In an embodiment of the invention, after the operation stopping target cpu node, basic input-output system BIOS can be passed through to normal cpu node generation maintenance instruction, normal cpu node is after receiving maintenance instruction, continue each Node Controller to normal cpu node is corresponding and send NULL sky bag, so can ensure that in the process that target cpu node is out of service, link between normal cpu node and each corresponding Node Controller remains normal, without passing through to restart mode and re-establish the link between normal cpu node and each corresponding Node Controller after target cpu node is recovered, improve the efficiency replacing cpu node.

In an embodiment of the invention, after the CPU in target cpu node has been replaced, target cpu node after replacing CPU is turned back on, link parameter between initialized target cpu node and each corresponding Node Controller, and send interpolation instruction by the form interrupted to the host CPU node of server, target cpu node, after receiving interpolation instruction, is carried out initialization process, and is added to by the target cpu node after initialization process in the system of server by host CPU node.So, completing the replacement of target cpu node, target cpu node reenters the system of server and runs business.

For making the object, technical solutions and advantages of the present invention clearly, below in conjunction with drawings and the specific embodiments, the present invention is described in further detail.

As in figure 2 it is shown, embodiments provide a kind of method that heat replaces cpu node, the method may comprise steps of:

Step 201: determine the target cpu node that on server, CPU breaks down.

In an embodiment of the invention, in server including multiple cpu nodes, for each cpu node, one corresponding toggle switch is set, when the CPU in one of them cpu node breaks down, stir the toggle switch that this cpu node is corresponding, server system is according to the toggle switch being struck, it is determined that the target cpu node broken down.

Such as, as it is shown on figure 3, a server includes 4 cpu nodes, respectively Clump0, Clump1, Clump2 and Clump3, each cpu node includes 4 CPU, respectively CPU0, CPU1, CPU2 and CPU3, CPU0 is connected with CPU1 and CPU2 respectively, and CPU3 is connected with CPU1 and CPU2 respectively；Corresponding 2 Node Controllers of each cpu node, respectively CN0 and CN1, wherein CN0 one end is connected with CPU0 and CPU1 in this cpu node, the CN0 that the other end is corresponding with other three cpu nodes respectively is connected, CN1 one end is connected with CPU2 and CPU3 in this cpu node, and the CN1 that the other end is corresponding with other three cpu nodes respectively is connected.After the one or more CPU in Clump0 break down, server administrators stir the toggle switch that Clump0 is corresponding, the server system change according to toggle switch signal, it is determined that Clump0 is target cpu node.

Step 202: the internal memory of each cpu node management on server is carried out buffer consistency and writes back.

In an embodiment of the invention, after determining target cpu node, server system writes back instruction by the form interrupted to the transmission of each cpu node, each cpu node receive after writing back instruction, the internal memory of each Self management is carried out buffer consistency written-back operation, and the data in the internal memory manage each cpu node update the internal memory of server and the physical storage locations of correspondence.

Such as, as shown in Figure 3, after determining that Clump0 is target cpu node, server system is sent to Clump0, Clump1, Clump2 and Clump3 respectively by the form interrupted and writes back instruction, 4 cpu nodes are after receiving and writing back instruction, it is directed to each cpu node, respectively the internal memory of CPU0, CPU1, CPU2 and CPU3 management is carried out buffer consistency written-back operation, thus the data in the internal memory of 16 CPU management being updated in the internal memory of server and the physical storage locations of correspondence, it is ensured that the data in internal memory are all up-to-date.

Step 203: the application program in the internal memory manage target cpu node is moved to normal cpu node in the internal memory managed.

In an embodiment of the invention, carry out after concordance writes back at the internal memory that each cpu node is managed, application program in the internal memory manage target cpu node is moved in this server in the internal memory of other normal cpu nodes management, normal CPU continue the application program moved is run.

Such as, as shown in Figure 3, the internal memory of 16 CPU management is being carried out after concordance write back, application program in the internal memory of in Clump0 4 CPU management is moved respectively in the internal memory of 4 CPU management to Clump1, the application program moved in the internal memory of 4 CPU management from Clump0 is run respectively by 4 CPU in Clump1, Clump1, Clump2 and Clump3 are properly functioning simultaneously, and the original application program of each self-operating.

Step 204: stop the operation of target cpu node, and continue to send NULL sky bag to each corresponding for normal CPU node manager.

In an embodiment of the invention, after the application program in the internal memory managed by target cpu node is moved to the internal memory that normal cpu node manages, stop the operation of target cpu node.After target cpu node is out of service, maintenance instruction is sent to each normal cpu node in server by basic input-output system BIOS, after each normal cpu node receives maintenance instruction, CPU in normal cpu node continues to send, to Node Controller corresponding to normal cpu node, the NULL sky bag not including valid data, to safeguard in normal cpu node the normal of link between CPU and corresponding node controller.

Such as, as it is shown on figure 3, after in the internal memory that the application program in the Clump0 internal memory managed is moved Clump1 management, stop the operation of Clump0,4 CPU that Clump0 includes quit work therewith.After Clump0 is out of service, maintenance instruction is sent to Clump1, Clump2 and Clump3 respectively by the BIOS of server, Clump1, Clump2 and Clump3 are after receiving maintenance instruction, CPU in Clump1, Clump2 and Clump3 continues to send the NULL sky bag that destination address is Clump0 to each self-corresponding CN0 and CN1 respectively, it is ensured that link normal between each CPU and each self-corresponding CN0 and CN1 in Clump1, Clump2 and Clump3.

It should be noted that this step is most important when server only includes two cpu nodes.

Step 205: the CPU broken down in target cpu node is replaced.

In an embodiment of the invention, after target cpu node quits work, the CPU broken down in target cpu node is removed by server administrators, replaces the CPU broken down with new CPU.

Such as, as it is shown on figure 3, in 4 cpu nodes including of this server, each cpu node is an entirety, after a CPU in Clump0 breaks down, it is necessary to 4 CPU, CN0 and CN1 entirety included by Clump0 remove, replace whole Clump0 with a new cpu node.It should be noted that this is a kind of implementation of the embodiment of the present invention, in concrete business realizing process, it is possible to individually replace one or more CPU, and do not replace whole cpu node.

Step 206: the target CPU after replacement CPU is added in the system of server.

In an embodiment of the invention, after the CPU broken down in target cpu node has replaced, again target CPU is booted up, link parameter between each CPU and corresponding Node Controller that initialized target CPU includes, interpolation instruction is sent to the host CPU node of server by the form interrupted, host CPU node is after receiving interpolation instruction, target cpu node is carried out initialization process, target cpu node after initialization process is added in the system of server, target cpu node resumes operation, run corresponding application program, complete the replacement to target cpu node.

Such as, as shown in Figure 3, after Clump0 is carried out overall replacement, new Clump0 is turned back on, initialize the link parameter between CN0 and CN1 in 4 CPU in new Clump0 and new Clump0, enable CPU0 and CPU1 in new Clump0 to communicate with the CN0 in new Clump0, enable CPU2 and CPU3 in new Clump0 to communicate with the CN1 in new Clump0.New Clump0 sends interpolation instruction by the form interrupted to host CPU node Clump1, after Clump1 receives interpolation instruction, Clump0 is carried out initialization process, after initialization process completes, Clump0 is added in the system of server, Clump0 can properly functioning application program, so far complete the replacement to Clump0.

As shown in Figure 4, embodiments provide a kind of heat and replace the device of cpu node.Device embodiment can be realized by software, as shown in Figure 4, as the device on a logical meaning, is that computer program instructions corresponding in nonvolatile memory is read operation formation in internal memory by the CPU by its place equipment.The heat that the present embodiment provides replaces the device of cpu node, comprises determining that unit 401, moves unit 402, replacement unit 403 and adding device 404；

Described determine unit 401, for determining the target cpu node that on described server, CPU breaks down；

Described move unit 402, for being moved by the described application program determined in the target cpu node institute managing internal memory that unit 401 is determined to normal cpu node in the internal memory managed, to be run by described normal cpu node；

Described replacement unit 403, for completing after application program moves, to stop the operation of described target cpu node, to complete to utilize the new CPU fault CPU replaced in described target cpu node at described unit 402 of moving；

Described adding device 404, adds to for the described target cpu node after described replacement unit 403 is replaced CPU in the system of described server.

Embodiments provide a kind of heat and replace the device of cpu node, moving unit, the application program in target cpu node institute managing internal memory is moved after in the internal memory that normal cpu node manages, normal cpu node is responsible for the application program moved is run, replacement unit only stops the operation of target cpu node, other normal cpu nodes remain on state, run corresponding application program, after the CPU broken down in target cpu node is replaced, target cpu node is added in the system of server by adding device again.In the process that target cpu node is replaced, each application program that server runs originally will not be moved to end, and interrupts thus avoiding the business run on server.

In an embodiment of the invention, this device can also include writeback unit, and writeback unit is for carrying out buffer consistency written-back operation to the internal memory of each cpu node management on server.Writeback unit carries out concordance written-back operation moving the internal memory before unit moves application program, each cpu node managed, it is ensured that the data in server memory are all up-to-date, and then ensure that allochthonous application program can normally be performed.

In an embodiment of the invention, when the device that this heat replaces cpu node includes writeback unit, writeback unit writes back instruction by the form interrupted to the transmission of each cpu node, each cpu node receive after writing back instruction, the application program of operation suspension, the internal memory of each Self management is carried out buffer consistency write back, buffer consistency written-back operation continues to run with respective application program after completing, so can ensure that each cpu node carries out buffer consistency written-back operation in time, avoid each cpu node not complete buffer consistency when moving application program to write back, cause that the situation that application program cannot be properly functioning occurs.

In an embodiment of the invention, this device may further include link maintenance unit, link maintenance unit is for after the operation of replacement unit stopping target cpu node, maintenance instruction is sent to each normal cpu node by basic input-output system BIOS, each normal cpu node is after receiving maintenance instruction, continue to send NULL sky bag to each each Node Controller self-corresponding, so can maintain in normal cpu node the normal of link between each CPU and each corresponding Node Controller, after target CPU is re-added in server system, without passing through the link restarting normal cpu node to set up in normal cpu node between CPU and corresponding node controller, improve the efficiency replacing cpu node.

In an embodiment of the invention, target cpu node after replacing CPU is turned back on by adding device, link parameter between each CPU and each corresponding Node Controller in initialized target cpu node, and send interpolation instruction by the form interrupted to the host CPU node of server, after host CPU node receives interpolation instruction, target cpu node is carried out initialization process, after having processed, target cpu node is added in the system of server, target cpu node can run application program and communicate with other cpu nodes, complete the replacement to target cpu node.

The contents such as the information between each unit in said apparatus is mutual, execution process, due to the inventive method embodiment based on same design, particular content referring to the narration in the inventive method embodiment, can repeat no more herein.

The heat that the embodiment of the present invention provides replaces method and the device of cpu node, at least has the advantages that

1, in the embodiment of the present invention, after CPU in target cpu node breaks down, stop running so that the CPU broken down to be replaced of target cpu node, other cpu nodes are properly functioning, and stopping being moved by the application program run on target cpu node before target cpu node runs operation on other normal cpu nodes, complete to be re-added in server system by target cpu node after CPU replaces.So, fault cpu node replacement process not terminating each application program run on server, interrupting thus avoiding the business run on server.

2, in the embodiment of the present invention, before application program in the internal memory managed by target cpu node is moved to normal cpu node in the internal memory managed, the internal memory that each cpu node is managed carries out buffer consistency written-back operation, so can ensure that the data in server memory are up-to-date, avoid there are differences owing to normal cpu node and target cpu node institute managing internal memory being directed to same data, the situation that allochthonous application program is not normally functioning is caused to occur, it is ensured that the business run on server is not interrupted.

3, in embodiments of the present invention, after the operation stopping target cpu node, each CPU in normal cpu node continues to send sky bag to corresponding Node Controller, so can safeguard in normal cpu node the normal of link between CPU and Node Controller, avoid owing to not having the situation that data transmission causes in normal cpu node the link between CPU and corresponding node controller to disconnect to occur between normal cpu node and target cpu node, from without passing through to restart the link that the mode of normal cpu node is rebuild normal cpu node between CPU and corresponding node controller, on the one hand being further ensured that on server operation business does not interrupt, the efficiency that cpu node is replaced can be improved on the other hand.

4, in the embodiment of the present invention, instruction is write back to each cpu node transmission in server by the form interrupted, each cpu node receive after writing back instruction, suspend the process being currently running, buffer consistency written-back operation is carried out according to writing back the instruction internal memory to each Self management, the process suspended is continued executing with after having operated, so can ensure that completing buffer consistency in time writes back, write back but without carrying out buffer consistency after avoiding target cpu node out of service, cause that the situation that application program cannot be properly functioning occurs, further ensure the business run on server will not interrupt.

It should be noted that, in this article, the relational terms of such as first and second etc is used merely to separate an entity or operation with another entity or operating space, and not necessarily requires or imply the relation that there is any this reality between these entities or operation or sequentially.And, term " includes ", " comprising " or its any other variant are intended to comprising of nonexcludability, so that include the process of a series of key element, method, article or equipment not only include those key elements, but also include other key elements being not expressly set out, or also include the key element intrinsic for this process, method, article or equipment.When there is no more restriction, statement " including " key element limited, it is not excluded that there is also other same factor in including the process of described key element, method, article or equipment.

One of ordinary skill in the art will appreciate that: all or part of step realizing said method embodiment can be completed by the hardware that programmed instruction is relevant, aforesaid program can be stored in the storage medium of embodied on computer readable, this program upon execution, performs to include the step of said method embodiment；And aforesaid storage medium includes: in the various media that can store program code such as ROM, RAM, magnetic disc or CD.

Last it should be understood that the foregoing is only presently preferred embodiments of the present invention, it is merely to illustrate technical scheme, is not intended to limit protection scope of the present invention.All make within the spirit and principles in the present invention any amendment, equivalent replacement, improvement etc., be all contained in protection scope of the present invention.

Claims

1. the method that a heat replaces cpu node, it is characterised in that be applied to include the server of at least two cpu node, including:

Determine the target cpu node that on described server, CPU breaks down；

2. method according to claim 1, it is characterised in that

Described application program in described target cpu node institute managing internal memory is moved in the internal memory managed to normal cpu node before farther include:

3. method according to claim 2, it is characterised in that

Described the internal memory of each cpu node management on described server is carried out buffer consistency write back and include:

4. method according to claim 1, it is characterised in that

Farther include after the operation of the described target cpu node of described stopping:

5. according to described method arbitrary in Claims 1-4, it is characterised in that

Described include replacing the system that the described target cpu node after CPU adds described server to:

6. a heat replaces the device of cpu node, it is characterised in that comprises determining that unit, move unit, replacement unit and adding device；

7. device according to claim 6, it is characterised in that farther include: writeback unit；

8. device according to claim 7, it is characterised in that

9. device according to claim 6, it is characterised in that farther include: link maintenance unit；

10. according to described device arbitrary in claim 6 to 9, it is characterised in that