CN105808391A - Method and device for hot replacing CPU nodes - Google Patents

Method and device for hot replacing CPU nodes Download PDF

Info

Publication number
CN105808391A
CN105808391A CN201610204324.6A CN201610204324A CN105808391A CN 105808391 A CN105808391 A CN 105808391A CN 201610204324 A CN201610204324 A CN 201610204324A CN 105808391 A CN105808391 A CN 105808391A
Authority
CN
China
Prior art keywords
cpu
node
cpu node
server
normal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610204324.6A
Other languages
Chinese (zh)
Inventor
周玉龙
童元满
李仁刚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inspur Electronic Information Industry Co Ltd
Original Assignee
Inspur Electronic Information Industry Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inspur Electronic Information Industry Co Ltd filed Critical Inspur Electronic Information Industry Co Ltd
Priority to CN201610204324.6A priority Critical patent/CN105808391A/en
Publication of CN105808391A publication Critical patent/CN105808391A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/202Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant
    • G06F11/2023Failover techniques
    • G06F11/203Failover techniques using migration
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/202Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant
    • G06F11/2023Failover techniques
    • G06F11/2028Failover techniques eliminating a faulty processor or activating a spare
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/202Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant
    • G06F11/2041Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant with more than one idle spare processing component

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Hardware Redundancy (AREA)

Abstract

The invention provides a method and device for hot replacing CPU nodes.The method includes the steps that target CPU nodes with faulted CPUs on a server are determined; application programs, managed by the target CPU nodes, of a memory are moved to a memory managed by normal CPU nodes so that running can be carried out through the normal CPU nodes; running of the target CPU nodes is stopped, and the breakdown CPUs in the target CPU nodes are replaced with new CPUs; the target CPU nodes with the replaced CPU are added into a system of the server.The device comprises a determining unit, a moving unit, a replacing unit and an adding unit.By means of the method and device, interruption of services running on the server can be avoided.

Description

A kind of heat replaces method and the device of cpu node
Technical field
The present invention relates to field of computer technology, replace method and the device of cpu node particularly to a kind of heat.
Background technology
Server is widely used in every field as a kind of high-performance computer, processes miscellaneous service.Increase along with the growth of portfolio and business complexity, user is also more and more higher to the performance requirement of server, user's requirement to server calculating speed cannot have been met by improving the performance of single processor cpu node, therefore, the performance of server is improved, to meet user's requirement to server process speed typically via increasing the quantity of cpu node in server.
In the server including multiple cpu node, each cpu node can run business simultaneously, improves the server speed to Business Processing.Owing to would be likely to occur the business of intersection between each cpu node, when in server, one of them cpu node breaks down, this cpu node is not normally functioning business, it is also possible to cause that other cpu nodes can not properly functioning business.
At present; when a cpu node in server breaks down, it is necessary to terminate the operation of each cpu node on this server, after replacing, with new CPU, the CPU broken down after server outage; restart each cpu node on this server, continue to run with business.
It is directed to the prior art solution to fault cpu node, when there being cpu node to break down, need to terminate the operation of each cpu node on this server, fault cpu node is replaced, this will cause that the business run on server is interrupted, bring to user constant greatly, even can cause serious consequence at some special dimensions.
Summary of the invention
Embodiments provide a kind of heat and replace method and the device of cpu node, it is possible to avoid the business run on server to interrupt.
Embodiments provide a kind of method that heat replaces cpu node, be applied to include the server of at least two cpu node, including:
Determine the target cpu node that on described server, CPU breaks down;
Application program in described target cpu node institute managing internal memory is moved to normal cpu node in the internal memory managed, to be run by described normal cpu node;
Stop the operation of described target cpu node, and utilize new CPU to replace the fault CPU in described target cpu node;
Described target cpu node after replacement CPU is added in the system of described server.
Preferably, described application program in described target cpu node institute managing internal memory is moved in the internal memory managed to normal cpu node before farther include:
The internal memory of each cpu node management on described server is carried out buffer consistency write back.
Preferably, described the internal memory of each cpu node management on described server is carried out buffer consistency write back and include:
Being sent to each cpu node in described server by the form interrupted and write back instruction, the internal memory of each Self management, after writing back instruction described in receiving, is carried out buffer consistency and writes back by each cpu node described.
Preferably, farther include after the operation of the described target cpu node of described stopping:
Maintenance instruction is sent to described normal cpu node by basic input-output system BIOS, described normal cpu node is after receiving described maintenance instruction, continue to send NULL sky bag to each Node Controller corresponding to described normal cpu node, to safeguard in described normal cpu node the normal of link between CPU and each corresponding Node Controller.
Preferably, described include replacing the system that the described target cpu node after CPU adds described server to:
Described target cpu node after replacing CPU is turned back on, initialize the link parameter between each CPU and each corresponding Node Controller in described target cpu node, and send interpolation instruction by the form interrupted to the host CPU node in described server, described host CPU node is after receiving described interpolation instruction, described target cpu node after replacing CPU is carried out initialization process, the described target cpu node after initialization process is added in the system of described server.
The embodiment of the present invention additionally provides a kind of heat and replaces the device of cpu node, comprises determining that unit, moves unit, replacement unit and adding device;
Described determine unit, for determining the target cpu node that on described server, CPU breaks down;
Described move unit, for being moved by the described application program determined in the target cpu node institute managing internal memory that unit is determined to normal cpu node in the internal memory managed, to be run by described normal cpu node;
Described replacement unit, for completing after application program moves, to stop the operation of described target cpu node, to complete to utilize the new CPU fault CPU replaced in described target cpu node at described unit of moving;
Described adding device, adds to for the described target cpu node after described replacement unit is replaced CPU in the system of described server.
Preferably, this device farther includes: writeback unit;
Described writeback unit, writes back for the internal memory of each cpu node management on described server is carried out buffer consistency.
Preferably,
Described writeback unit, writes back instruction for the form by interrupting to each cpu node transmission of described server, and the internal memory of each Self management, after writing back instruction described in receiving, is carried out concordance and writes back by each cpu node described.
Preferably, this device farther includes: link maintenance unit;
Described link maintenance unit, for sending maintenance instruction by basic input-output system BIOS to described normal cpu node, described normal cpu node is after receiving described maintenance instruction, continue to send NULL sky bag to each Node Controller corresponding to described normal cpu node, to safeguard in described normal cpu node the normal of link between CPU and each corresponding Node Controller.
Preferably,
Described adding device, for the described target cpu node after replacement CPU is turned back on, initialize the link parameter between each CPU and each corresponding Node Controller in described target cpu node, and send interpolation instruction by the form interrupted to the host CPU node in described server, described host CPU node is after receiving described interpolation instruction, described target cpu node after replacing CPU is carried out initialization process, the described target cpu node after initialization process is added in the system of described server.
Embodiments provide a kind of heat and replace method and the device of cpu node, after CPU in target cpu node breaks down, stop running so that the CPU broken down to be replaced of target cpu node, other cpu nodes are properly functioning, and stopping being moved by the application program run on target cpu node before target cpu node runs operation on other normal cpu nodes, complete to be re-added in server system by target cpu node after CPU replaces.So, fault cpu node replacement process not terminating each application program run on server, interrupting thus avoiding the business run on server.
Accompanying drawing explanation
In order to be illustrated more clearly that the embodiment of the present invention or technical scheme of the prior art, the accompanying drawing used required in embodiment or description of the prior art will be briefly described below, apparently, accompanying drawing in the following describes is some embodiments of the present invention, for those of ordinary skill in the art, under the premise not paying creative work, it is also possible to obtain other accompanying drawing according to these accompanying drawings.
Fig. 1 is the method flow diagram that a kind of heat that one embodiment of the invention provides replaces cpu node;
Fig. 2 is the method flow diagram that a kind of heat that another embodiment of the present invention provides replaces cpu node;
Fig. 3 is a kind of server architecture schematic diagram that one embodiment of the invention provides;
Fig. 4 is the device schematic diagram that a kind of heat that one embodiment of the invention provides replaces cpu node.
Detailed description of the invention
For making the purpose of the embodiment of the present invention, technical scheme and advantage clearly; below in conjunction with the accompanying drawing in the embodiment of the present invention; technical scheme in the embodiment of the present invention is clearly and completely described; obviously; described embodiment is a part of embodiment of the present invention, rather than whole embodiments, based on the embodiment in the present invention; the every other embodiment that those of ordinary skill in the art obtain under the premise not making creative work, broadly falls into the scope of protection of the invention.
As it is shown in figure 1, embodiments provide a kind of method that heat replaces cpu node, being applied to include the server of at least two cpu node, the method may comprise steps of:
Step 101: determine the target cpu node that on described server, CPU breaks down;
Step 102: the application program in described target cpu node institute managing internal memory is moved to normal cpu node in the internal memory managed, to be run by described normal cpu node;
Step 103: stop the operation of described target cpu node, and utilize new CPU to replace the fault CPU in described target cpu node;
Step 104: the described target cpu node after replacement CPU is added in the system of described server.
Embodiments provide a kind of method that heat replaces cpu node, after CPU in target cpu node breaks down, stop running so that the CPU broken down to be replaced of target cpu node, other cpu nodes are properly functioning, and stopping being moved by the application program run on target cpu node before target cpu node runs operation on other normal cpu nodes, complete to be re-added in server system by target cpu node after CPU replaces.So, fault cpu node replacement process not terminating each application program run on server, interrupting thus avoiding the business run on server.
In an embodiment of the invention, before application program in the internal memory managed by target cpu node is moved to normal cpu node in the internal memory managed, first the internal memory of each cpu node management on server is carried out buffer consistency to write back, data in the internal memory manage each cpu node carry out preservation process, so, data in server memory are all up-to-date, after the operation stopping target cpu node, ensure the concordance of server system buffer memory, enable the application program moving normal cpu node properly functioning.
In an embodiment of the invention, the internal memory that each cpu node is managed carry out buffer consistency write back time, being sent to each cpu node by the form interrupted and write back instruction, the internal memory of each Self management, after receiving and writing back instruction, is carried out buffer consistency and writes back by each cpu node.By the form interrupted, ensure that the buffer consistency completing each cpu node managing internal memory before target cpu node is removed writes back, it is up-to-date for making the data in server memory, it is ensured that the application program in target cpu node internal memory is moved can be properly functioning to other normal cpu nodes.
In an embodiment of the invention, after the operation stopping target cpu node, basic input-output system BIOS can be passed through to normal cpu node generation maintenance instruction, normal cpu node is after receiving maintenance instruction, continue each Node Controller to normal cpu node is corresponding and send NULL sky bag, so can ensure that in the process that target cpu node is out of service, link between normal cpu node and each corresponding Node Controller remains normal, without passing through to restart mode and re-establish the link between normal cpu node and each corresponding Node Controller after target cpu node is recovered, improve the efficiency replacing cpu node.
In an embodiment of the invention, after the CPU in target cpu node has been replaced, target cpu node after replacing CPU is turned back on, link parameter between initialized target cpu node and each corresponding Node Controller, and send interpolation instruction by the form interrupted to the host CPU node of server, target cpu node, after receiving interpolation instruction, is carried out initialization process, and is added to by the target cpu node after initialization process in the system of server by host CPU node.So, completing the replacement of target cpu node, target cpu node reenters the system of server and runs business.
For making the object, technical solutions and advantages of the present invention clearly, below in conjunction with drawings and the specific embodiments, the present invention is described in further detail.
As in figure 2 it is shown, embodiments provide a kind of method that heat replaces cpu node, the method may comprise steps of:
Step 201: determine the target cpu node that on server, CPU breaks down.
In an embodiment of the invention, in server including multiple cpu nodes, for each cpu node, one corresponding toggle switch is set, when the CPU in one of them cpu node breaks down, stir the toggle switch that this cpu node is corresponding, server system is according to the toggle switch being struck, it is determined that the target cpu node broken down.
Such as, as it is shown on figure 3, a server includes 4 cpu nodes, respectively Clump0, Clump1, Clump2 and Clump3, each cpu node includes 4 CPU, respectively CPU0, CPU1, CPU2 and CPU3, CPU0 is connected with CPU1 and CPU2 respectively, and CPU3 is connected with CPU1 and CPU2 respectively;Corresponding 2 Node Controllers of each cpu node, respectively CN0 and CN1, wherein CN0 one end is connected with CPU0 and CPU1 in this cpu node, the CN0 that the other end is corresponding with other three cpu nodes respectively is connected, CN1 one end is connected with CPU2 and CPU3 in this cpu node, and the CN1 that the other end is corresponding with other three cpu nodes respectively is connected.After the one or more CPU in Clump0 break down, server administrators stir the toggle switch that Clump0 is corresponding, the server system change according to toggle switch signal, it is determined that Clump0 is target cpu node.
Step 202: the internal memory of each cpu node management on server is carried out buffer consistency and writes back.
In an embodiment of the invention, after determining target cpu node, server system writes back instruction by the form interrupted to the transmission of each cpu node, each cpu node receive after writing back instruction, the internal memory of each Self management is carried out buffer consistency written-back operation, and the data in the internal memory manage each cpu node update the internal memory of server and the physical storage locations of correspondence.
Such as, as shown in Figure 3, after determining that Clump0 is target cpu node, server system is sent to Clump0, Clump1, Clump2 and Clump3 respectively by the form interrupted and writes back instruction, 4 cpu nodes are after receiving and writing back instruction, it is directed to each cpu node, respectively the internal memory of CPU0, CPU1, CPU2 and CPU3 management is carried out buffer consistency written-back operation, thus the data in the internal memory of 16 CPU management being updated in the internal memory of server and the physical storage locations of correspondence, it is ensured that the data in internal memory are all up-to-date.
Step 203: the application program in the internal memory manage target cpu node is moved to normal cpu node in the internal memory managed.
In an embodiment of the invention, carry out after concordance writes back at the internal memory that each cpu node is managed, application program in the internal memory manage target cpu node is moved in this server in the internal memory of other normal cpu nodes management, normal CPU continue the application program moved is run.
Such as, as shown in Figure 3, the internal memory of 16 CPU management is being carried out after concordance write back, application program in the internal memory of in Clump0 4 CPU management is moved respectively in the internal memory of 4 CPU management to Clump1, the application program moved in the internal memory of 4 CPU management from Clump0 is run respectively by 4 CPU in Clump1, Clump1, Clump2 and Clump3 are properly functioning simultaneously, and the original application program of each self-operating.
Step 204: stop the operation of target cpu node, and continue to send NULL sky bag to each corresponding for normal CPU node manager.
In an embodiment of the invention, after the application program in the internal memory managed by target cpu node is moved to the internal memory that normal cpu node manages, stop the operation of target cpu node.After target cpu node is out of service, maintenance instruction is sent to each normal cpu node in server by basic input-output system BIOS, after each normal cpu node receives maintenance instruction, CPU in normal cpu node continues to send, to Node Controller corresponding to normal cpu node, the NULL sky bag not including valid data, to safeguard in normal cpu node the normal of link between CPU and corresponding node controller.
Such as, as it is shown on figure 3, after in the internal memory that the application program in the Clump0 internal memory managed is moved Clump1 management, stop the operation of Clump0,4 CPU that Clump0 includes quit work therewith.After Clump0 is out of service, maintenance instruction is sent to Clump1, Clump2 and Clump3 respectively by the BIOS of server, Clump1, Clump2 and Clump3 are after receiving maintenance instruction, CPU in Clump1, Clump2 and Clump3 continues to send the NULL sky bag that destination address is Clump0 to each self-corresponding CN0 and CN1 respectively, it is ensured that link normal between each CPU and each self-corresponding CN0 and CN1 in Clump1, Clump2 and Clump3.
It should be noted that this step is most important when server only includes two cpu nodes.
Step 205: the CPU broken down in target cpu node is replaced.
In an embodiment of the invention, after target cpu node quits work, the CPU broken down in target cpu node is removed by server administrators, replaces the CPU broken down with new CPU.
Such as, as it is shown on figure 3, in 4 cpu nodes including of this server, each cpu node is an entirety, after a CPU in Clump0 breaks down, it is necessary to 4 CPU, CN0 and CN1 entirety included by Clump0 remove, replace whole Clump0 with a new cpu node.It should be noted that this is a kind of implementation of the embodiment of the present invention, in concrete business realizing process, it is possible to individually replace one or more CPU, and do not replace whole cpu node.
Step 206: the target CPU after replacement CPU is added in the system of server.
In an embodiment of the invention, after the CPU broken down in target cpu node has replaced, again target CPU is booted up, link parameter between each CPU and corresponding Node Controller that initialized target CPU includes, interpolation instruction is sent to the host CPU node of server by the form interrupted, host CPU node is after receiving interpolation instruction, target cpu node is carried out initialization process, target cpu node after initialization process is added in the system of server, target cpu node resumes operation, run corresponding application program, complete the replacement to target cpu node.
Such as, as shown in Figure 3, after Clump0 is carried out overall replacement, new Clump0 is turned back on, initialize the link parameter between CN0 and CN1 in 4 CPU in new Clump0 and new Clump0, enable CPU0 and CPU1 in new Clump0 to communicate with the CN0 in new Clump0, enable CPU2 and CPU3 in new Clump0 to communicate with the CN1 in new Clump0.New Clump0 sends interpolation instruction by the form interrupted to host CPU node Clump1, after Clump1 receives interpolation instruction, Clump0 is carried out initialization process, after initialization process completes, Clump0 is added in the system of server, Clump0 can properly functioning application program, so far complete the replacement to Clump0.
As shown in Figure 4, embodiments provide a kind of heat and replace the device of cpu node.Device embodiment can be realized by software, as shown in Figure 4, as the device on a logical meaning, is that computer program instructions corresponding in nonvolatile memory is read operation formation in internal memory by the CPU by its place equipment.The heat that the present embodiment provides replaces the device of cpu node, comprises determining that unit 401, moves unit 402, replacement unit 403 and adding device 404;
Described determine unit 401, for determining the target cpu node that on described server, CPU breaks down;
Described move unit 402, for being moved by the described application program determined in the target cpu node institute managing internal memory that unit 401 is determined to normal cpu node in the internal memory managed, to be run by described normal cpu node;
Described replacement unit 403, for completing after application program moves, to stop the operation of described target cpu node, to complete to utilize the new CPU fault CPU replaced in described target cpu node at described unit 402 of moving;
Described adding device 404, adds to for the described target cpu node after described replacement unit 403 is replaced CPU in the system of described server.
Embodiments provide a kind of heat and replace the device of cpu node, moving unit, the application program in target cpu node institute managing internal memory is moved after in the internal memory that normal cpu node manages, normal cpu node is responsible for the application program moved is run, replacement unit only stops the operation of target cpu node, other normal cpu nodes remain on state, run corresponding application program, after the CPU broken down in target cpu node is replaced, target cpu node is added in the system of server by adding device again.In the process that target cpu node is replaced, each application program that server runs originally will not be moved to end, and interrupts thus avoiding the business run on server.
In an embodiment of the invention, this device can also include writeback unit, and writeback unit is for carrying out buffer consistency written-back operation to the internal memory of each cpu node management on server.Writeback unit carries out concordance written-back operation moving the internal memory before unit moves application program, each cpu node managed, it is ensured that the data in server memory are all up-to-date, and then ensure that allochthonous application program can normally be performed.
In an embodiment of the invention, when the device that this heat replaces cpu node includes writeback unit, writeback unit writes back instruction by the form interrupted to the transmission of each cpu node, each cpu node receive after writing back instruction, the application program of operation suspension, the internal memory of each Self management is carried out buffer consistency write back, buffer consistency written-back operation continues to run with respective application program after completing, so can ensure that each cpu node carries out buffer consistency written-back operation in time, avoid each cpu node not complete buffer consistency when moving application program to write back, cause that the situation that application program cannot be properly functioning occurs.
In an embodiment of the invention, this device may further include link maintenance unit, link maintenance unit is for after the operation of replacement unit stopping target cpu node, maintenance instruction is sent to each normal cpu node by basic input-output system BIOS, each normal cpu node is after receiving maintenance instruction, continue to send NULL sky bag to each each Node Controller self-corresponding, so can maintain in normal cpu node the normal of link between each CPU and each corresponding Node Controller, after target CPU is re-added in server system, without passing through the link restarting normal cpu node to set up in normal cpu node between CPU and corresponding node controller, improve the efficiency replacing cpu node.
In an embodiment of the invention, target cpu node after replacing CPU is turned back on by adding device, link parameter between each CPU and each corresponding Node Controller in initialized target cpu node, and send interpolation instruction by the form interrupted to the host CPU node of server, after host CPU node receives interpolation instruction, target cpu node is carried out initialization process, after having processed, target cpu node is added in the system of server, target cpu node can run application program and communicate with other cpu nodes, complete the replacement to target cpu node.
The contents such as the information between each unit in said apparatus is mutual, execution process, due to the inventive method embodiment based on same design, particular content referring to the narration in the inventive method embodiment, can repeat no more herein.
The heat that the embodiment of the present invention provides replaces method and the device of cpu node, at least has the advantages that
1, in the embodiment of the present invention, after CPU in target cpu node breaks down, stop running so that the CPU broken down to be replaced of target cpu node, other cpu nodes are properly functioning, and stopping being moved by the application program run on target cpu node before target cpu node runs operation on other normal cpu nodes, complete to be re-added in server system by target cpu node after CPU replaces.So, fault cpu node replacement process not terminating each application program run on server, interrupting thus avoiding the business run on server.
2, in the embodiment of the present invention, before application program in the internal memory managed by target cpu node is moved to normal cpu node in the internal memory managed, the internal memory that each cpu node is managed carries out buffer consistency written-back operation, so can ensure that the data in server memory are up-to-date, avoid there are differences owing to normal cpu node and target cpu node institute managing internal memory being directed to same data, the situation that allochthonous application program is not normally functioning is caused to occur, it is ensured that the business run on server is not interrupted.
3, in embodiments of the present invention, after the operation stopping target cpu node, each CPU in normal cpu node continues to send sky bag to corresponding Node Controller, so can safeguard in normal cpu node the normal of link between CPU and Node Controller, avoid owing to not having the situation that data transmission causes in normal cpu node the link between CPU and corresponding node controller to disconnect to occur between normal cpu node and target cpu node, from without passing through to restart the link that the mode of normal cpu node is rebuild normal cpu node between CPU and corresponding node controller, on the one hand being further ensured that on server operation business does not interrupt, the efficiency that cpu node is replaced can be improved on the other hand.
4, in the embodiment of the present invention, instruction is write back to each cpu node transmission in server by the form interrupted, each cpu node receive after writing back instruction, suspend the process being currently running, buffer consistency written-back operation is carried out according to writing back the instruction internal memory to each Self management, the process suspended is continued executing with after having operated, so can ensure that completing buffer consistency in time writes back, write back but without carrying out buffer consistency after avoiding target cpu node out of service, cause that the situation that application program cannot be properly functioning occurs, further ensure the business run on server will not interrupt.
It should be noted that, in this article, the relational terms of such as first and second etc is used merely to separate an entity or operation with another entity or operating space, and not necessarily requires or imply the relation that there is any this reality between these entities or operation or sequentially.And, term " includes ", " comprising " or its any other variant are intended to comprising of nonexcludability, so that include the process of a series of key element, method, article or equipment not only include those key elements, but also include other key elements being not expressly set out, or also include the key element intrinsic for this process, method, article or equipment.When there is no more restriction, statement " including " key element limited, it is not excluded that there is also other same factor in including the process of described key element, method, article or equipment.
One of ordinary skill in the art will appreciate that: all or part of step realizing said method embodiment can be completed by the hardware that programmed instruction is relevant, aforesaid program can be stored in the storage medium of embodied on computer readable, this program upon execution, performs to include the step of said method embodiment;And aforesaid storage medium includes: in the various media that can store program code such as ROM, RAM, magnetic disc or CD.
Last it should be understood that the foregoing is only presently preferred embodiments of the present invention, it is merely to illustrate technical scheme, is not intended to limit protection scope of the present invention.All make within the spirit and principles in the present invention any amendment, equivalent replacement, improvement etc., be all contained in protection scope of the present invention.

Claims (10)

1. the method that a heat replaces cpu node, it is characterised in that be applied to include the server of at least two cpu node, including:
Determine the target cpu node that on described server, CPU breaks down;
Application program in described target cpu node institute managing internal memory is moved to normal cpu node in the internal memory managed, to be run by described normal cpu node;
Stop the operation of described target cpu node, and utilize new CPU to replace the fault CPU in described target cpu node;
Described target cpu node after replacement CPU is added in the system of described server.
2. method according to claim 1, it is characterised in that
Described application program in described target cpu node institute managing internal memory is moved in the internal memory managed to normal cpu node before farther include:
The internal memory of each cpu node management on described server is carried out buffer consistency write back.
3. method according to claim 2, it is characterised in that
Described the internal memory of each cpu node management on described server is carried out buffer consistency write back and include:
Being sent to each cpu node in described server by the form interrupted and write back instruction, the internal memory of each Self management, after writing back instruction described in receiving, is carried out buffer consistency and writes back by each cpu node described.
4. method according to claim 1, it is characterised in that
Farther include after the operation of the described target cpu node of described stopping:
Maintenance instruction is sent to described normal cpu node by basic input-output system BIOS, described normal cpu node is after receiving described maintenance instruction, continue to send NULL sky bag to each Node Controller corresponding to described normal cpu node, to safeguard in described normal cpu node the normal of link between CPU and each corresponding Node Controller.
5. according to described method arbitrary in Claims 1-4, it is characterised in that
Described include replacing the system that the described target cpu node after CPU adds described server to:
Described target cpu node after replacing CPU is turned back on, initialize the link parameter between each CPU and each corresponding Node Controller in described target cpu node, and send interpolation instruction by the form interrupted to the host CPU node in described server, described host CPU node is after receiving described interpolation instruction, described target cpu node after replacing CPU is carried out initialization process, the described target cpu node after initialization process is added in the system of described server.
6. a heat replaces the device of cpu node, it is characterised in that comprises determining that unit, move unit, replacement unit and adding device;
Described determine unit, for determining the target cpu node that on described server, CPU breaks down;
Described move unit, for being moved by the described application program determined in the target cpu node institute managing internal memory that unit is determined to normal cpu node in the internal memory managed, to be run by described normal cpu node;
Described replacement unit, for completing after application program moves, to stop the operation of described target cpu node, to complete to utilize the new CPU fault CPU replaced in described target cpu node at described unit of moving;
Described adding device, adds to for the described target cpu node after described replacement unit is replaced CPU in the system of described server.
7. device according to claim 6, it is characterised in that farther include: writeback unit;
Described writeback unit, writes back for the internal memory of each cpu node management on described server is carried out buffer consistency.
8. device according to claim 7, it is characterised in that
Described writeback unit, writes back instruction for the form by interrupting to each cpu node transmission of described server, and the internal memory of each Self management, after writing back instruction described in receiving, is carried out concordance and writes back by each cpu node described.
9. device according to claim 6, it is characterised in that farther include: link maintenance unit;
Described link maintenance unit, for sending maintenance instruction by basic input-output system BIOS to described normal cpu node, described normal cpu node is after receiving described maintenance instruction, continue to send NULL sky bag to each Node Controller corresponding to described normal cpu node, to safeguard in described normal cpu node the normal of link between CPU and each corresponding Node Controller.
10. according to described device arbitrary in claim 6 to 9, it is characterised in that
Described adding device, for the described target cpu node after replacement CPU is turned back on, initialize the link parameter between each CPU and each corresponding Node Controller in described target cpu node, and send interpolation instruction by the form interrupted to the host CPU node in described server, described host CPU node is after receiving described interpolation instruction, described target cpu node after replacing CPU is carried out initialization process, the described target cpu node after initialization process is added in the system of described server.
CN201610204324.6A 2016-04-05 2016-04-05 Method and device for hot replacing CPU nodes Pending CN105808391A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610204324.6A CN105808391A (en) 2016-04-05 2016-04-05 Method and device for hot replacing CPU nodes

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610204324.6A CN105808391A (en) 2016-04-05 2016-04-05 Method and device for hot replacing CPU nodes

Publications (1)

Publication Number Publication Date
CN105808391A true CN105808391A (en) 2016-07-27

Family

ID=56460413

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610204324.6A Pending CN105808391A (en) 2016-04-05 2016-04-05 Method and device for hot replacing CPU nodes

Country Status (1)

Country Link
CN (1) CN105808391A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106897175A (en) * 2017-02-19 2017-06-27 郑州云海信息技术有限公司 Heat replaces the method and device of NC nodes
CN107301104A (en) * 2017-07-17 2017-10-27 郑州云海信息技术有限公司 A kind of device replacing options and device
CN108153648A (en) * 2017-12-27 2018-06-12 西安奇维科技有限公司 A kind of method for the more redundant computers for realizing flexible dispatching
CN115037674A (en) * 2022-05-16 2022-09-09 郑州小鸟信息科技有限公司 Single-machine and multi-equipment redundancy backup method for central control system

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101211282A (en) * 2006-12-28 2008-07-02 国际商业机器公司 Method of executing invalidation transfer operation for failure node in computer system
CN101662645A (en) * 2009-09-17 2010-03-03 中兴通讯股份有限公司 Backup method for media processing unit, multipoint control unit and video communication system
CN101714109A (en) * 2009-11-24 2010-05-26 杭州华三通信技术有限公司 Method and device for controlling mainboard of double CPU system
CN102047643A (en) * 2008-04-02 2011-05-04 国际商业机器公司 Method for enabling faster recovery of client applications in the event of server failure
CN103354503A (en) * 2013-05-23 2013-10-16 浙江闪龙科技有限公司 Cloud storage system capable of automatically detecting and replacing failure nodes and method thereof
JP2014146254A (en) * 2013-01-30 2014-08-14 Fujitsu Ltd Information processing device and control method of information processing device
CN104461792A (en) * 2014-12-03 2015-03-25 浪潮集团有限公司 HA method for clearing single-point failure of NAMENODE of HADOOP distributed file system
CN104812097A (en) * 2015-05-21 2015-07-29 北京深思数盾科技有限公司 Bluetooth equipment and communication method thereof
CN105406980A (en) * 2015-10-19 2016-03-16 浪潮(北京)电子信息产业有限公司 Multi-node backup method and multi-node backup device

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101211282A (en) * 2006-12-28 2008-07-02 国际商业机器公司 Method of executing invalidation transfer operation for failure node in computer system
CN102047643A (en) * 2008-04-02 2011-05-04 国际商业机器公司 Method for enabling faster recovery of client applications in the event of server failure
CN101662645A (en) * 2009-09-17 2010-03-03 中兴通讯股份有限公司 Backup method for media processing unit, multipoint control unit and video communication system
CN101714109A (en) * 2009-11-24 2010-05-26 杭州华三通信技术有限公司 Method and device for controlling mainboard of double CPU system
JP2014146254A (en) * 2013-01-30 2014-08-14 Fujitsu Ltd Information processing device and control method of information processing device
CN103354503A (en) * 2013-05-23 2013-10-16 浙江闪龙科技有限公司 Cloud storage system capable of automatically detecting and replacing failure nodes and method thereof
CN104461792A (en) * 2014-12-03 2015-03-25 浪潮集团有限公司 HA method for clearing single-point failure of NAMENODE of HADOOP distributed file system
CN104812097A (en) * 2015-05-21 2015-07-29 北京深思数盾科技有限公司 Bluetooth equipment and communication method thereof
CN105406980A (en) * 2015-10-19 2016-03-16 浪潮(北京)电子信息产业有限公司 Multi-node backup method and multi-node backup device

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106897175A (en) * 2017-02-19 2017-06-27 郑州云海信息技术有限公司 Heat replaces the method and device of NC nodes
CN107301104A (en) * 2017-07-17 2017-10-27 郑州云海信息技术有限公司 A kind of device replacing options and device
CN108153648A (en) * 2017-12-27 2018-06-12 西安奇维科技有限公司 A kind of method for the more redundant computers for realizing flexible dispatching
CN115037674A (en) * 2022-05-16 2022-09-09 郑州小鸟信息科技有限公司 Single-machine and multi-equipment redundancy backup method for central control system
CN115037674B (en) * 2022-05-16 2023-08-22 郑州小鸟信息科技有限公司 Single-machine and multi-equipment redundancy backup method for central control system

Similar Documents

Publication Publication Date Title
CN109815043B (en) Fault processing method, related equipment and computer storage medium
US9886736B2 (en) Selectively killing trapped multi-process service clients sharing the same hardware context
US7877358B2 (en) Replacing system hardware
CN101290593B (en) System and method for tracking and transferring logic partition memory state
US8171236B2 (en) Managing migration of a shared memory logical partition from a source system to a target system
US20190149399A1 (en) Dynamic reconfiguration of resilient logical modules in a software defined server
US8495267B2 (en) Managing shared computer memory using multiple interrupts
US9304878B2 (en) Providing multiple IO paths in a virtualized environment to support for high availability of virtual machines
US8612973B2 (en) Method and system for handling interrupts within computer system during hardware resource migration
CN104871493A (en) Communication channel failover in a high performance computing (hpc) network
US9715403B2 (en) Optimized extended context management for virtual machines
US20230362203A1 (en) Implementing a service mesh in the hypervisor
CN105808391A (en) Method and device for hot replacing CPU nodes
US20170177225A1 (en) Mid-level controllers for performing flash management on solid state drives
CN105612498A (en) Virtual machine live migration method, virtual machine memory data processing method, server, and virtual machine system
KR100633827B1 (en) Method and apparatus for enumeration of a multi-node computer system
US20190073142A1 (en) Synchronously performing commit records operations
CN114035905A (en) Fault migration method and device based on virtual machine, electronic equipment and storage medium
US11210757B2 (en) GPU packet aggregation system
US20230161674A1 (en) Live Migrating Virtual Machines to a Target Host Upon Fatal Memory Errors
WO2017078707A1 (en) Method and apparatus for recovering in-memory data processing system
CN115033337A (en) Virtual machine memory migration method, device, equipment and storage medium
JP5832408B2 (en) Virtual computer system and control method thereof
KR102640910B1 (en) Method and system for recovering data associated with artificial intelligence calculation
US11550673B2 (en) Virtual machines recoverable from uncorrectable memory errors

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20160727