CN116610430A

CN116610430A - Method for realizing electrified operation and maintenance of processor and server system

Info

Publication number: CN116610430A
Application number: CN202310675990.8A
Authority: CN
Inventors: 黄凯
Original assignee: Suzhou Inspur Intelligent Technology Co Ltd
Current assignee: Suzhou Inspur Intelligent Technology Co Ltd
Priority date: 2023-06-08
Filing date: 2023-06-08
Publication date: 2023-08-18

Abstract

The invention belongs to the technical field of servers, and particularly provides a method for realizing electrified operation and maintenance of a processor and a server system, wherein the method comprises the following steps: after the processor generates an unrepairable error and unloads a computing task from the processor to the DPU, the processor generates a system management interrupt; after the baseboard management controller identifies the fault information of the processor and acquires the system management interrupt, triggering the FPGA to send a offline request to the processor generating the fault information; after receiving the response, the FPGA triggers the reset of the processor and synchronously pulls down the power supply signal of the processor; after the power supply signals of the processor are pulled down, the BIOS sends a system control interrupt to the operating system and stops the operation of the processor on the external I/O port; after the operating system identifies the interruption of the system control sent by the BIOS, the interaction with the processor is stopped, and the interruption operation of the hardware link of the processor is completed. The processor is replaced without sense, and the continuous operation of the service is ensured.

Description

Method for realizing electrified operation and maintenance of processor and server system

Technical Field

The invention relates to the technical field of servers, in particular to a method for realizing electrified operation and maintenance of a processor and a server system.

Background

The design that a plurality of computing nodes can be placed in one chassis is considered by a customer, so that all the nodes can share the DPU, the cost of the DPU is high at present, and because the intelligent network card is the network computing power which is increased far higher than the computing power of the processor, each computing node does not need to be matched with one DPU to be used, and the plurality of computing nodes share one DPU, so that the hardware cost of the DPU shared by each node or each processor can be greatly reduced.

The DPU itself needs to supply power to the DPU when the server is not started after the server is plugged with the AC power line, so that the DPU is not powered down when the whole machine is shut down, the functions of business and management of clients and the like can be conveniently operated on the DPU, the DPU can perform network unloading for cloud users, and cloud disks and computing resources are distributed and managed. The multi-path server or the multi-node server needs to be capable of realizing independent thermal maintenance, so that key components on a single node, particularly a processor, can be required to be maintained in an electrified mode, continuous operation of business is guaranteed, and the key components, particularly the processor, are required to be replaced in a non-sensing mode.

After the multiprocessor shares the DPU, the requirement of the mutual migration calculation force of the DPU and the processor is related, and on the premise of ensuring smooth migration, how to design a server to operate under the electrified condition, and the technical problem to be solved by the application is to meet the requirement of realizing the replacement of the fault processor under the condition of no service interruption of a customer.

Disclosure of Invention

After the multiprocessor shares the DPU, the requirement of the mutual migration calculation force of the DPU and the processor is related, and on the premise of ensuring smooth migration, how to design a server to operate under the electrified condition, so as to meet the problem that a customer can realize the replacement requirement of a fault processor under the condition of not interrupting service.

In a first aspect, the present invention provides a method for implementing live operation and maintenance of a processor, including the following steps:

after the processor generates an unrepairable error and unloads a computing task from the processor to the DPU, the processor generates a system management interrupt;

after the baseboard management controller identifies the fault information of the processor and acquires the system management interrupt, triggering the FPGA to send a offline request to the processor generating the fault information;

after receiving a response of a processor based on a offline request, the FPGA triggers the reset of the processor and synchronously pulls down a power supply signal of the processor;

after the power supply signals of the processor are pulled down, the BIOS sends a system control interrupt to the operating system and stops the operation of the processor on the external I/O port;

after the operating system recognizes the interruption of the system control sent by the BIOS through the self-driving, stopping any information interaction and business communication between the operating system and the processor, and completing the interruption operation of the hardware link of the processor.

When an unrepairable error occurs in the processor, the BIOS collects information of an internal register of the processor, different record formats are formed in the internal register of the processor for different faults, after the BIOS collects the fault information and reads the system management interrupt sent by the processor, the information is sent to the baseboard management controller through an IPMI command, and the baseboard management controller stores the fault in a log and simultaneously triggers the FPGA to enable the processor to be in a down line action.

As a further limitation of the present invention, after the processor generates an unrepairable error and the computing task is offloaded from the processor to the DPU, the step of generating a system management interrupt by the processor includes:

after the processor generates an unrepairable error and unloads the computing task from the processor to the DPU, the business layer software informs the processor of the completion of computing migration through an operating system, and the processor generates a system management interrupt.

When the server runs the service and the processor is required to be replaced by an unrepairable error, in order to avoid the influence of service interruption, network computing power, I/O computing power and storage computing power tasks which are performed on the processor are released to an application program on the DPU, and the DPU can also efficiently complete the tasks by combining the application program, namely, all virtual machines are established on a server host matched with the processor to combine software to perform data transmission, analysis and processing work, the operation is transferred to the DPU to run, and computing power which is consumed on the server host and is stored on the processor is taken over to the DPU; that is, when the processor encounters an unrepairable error or a fatal error and needs to do offline operation, the processor first combines the business layer software running on the server operating system to do corresponding processing, and the running business, especially the network computing power and the storage computing power, is unloaded onto the DPU.

As a further limitation of the technical solution of the present invention, the step of triggering the FPGA to send a request for offline to the processor generating the fault information after the baseboard management controller identifies the fault information of the processor and acquires the system management interrupt includes:

the BIOS collects information of the internal registers of the processor;

after the BIOS collects fault information of the processor and reads system management interruption sent by the processor, the fault information is sent to the baseboard management controller through an IPMI command;

after the baseboard management controller acquires fault information of the processor and acquires system management interruption from the BIOS, the acquired fault information is recorded into a log, and meanwhile, the FPGA is triggered to send a offline request to the processor generating the fault information.

As a further limitation of the technical solution of the present invention, the step of the baseboard management controller acquiring fault information of the processor and recording the acquired fault information to the log after acquiring system management interrupt from the BIOS, and triggering the FPGA to send an offline request to the processor generating the fault information includes:

the baseboard management controller actively and periodically polls interrupt register information in the FPGA;

the baseboard management controller judges whether the processor has a fatal fault according to the acquired interrupt register information;

If yes, the baseboard management controller acquires the system management interrupt sent by the processor from the BIOS, and acquires the information of the internal register of the processor through the PECI; wherein, the internal register of the processor has different record formats for different faults;

the baseboard management controller records the acquired fault information to a log, and triggers the FPGA to send a offline request to a processor generating the fault information;

if not, after the baseboard management controller acquires the system management interrupt from the BIOS, the fault information acquired from the BIOS is recorded into a log, and meanwhile, the FPGA is triggered to send a offline request to a processor generating the fault information.

The baseboard management controller actively and periodically polls the interrupt register information in the FPGA, when the baseboard management controller reads that the processor has a fatal fault and acquires the system management interrupt sent by the processor from the BIOS, the baseboard management controller acquires the information of the internal register of the processor through the PECI bus, records the fault information into a log, and simultaneously triggers the FPGA to enable the processor to perform the offline action.

As a further limitation of the technical solution of the present invention, after the processor generates the uncorrectable error and completes the migration of the computing power from the processor to the DPU, the step of generating the system management interrupt by the processor further includes:

When a fatal fault occurs to the processor, a fatal fault enable pin of the processor connected to the FPGA is pulled down; after the FPGA recognizes that the energy pin is pulled down when a fatal fault is detected, information of an interrupt register is modified in the FPGA.

When the processor has a fatal fault, the fatal fault enabling pin is pulled down, the pin is connected to the FPGA, and after the FPGA recognizes the fatal fault, the information of the interrupt register is modified in the FPGA.

As a further limitation of the technical solution of the present invention, the method further comprises:

the BIOS sends a system control interrupt to the operating system and simultaneously sends the system control interrupt to the baseboard management controller;

after receiving the system control interrupt, the baseboard management controller sends an instruction for representing the interrupt operation for completing the hardware link of the processor to the FPGA;

the FPGA lights the down state of the server node board card, and the down state light is lighted to indicate that the processor on the node can be removed or replaced in a live mode.

The operation and maintenance personnel successfully remove the faulty processor from the electrified node, and other nodes work normally at the moment, the key running service on the faulty processor is also unloaded to the DPU, the service is not interrupted due to the fault of the processor, and meanwhile, the operator replaces one processor under the electrified condition. As a further limitation of the technical solution of the present invention, the method further comprises:

After the electrified removed or replaced processor is replaced and installed, starting to execute BIOS codes, and guiding the processor to enter an operating system to enable the processor to generate system management interrupt;

after receiving the system management interrupt, the BIOS judges and identifies the processor needing to do online operation and informs the identification result to the baseboard management controller through the IPMI command;

the substrate management controller triggers the FPGA to send an online request to the replaced processor;

after receiving the response of the processor based on the online request, the FPGA enables the replaced power supply signal of the processor to be effective, and enables the power supply of all modules of the replaced processor to be in a power supply completion state;

after the FPGA reads the information of the power supply completion of the processor, a reset command is sent to the replaced processor to execute the soft reset of the processor at the hardware level, so that the processor is enabled to be reset;

after the processor completes the soft reset, re-detecting and identifying the memory and starting communication with the memory;

and after the starting is finished, the BIOS sends a system control interrupt to the operating system to finish the online action of the processor.

As a further limitation of the technical solution of the present invention, the BIOS sends a system control interrupt to the operating system, and after completing the step of the processor on-line action, the method further includes:

After the business layer software on the operating system receives the system control interrupt, the computing power task unloaded into the DPU is reloaded to the newly replaced online processor.

In a second aspect, the present invention further provides a server system for implementing live operation and maintenance of a processor, including an expansion board, a management board, and a plurality of nodes;

the expansion board is provided with a DPU, and each node is provided with a processor and an operating system; each processor is connected with a power supply;

the processor of each node is connected with the DPU and is used for unloading the calculation task from the failed processor to the DPU by the service layer of the operating system when the processor generates an unrepairable error;

the management board is provided with a substrate management controller; a register for storing the fault information of the processor is arranged in each processor; the baseboard management controller on the management board is connected with the internal register of the processor;

an interrupt register is arranged in the FPGA, a substrate management controller on a management board is connected with the interrupt register in the FPGA, and a processor of each node is provided with a fatal fault enabling pin connected with the FPGA for judging the fatal faults of the processor; when the processor has a fatal fault, the fatal fault enabling pin is pulled down to be connected to the FPGA, and after the FPGA recognizes the fatal fault, the information of the interrupt register is modified;

Processor live operation and maintenance are realized based on the server system executing the method according to the first aspect.

As a further limitation of the technical scheme of the invention, the management board is also provided with an IO port expansion selector which is respectively connected with the substrate management controller and the FPGA on the management board;

the baseboard management controller and the FPGA are respectively connected with the processor through the IO port expansion selector.

Because the number of the external interfaces is limited by the FPGA, and when the whole system performs the on-line and off-line operations of the processors, only 1 processor can be operated at a time, if one processor fails and needs to be replaced when the other processor fails, the replacement of the former failed processor (namely, the off-line and on-line of the processors are completed) can be completed, and then the other processor is replaced, so that the signals of the processors are connected to the IO port expansion selector firstly and then to the FPGA or the BMC.

As a further limitation of the technical scheme of the invention, the processor is provided with a PECI bus pin, a processor in-place pin, a processor power supply completion pin, a processor reset pin and a processor on-line and off-line demand pin;

The PECI bus pin of the processor is connected to the baseboard management controller after the selector is expanded through the IO port, when the FPGA detects that the processor has a fatal fault, the baseboard management controller acquires the fatal fault information from the interrupt register of the FPGA, and then the processor acquires the fault information from the internal register of the processor through the PECI bus pin;

the processor in-place pin is connected to the FPGA after passing through the IO port expansion selector and is used for carrying out in-place identification and judgment on the processor;

the power supply completion pin of the processor is connected to the FPGA after passing through the IO port expansion selector, the FPGA enables the power supply of different modules of the processor, and after the FPGA recognizes that all the power supplies are effective and stable, the FPGA informs the processor that the power supply is completed through the power supply completion pin of the processor;

the processor reset pin is connected to the FPGA after passing through the IO port expansion selector and is used for resetting the processor by the FPGA;

and the processor on/off line demand pin is used for sending on/off line demands to the IO port expansion selector through the processor on/off line demand pin when the processor encounters a fault and needs to be off line for processing, and when the processor is replaced to be on line, the processor is connected to the FPGA.

The pin of the processor is connected with the FPGA through the isolation control circuit and the IO port expansion selector in sequence when the processor fails;

The isolation control circuit comprises an MOS tube, wherein the drain electrode of the MOS tube is connected to a life-threatening fault pin of the processor, and the life-threatening fault pin of the processor is also connected to a power supply through a pull-up resistor; the source electrode of the MOS tube is connected to the IO port expansion selector, and the grid electrode of the MOS tube is connected to the FPGA.

The design of the isolation logic principle of the external fatal fault is mainly that the fatal fault enabling pin of the processor is pulled up to 1.0V power supply voltage through a resistor, because the fatal fault enabling pin is effective at low level, the power supply is connected to the IO port expansion selector through the selector and then connected to the FPGA, the drain electrode of the selector is connected with the fatal fault enabling pin, the source electrode of the selector is connected to the FPGA through the IO port expansion selector, the control end (grid electrode) of the selector is connected to the FPGA, and the FPGA is used for controlling the pin and the external isolation circuit to enable the FPGA to be switched on and switched off in combination with the replacement condition of the processor. The working logic of the FPGA is as follows:

when the processor works normally, and no fatal faults exist, the states of the pins can be as follows:

the enable pin is high when a critical fault occurs, the isolation device (here, the selection switch) is closed, and the enable pin is low when a critical fault occurs, i.e., the processor is required to perform a down line operation, so that the isolation device outside the enable pin when a critical fault occurs needs to be disconnected.

When the processor encounters an irreparable error or a fatal fault, the processor needs to be replaced, and the state of the pins in the case of the fatal fault is as follows:

before the processor is replaced, if the processor needs to be replaced due to the occurrence of a fatal fault, the fatal fault enabling pin is at a low level, the processor is required to be in an offline motion, and the isolation device is disconnected;

if the processor needs to be replaced due to other irreparable errors, the fatal fault enabling pin is at a high level, and the processor is required to be in a down-line action at the moment, and the isolator is disconnected.

After the processor is replaced, the isolation device is closed, the processor is online, and the fatal fault enabling pin is at a high level; before the processor does not complete the online action, the system does not pay attention to the state of the fatal fault enabling pin of the processor; it should be noted that, before replacing the fault processor, the software application layer service needs to be migrated, and then the hardware level offline operation is performed.

Before the processor is on line, the whole system can not execute system shutdown operation due to abnormal errors or faults triggered by the processor, and the same, deadly fault signals and corresponding logic circuits need to be designed in an isolated mode until the processor needing to be replaced is on line, and the deadly faults caused by the processor can not cause system shutdown. After the replacement processor finishes the reset operation, the isolation circuit corresponding to the fatal fault enabling pin is required to be enabled, the specific method is that the FPGA sends a signal to the control end of the fatal fault isolation circuit, namely the grid electrode of the selection switch, so that the isolation circuit is closed, the fatal fault enabling pin of the replaced processor is connected to the isolation circuit, and after the operation is finished, the firmware resends the system control interrupt to the operating system, namely the online action of the processor is finished.

From the above technical scheme, the invention has the following advantages: the multi-path server or multi-node server sharing the DPU can realize the electrified maintenance of a single node by combining the control of the processor, the BIOS and the operating system and matching with the FPGA, particularly the non-inductive replacement of the processor, and the continuous operation of the service is ensured.

In addition, the invention has reliable design principle, simple structure and very wide application prospect.

It can be seen that the present invention has outstanding substantial features and significant advances over the prior art, as well as its practical advantages.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required to be used in the description of the embodiments or the prior art will be briefly described below, and it will be obvious to those skilled in the art that other drawings can be obtained from these drawings without inventive effort.

FIG. 1 is a schematic flow chart of a method of one embodiment of the invention.

FIG. 2 is a schematic flow chart of fault determination for a method of one embodiment of the invention.

Fig. 3 is a schematic flow chart of a fatal fault judgment of a method of one embodiment of the present invention.

FIG. 4 is a schematic flow chart of a live installation of a processor in a method of one embodiment of the invention.

Fig. 5 is a control block diagram of an isolated line that is fatally faulty in an embodiment of the invention.

Fig. 6 is a schematic rear view of a server system provided by the present invention.

Fig. 7 is a schematic front view of a server system provided by the present invention.

Fig. 8 is a schematic top view of a server system provided by the present invention.

Fig. 9 is a schematic diagram of system connection of a heat dissipation design according to an embodiment of the invention.

Fig. 10 is a schematic diagram of specific signals of connection between nodes and a management board in the system.

Detailed Description

In order to realize the design of the multi-node shared DPU, the current X16 PCIE Lane of the DPU is respectively from 4 processors above 4 nodes, each processor is respectively provided with one X4 PCIE Lane, and the FPGA, the firmware and the operating system above the management board are matched to realize the situation that a plurality of nodes share one DPU, and can meet the conditions that when the processor above a certain node is out of line or abnormal, the computing power of the processor can be timely moved to the DPU, and the processor can be quickly replaced under the conditions of no shutdown and no service interruption. In order to make the technical solution of the present invention better understood by those skilled in the art, the technical solution of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.

As shown in fig. 1, an embodiment of the present invention provides a method for implementing a live operation of a processor, including the following steps:

step 1: after the processor generates an unrepairable error and unloads a computing task from the processor to the DPU, the processor generates a system management interrupt;

step 2: after the baseboard management controller identifies the fault information of the processor and acquires the system management interrupt, triggering the FPGA to send a offline request to the processor generating the fault information;

step 3: after receiving a response of a processor based on a offline request, the FPGA triggers the reset of the processor and synchronously pulls down a power supply signal of the processor;

step 4: after the power supply signals of the processor are pulled down, the BIOS sends a system control interrupt to the operating system and stops the operation of the processor on the external I/O port;

step 5: after the operating system recognizes the interruption of the system control sent by the BIOS through the self-driving, stopping any information interaction and business communication between the operating system and the processor, and completing the interruption operation of the hardware link of the processor.

It should be noted that, after the processor generates an unrepairable error and the computing task is unloaded from the processor to the DPU, the service layer software informs the processor that the computing migration is completed through the operating system, and the processor generates a system management interrupt. When the server runs the service and the processor is required to be replaced by an unrepairable error, in order to avoid the influence of service interruption, network computing power, I/O computing power and storage computing power tasks which are performed on the processor are released to an application program on the DPU, and the DPU can also efficiently complete the tasks by combining the application program, namely, all virtual machines are established on a server host matched with the processor to combine software to perform data transmission, analysis and processing work, the operation is transferred to the DPU to run, and computing power which is consumed on the server host and is stored on the processor is taken over to the DPU; that is, when the processor encounters an unrepairable error or a fatal error and needs to do offline operation, the processor first combines the business layer software running on the server operating system to do corresponding processing, and the running business, especially the network computing power and the storage computing power, is unloaded onto the DPU. The specific process of unloading the specific computing task is not the innovative point of the application, and the existing computing task unloading steps are adopted, so that the detailed description is omitted.

In some embodiments, as shown in fig. 2, after the baseboard management controller identifies the fault information of the processor and acquires the system management interrupt, the step of triggering the FPGA to send a offline request to the processor generating the fault information includes:

step 21: the BIOS collects information of the internal registers of the processor;

step 22: after the BIOS collects fault information of the processor and reads system management interruption sent by the processor, the fault information is sent to the baseboard management controller through an IPMI command;

step 23: after the baseboard management controller acquires fault information of the processor and acquires system management interruption from the BIOS, the acquired fault information is recorded into a log, and meanwhile, the FPGA is triggered to send a offline request to the processor generating the fault information.

It should be further noted that, when a fatal fault occurs in the processor, the fatal fault enable pin of the processor connected to the FPGA is pulled down; after the FPGA recognizes that the energy pin is pulled down when a fatal fault is detected, information of an interrupt register is modified in the FPGA. As shown in fig. 3, after the baseboard management controller obtains the fault information of the processor and obtains the system management interrupt from the BIOS, the step of recording the obtained fault information to the log, and triggering the FPGA to send a offline request to the processor generating the fault information includes:

Step 231: the baseboard management controller actively and periodically polls interrupt register information in the FPGA;

step 232: the baseboard management controller judges whether the processor has a fatal fault according to the acquired interrupt register information;

if yes, go to step 233; if not, go to step 235;

step 233: after the baseboard management controller obtains system management interrupt sent by the processor from the BIOS, obtaining information of an internal register of the processor through PECI; wherein, the internal register of the processor has different record formats for different faults;

step 234: the baseboard management controller records the acquired fault information to a log, and triggers the FPGA to send a offline request to a processor generating the fault information;

step 235: after the baseboard management controller obtains the system management interrupt from the BIOS, the fault information obtained from the BIOS is recorded into a log, and meanwhile, the FPGA is triggered to send a offline request to a processor generating the fault information.

In some embodiments, the method further comprises:

step 6: the BIOS sends a system control interrupt to the operating system and simultaneously sends the system control interrupt to the baseboard management controller; after receiving the system control interrupt, the baseboard management controller sends an instruction for representing the interrupt operation for completing the hardware link of the processor to the FPGA; the FPGA lights the down state of the server node board card, and the down state light is lighted to indicate that the processor on the node can be removed or replaced in a live mode.

The operation and maintenance personnel successfully remove the faulty processor from the electrified node, and other nodes work normally at the moment, the key running service on the faulty processor is also unloaded to the DPU, the service is not interrupted due to the fault of the processor, and meanwhile, the operator replaces one processor under the electrified condition. As shown in fig. 4, in some embodiments, the method further comprises:

step 7: after the electrified removed or replaced processor is replaced and installed, starting to execute BIOS codes, and guiding the processor to enter an operating system to enable the processor to generate system management interrupt;

step 8: after receiving the system management interrupt, the BIOS judges and identifies the processor needing to do online operation and informs the identification result to the baseboard management controller through the IPMI command;

In this step, the BIOS performs processing and analysis after receiving the system management interrupt, determines and identifies which processor needs to perform online operation, and the BIOS performs some checking actions to check which processor is not online, and then the BIOS informs the baseboard management controller through the IPMI command, and the baseboard management controller triggers the FPGA to send an online request to the replaced processor. It should be noted that, the replaced processor is just installed on the processor slot, the FPGA recognizes that the processor is in place through the processor in-place pin, and the BIOS can recognize which processors of the current complete machine system are in the on-line state, which processors are not in the off-line state, and the just installed processor is in the on-line state at present because a series of on-line action instructions are not completed.

Step 9: the substrate management controller triggers the FPGA to send an online request to the replaced processor;

step 10: after receiving the response of the processor based on the online request, the FPGA enables the replaced power supply signal of the processor to be effective, and enables the power supply of all modules of the replaced processor to be in a power supply completion state;

the power supply signals of the processor refer to all power supply signals of the processor, voltages required by different modules of the processor are output by a Voltage Regulator (VR) additionally designed outside, power input to the processor is pulled down before the processor is connected, an enabling pin for power supply of the processor is pulled down through an FPGA, and the corresponding Voltage Regulator (VR) for power supply is still working; after the processor is replaced, the FPGA can pull up an enabling pin for power supply of the processor to complete power supply of the processor, and then the processor is connected with a subsequent connection operation.

Step 11: after the FPGA reads the information of the power supply completion of the processor, a reset command is sent to the replaced processor to execute the soft reset of the processor at the hardware level, so that the processor is enabled to be reset;

when the power supply of each module of the processor is ready, the FPGA reads the power supply completion information of the processor, sends a reset command to the IO port expansion selector, then carries out the soft reset of the hardware-level processor to the replaced processor, enables the reset of the processor, and completes the reset operation after the time delay is 30 ms.

Step 12: after the processor completes the soft reset, re-detecting and identifying the memory and starting communication with the memory;

after the processor completes the soft reset, the registers inside the processor can be validated.

Step 13: and after the starting is finished, the BIOS sends a system control interrupt to the operating system to finish the online action of the processor.

Step 14: after the business layer software on the operating system receives the system control interrupt, the computing power task unloaded into the DPU is reloaded to the newly replaced online processor.

The embodiment of the application also provides a server system for realizing the electrified operation and maintenance of the processor, which comprises an expansion board, a management board and a plurality of nodes;

the server system performs the method described in the above embodiments to implement the processor live operation.

In some embodiments, the management board is further provided with an IO port expansion selector, and the IO port expansion selector is respectively connected with the substrate management controller and the FPGA on the management board;

The server system provided in this embodiment has 4 nodes, each node has 1 processor, 4 processors, and the signals received from the processors are all 4 groups of signals, because the FPGA has the limitation of the number of external interfaces, and the complete system can only operate 1 processor at a time when performing the on-line and off-line operations of the processors, if in the process of replacing one processor, only the replacement of the previous fault processor (i.e., the completion of the off-line and on-line operations of the processors) is completed, and then the replacement of the other processor is completed, so the signals of the 4 processors are all connected to the IO port expansion selector first and then to the FPGA or the BMC.

The processor is provided with a PECI bus pin, a processor in-place pin, a processor power supply completion pin, a processor reset pin and a processor on-line and off-line demand pin;

The design of the logic principle of the fatal fault is that the fatal fault enabling pin of the processor is pulled up to 1.0V power supply voltage through a resistor, because the fatal fault enabling pin is effective at low level, the isolating circuit is connected to an IO port expansion selector (namely an isolating device, a MOS tube in this case) through a selection switch, the drain electrode of the MOS tube is connected with the pin in the case of the fatal fault, the source electrode of the MOS tube is connected to the FPGA through the IO port expansion selector, the control end (grid electrode) of the MOS tube is connected to the FPGA, and the pin and an external isolating circuit are controlled by the FPGA to be switched on and switched off according to the replacement condition of the processor. The working logic of the FPGA is as follows:

Before the processor is on line, the whole system can not execute system shutdown operation due to abnormal errors or faults triggered by the processor, and the same, deadly fault signals and corresponding logic circuits need to be designed in an isolated mode until the processor needing to be replaced is on line, and the deadly faults caused by the processor can not cause system shutdown. After the replacement processor finishes the reset operation, the isolation circuit corresponding to the fatal fault enabling pin is required to be enabled, the specific method is that the FPGA sends a signal to the control end of the MOS tube, namely the grid electrode, so that the isolation circuit is closed, the fatal fault enabling pin of the replaced processor is connected to the isolation circuit, and after the operation is finished, the firmware resends the system control interrupt to the operating system, namely the online action of the processor is finished.

The following is a design for management interconnection and upgrade of different computing nodes in the high-end server and the multi-node server, and the following is an example of 4 computing nodes.

The following design of the multi-node server based on the domestic processor can meet the design of the multi-node shared DPU, and 4 nodes are positioned at the rear window of the server and are respectively node 0, node 1, node 2 and node 3, as shown in fig. 6. The server front window is an extension for extending storage and PCIE devices, as well as DPUs, as shown in fig. 7.

In the embodiment of the invention, only one DPU, one fan wall, one management board and 4 nodes are arranged, wherein the fan wall is used for radiating the whole system, the management board is used for monitoring and managing each node in the system, controlling the fan rotating speed, controlling the node power supply and controlling the charged maintenance of the node, a single processor, a substrate controller and a power supply are arranged on each node, and the substrate controller on the node manages key devices on the node and collects sensor information and the like, as shown in fig. 8.

The power management part is mainly characterized in that a substrate control card on the management board is connected with a power supply on each node through a PMBUS, information such as temperature, input/output voltage, input/output current, input/output power consumption and the like can be obtained, information of 4 PSUs is stored, the values are used for supplying the substrate controllers of the 4 nodes to conduct reading actions through the PMBUS, and the power supply can be effectively and redundantly controlled. As shown in fig. 9, in the fan management in the heat dissipation design, the substrate controller on each node obtains the sensor information for heat dissipation on each node, the substrate controller on the management board is connected with the substrate controller on each node through I2C, the BMC status of each node is collected to perform heat dissipation design on the node, the sensor information is obtained, then integrated evaluation is performed, linear intelligent partition regulation and control are performed, the substrate controller on the management board automatically controls the fan through PWM according to the fan regulation and control strategy, when the substrate controller on the management board is abnormal or is suspended, the FPGA takes over the fan rotation speed control, but at this time, the control is performed according to a certain fixed rotation speed, in order to ensure normal operation of the system, the rotation speed is generally set to be higher, for example, the fan is fully rotated or designed according to 85% rotation speed. Wherein the process of the spot removal processor and the live installation processor is as described in the above embodiments.

Fig. 10 is a schematic diagram of specific signals of connection between nodes and a management board in the system. Peci_0-3: the method comprises the steps that the processor is connected to the BMC after passing through an IO port selector, when the FPGA detects that the processor has a fatal fault, the BMC can acquire the fatal fault information from an interrupt register of the FPGA, and then the BMC acquires the fault information from an internal register of the processor through PECI;

2. the processor is in bits_0-3: the processor is connected to the FPGA after passing through the IO port selector and is used for performing in-place identification judgment on the processor;

3. the power supply of the processor is completed (0-3), the processor is connected to the FPGA after passing through the IO port selector, the FPGA enables the power supply of different modules of the processor to be effective, and the FPGA informs the processor that the power supply is completed through the pin after recognizing that all the power supplies are effective and stable;

4. and the processor reset_0-3 is connected to the FPGA after passing through the IO port selector and is used for resetting the processor by the FPGA.

5. Processor on-line/off-line requirements_0-3: when the processor encounters a fault and needs to perform offline processing, and the processor is replaced to perform online action, the processor sends online and offline requirements to the IO port selector and is connected to the FPGA;

6. fatal fault enable_0 to 3: when the processor has a fatal fault, the signal is pulled down, and the pin is connected to the IO port selector and then connected to the FPGA for judging the fatal fault of the processor; in addition to the processor encountering unrepairable errors of a general nature, there is a fatal fault contained within the unrepairable errors, but it has a separate pin, with an external logic circuit design, which is a general input-output, IO, pin, allowing it to be externally enabled, and then the processor triggers the corresponding checking and protection strategies.

In addition, the high-end multipath server, such as 4 paths, 8 paths, 16 paths, 32 paths, the multi-node server, such as 2 nodes, 4 nodes and 8 nodes, is combined with the design, and the non-inductive replacement of key components, such as a processor and PCIE equipment, is carried out under the starting-up state, so that the continuous operation of the service is ensured, and the use experience of a client is improved.

Although the present invention has been described in detail by way of preferred embodiments with reference to the accompanying drawings, the present invention is not limited thereto. Various equivalent modifications and substitutions may be made in the embodiments of the present invention by those skilled in the art without departing from the spirit and scope of the present invention, and it is intended that all such modifications and substitutions be within the scope of the present invention/be within the scope of the present invention as defined by the appended claims. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A method for implementing live operation of a processor, comprising the steps of:

2. The method for implementing live operation and maintenance of a processor according to claim 1, wherein the step of triggering the FPGA to send a request for offline to the processor generating the fault information after the baseboard management controller identifies the fault information of the processor and acquires the system management interrupt comprises:

the BIOS collects information of the internal registers of the processor;

3. The method for implementing live operation and maintenance of a processor according to claim 2, wherein the step of the baseboard management controller acquiring fault information of the processor and recording the acquired fault information to a log after acquiring a system management interrupt from the BIOS, and triggering the FPGA to send a request for offline to the processor generating the fault information comprises:

4. The method of claim 1, wherein after the processor generates the system management interrupt after the processor generates the unrepairable error and completes the migration of the computing power from the processor to the DPU, the step of generating the system management interrupt by the processor further comprises:

when a fatal fault occurs to the processor, a fatal fault enable pin of the processor connected to the FPGA is pulled down;

after the FPGA recognizes that the energy pin is pulled down when a fatal fault is detected, information of an interrupt register is modified in the FPGA.

5. The method of implementing a powered operation of a processor of claim 1, further comprising:

6. The method of implementing a powered-on operation of a processor of claim 5, further comprising:

7. The method of claim 6, wherein the BIOS sending a system control interrupt to the operating system, after completing the step of processor on-line actions, further comprises:

8. The server system for realizing the electrified operation and maintenance of the processor is characterized by comprising an expansion board, a management board and a plurality of nodes;

The server system performs the method of any of claims 1-7 to implement processor live operation.

9. The server system for realizing the electrified operation and maintenance of the processor according to claim 8, wherein the management board is further provided with an IO port expansion selector, and the IO port expansion selector is respectively connected with the substrate management controller and the FPGA on the management board;

10. The server system for realizing the live operation and maintenance of the processor according to claim 9, wherein the pin of the processor is connected with the FPGA through the isolation control circuit and the IO port expansion selector in sequence when the processor fails;