CN116610430A - Method for realizing electrified operation and maintenance of processor and server system - Google Patents

Method for realizing electrified operation and maintenance of processor and server system Download PDF

Info

Publication number
CN116610430A
CN116610430A CN202310675990.8A CN202310675990A CN116610430A CN 116610430 A CN116610430 A CN 116610430A CN 202310675990 A CN202310675990 A CN 202310675990A CN 116610430 A CN116610430 A CN 116610430A
Authority
CN
China
Prior art keywords
processor
fpga
interrupt
fault
management controller
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310675990.8A
Other languages
Chinese (zh)
Inventor
黄凯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Inspur Intelligent Technology Co Ltd
Original Assignee
Suzhou Inspur Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Inspur Intelligent Technology Co Ltd filed Critical Suzhou Inspur Intelligent Technology Co Ltd
Priority to CN202310675990.8A priority Critical patent/CN116610430A/en
Publication of CN116610430A publication Critical patent/CN116610430A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4812Task transfer initiation or dispatching by interrupt, e.g. masked
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0793Remedial or corrective actions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3466Performance evaluation by tracing or monitoring
    • G06F11/3476Data logging
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/445Program loading or initiating
    • G06F9/44594Unloading
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Computer Hardware Design (AREA)
  • Hardware Redundancy (AREA)

Abstract

The invention belongs to the technical field of servers, and particularly provides a method for realizing electrified operation and maintenance of a processor and a server system, wherein the method comprises the following steps: after the processor generates an unrepairable error and unloads a computing task from the processor to the DPU, the processor generates a system management interrupt; after the baseboard management controller identifies the fault information of the processor and acquires the system management interrupt, triggering the FPGA to send a offline request to the processor generating the fault information; after receiving the response, the FPGA triggers the reset of the processor and synchronously pulls down the power supply signal of the processor; after the power supply signals of the processor are pulled down, the BIOS sends a system control interrupt to the operating system and stops the operation of the processor on the external I/O port; after the operating system identifies the interruption of the system control sent by the BIOS, the interaction with the processor is stopped, and the interruption operation of the hardware link of the processor is completed. The processor is replaced without sense, and the continuous operation of the service is ensured.

Description

Method for realizing electrified operation and maintenance of processor and server system
Technical Field
The invention relates to the technical field of servers, in particular to a method for realizing electrified operation and maintenance of a processor and a server system.
Background
The design that a plurality of computing nodes can be placed in one chassis is considered by a customer, so that all the nodes can share the DPU, the cost of the DPU is high at present, and because the intelligent network card is the network computing power which is increased far higher than the computing power of the processor, each computing node does not need to be matched with one DPU to be used, and the plurality of computing nodes share one DPU, so that the hardware cost of the DPU shared by each node or each processor can be greatly reduced.
The DPU itself needs to supply power to the DPU when the server is not started after the server is plugged with the AC power line, so that the DPU is not powered down when the whole machine is shut down, the functions of business and management of clients and the like can be conveniently operated on the DPU, the DPU can perform network unloading for cloud users, and cloud disks and computing resources are distributed and managed. The multi-path server or the multi-node server needs to be capable of realizing independent thermal maintenance, so that key components on a single node, particularly a processor, can be required to be maintained in an electrified mode, continuous operation of business is guaranteed, and the key components, particularly the processor, are required to be replaced in a non-sensing mode.
After the multiprocessor shares the DPU, the requirement of the mutual migration calculation force of the DPU and the processor is related, and on the premise of ensuring smooth migration, how to design a server to operate under the electrified condition, and the technical problem to be solved by the application is to meet the requirement of realizing the replacement of the fault processor under the condition of no service interruption of a customer.
Disclosure of Invention
After the multiprocessor shares the DPU, the requirement of the mutual migration calculation force of the DPU and the processor is related, and on the premise of ensuring smooth migration, how to design a server to operate under the electrified condition, so as to meet the problem that a customer can realize the replacement requirement of a fault processor under the condition of not interrupting service.
In a first aspect, the present invention provides a method for implementing live operation and maintenance of a processor, including the following steps:
after the processor generates an unrepairable error and unloads a computing task from the processor to the DPU, the processor generates a system management interrupt;
after the baseboard management controller identifies the fault information of the processor and acquires the system management interrupt, triggering the FPGA to send a offline request to the processor generating the fault information;
after receiving a response of a processor based on a offline request, the FPGA triggers the reset of the processor and synchronously pulls down a power supply signal of the processor;
after the power supply signals of the processor are pulled down, the BIOS sends a system control interrupt to the operating system and stops the operation of the processor on the external I/O port;
after the operating system recognizes the interruption of the system control sent by the BIOS through the self-driving, stopping any information interaction and business communication between the operating system and the processor, and completing the interruption operation of the hardware link of the processor.
When an unrepairable error occurs in the processor, the BIOS collects information of an internal register of the processor, different record formats are formed in the internal register of the processor for different faults, after the BIOS collects the fault information and reads the system management interrupt sent by the processor, the information is sent to the baseboard management controller through an IPMI command, and the baseboard management controller stores the fault in a log and simultaneously triggers the FPGA to enable the processor to be in a down line action.
As a further limitation of the present invention, after the processor generates an unrepairable error and the computing task is offloaded from the processor to the DPU, the step of generating a system management interrupt by the processor includes:
after the processor generates an unrepairable error and unloads the computing task from the processor to the DPU, the business layer software informs the processor of the completion of computing migration through an operating system, and the processor generates a system management interrupt.
When the server runs the service and the processor is required to be replaced by an unrepairable error, in order to avoid the influence of service interruption, network computing power, I/O computing power and storage computing power tasks which are performed on the processor are released to an application program on the DPU, and the DPU can also efficiently complete the tasks by combining the application program, namely, all virtual machines are established on a server host matched with the processor to combine software to perform data transmission, analysis and processing work, the operation is transferred to the DPU to run, and computing power which is consumed on the server host and is stored on the processor is taken over to the DPU; that is, when the processor encounters an unrepairable error or a fatal error and needs to do offline operation, the processor first combines the business layer software running on the server operating system to do corresponding processing, and the running business, especially the network computing power and the storage computing power, is unloaded onto the DPU.
As a further limitation of the technical solution of the present invention, the step of triggering the FPGA to send a request for offline to the processor generating the fault information after the baseboard management controller identifies the fault information of the processor and acquires the system management interrupt includes:
the BIOS collects information of the internal registers of the processor;
after the BIOS collects fault information of the processor and reads system management interruption sent by the processor, the fault information is sent to the baseboard management controller through an IPMI command;
after the baseboard management controller acquires fault information of the processor and acquires system management interruption from the BIOS, the acquired fault information is recorded into a log, and meanwhile, the FPGA is triggered to send a offline request to the processor generating the fault information.
As a further limitation of the technical solution of the present invention, the step of the baseboard management controller acquiring fault information of the processor and recording the acquired fault information to the log after acquiring system management interrupt from the BIOS, and triggering the FPGA to send an offline request to the processor generating the fault information includes:
the baseboard management controller actively and periodically polls interrupt register information in the FPGA;
the baseboard management controller judges whether the processor has a fatal fault according to the acquired interrupt register information;
If yes, the baseboard management controller acquires the system management interrupt sent by the processor from the BIOS, and acquires the information of the internal register of the processor through the PECI; wherein, the internal register of the processor has different record formats for different faults;
the baseboard management controller records the acquired fault information to a log, and triggers the FPGA to send a offline request to a processor generating the fault information;
if not, after the baseboard management controller acquires the system management interrupt from the BIOS, the fault information acquired from the BIOS is recorded into a log, and meanwhile, the FPGA is triggered to send a offline request to a processor generating the fault information.
The baseboard management controller actively and periodically polls the interrupt register information in the FPGA, when the baseboard management controller reads that the processor has a fatal fault and acquires the system management interrupt sent by the processor from the BIOS, the baseboard management controller acquires the information of the internal register of the processor through the PECI bus, records the fault information into a log, and simultaneously triggers the FPGA to enable the processor to perform the offline action.
As a further limitation of the technical solution of the present invention, after the processor generates the uncorrectable error and completes the migration of the computing power from the processor to the DPU, the step of generating the system management interrupt by the processor further includes:
When a fatal fault occurs to the processor, a fatal fault enable pin of the processor connected to the FPGA is pulled down; after the FPGA recognizes that the energy pin is pulled down when a fatal fault is detected, information of an interrupt register is modified in the FPGA.
When the processor has a fatal fault, the fatal fault enabling pin is pulled down, the pin is connected to the FPGA, and after the FPGA recognizes the fatal fault, the information of the interrupt register is modified in the FPGA.
As a further limitation of the technical solution of the present invention, the method further comprises:
the BIOS sends a system control interrupt to the operating system and simultaneously sends the system control interrupt to the baseboard management controller;
after receiving the system control interrupt, the baseboard management controller sends an instruction for representing the interrupt operation for completing the hardware link of the processor to the FPGA;
the FPGA lights the down state of the server node board card, and the down state light is lighted to indicate that the processor on the node can be removed or replaced in a live mode.
The operation and maintenance personnel successfully remove the faulty processor from the electrified node, and other nodes work normally at the moment, the key running service on the faulty processor is also unloaded to the DPU, the service is not interrupted due to the fault of the processor, and meanwhile, the operator replaces one processor under the electrified condition. As a further limitation of the technical solution of the present invention, the method further comprises:
After the electrified removed or replaced processor is replaced and installed, starting to execute BIOS codes, and guiding the processor to enter an operating system to enable the processor to generate system management interrupt;
after receiving the system management interrupt, the BIOS judges and identifies the processor needing to do online operation and informs the identification result to the baseboard management controller through the IPMI command;
the substrate management controller triggers the FPGA to send an online request to the replaced processor;
after receiving the response of the processor based on the online request, the FPGA enables the replaced power supply signal of the processor to be effective, and enables the power supply of all modules of the replaced processor to be in a power supply completion state;
after the FPGA reads the information of the power supply completion of the processor, a reset command is sent to the replaced processor to execute the soft reset of the processor at the hardware level, so that the processor is enabled to be reset;
after the processor completes the soft reset, re-detecting and identifying the memory and starting communication with the memory;
and after the starting is finished, the BIOS sends a system control interrupt to the operating system to finish the online action of the processor.
As a further limitation of the technical solution of the present invention, the BIOS sends a system control interrupt to the operating system, and after completing the step of the processor on-line action, the method further includes:
After the business layer software on the operating system receives the system control interrupt, the computing power task unloaded into the DPU is reloaded to the newly replaced online processor.
In a second aspect, the present invention further provides a server system for implementing live operation and maintenance of a processor, including an expansion board, a management board, and a plurality of nodes;
the expansion board is provided with a DPU, and each node is provided with a processor and an operating system; each processor is connected with a power supply;
the processor of each node is connected with the DPU and is used for unloading the calculation task from the failed processor to the DPU by the service layer of the operating system when the processor generates an unrepairable error;
the management board is provided with a substrate management controller; a register for storing the fault information of the processor is arranged in each processor; the baseboard management controller on the management board is connected with the internal register of the processor;
an interrupt register is arranged in the FPGA, a substrate management controller on a management board is connected with the interrupt register in the FPGA, and a processor of each node is provided with a fatal fault enabling pin connected with the FPGA for judging the fatal faults of the processor; when the processor has a fatal fault, the fatal fault enabling pin is pulled down to be connected to the FPGA, and after the FPGA recognizes the fatal fault, the information of the interrupt register is modified;
Processor live operation and maintenance are realized based on the server system executing the method according to the first aspect.
As a further limitation of the technical scheme of the invention, the management board is also provided with an IO port expansion selector which is respectively connected with the substrate management controller and the FPGA on the management board;
the baseboard management controller and the FPGA are respectively connected with the processor through the IO port expansion selector.
Because the number of the external interfaces is limited by the FPGA, and when the whole system performs the on-line and off-line operations of the processors, only 1 processor can be operated at a time, if one processor fails and needs to be replaced when the other processor fails, the replacement of the former failed processor (namely, the off-line and on-line of the processors are completed) can be completed, and then the other processor is replaced, so that the signals of the processors are connected to the IO port expansion selector firstly and then to the FPGA or the BMC.
As a further limitation of the technical scheme of the invention, the processor is provided with a PECI bus pin, a processor in-place pin, a processor power supply completion pin, a processor reset pin and a processor on-line and off-line demand pin;
The PECI bus pin of the processor is connected to the baseboard management controller after the selector is expanded through the IO port, when the FPGA detects that the processor has a fatal fault, the baseboard management controller acquires the fatal fault information from the interrupt register of the FPGA, and then the processor acquires the fault information from the internal register of the processor through the PECI bus pin;
the processor in-place pin is connected to the FPGA after passing through the IO port expansion selector and is used for carrying out in-place identification and judgment on the processor;
the power supply completion pin of the processor is connected to the FPGA after passing through the IO port expansion selector, the FPGA enables the power supply of different modules of the processor, and after the FPGA recognizes that all the power supplies are effective and stable, the FPGA informs the processor that the power supply is completed through the power supply completion pin of the processor;
the processor reset pin is connected to the FPGA after passing through the IO port expansion selector and is used for resetting the processor by the FPGA;
and the processor on/off line demand pin is used for sending on/off line demands to the IO port expansion selector through the processor on/off line demand pin when the processor encounters a fault and needs to be off line for processing, and when the processor is replaced to be on line, the processor is connected to the FPGA.
The pin of the processor is connected with the FPGA through the isolation control circuit and the IO port expansion selector in sequence when the processor fails;
The isolation control circuit comprises an MOS tube, wherein the drain electrode of the MOS tube is connected to a life-threatening fault pin of the processor, and the life-threatening fault pin of the processor is also connected to a power supply through a pull-up resistor; the source electrode of the MOS tube is connected to the IO port expansion selector, and the grid electrode of the MOS tube is connected to the FPGA.
The design of the isolation logic principle of the external fatal fault is mainly that the fatal fault enabling pin of the processor is pulled up to 1.0V power supply voltage through a resistor, because the fatal fault enabling pin is effective at low level, the power supply is connected to the IO port expansion selector through the selector and then connected to the FPGA, the drain electrode of the selector is connected with the fatal fault enabling pin, the source electrode of the selector is connected to the FPGA through the IO port expansion selector, the control end (grid electrode) of the selector is connected to the FPGA, and the FPGA is used for controlling the pin and the external isolation circuit to enable the FPGA to be switched on and switched off in combination with the replacement condition of the processor. The working logic of the FPGA is as follows:
when the processor works normally, and no fatal faults exist, the states of the pins can be as follows:
the enable pin is high when a critical fault occurs, the isolation device (here, the selection switch) is closed, and the enable pin is low when a critical fault occurs, i.e., the processor is required to perform a down line operation, so that the isolation device outside the enable pin when a critical fault occurs needs to be disconnected.
When the processor encounters an irreparable error or a fatal fault, the processor needs to be replaced, and the state of the pins in the case of the fatal fault is as follows:
before the processor is replaced, if the processor needs to be replaced due to the occurrence of a fatal fault, the fatal fault enabling pin is at a low level, the processor is required to be in an offline motion, and the isolation device is disconnected;
if the processor needs to be replaced due to other irreparable errors, the fatal fault enabling pin is at a high level, and the processor is required to be in a down-line action at the moment, and the isolator is disconnected.
After the processor is replaced, the isolation device is closed, the processor is online, and the fatal fault enabling pin is at a high level; before the processor does not complete the online action, the system does not pay attention to the state of the fatal fault enabling pin of the processor; it should be noted that, before replacing the fault processor, the software application layer service needs to be migrated, and then the hardware level offline operation is performed.
Before the processor is on line, the whole system can not execute system shutdown operation due to abnormal errors or faults triggered by the processor, and the same, deadly fault signals and corresponding logic circuits need to be designed in an isolated mode until the processor needing to be replaced is on line, and the deadly faults caused by the processor can not cause system shutdown. After the replacement processor finishes the reset operation, the isolation circuit corresponding to the fatal fault enabling pin is required to be enabled, the specific method is that the FPGA sends a signal to the control end of the fatal fault isolation circuit, namely the grid electrode of the selection switch, so that the isolation circuit is closed, the fatal fault enabling pin of the replaced processor is connected to the isolation circuit, and after the operation is finished, the firmware resends the system control interrupt to the operating system, namely the online action of the processor is finished.
From the above technical scheme, the invention has the following advantages: the multi-path server or multi-node server sharing the DPU can realize the electrified maintenance of a single node by combining the control of the processor, the BIOS and the operating system and matching with the FPGA, particularly the non-inductive replacement of the processor, and the continuous operation of the service is ensured.
In addition, the invention has reliable design principle, simple structure and very wide application prospect.
It can be seen that the present invention has outstanding substantial features and significant advances over the prior art, as well as its practical advantages.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required to be used in the description of the embodiments or the prior art will be briefly described below, and it will be obvious to those skilled in the art that other drawings can be obtained from these drawings without inventive effort.
FIG. 1 is a schematic flow chart of a method of one embodiment of the invention.
FIG. 2 is a schematic flow chart of fault determination for a method of one embodiment of the invention.
Fig. 3 is a schematic flow chart of a fatal fault judgment of a method of one embodiment of the present invention.
FIG. 4 is a schematic flow chart of a live installation of a processor in a method of one embodiment of the invention.
Fig. 5 is a control block diagram of an isolated line that is fatally faulty in an embodiment of the invention.
Fig. 6 is a schematic rear view of a server system provided by the present invention.
Fig. 7 is a schematic front view of a server system provided by the present invention.
Fig. 8 is a schematic top view of a server system provided by the present invention.
Fig. 9 is a schematic diagram of system connection of a heat dissipation design according to an embodiment of the invention.
Fig. 10 is a schematic diagram of specific signals of connection between nodes and a management board in the system.
Detailed Description
In order to realize the design of the multi-node shared DPU, the current X16 PCIE Lane of the DPU is respectively from 4 processors above 4 nodes, each processor is respectively provided with one X4 PCIE Lane, and the FPGA, the firmware and the operating system above the management board are matched to realize the situation that a plurality of nodes share one DPU, and can meet the conditions that when the processor above a certain node is out of line or abnormal, the computing power of the processor can be timely moved to the DPU, and the processor can be quickly replaced under the conditions of no shutdown and no service interruption. In order to make the technical solution of the present invention better understood by those skilled in the art, the technical solution of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.
As shown in fig. 1, an embodiment of the present invention provides a method for implementing a live operation of a processor, including the following steps:
step 1: after the processor generates an unrepairable error and unloads a computing task from the processor to the DPU, the processor generates a system management interrupt;
step 2: after the baseboard management controller identifies the fault information of the processor and acquires the system management interrupt, triggering the FPGA to send a offline request to the processor generating the fault information;
step 3: after receiving a response of a processor based on a offline request, the FPGA triggers the reset of the processor and synchronously pulls down a power supply signal of the processor;
step 4: after the power supply signals of the processor are pulled down, the BIOS sends a system control interrupt to the operating system and stops the operation of the processor on the external I/O port;
step 5: after the operating system recognizes the interruption of the system control sent by the BIOS through the self-driving, stopping any information interaction and business communication between the operating system and the processor, and completing the interruption operation of the hardware link of the processor.
When an unrepairable error occurs in the processor, the BIOS collects information of an internal register of the processor, different record formats are formed in the internal register of the processor for different faults, after the BIOS collects the fault information and reads the system management interrupt sent by the processor, the information is sent to the baseboard management controller through an IPMI command, and the baseboard management controller stores the fault in a log and simultaneously triggers the FPGA to enable the processor to be in a down line action.
It should be noted that, after the processor generates an unrepairable error and the computing task is unloaded from the processor to the DPU, the service layer software informs the processor that the computing migration is completed through the operating system, and the processor generates a system management interrupt. When the server runs the service and the processor is required to be replaced by an unrepairable error, in order to avoid the influence of service interruption, network computing power, I/O computing power and storage computing power tasks which are performed on the processor are released to an application program on the DPU, and the DPU can also efficiently complete the tasks by combining the application program, namely, all virtual machines are established on a server host matched with the processor to combine software to perform data transmission, analysis and processing work, the operation is transferred to the DPU to run, and computing power which is consumed on the server host and is stored on the processor is taken over to the DPU; that is, when the processor encounters an unrepairable error or a fatal error and needs to do offline operation, the processor first combines the business layer software running on the server operating system to do corresponding processing, and the running business, especially the network computing power and the storage computing power, is unloaded onto the DPU. The specific process of unloading the specific computing task is not the innovative point of the application, and the existing computing task unloading steps are adopted, so that the detailed description is omitted.
In some embodiments, as shown in fig. 2, after the baseboard management controller identifies the fault information of the processor and acquires the system management interrupt, the step of triggering the FPGA to send a offline request to the processor generating the fault information includes:
step 21: the BIOS collects information of the internal registers of the processor;
step 22: after the BIOS collects fault information of the processor and reads system management interruption sent by the processor, the fault information is sent to the baseboard management controller through an IPMI command;
step 23: after the baseboard management controller acquires fault information of the processor and acquires system management interruption from the BIOS, the acquired fault information is recorded into a log, and meanwhile, the FPGA is triggered to send a offline request to the processor generating the fault information.
It should be further noted that, when a fatal fault occurs in the processor, the fatal fault enable pin of the processor connected to the FPGA is pulled down; after the FPGA recognizes that the energy pin is pulled down when a fatal fault is detected, information of an interrupt register is modified in the FPGA. As shown in fig. 3, after the baseboard management controller obtains the fault information of the processor and obtains the system management interrupt from the BIOS, the step of recording the obtained fault information to the log, and triggering the FPGA to send a offline request to the processor generating the fault information includes:
Step 231: the baseboard management controller actively and periodically polls interrupt register information in the FPGA;
step 232: the baseboard management controller judges whether the processor has a fatal fault according to the acquired interrupt register information;
if yes, go to step 233; if not, go to step 235;
step 233: after the baseboard management controller obtains system management interrupt sent by the processor from the BIOS, obtaining information of an internal register of the processor through PECI; wherein, the internal register of the processor has different record formats for different faults;
step 234: the baseboard management controller records the acquired fault information to a log, and triggers the FPGA to send a offline request to a processor generating the fault information;
step 235: after the baseboard management controller obtains the system management interrupt from the BIOS, the fault information obtained from the BIOS is recorded into a log, and meanwhile, the FPGA is triggered to send a offline request to a processor generating the fault information.
The baseboard management controller actively and periodically polls the interrupt register information in the FPGA, when the baseboard management controller reads that the processor has a fatal fault and acquires the system management interrupt sent by the processor from the BIOS, the baseboard management controller acquires the information of the internal register of the processor through the PECI bus, records the fault information into a log, and simultaneously triggers the FPGA to enable the processor to perform the offline action.
In some embodiments, the method further comprises:
step 6: the BIOS sends a system control interrupt to the operating system and simultaneously sends the system control interrupt to the baseboard management controller; after receiving the system control interrupt, the baseboard management controller sends an instruction for representing the interrupt operation for completing the hardware link of the processor to the FPGA; the FPGA lights the down state of the server node board card, and the down state light is lighted to indicate that the processor on the node can be removed or replaced in a live mode.
The operation and maintenance personnel successfully remove the faulty processor from the electrified node, and other nodes work normally at the moment, the key running service on the faulty processor is also unloaded to the DPU, the service is not interrupted due to the fault of the processor, and meanwhile, the operator replaces one processor under the electrified condition. As shown in fig. 4, in some embodiments, the method further comprises:
step 7: after the electrified removed or replaced processor is replaced and installed, starting to execute BIOS codes, and guiding the processor to enter an operating system to enable the processor to generate system management interrupt;
step 8: after receiving the system management interrupt, the BIOS judges and identifies the processor needing to do online operation and informs the identification result to the baseboard management controller through the IPMI command;
In this step, the BIOS performs processing and analysis after receiving the system management interrupt, determines and identifies which processor needs to perform online operation, and the BIOS performs some checking actions to check which processor is not online, and then the BIOS informs the baseboard management controller through the IPMI command, and the baseboard management controller triggers the FPGA to send an online request to the replaced processor. It should be noted that, the replaced processor is just installed on the processor slot, the FPGA recognizes that the processor is in place through the processor in-place pin, and the BIOS can recognize which processors of the current complete machine system are in the on-line state, which processors are not in the off-line state, and the just installed processor is in the on-line state at present because a series of on-line action instructions are not completed.
Step 9: the substrate management controller triggers the FPGA to send an online request to the replaced processor;
step 10: after receiving the response of the processor based on the online request, the FPGA enables the replaced power supply signal of the processor to be effective, and enables the power supply of all modules of the replaced processor to be in a power supply completion state;
the power supply signals of the processor refer to all power supply signals of the processor, voltages required by different modules of the processor are output by a Voltage Regulator (VR) additionally designed outside, power input to the processor is pulled down before the processor is connected, an enabling pin for power supply of the processor is pulled down through an FPGA, and the corresponding Voltage Regulator (VR) for power supply is still working; after the processor is replaced, the FPGA can pull up an enabling pin for power supply of the processor to complete power supply of the processor, and then the processor is connected with a subsequent connection operation.
Step 11: after the FPGA reads the information of the power supply completion of the processor, a reset command is sent to the replaced processor to execute the soft reset of the processor at the hardware level, so that the processor is enabled to be reset;
when the power supply of each module of the processor is ready, the FPGA reads the power supply completion information of the processor, sends a reset command to the IO port expansion selector, then carries out the soft reset of the hardware-level processor to the replaced processor, enables the reset of the processor, and completes the reset operation after the time delay is 30 ms.
Step 12: after the processor completes the soft reset, re-detecting and identifying the memory and starting communication with the memory;
after the processor completes the soft reset, the registers inside the processor can be validated.
Step 13: and after the starting is finished, the BIOS sends a system control interrupt to the operating system to finish the online action of the processor.
Step 14: after the business layer software on the operating system receives the system control interrupt, the computing power task unloaded into the DPU is reloaded to the newly replaced online processor.
The embodiment of the application also provides a server system for realizing the electrified operation and maintenance of the processor, which comprises an expansion board, a management board and a plurality of nodes;
The expansion board is provided with a DPU, and each node is provided with a processor and an operating system; each processor is connected with a power supply;
the processor of each node is connected with the DPU and is used for unloading the calculation task from the failed processor to the DPU by the service layer of the operating system when the processor generates an unrepairable error;
the management board is provided with a substrate management controller; a register for storing the fault information of the processor is arranged in each processor; the baseboard management controller on the management board is connected with the internal register of the processor;
an interrupt register is arranged in the FPGA, a substrate management controller on a management board is connected with the interrupt register in the FPGA, and a processor of each node is provided with a fatal fault enabling pin connected with the FPGA for judging the fatal faults of the processor; when the processor has a fatal fault, the fatal fault enabling pin is pulled down to be connected to the FPGA, and after the FPGA recognizes the fatal fault, the information of the interrupt register is modified;
the server system performs the method described in the above embodiments to implement the processor live operation.
In some embodiments, the management board is further provided with an IO port expansion selector, and the IO port expansion selector is respectively connected with the substrate management controller and the FPGA on the management board;
The baseboard management controller and the FPGA are respectively connected with the processor through the IO port expansion selector.
The server system provided in this embodiment has 4 nodes, each node has 1 processor, 4 processors, and the signals received from the processors are all 4 groups of signals, because the FPGA has the limitation of the number of external interfaces, and the complete system can only operate 1 processor at a time when performing the on-line and off-line operations of the processors, if in the process of replacing one processor, only the replacement of the previous fault processor (i.e., the completion of the off-line and on-line operations of the processors) is completed, and then the replacement of the other processor is completed, so the signals of the 4 processors are all connected to the IO port expansion selector first and then to the FPGA or the BMC.
The processor is provided with a PECI bus pin, a processor in-place pin, a processor power supply completion pin, a processor reset pin and a processor on-line and off-line demand pin;
the PECI bus pin of the processor is connected to the baseboard management controller after the selector is expanded through the IO port, when the FPGA detects that the processor has a fatal fault, the baseboard management controller acquires the fatal fault information from the interrupt register of the FPGA, and then the processor acquires the fault information from the internal register of the processor through the PECI bus pin;
The processor in-place pin is connected to the FPGA after passing through the IO port expansion selector and is used for carrying out in-place identification and judgment on the processor;
the power supply completion pin of the processor is connected to the FPGA after passing through the IO port expansion selector, the FPGA enables the power supply of different modules of the processor, and after the FPGA recognizes that all the power supplies are effective and stable, the FPGA informs the processor that the power supply is completed through the power supply completion pin of the processor;
the processor reset pin is connected to the FPGA after passing through the IO port expansion selector and is used for resetting the processor by the FPGA;
and the processor on/off line demand pin is used for sending on/off line demands to the IO port expansion selector through the processor on/off line demand pin when the processor encounters a fault and needs to be off line for processing, and when the processor is replaced to be on line, the processor is connected to the FPGA.
The design of the logic principle of the fatal fault is that the fatal fault enabling pin of the processor is pulled up to 1.0V power supply voltage through a resistor, because the fatal fault enabling pin is effective at low level, the isolating circuit is connected to an IO port expansion selector (namely an isolating device, a MOS tube in this case) through a selection switch, the drain electrode of the MOS tube is connected with the pin in the case of the fatal fault, the source electrode of the MOS tube is connected to the FPGA through the IO port expansion selector, the control end (grid electrode) of the MOS tube is connected to the FPGA, and the pin and an external isolating circuit are controlled by the FPGA to be switched on and switched off according to the replacement condition of the processor. The working logic of the FPGA is as follows:
When the processor works normally, and no fatal faults exist, the states of the pins can be as follows:
the enable pin is high when a critical fault occurs, the isolation device (here, the selection switch) is closed, and the enable pin is low when a critical fault occurs, i.e., the processor is required to perform a down line operation, so that the isolation device outside the enable pin when a critical fault occurs needs to be disconnected.
When the processor encounters an irreparable error or a fatal fault, the processor needs to be replaced, and the state of the pins in the case of the fatal fault is as follows:
before the processor is replaced, if the processor needs to be replaced due to the occurrence of a fatal fault, the fatal fault enabling pin is at a low level, the processor is required to be in an offline motion, and the isolation device is disconnected;
if the processor needs to be replaced due to other irreparable errors, the fatal fault enabling pin is at a high level, and the processor is required to be in a down-line action at the moment, and the isolator is disconnected.
After the processor is replaced, the isolation device is closed, the processor is online, and the fatal fault enabling pin is at a high level; before the processor does not complete the online action, the system does not pay attention to the state of the fatal fault enabling pin of the processor; it should be noted that, before replacing the fault processor, the software application layer service needs to be migrated, and then the hardware level offline operation is performed.
Before the processor is on line, the whole system can not execute system shutdown operation due to abnormal errors or faults triggered by the processor, and the same, deadly fault signals and corresponding logic circuits need to be designed in an isolated mode until the processor needing to be replaced is on line, and the deadly faults caused by the processor can not cause system shutdown. After the replacement processor finishes the reset operation, the isolation circuit corresponding to the fatal fault enabling pin is required to be enabled, the specific method is that the FPGA sends a signal to the control end of the MOS tube, namely the grid electrode, so that the isolation circuit is closed, the fatal fault enabling pin of the replaced processor is connected to the isolation circuit, and after the operation is finished, the firmware resends the system control interrupt to the operating system, namely the online action of the processor is finished.
The following is a design for management interconnection and upgrade of different computing nodes in the high-end server and the multi-node server, and the following is an example of 4 computing nodes.
The following design of the multi-node server based on the domestic processor can meet the design of the multi-node shared DPU, and 4 nodes are positioned at the rear window of the server and are respectively node 0, node 1, node 2 and node 3, as shown in fig. 6. The server front window is an extension for extending storage and PCIE devices, as well as DPUs, as shown in fig. 7.
In the embodiment of the invention, only one DPU, one fan wall, one management board and 4 nodes are arranged, wherein the fan wall is used for radiating the whole system, the management board is used for monitoring and managing each node in the system, controlling the fan rotating speed, controlling the node power supply and controlling the charged maintenance of the node, a single processor, a substrate controller and a power supply are arranged on each node, and the substrate controller on the node manages key devices on the node and collects sensor information and the like, as shown in fig. 8.
The power management part is mainly characterized in that a substrate control card on the management board is connected with a power supply on each node through a PMBUS, information such as temperature, input/output voltage, input/output current, input/output power consumption and the like can be obtained, information of 4 PSUs is stored, the values are used for supplying the substrate controllers of the 4 nodes to conduct reading actions through the PMBUS, and the power supply can be effectively and redundantly controlled. As shown in fig. 9, in the fan management in the heat dissipation design, the substrate controller on each node obtains the sensor information for heat dissipation on each node, the substrate controller on the management board is connected with the substrate controller on each node through I2C, the BMC status of each node is collected to perform heat dissipation design on the node, the sensor information is obtained, then integrated evaluation is performed, linear intelligent partition regulation and control are performed, the substrate controller on the management board automatically controls the fan through PWM according to the fan regulation and control strategy, when the substrate controller on the management board is abnormal or is suspended, the FPGA takes over the fan rotation speed control, but at this time, the control is performed according to a certain fixed rotation speed, in order to ensure normal operation of the system, the rotation speed is generally set to be higher, for example, the fan is fully rotated or designed according to 85% rotation speed. Wherein the process of the spot removal processor and the live installation processor is as described in the above embodiments.
Fig. 10 is a schematic diagram of specific signals of connection between nodes and a management board in the system. Peci_0-3: the method comprises the steps that the processor is connected to the BMC after passing through an IO port selector, when the FPGA detects that the processor has a fatal fault, the BMC can acquire the fatal fault information from an interrupt register of the FPGA, and then the BMC acquires the fault information from an internal register of the processor through PECI;
2. the processor is in bits_0-3: the processor is connected to the FPGA after passing through the IO port selector and is used for performing in-place identification judgment on the processor;
3. the power supply of the processor is completed (0-3), the processor is connected to the FPGA after passing through the IO port selector, the FPGA enables the power supply of different modules of the processor to be effective, and the FPGA informs the processor that the power supply is completed through the pin after recognizing that all the power supplies are effective and stable;
4. and the processor reset_0-3 is connected to the FPGA after passing through the IO port selector and is used for resetting the processor by the FPGA.
5. Processor on-line/off-line requirements_0-3: when the processor encounters a fault and needs to perform offline processing, and the processor is replaced to perform online action, the processor sends online and offline requirements to the IO port selector and is connected to the FPGA;
6. fatal fault enable_0 to 3: when the processor has a fatal fault, the signal is pulled down, and the pin is connected to the IO port selector and then connected to the FPGA for judging the fatal fault of the processor; in addition to the processor encountering unrepairable errors of a general nature, there is a fatal fault contained within the unrepairable errors, but it has a separate pin, with an external logic circuit design, which is a general input-output, IO, pin, allowing it to be externally enabled, and then the processor triggers the corresponding checking and protection strategies.
In addition, the high-end multipath server, such as 4 paths, 8 paths, 16 paths, 32 paths, the multi-node server, such as 2 nodes, 4 nodes and 8 nodes, is combined with the design, and the non-inductive replacement of key components, such as a processor and PCIE equipment, is carried out under the starting-up state, so that the continuous operation of the service is ensured, and the use experience of a client is improved.
Although the present invention has been described in detail by way of preferred embodiments with reference to the accompanying drawings, the present invention is not limited thereto. Various equivalent modifications and substitutions may be made in the embodiments of the present invention by those skilled in the art without departing from the spirit and scope of the present invention, and it is intended that all such modifications and substitutions be within the scope of the present invention/be within the scope of the present invention as defined by the appended claims. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (10)

1. A method for implementing live operation of a processor, comprising the steps of:
after the processor generates an unrepairable error and unloads a computing task from the processor to the DPU, the processor generates a system management interrupt;
After the baseboard management controller identifies the fault information of the processor and acquires the system management interrupt, triggering the FPGA to send a offline request to the processor generating the fault information;
after receiving a response of a processor based on a offline request, the FPGA triggers the reset of the processor and synchronously pulls down a power supply signal of the processor;
after the power supply signals of the processor are pulled down, the BIOS sends a system control interrupt to the operating system and stops the operation of the processor on the external I/O port;
after the operating system recognizes the interruption of the system control sent by the BIOS through the self-driving, stopping any information interaction and business communication between the operating system and the processor, and completing the interruption operation of the hardware link of the processor.
2. The method for implementing live operation and maintenance of a processor according to claim 1, wherein the step of triggering the FPGA to send a request for offline to the processor generating the fault information after the baseboard management controller identifies the fault information of the processor and acquires the system management interrupt comprises:
the BIOS collects information of the internal registers of the processor;
after the BIOS collects fault information of the processor and reads system management interruption sent by the processor, the fault information is sent to the baseboard management controller through an IPMI command;
After the baseboard management controller acquires fault information of the processor and acquires system management interruption from the BIOS, the acquired fault information is recorded into a log, and meanwhile, the FPGA is triggered to send a offline request to the processor generating the fault information.
3. The method for implementing live operation and maintenance of a processor according to claim 2, wherein the step of the baseboard management controller acquiring fault information of the processor and recording the acquired fault information to a log after acquiring a system management interrupt from the BIOS, and triggering the FPGA to send a request for offline to the processor generating the fault information comprises:
the baseboard management controller actively and periodically polls interrupt register information in the FPGA;
the baseboard management controller judges whether the processor has a fatal fault according to the acquired interrupt register information;
if yes, the baseboard management controller acquires the system management interrupt sent by the processor from the BIOS, and acquires the information of the internal register of the processor through the PECI; wherein, the internal register of the processor has different record formats for different faults;
the baseboard management controller records the acquired fault information to a log, and triggers the FPGA to send a offline request to a processor generating the fault information;
If not, after the baseboard management controller acquires the system management interrupt from the BIOS, the fault information acquired from the BIOS is recorded into a log, and meanwhile, the FPGA is triggered to send a offline request to a processor generating the fault information.
4. The method of claim 1, wherein after the processor generates the system management interrupt after the processor generates the unrepairable error and completes the migration of the computing power from the processor to the DPU, the step of generating the system management interrupt by the processor further comprises:
when a fatal fault occurs to the processor, a fatal fault enable pin of the processor connected to the FPGA is pulled down;
after the FPGA recognizes that the energy pin is pulled down when a fatal fault is detected, information of an interrupt register is modified in the FPGA.
5. The method of implementing a powered operation of a processor of claim 1, further comprising:
the BIOS sends a system control interrupt to the operating system and simultaneously sends the system control interrupt to the baseboard management controller;
after receiving the system control interrupt, the baseboard management controller sends an instruction for representing the interrupt operation for completing the hardware link of the processor to the FPGA;
the FPGA lights the down state of the server node board card, and the down state light is lighted to indicate that the processor on the node can be removed or replaced in a live mode.
6. The method of implementing a powered-on operation of a processor of claim 5, further comprising:
after the electrified removed or replaced processor is replaced and installed, starting to execute BIOS codes, and guiding the processor to enter an operating system to enable the processor to generate system management interrupt;
after receiving the system management interrupt, the BIOS judges and identifies the processor needing to do online operation and informs the identification result to the baseboard management controller through the IPMI command;
the substrate management controller triggers the FPGA to send an online request to the replaced processor;
after receiving the response of the processor based on the online request, the FPGA enables the replaced power supply signal of the processor to be effective, and enables the power supply of all modules of the replaced processor to be in a power supply completion state;
after the FPGA reads the information of the power supply completion of the processor, a reset command is sent to the replaced processor to execute the soft reset of the processor at the hardware level, so that the processor is enabled to be reset;
after the processor completes the soft reset, re-detecting and identifying the memory and starting communication with the memory;
and after the starting is finished, the BIOS sends a system control interrupt to the operating system to finish the online action of the processor.
7. The method of claim 6, wherein the BIOS sending a system control interrupt to the operating system, after completing the step of processor on-line actions, further comprises:
After the business layer software on the operating system receives the system control interrupt, the computing power task unloaded into the DPU is reloaded to the newly replaced online processor.
8. The server system for realizing the electrified operation and maintenance of the processor is characterized by comprising an expansion board, a management board and a plurality of nodes;
the expansion board is provided with a DPU, and each node is provided with a processor and an operating system; each processor is connected with a power supply;
the processor of each node is connected with the DPU and is used for unloading the calculation task from the failed processor to the DPU by the service layer of the operating system when the processor generates an unrepairable error;
the management board is provided with a substrate management controller; a register for storing the fault information of the processor is arranged in each processor; the baseboard management controller on the management board is connected with the internal register of the processor;
an interrupt register is arranged in the FPGA, a substrate management controller on a management board is connected with the interrupt register in the FPGA, and a processor of each node is provided with a fatal fault enabling pin connected with the FPGA for judging the fatal faults of the processor; when the processor has a fatal fault, the fatal fault enabling pin is pulled down to be connected to the FPGA, and after the FPGA recognizes the fatal fault, the information of the interrupt register is modified;
The server system performs the method of any of claims 1-7 to implement processor live operation.
9. The server system for realizing the electrified operation and maintenance of the processor according to claim 8, wherein the management board is further provided with an IO port expansion selector, and the IO port expansion selector is respectively connected with the substrate management controller and the FPGA on the management board;
the baseboard management controller and the FPGA are respectively connected with the processor through the IO port expansion selector.
10. The server system for realizing the live operation and maintenance of the processor according to claim 9, wherein the pin of the processor is connected with the FPGA through the isolation control circuit and the IO port expansion selector in sequence when the processor fails;
the isolation control circuit comprises an MOS tube, wherein the drain electrode of the MOS tube is connected to a life-threatening fault pin of the processor, and the life-threatening fault pin of the processor is also connected to a power supply through a pull-up resistor; the source electrode of the MOS tube is connected to the IO port expansion selector, and the grid electrode of the MOS tube is connected to the FPGA.
CN202310675990.8A 2023-06-08 2023-06-08 Method for realizing electrified operation and maintenance of processor and server system Pending CN116610430A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310675990.8A CN116610430A (en) 2023-06-08 2023-06-08 Method for realizing electrified operation and maintenance of processor and server system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310675990.8A CN116610430A (en) 2023-06-08 2023-06-08 Method for realizing electrified operation and maintenance of processor and server system

Publications (1)

Publication Number Publication Date
CN116610430A true CN116610430A (en) 2023-08-18

Family

ID=87683492

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310675990.8A Pending CN116610430A (en) 2023-06-08 2023-06-08 Method for realizing electrified operation and maintenance of processor and server system

Country Status (1)

Country Link
CN (1) CN116610430A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117971759A (en) * 2024-01-09 2024-05-03 启朔(深圳)科技有限公司 Array server network architecture, data interaction method, medium and device

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117971759A (en) * 2024-01-09 2024-05-03 启朔(深圳)科技有限公司 Array server network architecture, data interaction method, medium and device

Similar Documents

Publication Publication Date Title
JP6530774B2 (en) Hardware failure recovery system
US7930388B2 (en) Blade server management system
CN100517246C (en) Computer remote control method and system
US11687391B2 (en) Serializing machine check exceptions for predictive failure analysis
CN1947096B (en) Dynamic migration of virtual machine computer programs
EP3427151B1 (en) Memory backup management in computing systems
US9122652B2 (en) Cascading failover of blade servers in a data center
CN1770707B (en) Apparatus and method for quorum-based power-down of unresponsive servers in a computer cluster
US8990632B2 (en) System for monitoring state information in a multiplex system
US20090083467A1 (en) Method and System for Handling Interrupts Within Computer System During Hardware Resource Migration
CN110109782B (en) Method, device and system for replacing fault PCIe (peripheral component interconnect express) equipment
US20140304532A1 (en) Server systems having segregated power circuits for high availability applications
CN116610430A (en) Method for realizing electrified operation and maintenance of processor and server system
US20150046748A1 (en) Information processing device and virtual machine control method
CN112882901A (en) Intelligent health state monitor of distributed processing system
US20140201566A1 (en) Automatic computer storage medium diagnostics
US20200314172A1 (en) Server system and management method thereto
CN105068763A (en) Virtual machine fault-tolerant system and method for storage faults
CN107026759A (en) The firmware and its development approach of a kind of remote management BBU modules based on BMC
CN111984471B (en) Cabinet power BMC redundancy management system and method
CN117453036A (en) Method, system and device for adjusting power consumption of equipment in server
JP2008152552A (en) Computer system and failure information management method
CN107423113B (en) Method for managing virtual equipment, out-of-band management equipment and standby virtual equipment
CN110647435A (en) Server, hard disk remote control method and control assembly
CN210324184U (en) Multi-node computer monitoring device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination