CN113608603A

CN113608603A - Method, system, equipment and storage medium for repairing PCIe fault equipment

Info

Publication number: CN113608603A
Application number: CN202110740556.4A
Authority: CN
Inventors: 王培培
Original assignee: Suzhou Inspur Intelligent Technology Co Ltd
Current assignee: Suzhou Inspur Intelligent Technology Co Ltd
Priority date: 2021-06-30
Filing date: 2021-06-30
Publication date: 2021-11-05

Abstract

The invention provides a method, a system, equipment and a storage medium for repairing PCIe fault equipment, wherein the method comprises the steps of obtaining fault information of first PCIe equipment; if the fault information quantity of the first PCIe equipment is judged to be PCIe fault, the equipment service is stopped, and the power supply output of the corresponding power supply adjusting module is closed; and the power supply adjusting module is controlled again to power on the first PCIe equipment, and the service is started. When the first PCIe device is stopped to close the power supply output, the service of the second PCIe device is not stopped, and the output of the power supply adjusting module of the second PCIe device is also not closed; the second PCIe device is a device that is not determined to be a PCIe failure. Based on the method, the invention also provides a system, equipment and a storage medium for repairing PCIe fault equipment.

Description

Method, system, equipment and storage medium for repairing PCIe fault equipment

Technical Field

The invention belongs to the technical field of server failure and rest, and particularly relates to a method, a system, equipment and a storage medium for repairing PCIe failure equipment.

Background

PCIe devices are indispensable components in servers, and the performance, computation, functions and the like of the servers are related to the PCIe devices, and relate to computation (such as GPU and FPGA), storage (such as SAS HBA and NVME SSD), Networks (NIC) and the like of the servers. In order to ensure the stability of the system, the mainboard BMC monitors PCIe fault information all the time, gives an alarm in time when an error occurs, records an error log, and prevents the phenomena of system downtime, restart and the like caused by PCIe fault. PCIe equipment has many reasons for faults, and if the fault is repairable by software, the fault can be recovered by self without human intervention, and the operation of a server system is not influenced; in the event of a hardware failure, maintenance personnel are typically required to participate in controlling the system reboot, or replacing the PCIe device. The PCIe fault equipment has a plurality of prompting alarm information, the prompting alarm information is effectively utilized, and before the PCIe equipment does not cause downtime, the fault repairing function of the PCIe equipment can be realized by restarting the PCIe equipment.

The server has a PCIe automatic error reporting mechanism, when the PCIe error reporting occurs in the system, a Correct Error (CE) can be automatically repaired, and the BIOS reports the obtained result to the BMC and records a log; when the UCE is triggered, the problem is serious, the machine is down or restarted, then logs under the BMC and the OS are recorded, the equipment cannot normally run, operation and maintenance personnel need to participate in the solution, and the operation and maintenance personnel need to restart the whole server or replace PCIe equipment. When the fault reporting of the PCIe fault equipment occurs in the server, log records are generated, then operation and maintenance personnel locate the problem, and the server is restarted or the PCIe equipment is replaced, so that the system cannot be directly repaired in the running state, and the running service of the server is stopped.

Disclosure of Invention

In order to solve the technical problem, the method, the system, the device and the storage medium for repairing the PCIe fault equipment realize the fault repair of the PCIe equipment by controlling the restart of a single PCIe equipment without influencing the operation of other components of the server.

In order to achieve the purpose, the invention adopts the following technical scheme:

a method of repairing a PCIe failed device, comprising the steps of:

acquiring fault information of first PCIe equipment;

if the PCIe fault is judged to be a PCIe fault according to the fault information quantity of the first PCIe equipment, stopping the business of the first PCIe equipment and closing the power supply output of a power supply adjusting module corresponding to the first PCIe equipment;

and controlling the power supply adjusting module to power on the first PCIe device again, and starting the service on the first PCIe device.

Further, if the PCIe fault is determined according to the quantity of the fault information of the first PCIe device, stopping the service of the first PCIe device, and closing the power output of the power supply adjustment module corresponding to the first PCIe device further includes:

stopping the first PCIe equipment service, and stopping the power supply output of the power supply adjusting module corresponding to the first PCIe equipment, without stopping the second PCIe equipment service, and also without stopping the power supply output of the power supply adjusting module corresponding to the second PCIe equipment; the second PCIe device is a device which is not determined to be PCIe faulty.

Further, before receiving the failure information of the PCIe device, the method further includes setting a configuration space register corresponding to the PCIe device to support a violent hot plug mode.

Further, the failure information includes a first PCIe device code extracted by the hardware error check tool, and a first PCIe device code corresponding to the number of uncorrectable errors and the number of uncorrectable errors in the PCIe device register.

Further, if the PCIe failure is determined according to the number of pieces of failure information of the first PCIe device, the process of determining that the PCIe failure is: and if the number of the fault information in the current period is more than 2 times of the number of the previous periods, judging that the fault information is a PCIe fault.

Further, the process of stopping the service of the first PCIe device and turning off the power output of the power supply adjusting module corresponding to the first PCIe device is as follows:

modifying a VPP register corresponding to the CPLD, and stopping the service of the first PCIe equipment;

and controlling the PCIe equipment to reset, and closing the power supply output of the power supply adjusting module corresponding to the first PCIe equipment.

The invention also provides a system for repairing PCIe fault equipment, which comprises a plurality of PCIe devices, wherein the PCIe devices comprise a first PCIe device and a second PCIe device, and further comprise an input-output system, a substrate management controller, a CPLD and a power supply adjusting module;

the input and output system is used for sending the acquired fault information of the first PCIe device to the substrate management controller;

the baseboard management controller is used for executing and stopping the first PCIe equipment service through the CPLD and closing the power supply output of the power supply adjusting module corresponding to the first PCIe equipment if the PCIe fault is judged according to the fault information quantity;

the CPLD is connected with a plurality of PCIe devices, and the power supply adjusting module is connected with the PCIe devices.

Furthermore, the input and output system automatically modifies the register mode corresponding to the PCIe device to support a violent hot plug mode when the server is started.

The invention also proposes a device comprising:

a memory for storing a computer program;

a processor for implementing the method steps when executing the computer program.

The invention also proposes a readable storage medium having stored thereon a computer program which, when being executed by a processor, carries out the method steps.

The effect provided in the summary of the invention is only the effect of the embodiment, not all the effects of the invention, and one of the above technical solutions has the following advantages or beneficial effects:

the invention provides a method, a system, equipment and a storage medium for repairing PCIe fault equipment, wherein the method comprises the steps of obtaining fault information of first PCIe equipment; if the number of the fault information of the first PCIe equipment is judged to be PCIe fault, stopping the business of the first PCIe equipment and closing the power supply output of the power supply adjusting module corresponding to the first PCIe equipment; and controlling the power supply adjusting module to power on the first PCIe equipment again, and starting the service on the first PCIe equipment. When the first PCIe equipment service is stopped and the power supply output of the power supply adjusting module corresponding to the first PCIe equipment is closed, the second PCIe equipment service is not stopped, and the power supply output of the power supply adjusting module corresponding to the second PCIe equipment is not closed; the second PCIe device is a device that is not determined to be a PCIe failure. The method realizes the fault repair of the PCIe equipment by controlling the restart of the single PCIe equipment, can control the power-off and power-on restart of the PCIe fault equipment after detecting the fault of the server PCIe equipment, and is connected with the CPU again without influencing the operation of other parts of the server system except the PCIe equipment. The problem that when the PCIe fault equipment reports the fault, the service stops running, the server is manually restarted or the PCIe equipment is replaced, and the labor cost and the equipment cost are caused can be effectively solved.

Based on a method for repairing PCIe fault equipment, the invention also provides a system, equipment and a storage medium for repairing the PCIe fault equipment. The technical effects of the above method are also achieved, and are not described herein.

Drawings

Fig. 1 is a flowchart of a method for repairing a PCIe failure device according to embodiment 1 of the present invention;

fig. 2 is a system connection diagram of a method for repairing a PCIe failure device according to embodiment 1 of the present invention.

Detailed Description

In order to clearly explain the technical features of the present invention, the following detailed description of the present invention is provided with reference to the accompanying drawings. The following disclosure provides many different embodiments, or examples, for implementing different features of the invention. To simplify the disclosure of the present invention, the components and arrangements of specific examples are described below. Furthermore, the present invention may repeat reference numerals and/or letters in the various examples. This repetition is for the purpose of simplicity and clarity and does not in itself dictate a relationship between the various embodiments and/or configurations discussed. It should be noted that the components illustrated in the figures are not necessarily drawn to scale. Descriptions of well-known components and processing techniques and procedures are omitted so as to not unnecessarily limit the invention.

Example 1

The embodiment 1 of the invention provides a method, a system, equipment and a storage medium for repairing PCIe fault equipment, which realize the fault repair of the PCIe equipment by controlling the restart of a single piece of PCIe equipment, can control the power failure and the power on restart of the PCIe fault equipment after detecting the fault of the PCIe equipment of a server, and are connected with a CPU again without influencing the operation of other parts of the server system except the PCIe equipment. The problem that when the PCIe fault equipment reports the fault, the service stops running, the server is manually restarted or the PCIe equipment is replaced, and the labor cost and the equipment cost are caused can be effectively solved.

Fig. 1 is a flowchart of a method for repairing a PCIe failure device according to embodiment 1 of the present invention.

In step S101, the CPU rootport register is set to support the violent hot plug mode. The server is started, the bios automatically modifies a CPU Root port register corresponding to the PCIe device, and the Slot capability register, the link control register and the like are set to be in a mode of supporting violent hot plug and pull plug of the PCIe device. Under the violent hot plug mode, even if the PCIe equipment is detected to be lost under the system, the system cannot be down or abnormal error reporting is carried out.

In step S102, failure information of the first PCIe device is obtained, where the failure information of the first device includes the first PCIe device code extracted by the hardware error check tool, and the number of uncorrectable errors in the PCIe device register and the first PCIe device code corresponding to the number of uncorrectable errors.

In the invention, the BIOS captures an MCE log under the system, and extracts the B/D/F number of the corresponding PCIe equipment through the MCE log and sends the B/D/F number to the BMC. And the BIOS captures the number of error in an AER register of the corresponding Slot of the PCIe device, and sends the corresponding B/D/F number to the BMC.

Wherein, the BIOS is an input/output system;

MCE log: the Machine Check Exception log is a tool used for checking hardware errors, particularly memory and CPU errors on a Linux system;

B/D/F: a corresponding code of the bus/device/function number PCIe device;

AER: advanced error report;

in step S103, it is determined whether the PCIe failure is a PCIe failure according to the number of failure information of the first PCIe device.

In the invention, after the BMC receives the error report information, on one hand, the error report information is displayed through a BMC Web UI interface to inform maintenance personnel, on the other hand, the error report quantity of a PCIe device in each hour is counted, if the error report quantity of a PCIe device in a period of time is 2-3 times of the error report quantity in the previous period of time, the PCIe fault is judged, and the fault needs to be repaired

In step S104, the first PCIe device service is stopped, and the power output of the power conditioning module corresponding to the first PCIe device is turned off.

The BMC modifies a VPP register corresponding to the CPLD through the I2C, informs a first PCIe device corresponding to the CPU root port to be removed, and stops the service on the first PCIe device. The BMC then modifies the register of the CPLD through I2C, controlling the reset signal corresponding to the first PCIe device to be set. The BMC controls a power supply chip VR of the first PCIe device to close the output through the I2C, so that the first PCIe device is powered off. After a period of time, the BMC opens the output of the VR, controls power-on, then controls reset to be released, simultaneously informs the CPU that PCIe equipment is added, and the CPU is reconnected with the first PCIe equipment to complete fault repair and rerun the service.

Wherein, VPP: CPU virtual pin port, an interface for CPU to transmit PCIe hot plug and pull signal through SMBUS protocol.

In this step, when the first PCIe device service is stopped and the power supply output of the power supply adjustment module corresponding to the first PCIe device is turned off, the second PCIe device service is not stopped and the power supply output of the power supply adjustment module corresponding to the second PCIe device is also not turned off; the second PCIe device is a device that is not determined to be a PCIe failure.

The invention provides a method for repairing PCIe fault equipment, wherein a root port register of a CPU sets a Slot capability register corresponding to the PCIe equipment to support a surfrise hot plug mode through a BIOS. After receiving the fault information of the PCIe device, the BMC performs an algorithm to judge whether the PCIe device needs to be restarted, if so, the BMC notifies the CPLD through I2C to reset the PCIe device, and then controls a VR power supply chip of the PCIe fault device through I2C to close the output of the VR power supply chip. After a period of time, the BMC turns on the VR through I2C again to enable the VR to be powered normally, and then modifies the CPLD internal register to enable the corresponding PCIe Slot to be reset. When the PCIe is detected to exist, the CPU initiates the tracing process again, so that the PCIe starts to work again. And when the PCIe device with the fault is repaired, other PCIe devices without the fault can normally operate without powering down.

Example 2

Based on the method for repairing a PCIe faulty device proposed in embodiment 1 of the present invention, embodiment 2 of the present invention proposes a system for repairing a PCIe faulty device, and fig. 2 shows a schematic diagram of the system for repairing a PCIe faulty device in embodiment 2 of the present invention.

The system comprises a plurality of PCIe devices, wherein the PCIe devices comprise a first PCIe device and a second PCIe device, and further comprise an input-output system, a substrate management controller, a CPLD and a power supply adjusting module;

the baseboard management controller is used for executing and stopping the first PCIe equipment service through the CPLD and closing the power supply output of the power supply adjusting module corresponding to the first PCIe equipment if the PCIe fault is judged according to the fault information quantity; when the first PCIe equipment service is stopped and the power supply output of the power supply adjusting module corresponding to the first PCIe equipment is closed, the second PCIe equipment service is not stopped, and the power supply output of the power supply adjusting module corresponding to the second PCIe equipment is not closed; the second PCIe device is a device which is not determined to be PCIe faulty.

The CPLD is connected with the PCIe devices, and the power supply adjusting module is connected with the PCIe devices.

The system also comprises a server operating system and a server CPU, wherein the CPU is connected with the input and output system; the CPU is also connected to the CPLD.

And the input and output system also automatically modifies the register mode corresponding to the PCIe equipment into a mode supporting violent hot plug when the server is started.

The system for repairing PCIe fault equipment provided by the invention is characterized in that a root port register of a CPU sets a Slot capability register corresponding to the PCIe equipment to support a surfrise hot plug mode through a BIOS. After receiving the fault information of the PCIe device, the BMC performs an algorithm to judge whether the PCIe device needs to be restarted, if so, the BMC notifies the CPLD through I2C to reset the PCIe device, and then controls a VR power supply chip of the PCIe fault device through I2C to close the output of the VR power supply chip. After a period of time, the BMC turns on the VR through I2C again to enable the VR to be powered normally, and then modifies the CPLD internal register to enable the corresponding PCIe Slot to be reset. When the PCIe is detected to exist, the CPU initiates the tracing process again, so that the PCIe starts to work again. And when the PCIe device with the fault is repaired, other PCIe devices without the fault can normally operate without powering down.

Example 3

The invention also proposes a device comprising:

a memory for storing a computer program;

a processor for implementing the method steps when executing the computer program as follows:

Wherein, the BIOS is an input/output system;

B/D/F: a corresponding code of the bus/device/function number PCIe device;

AER: advanced error report;

Need to explain: the technical solution of the present invention also provides an electronic device, including: the communication interface can carry out information interaction with other equipment such as network equipment and the like; and the processor is connected with the communication interface to realize information interaction with other equipment, and is used for executing the method for repairing the PCIe fault equipment provided by one or more technical schemes when running a computer program, and the computer program is stored on the memory. When in actual use, however, the various components in the electronic device are coupled together by a bus system. It is understood that a bus system is used to enable the connection communication between these components. The bus system comprises, in addition to a data bus, a power bus, a control bus and a status signal bus. The memory in the embodiments of the present application is used to store various types of data to support the operation of the electronic device. Examples of such data include: any computer program for operating on an electronic device. It will be appreciated that the memory can be either volatile memory or nonvolatile memory, and can include both volatile and nonvolatile memory. Among them, the nonvolatile Memory may be a Read Only Memory (ROM), a Programmable Read Only Memory (PROM), an Erasable Programmable Read-Only Memory (EPROM), an Electrically Erasable Programmable Read-Only Memory (EEPROM), a magnetic random access Memory (FRAM), a Flash Memory (Flash Memory), a magnetic surface Memory, an optical disk, or a Compact Disc Read-Only Memory (CD-ROM); the magnetic surface storage may be disk storage or tape storage. Volatile memory can be Random Access Memory (RAM), which acts as external cache memory. By way of illustration and not limitation, many forms of RAM are available, such as Static Random Access Memory (SRAM), Synchronous Static Random Access Memory (SSRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic Random Access Memory (SDRAM), Double Data Rate Synchronous Dynamic Random Access Memory (DDRSDRAM), Enhanced Synchronous Dynamic Random Access Memory (ESDRAM), Enhanced Synchronous Dynamic Random Access Memory (Enhanced Synchronous Random Access Memory (eDRAM), Enhanced Synchronous Random Access Memory (DRD Random Access Memory), Synchronous link Dynamic Random Access Memory (SLRAM), Direct Random Access Memory (DRAM). The memory described in the embodiments of this application is intended to comprise, without being limited to, these and any other suitable types of memory. The method disclosed in the embodiments of the present application may be applied to a processor, or may be implemented by a processor. The processor may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in a processor or instructions in the form of software. The processor may be a general purpose processor, a DSP (Digital Signal Processing, i.e., a chip capable of implementing Digital Signal Processing technology), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, etc. The processor may implement or perform the methods, steps, and logic blocks disclosed in the embodiments of the present application. The general purpose processor may be a microprocessor or any conventional processor or the like. The steps of the method disclosed in the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software modules may be located in a storage medium located in a memory, where the processor reads the program from the memory and in combination with its hardware performs the steps of the method described above. When the processor executes the program, the corresponding flow in each method of the embodiments of the present application is realized, and for the sake of brevity, a detailed description is omitted here.

The invention also proposes a readable storage medium on which a computer program is stored, which, when executed by a processor, implements the method steps of:

Wherein, the BIOS is an input/output system;

B/D/F: a corresponding code of the bus/device/function number PCIe device;

AER: advanced error report;

Embodiments of the present application further provide a storage medium, that is, a computer storage medium, specifically, a computer-readable storage medium, for example, a memory storing a computer program, where the computer program is executable by a processor to perform the steps of the foregoing method. The computer readable storage medium may be Memory such as FRAM, ROM, PROM, EPROM, EEPROM, Flash Memory, magnetic surface Memory, optical disk, or CD-ROM.

Those of ordinary skill in the art will understand that: all or part of the steps for implementing the method embodiments may be implemented by hardware related to program instructions, and the program may be stored in a computer readable storage medium, and when executed, the program performs the steps including the method embodiments; and the aforementioned storage medium includes: a removable storage device, a ROM, a RAM, a magnetic or optical disk, or various other media capable of storing program code. Alternatively, the integrated units described above in the present application may be stored in a computer-readable storage medium if they are implemented in the form of software functional modules and sold or used as independent products. Based on such understanding, the technical solutions of the embodiments of the present application may be essentially implemented or portions thereof contributing to the prior art may be embodied in the form of a software product stored in a storage medium, and including several instructions for enabling an electronic device (which may be a personal computer, a server, or a network device) to execute all or part of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a removable storage device, a ROM, a RAM, a magnetic or optical disk, or various media that can store program code.

For a description of a device and a relevant part in a storage medium for repairing a PCIe faulty device provided in the embodiment of the present application, reference may be made to detailed descriptions of a corresponding part in a method for repairing a PCIe faulty device provided in embodiment 1 of the present application, and details are not described here again.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Furthermore, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include elements inherent in the list. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element. In addition, parts of the above technical solutions provided in the embodiments of the present application, which are consistent with the implementation principles of the corresponding technical solutions in the prior art, are not described in detail so as to avoid redundant description.

Although the embodiments of the present invention have been described with reference to the accompanying drawings, the scope of the present invention is not limited thereto. Various modifications and alterations will occur to those skilled in the art based on the foregoing description. And are neither required nor exhaustive of all embodiments. On the basis of the technical scheme of the invention, various modifications or changes which can be made by a person skilled in the art without creative efforts are still within the protection scope of the invention.

Claims

1. A method of repairing a PCIe failed device, comprising the steps of:

acquiring fault information of first PCIe equipment;

2. The method of claim 1, wherein if the PCIe failure is determined according to the number of pieces of failure information of the first PCIe device, stopping the service of the first PCIe device, and turning off the power output of the power regulation module corresponding to the first PCIe device further comprises:

3. The method of claim 1, wherein before receiving the failure information of the PCIe device, the method further comprises setting a configuration space register corresponding to the PCIe device to support a violent hot plug mode.

4. The method of claim 1, wherein the failure information comprises a first PCIe device code extracted by the hardware error checking tool and a first PCIe device code corresponding to the number of uncorrectable errors and the number of uncorrectable errors in PCIe device registers.

5. The method according to claim 1, wherein if the PCIe failure is determined according to the failure information amount of the first PCIe device, the process of determining that the PCIe failure is: and if the number of the fault information in the current period is more than 2 times of the number of the previous periods, judging that the fault information is a PCIe fault.

6. The method of claim 2, wherein the process of stopping the first PCIe device traffic and turning off the power output of the power regulation module corresponding to the first PCIe device is:

7. A system for repairing PCIe fault equipment comprises a plurality of PCIe devices, wherein the PCIe devices comprise a first PCIe device and a second PCIe device, and is characterized by also comprising an input-output system, a substrate management controller, a CPLD and a power supply regulating module;

8. The system for repairing a PCIe fault device of claim 7, wherein the input output system further automatically modifies a register mode corresponding to the PCIe device to support a brute force hot plug mode when the server is powered on.

9. An apparatus, comprising:

a memory for storing a computer program;

a processor for implementing the method steps of any one of claims 1 to 6 when executing the computer program.

10. A readable storage medium, characterized in that the readable storage medium has stored thereon a computer program which, when being executed by a processor, carries out the method steps of any one of claims 1 to 6.