CN114138606A - PCIE error information processing method and device and computer equipment - Google Patents

PCIE error information processing method and device and computer equipment Download PDF

Info

Publication number
CN114138606A
CN114138606A CN202111455713.3A CN202111455713A CN114138606A CN 114138606 A CN114138606 A CN 114138606A CN 202111455713 A CN202111455713 A CN 202111455713A CN 114138606 A CN114138606 A CN 114138606A
Authority
CN
China
Prior art keywords
pcie
information
error information
error
operating system
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202111455713.3A
Other languages
Chinese (zh)
Inventor
梁恒勋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Inspur Intelligent Technology Co Ltd
Original Assignee
Suzhou Inspur Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Inspur Intelligent Technology Co Ltd filed Critical Suzhou Inspur Intelligent Technology Co Ltd
Priority to CN202111455713.3A priority Critical patent/CN114138606A/en
Publication of CN114138606A publication Critical patent/CN114138606A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/3027Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system component is a bus
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0706Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
    • G06F11/0745Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in an input/output transactions management context
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0793Remedial or corrective actions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2213/00Indexing scheme relating to interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F2213/0026PCI express

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The application relates to a method and a device for processing PCIE error information and a computer device, wherein the operation information of a server is monitored through a basic input and output system, when the operation information comprises the PCIE error information, a fault PCIE device corresponding to the PCIE error information is determined according to the PCIE error information, the device information of the fault PCIE device is obtained, the PCIE error information and the device information of the fault PCIE device are sent to an operation system, the operation system receives the PCIE error information and the device information of the fault PCIE device sent by the basic input and output system, the error type of the PCIE error information is judged, when the error type of the PCIE error information is a non-physical error, the PCIE error information is replaced by correctable error information, the normal operation of the server is ensured, more maintenance time is obtained for a user, the risk of data loss is effectively reduced, and the quality of the server is improved.

Description

PCIE error information processing method and device and computer equipment
Technical Field
The present application relates to the field of device operation and maintenance technologies, and in particular, to a method and an apparatus for processing PCIE error information, and a computer device.
Background
With the advent of the big data era, the requirement of processing mass data on a server is higher and higher, and the server also has new problems, such as component problems, Error Correction Code (ECC) Error reporting affects service operation, hard disk damage affects data transmission efficiency, Error reporting of high-speed serial computer extended bus standard (PCI-Express, PCIE) equipment affects use, and a Central Processing Unit (CPU) Error reporting cannot perform machine operation. In response to these problems, the previous processing method is to replace the failed device with a new device, but this processing method may cause irreparable loss of data and services of the server.
At present, domestic manufacturers develop a mechanism function for inhibiting the generation of errors by a basic input/output system according to the above server problem, for example, a memory ECC Error inhibiting function, a PCIE Uncorrectable Error (UCE) and Correctable Error (CE) Error unrecorded function, a CPU Error is not restarted, and a UCE degradation CE mechanism of a memory, etc. However, the PCIE UCE and CE do not record the PCIE error information, but only shield the error information, and the error of the server still exists, which affects the normal operation of the server and causes data loss.
Disclosure of Invention
Therefore, it is necessary to provide a method, an apparatus, and a computer device for processing PCIE error information to solve the above technical problems, and by replacing PCIE UCE with PCIE CE, it is ensured that a service on a server can still run normally under the condition of PCIE UCE, so as to strive for more maintenance time for a user, effectively reduce a risk of data loss, and improve the quality of the server.
In a first aspect, a method for processing PCIE error information is provided, where the method includes:
the basic input and output system monitors the operation information of the server;
when the operation information comprises PCIE error information of the high-speed serial computer expansion bus standard, determining a fault PCIE device corresponding to the PCIE error information according to the PCIE error information, and acquiring the device information of the fault PCIE device;
sending PCIE error information and equipment information of a failed PCIE equipment to an operating system;
the operating system receives PCIE error information and equipment information of a failed PCIE device sent by the basic input and output system, and judges the error type of the PCIE error information;
when the error type of the PCIE error information is a non-physical error, the PCIE error information is replaced by correctable error information.
In one possible implementation, the method further includes:
after the operating system replaces the PCIE error information with correctable error information, checking whether non-physical error information exists in the failed PCIE equipment;
when the failure PCIE equipment does not have non-physical error information, generating degradation information;
and sending degradation information to the basic input and output system.
In one possible implementation, the degradation information includes correctable error information, and the method further includes:
the basic input and output system receives correctable error information sent by an operating system;
judging whether the correctable error information is consistent with the PCIE error information or not;
and when the correctable error information is inconsistent with the PCIE error information, sending notification information that the degradation of the PCIE error information is successful to the substrate management controller.
In one possible implementation, the method further includes:
a baseboard management controller receives a notification message sent by a basic input and output system;
and analyzing the notification information to obtain log information so as to provide maintenance reference data.
In one possible implementation, before the bios monitors the operation information of the server, the method further includes:
when a server is started, a basic input and output system automatically acquires first Input and Output (IO) addresses of all PCIE (peripheral component interface express) equipment in the server;
and saving the first IO address to a storage unit of the basic input and output system.
In a possible implementation manner, when the operation information includes PCIE error information, determining a faulty PCIE device corresponding to the PCIE error information according to the PCIE error information, and acquiring device information of the faulty PCIE device, the method includes:
when the operation information comprises PCIE error information, acquiring a second IO address in the PCIE error information;
comparing the first IO address with the second IO address, and determining a fault PCIE device corresponding to the second IO address;
and acquiring the equipment information of the failed PCIE equipment.
In one possible implementation, the method further includes:
configuring a running environment of a GNU compiler suite for an operating system;
compiling an environment package of the GNU compiler suite and upgrading the version of the GNU compiler suite;
when the version upgrading of the GNU compiler suite is completed, configuring a PCIE uncorrectable error processing environment for the operating system;
after the configuration of the processing environment of the PCIE uncorrectable errors is completed, a communication professional signaling protocol required for processing the PCIE uncorrectable errors is configured for the operating system, so that the operating system has the function of converting the error types of the error information.
In a second aspect, a device for processing PCIE error information is provided, where the device includes:
the monitoring module is used for monitoring the operation information of the server by the basic input and output system;
the determining module is used for determining a fault PCIE device corresponding to the PCIE error information according to the PCIE error information when the operation information comprises the PCIE error information of the high-speed serial computer expansion bus standard, and acquiring the device information of the fault PCIE device;
the device comprises a sending module, a receiving module and a sending module, wherein the sending module is used for sending PCIE error information and equipment information of a failure PCIE device to an operating system;
the system comprises a judgment module, a first module and a second module, wherein the judgment module is used for receiving PCIE error information sent by a basic input and output system and equipment information of a failure PCIE device by an operating system and judging the error type of the PCIE error information;
and the replacement module is used for replacing the PCIE error information with correctable error information when the error type of the PCIE error information is a non-physical error.
A third aspect provides a computer device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the method for processing PCIE error information in the first aspect or any implementation manner of the first aspect when executing the computer program.
In a fourth aspect, a computer-readable storage medium is provided, on which a computer program is stored, where the computer program, when executed by a processor, implements the method for processing PCIE error information in the first aspect or any one of the implementations of the first aspect.
According to the method, the device and the computer equipment for processing the PCIE error information, the operation information of the server is monitored through the basic input and output system, when the operation information comprises the PCIE error information, the fault PCIE equipment corresponding to the PCIE error information is determined according to the PCIE error information, the equipment information of the fault PCIE equipment is obtained, the PCIE error information and the equipment information of the fault PCIE equipment are sent to the operating system, the operating system receives the PCIE error information and the equipment information of the fault PCIE equipment sent by the basic input and output system, the error type of the PCIE error information is judged, when the error type of the PCIE error information is non-physical error, the PCIE error information is replaced by correctable error information, the normal operation of the server is ensured, more maintenance time is strived for users, the risk of data loss is effectively reduced, and the quality of the server is improved.
Drawings
Fig. 1 is a schematic flow chart of a method for processing PCIE error information in an embodiment of the present application;
fig. 2 is a block diagram of a device for processing PCIE error information in an embodiment of the present application;
FIG. 3 is a diagram illustrating an internal structure of a computer device according to an embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
At present, as the country enters the big data era, a large data processing server or a cloud computing server is basically used for data management. According to the market demand, the demand is obviously larger than the supply, and the customer can consider the aspects of public praise, technology, server performance and the like of each manufacturer while selecting the supplier, and the qualified standard is used as the qualified standard for selecting the supplier. With the increasing demand on the server, the problem of the server also has new quality problem, the mechanism function of the basic input and output system for inhibiting the generation of errors is realized by only using a surface form to realize the generation of error information and generated information such as logs and the like, shielding is realized by a code function, and no error-related information is generated in a log module, so that the code operation of other modules is not influenced. However, in practice, the server error still exists, and does not disappear due to non-display or shielding, or the error is repaired or degraded, so that the normal operation of the server is affected, and unnecessary economic loss is brought.
In order to solve the problem of the prior art, embodiments of the present application provide a method, an apparatus, a device, and a computer storage medium for processing PCIE error information. First, a method for processing PCIE error information provided in the embodiment of the present application is described below.
Fig. 1 shows a flowchart of a method for processing PCIE error information according to an embodiment of the present application. As shown in fig. 1, the method may include the steps of:
and S110, monitoring the operation information of the server by the basic input output system.
When the server runs, the basic input and output system can automatically capture running information of all equipment under the server, monitor whether error information exists in the running information in real time, recognize information which is not in accordance with normal code logic and avoid abnormal running of the server.
And S120, when the operation information includes PCIE error information of the high-speed serial computer expansion bus standard, determining a faulty PCIE device corresponding to the PCIE error information according to the PCIE error information, and acquiring the device information of the faulty PCIE device.
In the process of monitoring the operation information by the basic input and output system, when an abnormal code is monitored, the error reporting information in the register is automatically acquired through the code of the basic input and output system, and the PCIE error information is acquired from the error reporting information.
When the operation information includes PCIE error information, it indicates that there is a PCIE device that has a failure, further determines a failure PCIE device corresponding to the PCIE error information, and acquires device information of the failure PCIE device.
S130, send PCIE error information and device information of the failed PCIE device to the operating system.
And sending the PCIE error information and the equipment information of the failed PCIE equipment to an operating system so that the operating system can locate the failed PCIE equipment and correspondingly process the PCIE error information of the failed PCIE equipment, so that the normal operation of the server is not influenced by the PCIE error information.
S140, the operating system receives the PCIE error information sent by the basic input/output system and the device information of the failed PCIE device, and determines an error type of the PCIE error information.
After receiving the PCIE error information sent by the basic input/output system and the device information of the failed PCIE device, the operating system obtains an error field of an abnormal code in the PCIE error information, and determines an error type of the PCIE error information according to the error field. For example, the error field is Corrected, which indicates that the error of the failed PCIE device is a low-level correctable error, and the error field is Detected bus unadrected, which indicates that the error of the failed PCIE device is a high-level non-physical error. By distinguishing error types of PCIE errors, a practical and effective processing method is adopted.
S150, when the error type of the PCIE error message is a non-physical error, replacing the PCIE error message with a correctable error message.
The non-physical error belongs to an uncorrectable error, when the error type of the PCIE error information is the non-physical error, the PCIE error information is indicated to influence the normal operation of the server, and an abnormal code in the PCIE error information is replaced by a code conforming to a normal logic mechanism, so that the PCIE error information is degraded from the non-physical error to the correctable error information, and the normal operation of the server is ensured.
In the embodiment of the application, the operating information of the server is monitored through the basic input/output system, when the operating information includes PCIE error information, according to the PCIE error information, a faulty PCIE device corresponding to the PCIE error information is determined, device information of the faulty PCIE device is acquired, the PCIE error information and the device information of the faulty PCIE device are sent to the operating system, the operating system receives the PCIE error information and the device information of the faulty PCIE device sent by the basic input/output system, and determines an error type of the PCIE error information, and when the error type of the PCIE error information is a non-physical error, the PCIE error information is replaced with correctable error information, so that the server is ensured to operate normally, more maintenance time is strived for a user, the risk of data loss is effectively reduced, and the quality of the server is improved.
In some embodiments, the method further comprises:
after the operating system replaces the PCIE error information with correctable error information, checking whether non-physical error information exists in the failed PCIE equipment;
when the failure PCIE equipment does not have non-physical error information, generating degradation information;
and sending degradation information to the basic input and output system.
In order to ensure that all non-physical error information is replaced by correctable errors, whether non-physical error information is also included in the failed PCIE device corresponding to the PCIE error information is checked again, and when the non-physical error information does not exist in the failed PCIE device and is already correctable information, degradation information is generated, the degradation information is sent to the basic input/output system, and it is notified that all the non-physical information in the basic input/output system server is replaced by the correctable information.
In some embodiments, the degradation information includes correctable error information, the method further comprising:
the basic input and output system receives correctable error information sent by an operating system;
judging whether the correctable error information is consistent with the PCIE error information or not;
and when the correctable error information is inconsistent with the PCIE error information, sending notification information that the degradation of the PCIE error information is successful to a baseboard management controller of the baseboard management controller.
The method comprises the steps that a basic input and output system receives degradation information sent by an operating system, correctable error information replaced by the operating system is obtained from the degradation information, in order to ensure that the correctable error information is information conforming to a code logic mechanism, the correctable error information is compared with PCIE error information, whether the correctable error information and the PCIE error information are identical or not is judged, if the correctable error information is not identical with the PCIE error information, the correctable error information is information conforming to the code logic mechanism, and notification information that the PCIE error information is successfully degraded is sent to a substrate management controller to inform a user of timely maintenance.
In some embodiments, the method further comprises:
a baseboard management controller receives a notification message sent by a basic input and output system;
and analyzing the notification information to obtain log information so as to provide maintenance reference data.
The baseboard management controller receives the notification message sent by the basic input and output system, analyzes the notification message into log information which is understood by a client, provides maintenance reference data for a user, informs the user that the non-physical error is degraded into a correctable error, and the server can normally run, thereby avoiding the condition that the server cannot be operated due to the non-physical error, and ensuring that the user can normally maintain the server.
In some embodiments, before the bios monitors the operational information of the server, the method further comprises:
when a server is started, a basic input and output system automatically acquires first Input and Output (IO) addresses of all PCIE (peripheral component interface express) equipment in the server;
and saving the first IO address to a storage unit of the basic input and output system.
Starting from the server, the basic input and output system automatically captures first IO addresses of all PCIE devices in the server, wherein the first IO addresses are IO addresses of the PCIE devices in normal operation and are stored in a storage unit of the basic input and output system, and an IO address database of the PCIE devices is obtained. Whether the device is powered off after being restarted or restarted after being turned off every time, the basic input and output system can reacquire the PCIE device in the server and automatically update the IO address database, so that the IO address in the IO address database can accurately reflect the current running state of the PCIE device.
In some embodiments, when the operation information includes PCIE error information, determining a faulty PCIE device corresponding to the PCIE error information according to the PCIE error information, and acquiring device information of the faulty PCIE device, including:
when the operation information comprises PCIE error information, acquiring a second IO address in the PCIE error information;
comparing the first IO address with the second IO address, and determining a fault PCIE device corresponding to the second IO address;
and acquiring the equipment information of the failed PCIE equipment.
When the PCIE equipment breaks down, a second IO address in the PCIE error information is obtained, the second IO address is an error IO address which is abnormal in operation, the second IO address is compared with the IO address in the IO address database, the specific IO address and the equipment information of the broken PCIE equipment can be located in detail, and the second IO address and the equipment information of the broken PCIE equipment are sent to an operating system and a base board management controller in a serial port mode.
The operating system determines the PCIE device under the operating system according to the second IO address and the device information, and monitors the code logic in the PCIE error and the predefined code logic, and when it is monitored that the PCIE error is a non-physical error, that is, the PCIE error is within an error range of the predefined code logic, replaces an abnormal code in the PCIE error information with a correctable logic code. When a PCIE error is not within the error range of the predefined code logic, the PCIE error is physical, and hardware replacement or maintenance is required.
The baseboard management controller analyzes the PCIE error information sent by the basic input and output system into log information which is understood by a user, and informs the user which PCIE equipment has a non-physical error, so that the user can maintain the PCIE equipment in a follow-up targeted manner, the time for checking problems is reduced, and the maintenance efficiency is improved.
In some embodiments, the method further comprises:
configuring a running environment of the GNU compiler suite for the operating system;
compiling an environment package of the GNU compiler suite and upgrading the version of the GNU compiler suite;
when the version upgrading of the GNU compiler suite is completed, configuring a PCIE uncorrectable error processing environment for the operating system;
after the configuration of the processing environment of the PCIE uncorrectable errors is completed, a communication professional signaling protocol required for processing the PCIE uncorrectable errors is configured for the operating system, so that the operating system has the function of converting the error types of the error information.
The user installs the operating system in the server in advance according to the requirement, and the operating system does not have a mechanism for converting a non-physical error into a correctable error at the moment, and needs to perform corresponding configuration on the operating system.
Firstly, a software source is configured according to an operating system, then, an operating environment of a GNU compiler suite is configured for the operating system, the version needs to be more than 4.9, a Prerequisites file is installed, build is created, a language environment used by the GNU compiler suite is further configured, an environment package of the GNU compiler suite is compiled, a GNU compiler suite tool is installed, and the version of the GNU compiler suite is upgraded. And after the version of the GNU compiler suite is successfully upgraded, upgrading the higher kernel version of the current operating system, and configuring a PCIE uncorrectable error processing environment for the operating system. And the kernel version is successfully upgraded, which indicates that the configuration of the processing environment with the error which can not be corrected by the PCIE is completed. And finally, configuring a communication professional signaling protocol required for processing PCIE uncorrectable errors for the operating system, wherein the operating system has the function of converting the error types of the error information, and after receiving the PCIE error information sent by the basic input and output system, the operating system replaces the non-physical error information with the correctable error information to complete the degradation of the error types, ensure the normal operation of the server, strive for longer maintenance time for users, and reduce the economic loss caused by serious PCIE error information in the operation of the server.
It should be understood that, although the steps in the flowchart of fig. 1 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a portion of the steps in fig. 1 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternately with other steps or at least a portion of the sub-steps or stages of other steps.
In some embodiments, as shown in fig. 2, there is provided a device for processing PCIE error information, including: a monitoring module 210, a sending module 220, a judging module 230, and a replacing module 240, wherein:
a monitoring module 210, configured to monitor operation information of the server by using the bios;
a determining module 220, configured to determine, according to the PCIE error information, a faulty PCIE device corresponding to the PCIE error information and acquire device information of the faulty PCIE device when the operation information includes PCIE error information in the high-speed serial computer expansion bus standard;
a determining module 230, configured to send the PCIE error information and the device information of the failed PCIE device to an operating system;
a determining module 240, configured to receive, by the operating system, PCIE error information sent by the basic input/output system and device information of a failed PCIE device, and determine an error type of the PCIE error information;
a replacing module 250, configured to replace the PCIE error information with correctable error information when the error type of the PCIE error information is a non-physical error.
In some embodiments, the apparatus further comprises: a generating module 260 for:
after the operating system replaces the PCIE error information with correctable error information, checking whether non-physical error information exists in the failed PCIE equipment;
when the failure PCIE equipment does not have non-physical error information, generating degradation information;
and sending degradation information to the basic input and output system.
In some embodiments, the degradation information includes correctable error information, and the determining module 230 is further configured to:
the basic input and output system receives correctable error information sent by an operating system;
judging whether the correctable error information is consistent with the PCIE error information or not;
and when the correctable error information is inconsistent with the PCIE error information, sending notification information that the degradation of the PCIE error information is successful to a baseboard management controller of the baseboard management controller.
In some embodiments, the apparatus further comprises: a parsing module 270 for:
a baseboard management controller receives a notification message sent by a basic input and output system;
and analyzing the notification information to obtain log information so as to provide maintenance reference data.
In some embodiments, before the bios monitors the operation information of the server, the apparatus further includes: a saving module 280 for:
when a server is started, a basic input and output system automatically acquires first Input and Output (IO) addresses of all PCIE (peripheral component interface express) equipment in the server;
and saving the first IO address to a storage unit of the basic input and output system.
In some embodiments, the determining module 220 is specifically configured to:
when the operation information comprises PCIE error information, acquiring a second IO address in the PCIE error information;
comparing the first IO address with the second IO address, and determining a fault PCIE device corresponding to the second IO address;
and acquiring the equipment information of the failed PCIE equipment.
In some embodiments, the apparatus further comprises: a configuration module 280 for:
configuring a running environment of the GNU compiler suite for the operating system;
compiling an environment package of the GNU compiler suite and upgrading the version of the GNU compiler suite;
when the version upgrading of the GNU compiler suite is completed, configuring a PCIE uncorrectable error processing environment for the operating system;
after the configuration of the processing environment of the PCIE uncorrectable errors is completed, a communication professional signaling protocol required for processing the PCIE uncorrectable errors is configured for the operating system, so that the operating system has the function of converting the error types of the error information.
For specific limitations of the processing apparatus for PCIE error information, refer to the above limitations on the processing method for PCIE error information, which are not described herein again. All or part of each module in the PCIE error information processing apparatus may be implemented by software, hardware, or a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In some embodiments, a computer device is provided, which may be a server, the internal structure of which may be as shown in fig. 3. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing the operating data. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method for processing PCIE error information.
Those skilled in the art will appreciate that the architecture shown in fig. 3 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
In some embodiments, there is provided a computer device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the following steps when executing the computer program:
the basic input and output system monitors the running information of the server;
when the operation information comprises PCIE error information of the high-speed serial computer expansion bus standard, determining a fault PCIE device corresponding to the PCIE error information according to the PCIE error information, and acquiring the device information of the fault PCIE device;
sending PCIE error information and equipment information of a failed PCIE equipment to an operating system;
the operating system receives PCIE error information and equipment information of a failed PCIE device sent by the basic input and output system, and judges the error type of the PCIE error information;
when the error type of the PCIE error information is a non-physical error, the PCIE error information is replaced by correctable error information.
In some embodiments, the processor, when executing the computer program, further performs the following steps, the method further comprising: after the operating system replaces the PCIE error information with correctable error information, checking whether non-physical error information exists in the failed PCIE equipment; when the failure PCIE equipment does not have non-physical error information, generating degradation information; and sending degradation information to the basic input and output system.
In some embodiments, the computer program when executed by the processor further performs the following steps, the degradation information comprising correctable error information, the method further comprising: the basic input and output system receives correctable error information sent by an operating system; judging whether the correctable error information is consistent with the PCIE error information or not; and when the correctable error information is inconsistent with the PCIE error information, sending notification information that the degradation of the PCIE error information is successful to a baseboard management controller of the baseboard management controller.
In some embodiments, the processor, when executing the computer program, further performs the following steps, the method further comprising: a baseboard management controller receives a notification message sent by a basic input and output system; and analyzing the notification information to obtain log information so as to provide maintenance reference data.
In some embodiments, the computer program when executed by the processor further performs the following steps, before the bios monitors the operational information of the server, the method further comprising: when a server is started, a basic input and output system automatically acquires first Input and Output (IO) addresses of all PCIE (peripheral component interface express) equipment in the server; and saving the first IO address to a storage unit of the basic input and output system.
In some embodiments, when the processor executes the computer program, the following step is further implemented, where when the operation information includes PCIE error information, determining, according to the PCIE error information, a faulty PCIE device corresponding to the PCIE error information, and acquiring device information of the faulty PCIE device, includes: when the operation information comprises PCIE error information, acquiring a second IO address in the PCIE error information; comparing the first IO address with the second IO address, and determining a fault PCIE device corresponding to the second IO address; and acquiring the equipment information of the failed PCIE equipment.
In some embodiments, the processor, when executing the computer program, further performs the following steps, the method further comprising: configuring a running environment of the GNU compiler suite for the operating system; compiling an environment package of the GNU compiler suite and upgrading the version of the GNU compiler suite; when the version upgrading of the GNU compiler suite is completed, configuring a PCIE uncorrectable error processing environment for the operating system; after the configuration of the processing environment of the PCIE uncorrectable errors is completed, a communication professional signaling protocol required for processing the PCIE uncorrectable errors is configured for the operating system, so that the operating system has the function of converting the error types of the error information.
In some embodiments, there is provided a computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of:
the basic input and output system monitors the running information of the server;
when the operation information comprises PCIE error information of the high-speed serial computer expansion bus standard, determining a fault PCIE device corresponding to the PCIE error information according to the PCIE error information, and acquiring the device information of the fault PCIE device;
sending PCIE error information and equipment information of a failed PCIE equipment to an operating system;
the operating system receives PCIE error information and equipment information of a failed PCIE device sent by the basic input and output system, and judges the error type of the PCIE error information;
when the error type of the PCIE error information is a non-physical error, the PCIE error information is replaced by correctable error information.
In some embodiments, the computer program when executed by the processor further performs the steps of: after the operating system replaces the PCIE error information with correctable error information, checking whether non-physical error information exists in the failed PCIE equipment; when the failure PCIE equipment does not have non-physical error information, generating degradation information; and sending degradation information to the basic input and output system.
In some embodiments, the computer program when executed by the processor further performs the following steps, the degradation information comprising correctable error information, the method further comprising: the basic input and output system receives correctable error information sent by an operating system; judging whether the correctable error information is consistent with the PCIE error information or not; and when the correctable error information is inconsistent with the PCIE error information, sending notification information that the degradation of the PCIE error information is successful to a baseboard management controller of the baseboard management controller.
In some embodiments, the computer program when executed by the processor further performs the steps of: a baseboard management controller receives a notification message sent by a basic input and output system; and analyzing the notification information to obtain log information so as to provide maintenance reference data.
In some embodiments, the computer program when executed by the processor further performs the following steps, before the bios monitoring the operational information of the server, the method further comprising: when a server is started, a basic input and output system automatically acquires first Input and Output (IO) addresses of all PCIE (peripheral component interface express) equipment in the server; and saving the first IO address to a storage unit of the basic input and output system.
In some embodiments, when the running information includes PCIE error information, determining, according to the PCIE error information, a faulty PCIE device corresponding to the PCIE error information, and acquiring device information of the faulty PCIE device, where the step of determining, by the processor, the faulty PCIE device includes: when the operation information comprises PCIE error information, acquiring a second IO address in the PCIE error information; comparing the first IO address with the second IO address, and determining a fault PCIE device corresponding to the second IO address; and acquiring the equipment information of the failed PCIE equipment.
In some embodiments, the computer program when executed by the processor further performs the steps of: configuring a running environment of the GNU compiler suite for the operating system; compiling an environment package of the GNU compiler suite and upgrading the version of the GNU compiler suite; when the version upgrading of the GNU compiler suite is completed, configuring a PCIE uncorrectable error processing environment for the operating system; after the configuration of the processing environment of the PCIE uncorrectable errors is completed, a communication professional signaling protocol required for processing the PCIE uncorrectable errors is configured for the operating system, so that the operating system has the function of converting the error types of the error information.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (10)

1. A method for processing PCIE error information is characterized in that the method comprises the following steps:
the basic input and output system monitors the operation information of the server;
when the operation information comprises PCIE error information of a high-speed serial computer expansion bus standard, determining a fault PCIE device corresponding to the PCIE error information according to the PCIE error information, and acquiring device information of the fault PCIE device;
sending the PCIE error information and the equipment information of the failed PCIE equipment to an operating system;
the operating system receives PCIE error information sent by the basic input and output system and equipment information of the failed PCIE equipment, and judges the error type of the PCIE error information;
and when the error type of the PCIE error information is a non-physical error, replacing the PCIE error information with correctable error information.
2. The method of claim 1, further comprising:
after the operating system replaces the PCIE error information with correctable error information, checking whether non-physical information exists in the failed PCIE equipment;
when the failure PCIE equipment does not have non-physical error information, generating degradation information;
sending the degradation information to the basic input output system.
3. The method of claim 2, wherein the degradation information comprises the correctable error information, the method further comprising:
the basic input and output system receives correctable error information sent by the operating system;
judging whether the correctable error information is consistent with the PCIE error information or not;
and when the correctable error information is inconsistent with the PCIE error information, sending notification information that the degradation of the PCIE error information is successful to a substrate management controller.
4. The method of claim 3, further comprising:
the baseboard management controller receives a notification message sent by the basic input and output system;
and analyzing the notification information to obtain log information so as to provide maintenance reference data.
5. The method of claim 1, wherein before the bios monitors the operational information of the server, the method further comprises:
when a server is started, the basic input and output system automatically acquires first Input and Output (IO) addresses of all PCIE (peripheral component interface express) equipment in the server;
and saving the first IO address to a storage unit of the basic input and output system.
6. The method according to claim 5, wherein when the operation information includes PCIE error information, determining, according to the PCIE error information, a faulty PCIE device corresponding to the PCIE error information, and acquiring device information of the faulty PCIE device, includes:
when the operation information comprises PCIE error information, acquiring a second IO address in the PCIE error information;
comparing the first IO address with the second IO address, and determining a fault PCIE device corresponding to the second IO address;
and acquiring the equipment information of the failed PCIE equipment.
7. The method of claim 1, further comprising:
configuring a running environment of a GNU compiler suite for the operating system;
compiling the environment package of the GNU compiler suite and upgrading the version of the GNU compiler suite;
when the version upgrading of the GNU compiler suite is completed, configuring a PCIE uncorrectable error processing environment for the operating system;
and after the configuration of the processing environment of the PCIE uncorrectable errors is finished, configuring a communication professional signaling protocol required for processing the PCIE uncorrectable errors for the operating system, so that the operating system has a function of converting error types of error information.
8. An apparatus for processing PCIE error information, the apparatus comprising:
the monitoring module is used for monitoring the operation information of the server by the basic input and output system;
the determining module is used for determining a faulty PCIE device corresponding to the PCIE error information according to the PCIE error information when the operation information includes PCIE error information of a high-speed serial computer expansion bus standard, and acquiring device information of the faulty PCIE device;
a sending module, configured to send the PCIE error information and the device information of the failed PCIE device to an operating system;
a determining module, configured to receive, by the operating system, PCIE error information sent by the basic input/output system and device information of the failed PCIE device, and determine an error type of the PCIE error information;
and the replacing module is used for replacing the PCIE error information with correctable error information when the error type of the PCIE error information is a non-physical error.
9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method of any of claims 1 to 7 are implemented when the computer program is executed by the processor.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.
CN202111455713.3A 2021-12-01 2021-12-01 PCIE error information processing method and device and computer equipment Withdrawn CN114138606A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111455713.3A CN114138606A (en) 2021-12-01 2021-12-01 PCIE error information processing method and device and computer equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111455713.3A CN114138606A (en) 2021-12-01 2021-12-01 PCIE error information processing method and device and computer equipment

Publications (1)

Publication Number Publication Date
CN114138606A true CN114138606A (en) 2022-03-04

Family

ID=80386536

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111455713.3A Withdrawn CN114138606A (en) 2021-12-01 2021-12-01 PCIE error information processing method and device and computer equipment

Country Status (1)

Country Link
CN (1) CN114138606A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114356644A (en) * 2022-03-18 2022-04-15 阿里巴巴(中国)有限公司 PCIE equipment fault processing method and device
TWI800443B (en) * 2022-08-15 2023-04-21 緯穎科技服務股份有限公司 Peripheral component interconnect express device error reporting optimization method and peripheral component interconnect express device error reporting optimization system

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114356644A (en) * 2022-03-18 2022-04-15 阿里巴巴(中国)有限公司 PCIE equipment fault processing method and device
TWI800443B (en) * 2022-08-15 2023-04-21 緯穎科技服務股份有限公司 Peripheral component interconnect express device error reporting optimization method and peripheral component interconnect express device error reporting optimization system
US11953975B2 (en) 2022-08-15 2024-04-09 Wiwynn Corporation Peripheral component interconnect express device error reporting optimization method and system capable of filtering error reporting messages

Similar Documents

Publication Publication Date Title
US9146839B2 (en) Method for pre-testing software compatibility and system thereof
US20160132420A1 (en) Backup method, pre-testing method for environment updating and system thereof
CN114138606A (en) PCIE error information processing method and device and computer equipment
CN113282434B (en) Memory repair method based on post-package repair technology and related components
US7661044B2 (en) Method, apparatus and program product to concurrently detect, repair, verify and isolate memory failures
CN113064757A (en) Server firmware self-recovery system and server
CN110908702A (en) Version switching method, version switching device, computer equipment and storage medium
TWI723477B (en) Electronic apparatus, system and method capable of remotely maintaining the operation of electronic apparatus
TWI518680B (en) Method for maintaining file system of computer system
CN114385418A (en) Protection method, device, equipment and storage medium for communication equipment
CN111417019B (en) Method and device for processing plug-in abnormity, computer equipment and storage medium
CN113805925A (en) Online upgrading method, device, equipment and medium for distributed cluster management software
CN111078476B (en) Network card drive firmware stability test method, system, terminal and storage medium
CN113626227A (en) Abnormal log information reporting method, intelligent terminal and storage medium
CN113868001B (en) Method, system and computer storage medium for checking memory repair result
CN114116330B (en) Server performance testing method, system, terminal and storage medium
GB2532076A (en) Backup method, pre-testing method for environment updating and system thereof
CN110879757B (en) Restarting method and device for abnormal operation of client and computer equipment
CN113608603A (en) Method, system, equipment and storage medium for repairing PCIe fault equipment
CN112565896B (en) System repairing method, terminal and storage medium
CN116302062A (en) Voltage reference chip configuration method and device, electronic equipment and storage medium
CN112286797B (en) Service monitoring method and device, electronic equipment and storage medium
CN115454820A (en) Firmware upgrade exception test method and device, computer equipment and storage medium
CN117312037A (en) Memory repair method and device, electronic equipment and storage medium
CN109683924B (en) Application software upgrading method, system, equipment and computer readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication

Application publication date: 20220304

WW01 Invention patent application withdrawn after publication