CN113076213A

CN113076213A - Method and system for optimizing system management interrupt handling hardware error time

Info

Publication number: CN113076213A
Application number: CN202110338474.7A
Authority: CN
Inventors: 罗鹏芳; 陈思彤; 李道童
Original assignee: Shandong Yingxin Computer Technology Co Ltd
Current assignee: Shandong Yingxin Computer Technology Co Ltd
Priority date: 2021-03-30
Filing date: 2021-03-30
Publication date: 2021-07-06
Anticipated expiration: 2041-03-30
Also published as: CN113076213B

Abstract

The invention belongs to the technical field of system management interrupt processing hardware, and relates to a method and a system for optimizing system management interrupt processing hardware error time, wherein the method comprises the following steps: s1: creating a system management memory error processing module in a starting stage; s2: when a hardware error occurs in the operation stage and the triggering system management is interrupted, the CPU enters a system management mode and checks whether a hardware detection mechanism is triggered; s3: acquiring the strategy setting of the current RAS function, and judging whether to execute the strategy setting; s4: after the basic input and output system sets the starting error processing function, reading MC bank check error information, and determining the memory position of the physical address recorded in the MC bank; s5: reading and preprocessing the pre-application error information space; s6: and sending the error information to a substrate management controller, exiting the system management mode by the CPU, and ending the memory error processing. The invention can optimize the time for the system to manage the interrupt processing hardware fault.

Description

Method and system for optimizing system management interrupt handling hardware error time

Technical Field

The invention belongs to the technical field of system management interrupt processing hardware, and particularly relates to a method and a system for optimizing system management interrupt processing hardware error time.

Background

The fault detection and processing of the general server based on the Intel chip support the firmware priority principle, when the server is started to the operating system, the system management interruption is triggered preferentially, the basic input and output system detects the error, the current fault processing method is read from the system variable, the corresponding processing method is selected, when the fault address recorded by the hardware detection mechanism points to the memory address, the address is analyzed by the memory address analysis algorithm, the analysis algorithm needs to read a large number of registers related to address decoding from the CPU, when the error is sent to the substrate management controller by the intelligent platform management interface, the sent data firstly applies for the memory to store the data, the intelligent platform management interface protocol is loaded before the data is sent to execute the intelligent platform management interface command, then the error information is sent to the substrate management controller, and then the error is cleared and returned to the operating system, the baseboard management controller receives the error data record sent by the basic input and output system to the log, and the user views the alarm component through the information recorded by the log.

The system management interrupt requires all CPUs of the server to enter the system management mode, and the shorter the time from entering to exiting the system management mode, the smaller the impact on the system performance, ideally one hundred milliseconds, is not to impact the performance of the operating system. In the prior art, many data used in the process of processing faults are obtained by finding a specified variable from a variable domain, executing GetVariable, finding a specified protocol from a protocol domain, executing LocateProtocol, reading register information from a CPU, and the like, and the data access mode initialized to a fixed value from the starting stage also has an optimized space, so that the processing time of system management interrupt is reduced. This is a technical problem in the prior art.

In view of this, the present invention provides a method and a technical solution of a system for optimizing system management interrupt handling hardware error time; to solve the defects and problems in the prior art.

Disclosure of Invention

Aiming at the problems that the system management interrupt processing hardware error has long time and the system performance is influenced in the prior art; the invention provides a method and a system for optimizing system management interrupt processing hardware error time, which aim to solve the technical problem.

In order to achieve the purpose, the invention provides the following technical scheme:

in a first aspect, the present invention provides a method for optimizing hardware error time of system management interrupt processing, comprising the following steps:

s1: a system management memory error processing module is established in a starting stage, and the data space of the system management memory is applied; applying for a parameter data structure of memory analysis; applying for a memory for storing error information, initializing all the memories to zero, and acquiring a memory address for storing the error information; acquiring an address of an intelligent platform management interface protocol and storing the address into a system management memory;

s2: when a hardware error occurs in the operation stage, starting a firmware priority function, after triggering the interruption of system management, enabling all CPUs to enter a system management mode, checking whether a hardware detection mechanism is triggered or not by a board-level support packet, if so, continuously checking a strategy of an RAS function, and if not, exiting the system management mode;

s3: obtaining the strategy setting of the current RAS function from a system management memory, judging whether to execute the strategy of the RAS function, if so, continuing to execute the strategy of the RAS function, otherwise, exiting the system management mode;

s4: when the basic input and output system sets the starting error processing function, reading the MC bank to check error information, calling an analysis algorithm when the MC bank records a physical address, reading the numerical value of an SAD/TAD/RIR decoding register used by the analysis algorithm from a system management memory, and decoding the numerical value of the SAD/TAD/RIR decoding register to obtain the memory position where the physical address is located through register information;

s5: reading a pre-application error information space from a system management memory, and preprocessing the pre-application error information space;

s6: sending the error information to a substrate management controller; all CPUs exit the system management mode and end the memory error processing.

Preferably, in step S1, the data space of the system management memory is used to store a policy of an RAS function, where the policy of the RAS function includes logging on or off, supporting processing of memory data backup, and turning on or off an interrupt notifying an operating system to continue processing; in the parameter data structure for applying memory analysis, each CPU has one decoder to read the data from the decoding register of CPU and fill the data into the system management memory.

The effect of this step is: the fixed data needed to be used in the error processing process is initialized in advance, namely, the fixed data is stored in a system management memory in the starting stage, a related data structure is established in the system management memory, and each data is assigned to prepare for the subsequent processing of hardware errors.

Preferably, in step S4, the analysis algorithm is a system address to memory address analysis algorithm.

The effect of this step is: and obtaining the memory address of the error information, so that the error information is conveniently filled.

Preferably, in step S5, the preprocessing includes clearing error information recorded last time, and filling the error information according to interactive data structures defined in-band and out-of-band, where the error information includes a memory error type, a memory location, and an error level.

The effect of this step is: the error information is filled in through the interactive data structure defined in-band and out-of-band, the type of the memory error is determined, the position of the memory error is recorded, the error level is recorded, the error information is conveniently recorded, and the error problem is solved.

Preferably, in step S6, the error message obtains an address of an intelligent platform management interface protocol from a system management memory, and calls an intelligent platform management interface command from the intelligent platform management interface protocol, so as to send the error message to the baseboard management controller, where the baseboard management controller records a memory error log, and obtains a policy setting of an RAS function for notifying the error message of the operating system, and when the operating system is notified that the function is turned on, sends an interrupt to the operating system.

In a second aspect, the present invention provides a system for optimizing system management interrupt handling hardware error time, comprising:

creating a system management memory error processing module: applying for a data space of a system management memory; applying for a parameter data structure of memory analysis; applying for a memory for storing error information, initializing all the memories to zero, and acquiring a memory address for storing the error information; acquiring an address of an intelligent platform management interface protocol and storing the address into a system management memory;

triggering a hardware detection mechanism module: starting a firmware priority function when a hardware error occurs in the operation stage, after triggering system management interruption, all CPUs enter a system management mode, and a board-level support packet checks whether a hardware detection mechanism module is triggered or not;

determining an error message memory location module: when the basic input and output system sets the starting error processing function, reading the MC bank to check error information, calling an analysis algorithm when the MC bank records a physical address, reading the numerical value of an SAD/TAD/RIR decoding register used by the analysis algorithm from a system management memory, and decoding the numerical value of the SAD/TAD/RIR decoding register to obtain the memory position where the physical address is located through register information;

an error information space preprocessing module: and reading the pre-application error information space from the system management memory, and preprocessing the pre-application error information space.

Preferably, in the creating system management memory error processing module, the data space of the system management memory is used for storing a policy of an RAS function, where the policy of the RAS function includes recording log on or off, supporting processing memory data backup, and turning on or off an interrupt notifying an operating system to continue processing; in the parameter data structure for applying memory analysis, each CPU has one decoder to read the data from the decoding register of CPU and fill the data into the system management memory.

Preferably, in the triggering hardware detection mechanism module, if the hardware detection mechanism module is triggered, the policy of the RAS function is continuously checked, otherwise, the system management mode is exited, the policy setting of the current RAS function is obtained from the system management memory, whether the policy of the RAS function is executed or not is judged, if yes, the policy of the RAS function is continuously executed, and otherwise, the system management mode is exited.

Preferably, in the module for determining the memory location of the error message, the analysis algorithm is an analysis algorithm for converting a system address into a memory address.

Preferably, in the error information space preprocessing module, the preprocessing includes clearing error information recorded last time, and filling the error information according to interactive data structures defined in-band and out-of-band, where the error information includes a memory error type, a memory location, and an error level.

Preferably, the error information obtains an intelligent platform management interface protocol address from a system management memory, and calls an intelligent platform management interface command from the intelligent platform management interface protocol, so as to send the error information to the baseboard management controller, the baseboard management controller records a memory error log, and simultaneously obtains a policy setting of an RAS function whether to notify the error information of the operating system, and when the operating system is notified of the function being started, the interrupt is sent to the operating system, all CPUs exit from a system management mode, and the memory error processing is finished.

The invention has the advantages that the parameters or variables needed to be used in the system management mode are initialized and solidified into the system management memory in the starting process, and the data of the system management memory are directly accessed during use, so that the data processing time of system management interrupt is optimized, the processing time of the system management interrupt is greatly reduced, and the influence of the system management interrupt on an operating system is reduced. In addition, the invention has reliable design principle, simple structure and very wide application prospect.

Therefore, compared with the prior art, the invention has prominent substantive features and remarkable progress, and the beneficial effects of the implementation are also obvious.

Drawings

In order to more clearly illustrate the embodiments or technical solutions in the prior art of the present invention, the drawings used in the description of the embodiments or prior art will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained based on these drawings without creative efforts.

Fig. 1 is a flowchart of a method for optimizing hardware error time of system management interrupt handling according to embodiment 1 of the present invention.

Fig. 2 is a schematic block diagram of a system for optimizing hardware error time of system management interrupt handling according to embodiment 2 of the present invention.

The method comprises the following steps of 1-creating a system management memory error processing module, 2-triggering a hardware detection mechanism module, 3-determining an error information memory position module and 4-preprocessing an error information space.

Detailed Description

In order to make those skilled in the art better understand the technical solution of the present invention, the technical solution in the embodiment of the present invention will be clearly and completely described below with reference to the drawings in the embodiment of the present invention, and it is obvious that the described embodiment is only a part of the embodiment of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The following explains key terms appearing in the present invention.

An important application of system management interrupt is to process hardware fault, automatically trigger system management interrupt when hardware fault occurs, process the fault by system management interrupt processing program, analyze the equipment pointed by the fault, report the fault log recorded by the baseboard management controller, report the error information of the operating system, continue to process the fault by the interrupt of machine inspection system or non-shielding interrupt under the operating system, and simultaneously isolate or recover the unit where the part of fault is located, so as to prevent the fault component from being used continuously.

Example 1:

as shown in fig. 1, the present embodiment provides a method for optimizing hardware error time of system management interrupt processing, including the following steps:

s1: a system management memory error processing module is established in a starting stage, and the data space of the system management memory is applied; applying for a parameter data structure of memory analysis, applying for a memory for storing error information, initializing the memory for storing the error information to be all zero, and acquiring a memory address for storing the error information; acquiring an address of an intelligent platform management interface protocol and storing the address into a system management memory;

the data space of the system management memory is used for storing the strategy of the RAS function, and the strategy of the RAS function comprises the steps of recording the start or the close of a log, supporting the processing of memory data backup, and starting or closing the interrupt for informing an operating system to continue processing; in the parameter data structure applying memory analysis, each CPU has one decoder to read the data from the decoding register of CPU and fill the data into the system management memory; the fixed data needed to be used in the error processing process is initialized in advance, namely, the fixed data is stored in a system management memory in the starting stage, a related data structure is established in the system management memory, and each data is assigned to prepare for the subsequent processing of hardware errors.

the analysis algorithm is a system address to memory address analysis algorithm, and the memory address of the error information is obtained, so that the error information is conveniently filled.

the preprocessing comprises the steps of clearing error information recorded last time, filling the error information according to interactive data structures defined in an in-band and an out-of-band, wherein the error information comprises a memory error type, a memory position and an error level, and the error information is filled through the interactive data structures defined in the in-band and the out-of-band, so that the type of the memory error is determined, the position of the memory error is recorded, the error level is recorded, the error information is conveniently recorded, and the error problem is solved.

S6: sending the error information to a substrate management controller; all CPUs exit the system management mode and end memory error processing;

the error information obtains an intelligent platform management interface protocol address from a system management memory, and calls an intelligent platform management interface command from an intelligent platform management interface protocol so as to send the error information to a substrate management controller, the substrate management controller records a memory error log and simultaneously obtains strategy setting of RAS function for informing the operating system of the error information, and when the operating system is informed of starting the function, an interrupt is sent to the operating system.

Example 2:

as shown in fig. 2, the present embodiment provides a system for optimizing hardware error time of system management interrupt processing, including:

creating a system management memory error processing module 1: applying for a data space of a system management memory; applying for a parameter data structure of memory analysis; applying for a memory for storing error information, initializing all the memories to zero, and acquiring a memory address for storing the error information; acquiring an address of an intelligent platform management interface protocol and storing the address into a system management memory; the data space of the system management memory is used for storing the strategy of the RAS function, and the strategy of the RAS function comprises the steps of recording the start or the close of a log, supporting the processing of memory data backup, and starting or closing the interrupt for informing an operating system to continue processing; in the parameter data structure for applying memory analysis, each CPU has one decoder to read the data from the decoding register of CPU and fill the data into the system management memory.

The triggering hardware detection mechanism module 2: starting a firmware priority function when a hardware error occurs in the operation stage, after triggering system management interruption, all CPUs enter a system management mode, and a board-level support packet checks whether to trigger a hardware detection mechanism module 2; if the hardware detection mechanism module 2 is triggered, the strategy of the RAS function is continuously checked, otherwise, the system management mode is exited, the strategy setting of the current RAS function is obtained from the system management memory, whether the strategy of the RAS function is executed or not is judged, if yes, the strategy of the RAS function is continuously executed, and otherwise, the system management mode is exited.

The memory location module for determining error information 3: when the basic input and output system sets and starts the error processing function, reading the MC bank to check error information, calling an analysis algorithm of converting the system address into the memory address when the MC bank records the physical address, reading the SAD/TAD/RIR decoding register value used by the analysis algorithm from the system management memory, and decoding the register information to obtain the memory position where the physical address is located.

The error information space preprocessing module 4: reading a pre-application error information space from a system management memory, and preprocessing the pre-application error information space, wherein the preprocessing comprises clearing error information recorded last time, and filling the error information according to interactive data structures defined in an in-band and an out-of-band, and the error information comprises a memory error type, a memory position and an error level; the error information obtains an intelligent platform management interface protocol address from a system management memory, an intelligent platform management interface command is called from an intelligent platform management interface protocol, so that the error information is sent to a substrate management controller, the substrate management controller records a memory error log, meanwhile, the strategy setting of whether to inform the RAS function of the error information of an operating system is obtained, when the operating system is informed to be started, an interrupt is sent to the operating system, all CPUs exit from a system management mode, and the memory error processing is finished.

Although the present invention has been described in detail by referring to the drawings in connection with the preferred embodiments, the present invention is not limited thereto. Various equivalent modifications or substitutions can be made on the embodiments of the present invention by those skilled in the art without departing from the spirit and scope of the present invention, and these modifications or substitutions are within the scope of the present invention/any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A method for optimizing system management interrupt handling hardware error time, comprising the steps of:

2. The method of claim 1, wherein in step S1, the data space of the system management memory is used to store a policy of an RAS function, and the policy of the RAS function includes logging on or off, supporting processing memory data backup, and turning on or off an interrupt notifying an operating system to continue processing.

3. The method of claim 2, wherein in step S4, the resolving algorithm is a system address to memory address resolving algorithm.

4. The method of claim 3, wherein the preprocessing comprises clearing the last recorded error message and filling in the error message according to the interactive data structure defined in-band and out-of-band in step S5.

5. The method of claim 4, wherein in step S6, the error message obtains an address of an intelligent platform management interface protocol from a system management memory, calls an intelligent platform management interface command from the intelligent platform management interface protocol, and sends the error message to a baseboard management controller, and the baseboard management controller records a memory error log, and obtains a policy setting of an RAS function for notifying the operating system of the error message, and sends an interrupt to the operating system when the operating system function is notified to be turned on.

6. A system for optimizing system management interrupt handling hardware error time, comprising:

creating a system management memory error processing module: applying for a data space of a system management memory, applying for a parameter data structure analyzed by the memory, applying for the memory for storing error information, initializing all to zero, acquiring a memory address for storing the error information, acquiring an address of an intelligent platform management interface protocol and storing the address to the system management memory;

7. The system of claim 6, wherein in the create SMM error handling module, the data space of the SMM is used to store the RAS function policy, and the RAS function policy includes logging on or off, supporting memory data backup, and turning on or off an interrupt that informs an operating system to continue processing.

8. The system of claim 7, wherein in the triggered hardware detection mechanism module, if the hardware detection mechanism module is triggered, the policy of the RAS function is continuously checked, otherwise, the system management mode is exited, the policy setting of the current RAS function is obtained from the system management memory, whether the policy of the RAS function is executed is determined, if yes, the policy of the RAS function is continuously executed, otherwise, the system management mode is exited.

9. The system of claim 8, wherein the resolution algorithm in the module for determining memory locations of error messages is a system address to memory address resolution algorithm.

10. The system of claim 9, wherein the error information space pre-processing module pre-processes the error information by removing the last recorded error information, filling the error information according to the interactive data structure defined in-band and out-of-band, the error information obtaining the intelligent platform management interface protocol address from the system management memory, calling the intelligent platform management interface command from the intelligent platform management interface protocol, sending the error information to the baseboard management controller, and all CPUs exit the system management mode to finish the memory error processing.