IE85357B1 - System and method for logging recoverable errors - Google Patents
System and method for logging recoverable errorsInfo
- Publication number
- IE85357B1 IE85357B1 IE2006/0744A IE20060744A IE85357B1 IE 85357 B1 IE85357 B1 IE 85357B1 IE 2006/0744 A IE2006/0744 A IE 2006/0744A IE 20060744 A IE20060744 A IE 20060744A IE 85357 B1 IE85357 B1 IE 85357B1
- Authority
- IE
- Ireland
- Prior art keywords
- recoverable
- status register
- chipset
- errors
- logging
- Prior art date
Links
- 238000004891 communication Methods 0.000 claims description 5
- 230000000737 periodic Effects 0.000 description 11
- 238000000034 method Methods 0.000 description 8
- 238000010586 diagram Methods 0.000 description 2
- 230000003044 adaptive Effects 0.000 description 1
- 238000001816 cooling Methods 0.000 description 1
- 230000003247 decreasing Effects 0.000 description 1
- 230000002950 deficient Effects 0.000 description 1
- 230000000593 degrading Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000002093 peripheral Effects 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 239000002689 soil Substances 0.000 description 1
- 230000026676 system process Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/22—Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing
- G06F11/2268—Logging of test results
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/36—Preventing errors by testing or debugging software
- G06F11/362—Software debugging
- G06F11/3648—Software debugging using additional hardware
Description
PATENTS ACT, 1992
System and Method for Logging Recoverable Errors
Dell Products LP
SYSTEM AND METHOD FOR LOGGING RECOVERABLE ERRORS
TECHNICAL FIELD
The present disclosure relates generally to computer systems and
information handling systems, and, more specifically, to a system and method for
logging recoverable errors.
BACKGROUND
As the value and use of information continues to increase, individuals
and businesses seek additional ways to process and store information. One option
available to these users is an information handling system. An information handling
system generally processes, compiles, stores, and/or communicates information or
data for business, personal, or other purposes thereby allowing users to take advantage
of the value of the information. Because technology and information handling needs
and requirements vary between different users or applications, infonnation handling
systems may vary with respect to the type of information handled; the methods for
handling the information; the methods for processing, storing or communicating the
information; the amount of information processed, stored, or communicated; and the
speed and efficiency with which the infonnation is processed, stored, or
communicated. The variations in information handling systems allow for information
handling systems to be general or configured for a specific user or specific use such as
financial transaction processing, airline reservations, enterprise data storage, or global
communications. In addition, information handling systems may include or comprise
a variety of hardware and software components that may be configured to process,
store, and communicate information and may include one or more computer systems,
data storage systems, and networking systems.
Server systems can experience recoverable or correctable errors during
normal system operation. Such recoverable errors might occur, for example, when
memory units coupled to the server system fail. To increase system reliability, sewer
systems are often designed to capture and log recoverable or correctable errors as they
occur. Because recoverable errors often are warning signals for impending memory
failures, this capture-and-log process gives the server-system user a chance to replace
defective memory units before the entire system crashes. Server systems often route
errors to be logged by generating a System Management interrupt (SMI) via sideband
signals. The SMI travels through the sideband to the CPU, and the CPU then freezes
ongoing server system processes. These pauses in processing caused by the SM!
enable the Basic-Input—Output System (BIOS) residing on the server system to log the
recoverable errors as they occur, using a SM! handler. Once the BIOS logs the errors,
the SMls end, and the server system may resume performing any interrupted
processes. The Baseboard Management Controller (BMC), which manages the
interface between system management software and platform hardware, processes the
error logging commands received from the BIOS and does the actual writing to its
non—volatile memory. _Throughout the entire notification process, the operating
system (OS) residing on the server system is unaware of the error and subsequent
logging. of that error.
Some server systems, however, do not include sideband signal
capability. All communications must travel through the main transport link. Because
recoverable errors are correctible, the server system does not generate a notification
when recoverable errors occur. These server systems may thus be designed to report
recoverable errors by employing the server system BlOS or the chipset to perform
periodic scans, such as periodic SMls. Similarly, these server systems may require
the server-system OS to periodically scan the system. For example, the OS might
periodically scan the system and log any recoverable errors that have been detected in
the machine check status register. A typical OS will scan about once every minute.
Using the server-system OS to periodically scan the system has its drawbacks,
however. For example, most hardware errors are system-specific. Typically,
however, an OS lacks any understanding of the specific architecture for the system.
The OS often cannot identify which component is at fault without seeking help from
the system BIOS, thereby tying up both resources. Server system users often require
more specificity than a generic error logging perfonned by an OS, particularly if the
system at issue is a high-end server system. Moreover, the OS will often log errors in
a machine check status register,‘ which does not store information regarding the error
source and therefore does not permit the system or user to later determine the location
of that error source. Although some OS versions can maintain a log of as many as ten
recoverable errors per scan, typically an OS will disable further logging of
recoverable error once this happens, thereby preventing the user from looking at
errors over time to determine the source of the problems.
SUMMARY
In accordance with the present disclosure, a method and system for
logging recoverable errors in an information handling system is disclosed. The
system includes a central processing unit, a chipset coupled to the central processing
unit, and at least one chipset memory unit coupled to and associated with the chipset.
The system also includes a Baseboard Management Controller (BMC), and a memory
unit containing a Basic Input Output System (BIOS).
A System Management Interrupt (SMI) is periodically invoked. Error
status registers are scanned to detect whether a recoverable error has occurred. If a
recoverable error is detected, the system logs the recoverable error in a non-volatile
memory unit associated with the BMC. The system logs information that indicates a
source of the recoverable error and that source’s location. if no recoverable errors are
detected, the system transmits a communication indicating that no recoverable errors
have occurred.
The system and method disclosed herein are advantageous because
they allow the information handling system to determine the source of recoverable
errors and location of that source, even if the information handling system lacks the
capability to send signals via a sideband. The BMC or the BIOS, not the OS,
identifies and logs the source of recoverable errors. The system and method disclosed
herein are also advantageous because they may allow the periodicity of the SM to be
dynamically adjusted based on an event during operation of the information handling
system or a change in operation of the information handling system. The periodic
scan can be faster than the OS recoverable-error scanning rate.
BRIEF DESCRIPTION OF THE DRAWINGS
A more complete understanding of the present embodiments and
advantages thereof may be acquired by referring to the following description taken in
conjunction with the accompanying drawings, in which like reference numbers
indicate like features, and wherein:
Figure 1 is a block diagram of an example architecture for an example
motherboard; ‘
Figure 2 is a flowchart illustrating a sample method for adapting the
frequency at which the system perfonns a periodic scan; and
Figure 3 is a block diagram of an example architecture for an example
motherboard. '
DETAILED DESCRIPTION
For purposes of this disclosure, an information handling system may
include any instrumentality or aggregate of instrumentalitics operable to compute,
classify, process, transmit, receive, retrieve, originate, switch, store, display, manifest,
detect, record, reproduce, handle, or utilize any form of information, intelligence, or
data for business, scientific, control, or other purposes. For example, an information
handling system may be a personal computer, a network storage device, or any other
suitable device and may vary in size, shape, performance, functionality, and price.
The information handling system may include random access memory (RAM), one or
more processing resources such as a central processing unit (CPU) or hardware or
software control logic, ROM, and/or other types of nonvolatile memory. Additional
components of the information handling system may include one or more disk drives,
one or more network ports for communication with external devices as well as various
input and output (l/O) devices, such as a keyboard, a mouse, and a video display. The
inforrnation handling system may also include one or more buses operable to transmit
communications between the various hardware components.
Figure 1 illustrates an architecture -for a motherboard, indicated
generally by the numeral 100, for use in an information handling system such as a
server system. The architecture shown in Figure l is for exemplary purposes only and
should be understood as depicting only one of the many possible architectures for
motherboards. As shown in Figure 1, motherboard 100 may include a microprocessor
110. Microprocessor l l0 may act as the CPU for the motherboard. Microprocessor
ll0 may to a chip commonly referred to as the “Northbridge," labeled 130 in Figure
1, via a processor bus 120. Northbridge 130 typically manages communications
between the CPU and other components of the information handling system, such as
memory units. Thus, one or more memory units and a memory controller, indicated
generally by the numeral 140, may couple to Northbridge 130. A chip known as the
“Sout.hbridge,” labeled I50 in Figure 1, may also couple to Northbridge 130.
Southbridge I50 typically implements slower services for the motherboard than
implemented by Northbridge 130, such as power management and operation of the
Peripheral Component Interface (PCIJ bus. Southbridge 150 may couplevia a Low
Pin Count (LPC) bus 160 to a memory unit containing a BIOS I70. The BIOS is
sometimes referred .to as “t'rmrware." Northbridge 130 and Southbridge 150 are
sometimes collectively referred to as the “chipset" for motherboard 100. However,
should motherboard 100 include other or additional chips, these components could be
part of the chipset as well.
A BMC 180 also may couple to the LPC bus 160, as indicated at the
bottom of Figure l. A controller and one or more memory units, indicated generally .
by the numeral 190, couple to BMC l80. Memory unit or units 190 may preferably
be non-volatile memory units. BMC 180 may have its ovm power supply, although
no power supply is indicated in Figure l. As discussed previously in this disclosure,
BMC 180 will typically manage the interface between system management software
and platform hardware. Different sensors built into the information handling system
may report to BMC 180 on parameters relevant to the status and operability of the
information handling system, such as temperature, cooling fan speeds, and various
voltages. If BMC IEO detects a deviation in any monitored parameter from desired
preset limits, it may send an alert to the user or system administrator. BMC 180 may
thus couple to a number of hardware components and a network, not shown in Figure
1, to monitor these parameters and activate alerts if necessary. '
The architecture for motherboard l00 shown in Figure 1 does not
include sideband signal capability between microprocessor 110 and Southbridge 150.
All communications must travel through the main transport link, and an information
handling system incorporating motherboard 100 cannot rely upon sideband signals for
reports of recoverable errors. Moreover, because recoverable errors are correctible,
this information handling system generally will not notify the user that such an error
has occurred unless it periodically polls for errors. Thus, an infonnation handling
system incorporating motherboard l00 might be designed to report recoverable errors
by employing BIOS 170 to perform periodic scans, such as periodic SMIS. Likewise,
an information handling system incorporating motherboard 100 might be designed to
rely on the OS residing for the information handling system to invoke the periodic
scans. These methods, however, are not without their drawbacks, as discussed
previously in this disclosure. For example, the OS typically cannot identify which
component is the source of the recoverable error because 03 packages are generic and
do not include maps of the architecture of the particular systems on which they reside.
Moreover, the OS logs recoverable errors in the machine check status register (which
may not be local to the component causing the error) and then clears the machine
check status register. i
instead of relying on the OS or on B103 170 alone to manage periodic
scans, information handling systems incorporating motherboard l00 may instead rely
upon BMC l80 to invoke periodic soft SMls. That is, once the information handling
system is up and running, BMC l80 may invoke a soft SMI after a predefined period
of time. An interrupt request line 195 between BMC l80 and the chipset on
motherboard 100. can be made available for invoking the soft SMI. General Purpose
Input Output (GPlO) ports, not shown in Figure i, can be configured to permit
communications between BIOS 170 and BMC 180. When BMC l80 invokes the soft
SMI, BIOS 170 will look for recoverable errors by reading, for example, the status
register of the chipset, memory status register, and/or the status register of
microprocessor 110. lf BIOS l70 finds no errors in the status register(s), BIOS l70
will communicate the lack of errors to BMC 180. If BIOS l70 does find an error,
BIOS 170 will communicate the error to BMC l80 and clear _the status register
containing the error. BIOS l70 may also log the error via BMC 180 in memory unit
190, typically in a non-volatile System Event Log. Because BIOS 170 is familiar
with the architecture of motherboard 100, BlOS 170 may identify in the log the
location of the source of the recoverable error. .
The period at which BMC l80 invokes the son SMI can be preset to
any period desired by the manufacturer or user. For example, as we discussed
previously in this disclosure, some OS versions perform periodic scans of a system's
machine check status register once per minute. Thus, the period at which BMC l80
invokes the soft SMl may be set at less than one minute so that BIOS 170 checks the
status registers more frequently than the resident OS performs its scan, thereby
reducing the risk that the OS will clear errors from the machine check status register
before BlOS 170 can detect them. BMC l80 may even invoke the soft SMI
frequently enough to prevent the OS from ever detecting any errors. However, the
period between soil SMls should be great enough to avoid tying up BIOS I70 and
BMC l80 unnecessarily and thereby degrading system performance.
Alternatively, BMC l80 may adaptively change the frequency of the
soft SMI afier learning the error status from BIOS 170. Figure 2 includes a flowchart
illustrating a possible method for adaptively changing the frequency of the soft SMI.
As shown in block 200 of the flowchart, BMC l80 may first invoke a soft SMl.
BIOS 170 may then check the appropriate machine check status register(s), as shown
in block 210 of the flowchart. BIOS 170 will determine whether it has located an
error, as stated in block 220. If BlOS l70 does not detect any errors, BIOS I70 will
send a single-bit communication to BMC 180 indicating no error was detected, as
indicated in block 230. As block 240 of the flowchart shows, BMC l80 can then
decrease the frequency at which it invokes the soft SMI. If, instead, Bl0S l70 detects
an error, BIOS 170 will next determine whether the error is recoverable. If BIOS 170
detects one or more recoverable errors, BIOS 170 will communicate that fact to BMC
180, as shown in block 260. BMC 180 can increase the frequency at which it invokes
the sofi srvn, as shown in block 270." If, however, BIOS 170 detects unrecoverable
errors, it will communicate that fact to BMC 180. At that point, the entire system can
be reset, and the frequency of the soft SMI can be reset back to a default setting, for
example, as shown in block 290.
The generation of soft SMls can be controlled using a system timer.
The frequency of errors will usually increase or decrease innsteps, so no extreme
changes in the frequency of the soft SMI will be necessary to capture the correct error
status for the system. For a system that adaptively changes the frequency of soft
SMIs, however, the user or manufacturer should set a predetermined minimum and
maximum values for the frequency at which BMC 180 can invoke any SMls.
Figure 3 illustrates an alternative architecture for a motherboard,
indicated generally by the numeral 300, for use in an information handling system
such as a server system. The architecture depicted in Figure 3 is similar to that
depicted in Figure 1. Thus, like components in both figures are identified by the same
, reference characters. in motherboard 300, however, BMC 180 and the chipset, or
even just Northbridge 130 may be coupled via an Inter-Interconnect (l2C) bus 310, as
shown in Figure 3. Motherboard 300 may_also be designed to permit the status
register for memory unit M0 to be shadowed or tracked by the chipset. in particular,
motherboard 300 may be designed to permit Northbridge 130 to shadow the status
register for memory unit 140 in its own status register. Thus, BMC l80 may scan
Northbridge l30’s status register via IZC bus 310 and detennine if any recoverable
errors for memory unit 140 have occurred. If BMC I80 detects a recoverable
memory error, it may invoke a soft SMI to command BIOS 170 to log the recoverable
error. If, however, BMC 180 does not detect a recoverable memory error, it will not
disturb the operation of B103 170. Thus, the load on BIOS 170 may be reduced, as it
is only required to act upon real errors previously detected by BMC tilt). in certain
systems, BMC 180 may log recoverable errors. However, for many systems, BIOS
170 may remain the more efficient choice for logging recoverable errors because an
algorithm is already implemented in a typical BlOS to determine the cause of the
error and the location of the component responsible for the error. Thus, if BMC I80
informs BIOS 170 that it has detected an error by generating a soft SMI, B103 170
can detennine the cause of the error and log that information. The frequency at which
BMC 180 scans Northbridge 130’s machine check status may be predetennined.
Alternatively, the frequency may be adaptively altered, as described previously in this
disclosure. For example, the frequency may be increased if single-bit errors are
detected or decreased if no errors are detected.
Although the present disclosure has described a system and method
that may include adaptive changes to time interval between periodic scans by B108
170 and/or BMC 180 in response to detected errors, other factors may be used to
adjust the fiequency of those scans. For example, the load experienced by the
component performing the scan, be it BIOS 170 or BMC 180, can influence the
periodicity of the scans. If component performing the scan is overloaded with other
tasks, for example, the frequency of the scans can be reduced to decrease the load on
that component;
Claims (1)
- CLAIMS 1. A method for logging recoverable errors in an information handling system, comprising the stepsof: _ invoking a System Management Interrupt (SMI) periodically Using 51 baseboard management controller ; scanning a status register to detect whether a recoverable error has occurred; logging a recoverable error if a recoverable error is detected, wherein logging a recoverable error includes logging information that indicates a source of the l0 recoverable error and that source’s location in a non-volatile "memory unit associated with the baseboard management controller, and, transmitting a communication indicating that no recoverable errors have occurred if no recoverable errors are detected. 15 2- The method for logging recoverable errors of claim 1' wherein the step of scanning a status register to detect whether a recoverable error has occurred includes the step of scanning a status register using a Basic Input Output System (BIOS) stored in a memory unit in the information handling system. 20 3~ The method for_ logging recoverable errors of claim l wherein the step of scanning a status register to detect whether a recoverable error has occurred includes the step of scanning a status register using the baseboard management controller. 25 4. The method for logging recoverable errors of any one of the preceding claims, wherein the step of scanning a status register to detect whether a recoverable error has occurred includes the step of scanning at least one of a processor status register, a chipset status register and a memory status register associated with a central processing unit. 5- The method for logging recoverable errors of any one of the preceding claims, further comprising: documenting recoverable errors arising from errors during operation of at least one memory unit associated with a chipset in a memory unit status register, and tracking in a chipset status register any recoverable errors documented in the memory unit status register. 6, The method of any one of the preceding claims,Afiirther comprising altering how ofien the SMI is periodically invoked based on an event during operation of the 10 , information handling system. ' 7. The method of claim .6, wherein altering how often the SMI is periodically invoked based on an -event during operation of the information handling system comprises altering how often the SMl is periodically invoked based on whether a 15 recoverable error has been detected. 8. The method of any one of claims 1 to 6, further comprising altering how often the SMI is periodically invoked based on a change in operation of the information handling system. 20 9_. The method of claim 8, wherein the step of altering how often the SMI is periodically invoked based a change in operation of the information handling system comprises altering how often the SMI is periodically invoked based on a change in workload fora Basic Input Output System stored in the information handling system. 25 10- A method for logging recoverable errors in an information handling system substantially as described with respect to any of the accompanying drawings. I 1. A system for logging recoverable errors, comprising: 30 a central processing unit, a chipset coupled to the central processing unit, at least one chipset memory unit coupled to and associated with the chipset, at least one firmware memory unit containing a Basic Input Output System l5 l2 (BIOS), wherein the «at least one firmware memory unit is coupled to the at least one chipset; a baseboard management controller (BMC) coupled to the chipset and to the at least one firmware memory unit; and at least one non-volatile memory unit coupled to and associated with the BMC; wherein the BMC is arranged to periodically invoke a system management interrupt (SMI) to scan a status register to detect if a recoverable error has occurred, and to log a recoverable error if a recoverable error has occurred by logging information that indicates a source of the recoverable error and that source’s location in the non~volatile memory, for transmitting a communication indicating that no recoverable errors have occurred if no recoverable errors are detected. 12_ The system for logging recoverable errors of claim ll 1, further comprising an interrupt request line that couples the BMC to the chipset, wherein the BMC can transmit an interrupt through the interrupt request line to the chipset. 13, The system for logging recoverable errors of claim ll or claim 12, further comprising a memory status register associated with the at least one chipset memory unit, wherein the BIOS may check at least one of the memory status register, the processor status register and the chipset status register to check for recoverable BITDTS. 14, The system for logging recoverable errors according to any one of claims 1 l to 13, wherein the at least one chipset memory unit is associated with a memory status register, the system including a chipset status register associated with the chipset, the chipset status register able to track the contents of the memory status register, wherein the interrupt invoked by the BMC can check for recoverable errors in the chipset status register. 15. The system for logging recoverabie errors of claim 14, further comprising an Inter-interconnect bus that couples the BMC to the chipset. 16, A system for logging recoverable errors, substantially as shown in or as described with respect to any of the accompanying drawings.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
USUNITEDSTATESOFAMERICA14/10/20051 | |||
US11/250,603 US20070088988A1 (en) | 2005-10-14 | 2005-10-14 | System and method for logging recoverable errors |
Publications (2)
Publication Number | Publication Date |
---|---|
IE20060744A1 IE20060744A1 (en) | 2007-06-13 |
IE85357B1 true IE85357B1 (en) | 2009-10-14 |
Family
ID=
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20070088988A1 (en) | System and method for logging recoverable errors | |
US7702966B2 (en) | Method and apparatus for managing software errors in a computer system | |
US7945841B2 (en) | System and method for continuous logging of correctable errors without rebooting | |
US11526411B2 (en) | System and method for improving detection and capture of a host system catastrophic failure | |
US11132314B2 (en) | System and method to reduce host interrupts for non-critical errors | |
US20080256400A1 (en) | System and Method for Information Handling System Error Handling | |
US20070006048A1 (en) | Method and apparatus for predicting memory failure in a memory system | |
US20080307273A1 (en) | System And Method For Predictive Failure Detection | |
US12013946B2 (en) | Baseboard memory controller (BMC) reliability availability and serviceability (RAS) driver firmware update via basic input/output system (BIOS) update release | |
Radojkovic et al. | Towards resilient EU HPC systems: A blueprint | |
US10635554B2 (en) | System and method for BIOS to ensure UCNA errors are available for correlation | |
US8726102B2 (en) | System and method for handling system failure | |
US11126486B2 (en) | Prediction of power shutdown and outage incidents | |
Kleen | Mcelog: Memory error handling in user space | |
US20230118160A1 (en) | Apparatus, Device, Method, and Computer Program for Monitoring a Processing Device from a Trusted Domain | |
IE85357B1 (en) | System and method for logging recoverable errors | |
US7114095B2 (en) | Apparatus and methods for switching hardware operation configurations | |
US11797368B2 (en) | Attributing errors to input/output peripheral drivers | |
US20240012651A1 (en) | Enhanced service operating system capabilities through embedded controller system health state tracking | |
US11422876B2 (en) | Systems and methods for monitoring and responding to bus bit error ratio events | |
US20240028713A1 (en) | Trust-based workspace instantiation | |
US20240028723A1 (en) | Suspicious workspace instantiation detection | |
US11743106B2 (en) | Rapid appraisal of NIC status for high-availability servers | |
US20240354186A1 (en) | Pcie dpc smi storm prevention system |