CN109117299B - Error detecting device and method for server - Google Patents

Error detecting device and method for server Download PDF

Info

Publication number
CN109117299B
CN109117299B CN201710487094.3A CN201710487094A CN109117299B CN 109117299 B CN109117299 B CN 109117299B CN 201710487094 A CN201710487094 A CN 201710487094A CN 109117299 B CN109117299 B CN 109117299B
Authority
CN
China
Prior art keywords
system management
processing unit
address space
memory
identification code
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710487094.3A
Other languages
Chinese (zh)
Other versions
CN109117299A (en
Inventor
简天朴
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Mitac Computer Shunde Ltd
Mitac Computing Technology Corp
Original Assignee
Mitac Computer Shunde Ltd
Mitac Computing Technology Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Mitac Computer Shunde Ltd, Mitac Computing Technology Corp filed Critical Mitac Computer Shunde Ltd
Priority to CN201710487094.3A priority Critical patent/CN109117299B/en
Publication of CN109117299A publication Critical patent/CN109117299A/en
Application granted granted Critical
Publication of CN109117299B publication Critical patent/CN109117299B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/08Error detection or correction by redundancy in data representation, e.g. by using checking codes
    • G06F11/10Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's
    • G06F11/1008Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's in individual solid state devices
    • G06F11/1064Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's in individual solid state devices in cache or content addressable memories

Abstract

The invention provides a debugging device and a debugging method of a server, wherein the debugging method of the server comprises the steps that a processing unit operates in a system management mode according to an interrupt signal, the processing unit executes a basic input/output system code in the system management mode, executes a debugging program corresponding to an identification code in the basic input/output system code according to the identification code stored in a second address space of a memory unit of a memory module, generates debugging data, and the processing unit stores the debugging data in a third address space of the memory unit of the memory module in the system management mode. Wherein the first address space of the memory unit stores a plurality of sequences of presence detect data.

Description

Error detecting device and method for server
Technical Field
The present invention relates to a fault detection device and a fault detection method for a server, and more particularly, to a fault detection device and a fault detection method for a server including a sequence presence detection memory.
Background
In a conventional server device, a designer may create an error removal code in a Basic Input/Output System (BIOS) code, and execute the error removal code to perform a debugging process when the server encounters a boot error. However, in practical situations, there are many possible causes of the boot error, and when the debugger needs to perform the debugging process with other debugging codes, the debugger needs to redesign the BIOS code and update the redesigned BIOS code to the server, which is time consuming.
In addition, another debugging method is to connect a pin of the processing unit to a display, and the processing unit can control the display to display a digital message in the designated stage of the boot process, so that the debugger can perform debugging according to the digital message. However, the amount of information that can be represented by the digital message is limited, so that a debugger cannot accurately infer the cause of the boot error according to the digital message, and thus the debugging is difficult to perform, and the digital message cannot be obtained again after being displayed on a display, so that the debugger is more difficult to perform the debugging, which is inconvenient.
Furthermore, another debugging method is to add an extended debug port (XDP) on the motherboard of the server, where the XDP can communicate with other units in the server, and the debugger can read the states of other units through the XDP to perform debugging. However, the addition of the debug extension port on the motherboard increases the production cost of the motherboard, and the pins for communication with the debug extension port must be added to other units in the server, thereby increasing the production cost of the server as a whole.
Disclosure of Invention
In view of the above, the present invention provides a fault detection apparatus for a server and a fault detection method thereof.
Therefore, the invention provides a fault detection device of a server, which comprises a memory module, a basic input and output system memory and a processing unit. The memory module includes a memory unit including a first address space for storing a plurality of sequences of presence detection data, a second address space for storing an identification code corresponding to an error detection procedure, and a third address space. The BIOS memory is used to store a BIOS code containing the debugging program. The processing unit is coupled with the basic input and output system memory and is used for operating in a system management mode according to an interrupt signal; in the system management mode, the processing unit executes the debugging procedure in the BIOS code according to the identification code to generate debugging data, and stores the debugging data in the third address space of the memory unit.
In one embodiment, the memory unit includes a system management bus interface or an integrated circuit bus interface, and the memory unit outputs the debug data through the system management bus interface or the integrated circuit bus interface.
In one embodiment, the error detection apparatus of the server further comprises a system management bus connected to the memory module and the system management bus interface; in the system management mode, the processing unit turns off a temperature data output function of the memory module, so that the memory module cannot transmit temperature data through the system management bus in the system management mode.
In one embodiment, the processing unit stores the identifier corresponding to the error type in the second address space according to an error type of an error parameter when executing an operating system.
In one embodiment, the processing unit updates the identification code after the error detection data is stored in the third address space.
In one embodiment, a method for debugging a server includes: a processing unit operating in a system management mode according to an interrupt signal, the processing unit executing a BIOS code in the system management mode to execute an error detection program corresponding to an identification code in the BIOS code according to the identification code stored in a second address space of a memory unit of a memory module and generate an error detection data; the processing unit stores the debug data in a third address space of a memory unit of a memory module in a system management mode. Wherein a first address space of the memory unit stores a plurality of sequences of presence detect data.
In an embodiment, the method for debugging a server further includes: the memory unit outputs the debug data through a system management bus interface or an integrated circuit bus interface.
In an embodiment, the method for debugging a server further includes: the processing unit turns off a temperature data output function of the memory module in the system management mode, so that the memory module cannot transmit temperature data through a system management bus connected to the memory module and the system management bus interface when the processing unit operates in the system management mode.
In one embodiment, in the step of the processing unit executing the error detection procedure according to the identification code, the processing unit stores the identification code corresponding to the error type in the second address space according to an error type of an error parameter when executing an operating system, so as to execute the error detection procedure according to the identification code.
In an embodiment, the method for debugging a server further includes: the processing unit updates the identification code after the error detection data is stored in the third address space.
Compared with the prior art, the error detecting device and the error detecting method of the server of the invention have the advantages that the memory unit storing the detection data in sequence can store the identification code and the error detecting data, a debugger can store different identification codes in the address space according to the actual error condition of the server and can obtain the error detecting data from the memory unit, thereby improving the convenience in error removal and the accuracy in error detection; moreover, the memory unit storing the serial detection data is used for storing the identification code and the debugging data, no additional hardware is needed for executing a debugging program, and the stored identification code and the debugging data are not lost due to shutdown or power removal, so that the overall production cost of the server is further reduced.
[ description of the drawings ]
FIG. 1 is a block diagram illustrating an embodiment of a server according to the present invention.
FIG. 2 is a schematic diagram of one embodiment of an address space arrangement of the memory cells of FIG. 1.
[ detailed description ] embodiments
FIG. 1 is a block diagram illustrating an embodiment of a server according to the present invention. Referring to fig. 1, the server at least includes a memory module 10, a processing unit 11 and a BIOS memory 12. The processing unit 11 is coupled to the memory module 10 and the BIOS memory 12. In one embodiment, the processing unit 11 may be a Central Processing Unit (CPU).
The memory module 10 includes a plurality of memory cells. Here, fig. 1 illustrates that the memory module 10 includes three memory units 101, 102, and 103, but the invention is not limited thereto, and the number of the memory units may be more than three or less than three. The memory cells 101 in the memory module 10 are used to store a plurality of Serial Presence Detect (SPD) data, such as the timing settings, various timing and voltage specification parameters of the memory module 10. The memory unit 101 further includes at least one identification code corresponding to a debug program, and the debug program can implement different functions, for example, the function implemented by the debug program can be to read a register value of a specific unit in the server, such as a chipset (chipset), or monitor the temperature of the specific unit in the server, or perform Serial ATA (SATA) test on the generated data, or record the value displayed by a debug light number in a Power On Self Test (POST) stage, or perform MCA (Machine Check Architecture) detection and error reporting, and the debugger of the server can define the program code by itself and store the program code in the BIOS memory 12 as a part of the debug program of the BIOS code. In one embodiment, the Identifier may be a Globally Unique Identifier (GUID).
FIG. 2 is a schematic diagram of one embodiment of an address space arrangement for memory cell 101 of FIG. 1. In configuration, referring to FIG. 2, the SPD data are stored in a first address space 101A of the memory cell 101, the identification code is stored in a second address space 101B different from the first address space 101A, and the second address space 101B may be continuous or discontinuous with the first address space 101A. Furthermore, the memory unit 101 further includes a third address space 101C for storing data, and the third address space 101C may be continuous or discontinuous with the first address space 101A and the second address space 101B. In one embodiment, the amount of data covered by the first address space 101A can be 384 bytes (i.e., between 0 th byte and 383 th byte), and the amount of data covered by the second address space 101B and the third address space 101C together can be 168 bytes (i.e., between 384 th byte and 551 th byte), but the invention is not limited thereto, and the amount of data covered by each address space can be configured according to actual requirements.
The processing unit 11 includes a System Management Interrupt (SMI) pin 111, the SMI pin 111 is used for receiving an interrupt signal, when the SMI pin 111 receives the interrupt signal, a logic level (logic level) of the SMI pin 111 is a high logic level, and the processing unit 11 enters a System Management Mode (SMM). In the system management mode, the processing unit 11 executes the BIOS code in the BIOS memory 12, and the processing unit 11 reads the memory unit 101 through the BIOS code to obtain an identification code in the second address space 101B, and executes a corresponding debugging procedure in the BIOS code through the identification code. In one embodiment, the identifier and the error detection procedure have a one-to-one correspondence relationship, for example, the identifier of "2" corresponds to the third error detection procedure, the identifier of "9" corresponds to the sixth error detection procedure, when the identifier stored in the second address space 101B is "2", the processing unit 11 executes the third error detection procedure according to the identifier of "2", and when the identifier stored in the second address space 101B is "9", the processing unit 11 executes the sixth error detection procedure according to the identifier of "9". Then, the processing unit 11 generates debug data during the debug procedure, and the processing unit 11 stores the debug data in the third address space 101C of the memory unit 101. In one embodiment, the memory unit 101 may be an Electrically Erasable Programmable Read Only Memory (EEPROM), when the processing unit 11 operates in the system management mode, the processing unit 11 turns on the write function of the memory unit 101, and after the error detection data is written into the third address space 101C, the processing unit 11 turns off the write function of the memory unit 101 and leaves the system management mode.
Further, the memory module 10 includes a System Management Bus (SMBus) interface, the processing unit 11 and other units in the server also have the SMBus interface, and the SMBus interfaces of the processing unit 11 and other units in the server are connected to the SMBus interface of the memory module 10 through a System Management Bus. After the debug data is stored in the third address space 101C of the memory unit 101, the processing unit 11 and other units can read the memory unit 101, so that the memory module 10 can output the debug data in the third address space 101C of the memory unit 101 and transmit the debug data through the system management bus.
For example, other units of the server may be a Baseboard Management Controller (BMC) and/or a chipset. As shown in fig. 1, the server further includes a chipset 13 and a bmc 14, the processing unit 11, the chipset 13 and the bmc 14 are respectively connected to the memory module 10 through system management buses 17, 16 and 15, and the processing unit 11, the chipset 13 and the bmc 14 can respectively obtain the debug data stored in the memory unit 101 from the memory module 10 through the system management buses 17, 16 and 15. Therefore, the debugger can go through the processing unit 11 and other units of the server to obtain the debug data and then proceed further debugging process.
In addition, the memory unit 101 may be externally connected to other debugging equipment, such as an analyzer (analyzer) or an oscilloscope, and a debugger may connect the debugging equipment to the system management bus interface of the memory unit 101, and after the processing unit 11 stores the debug data in the third address space 101C, receive the debug data output by the system management bus interface with the debugging equipment, and perform a subsequent debugging procedure according to the debug data.
In one embodiment, the memory module 10 may include an integrated circuit bus interface (I2C), and the memory unit 101 may output the debug data in the third address space 101C to other units or debugging devices in the server having the integrated circuit bus interface through the integrated circuit bus interface, which will not be described herein.
In an embodiment, the processing unit 11 may receive the aforementioned interrupt signal according to a BIOS code during a boot phase, and in detail, taking a boot phase of a Unified Extensible Firmware Interface (UEFI) as an example, during a Driver Execution Environment (DXE) boot phase, the SMI pin 111 of the processing unit 11 is initialized and the SMI pin 111 may be triggered by the BIOS code, so that the processing unit 11 operates in a system management mode to execute an error detection program and store error detection data in the third address space 101C of the memory unit 101. When the server encounters a boot-up abnormality and cannot enter the operating system, a debugger of the server can perform debugging on the display unit according to the debugging data, or use a testing instrument to directly connect to the output port of the memory unit 101 through an external circuit to read the debugging data in the third address space 101C to determine the possible cause of the boot-up abnormality of the server and perform a corresponding debugging procedure.
Furthermore, when the server enters the operating system without encountering a boot error, the processing unit 11 may periodically scan a plurality of system parameters in the operating system, such as various parameters of different hardware components in the server, and when the processing unit 11 scans the error parameters, the operating system triggers the processing unit 11 to enter the system management mode. In the system management mode, the processing unit 11 determines the error type of the error parameter and fills the second address space 101B with different identification codes according to the error type, and when the error parameter corresponds to a plurality of different error types, the processing unit 11 can fill the second address space 101B with a plurality of identification codes. In this regard, the processing unit 11 may perform a debug procedure associated with the type of error after scanning the error, and store the debug data in the third address space 101C. The debugger may obtain debug data generated after the processing unit 11 finds the error from the memory unit 101.
In one embodiment, after the processing unit 11 executes the debugging process, the processing unit 11 can clear the identification code corresponding to the executed debugging process in the second address space 101B for filling the identification code corresponding to the other error type in the second address space 101B, and execute the other non-executed debugging process by using the identification code previously stored in the second address space 101B when the next operation is in the system management mode. For example, taking the second address space 101B as an example where the first identification code corresponding to the first error category is stored first, after the processing unit 11 executes the corresponding first error detection program according to the first identification code, the processing unit 11 clears the first identification code in the second address space 101B, and fills the second identification code corresponding to the second error category, so that the processing unit 11 can execute the corresponding second error detection program according to the second identification code when operating in the system management mode next time.
In one embodiment, the memory module 10 can transmit the temperature data of any memory unit through its system management bus, that is, the memory unit 101 and the other memory units 102 and 103 in the memory module 10 are connected to the same system management bus and share the same system management bus, to avoid conflicts resulting from different memory units simultaneously transferring data via the system management bus, in the system management mode, the processing unit 11 first turns off the temperature data output function of the memory module 10, during the execution of the debug program by the processing unit 11 and the writing of the debug data into the memory unit 101, the memory module 10 cannot transmit the temperature data through the system management bus, after the processing unit 11 executes the debug procedure and stores the debug data in the third address space 101C, the processing unit 11 restarts the temperature data output function of the memory module 10 and leaves the system management mode.
In summary, according to an embodiment of the error detecting apparatus and the error detecting method of the server of the present invention, the memory unit storing the serial detection data can store the identification code and the error detecting data, and the debugger can store different identification codes in the address space according to the actual condition of the server error and can obtain the error detecting data from the memory unit, thereby improving the convenience of error removal and the accuracy of error detection; moreover, the memory unit storing the serial detection data is used for storing the identification code and the debugging data, no additional hardware is needed for executing a debugging program, and the stored identification code and the debugging data are not lost due to shutdown or power removal, so that the overall production cost of the server is further reduced.
The embodiments and examples of the present invention are described in detail with reference to the accompanying drawings, but the scope of the invention is not limited thereto, and all equivalent modifications and changes within the scope of the claims of the present invention should be considered as falling within the scope of the present invention.

Claims (6)

1. A fault detection device for a server, comprising:
a memory module comprising a memory cell, the memory cell comprising:
a first address space for storing a plurality of sequences of presence detect data;
a second address space for storing an identification code corresponding to an error detection procedure; and
a third address space;
a BIOS memory for storing a BIOS code, the BIOS code including the debug program; and
a processing unit coupled to the BIOS memory for operating in a system management mode according to an interrupt signal, wherein in the system management mode, the processing unit executes the debugging procedure in the BIOS code according to the identification code to generate debugging data and stores the debugging data in the third address space, the processing unit stores the identification code corresponding to an error type of an error parameter in the second address space according to an error type of the error parameter when executing an operating system, and the processing unit updates the identification code according to another identification code corresponding to another error type of the error parameter of the operating system after the debugging data is stored in the third address space.
2. The apparatus of claim 1, wherein the memory unit comprises a system management bus interface or an integrated circuit bus interface, and the memory unit outputs the debug data via the system management bus interface or the integrated circuit bus interface.
3. The apparatus of claim 2, further comprising a system management bus coupled to the memory module and the system management bus interface, wherein in the system management mode, the processing unit disables a temperature data output function of the memory module, such that the memory module cannot transmit a temperature data via the system management bus in the system management mode.
4. A method for debugging a server, comprising:
a processing unit operating in a system management mode according to an interrupt signal;
the processing unit executes a BIOS code in the system management mode, executes an error detection program corresponding to an identification code in the BIOS code according to the identification code stored in a second address space of a memory unit of a memory module, and generates an error detection data, wherein a first address space of the memory unit stores a plurality of sequences of presence detection data, and the processing unit stores the identification code corresponding to an error type in the second address space according to an error type of an error parameter when executing an operating system, so as to execute the error detection program according to the identification code; and
the processing unit stores the error detection data in a third address space of the memory unit in the system management mode, and updates the identification code according to another identification code corresponding to another error type of the error parameter of the operating system after the error detection data is stored in the third address space.
5. The method of claim 4, further comprising: the memory unit outputs the debug data through a system management bus interface or an integrated circuit bus interface.
6. The method of claim 5, further comprising: the processing unit turns off a temperature data output function of the memory module in the system management mode, so that the memory module cannot transmit temperature data through a system management bus connected to the memory module and the system management bus interface when the processing unit operates in the system management mode.
CN201710487094.3A 2017-06-23 2017-06-23 Error detecting device and method for server Active CN109117299B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710487094.3A CN109117299B (en) 2017-06-23 2017-06-23 Error detecting device and method for server

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710487094.3A CN109117299B (en) 2017-06-23 2017-06-23 Error detecting device and method for server

Publications (2)

Publication Number Publication Date
CN109117299A CN109117299A (en) 2019-01-01
CN109117299B true CN109117299B (en) 2022-04-05

Family

ID=64732310

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710487094.3A Active CN109117299B (en) 2017-06-23 2017-06-23 Error detecting device and method for server

Country Status (1)

Country Link
CN (1) CN109117299B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI714958B (en) * 2019-01-30 2021-01-01 神雲科技股份有限公司 A method of modifying setup of basic input/output system
CN113687967A (en) * 2020-05-18 2021-11-23 佛山市顺德区顺达电脑厂有限公司 Method for recording startup error information
CN113760612A (en) * 2020-06-05 2021-12-07 佛山市顺德区顺达电脑厂有限公司 Server debugging method

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104424084A (en) * 2013-08-27 2015-03-18 鸿富锦精密电子(天津)有限公司 System error information detection system and method for server
CN106598790A (en) * 2015-10-16 2017-04-26 中兴通讯股份有限公司 Server hardware failure detection method, apparatus of server, and server
CN106815088A (en) * 2015-11-27 2017-06-09 佛山市顺德区顺达电脑厂有限公司 server and its debugging method

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI220471B (en) * 2003-02-20 2004-08-21 Akom Technology Corp Method, controller and apparatus for displaying BIOS debug message
CN100524245C (en) * 2006-12-21 2009-08-05 英业达股份有限公司 Method for monitoring input/output port data
US7613952B2 (en) * 2006-12-29 2009-11-03 Inventec Corporation Method for facilitating BIOS testing
CN101408860A (en) * 2007-10-12 2009-04-15 华硕电脑股份有限公司 Monitoring apparatus and monitoring method
CN104035844A (en) * 2013-03-04 2014-09-10 联想(北京)有限公司 Fault testing method and electronic device
CN106547653B (en) * 2015-09-21 2020-03-13 龙芯中科技术有限公司 Computer system fault state detection method, device and system
CN106502846A (en) * 2016-10-14 2017-03-15 合肥联宝信息技术有限公司 A kind of computer glitch detection method and device

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104424084A (en) * 2013-08-27 2015-03-18 鸿富锦精密电子(天津)有限公司 System error information detection system and method for server
CN106598790A (en) * 2015-10-16 2017-04-26 中兴通讯股份有限公司 Server hardware failure detection method, apparatus of server, and server
CN106815088A (en) * 2015-11-27 2017-06-09 佛山市顺德区顺达电脑厂有限公司 server and its debugging method

Also Published As

Publication number Publication date
CN109117299A (en) 2019-01-01

Similar Documents

Publication Publication Date Title
TWI620061B (en) Error detecting apparatus of server and error detecting method thereof
US6721881B1 (en) System and method for determining if a display device configuration has changed by comparing a current indicator with a previously saved indicator
US7565579B2 (en) Post (power on self test) debug system and method
US9146823B2 (en) Techniques for testing enclosure management controller using backplane initiator
US7293204B2 (en) Computer peripheral connecting interface system configuration debugging method and system
US20130268708A1 (en) Motherboard test device and connection module thereof
US8707103B2 (en) Debugging apparatus for computer system and method thereof
US20080046706A1 (en) Remote Monitor Module for Computer Initialization
CN109117299B (en) Error detecting device and method for server
CN106547653B (en) Computer system fault state detection method, device and system
CN103257922B (en) A kind of method of quick test BIOS and OS interface code reliability
US20210216388A1 (en) Method and System to Detect Failure in PCIe Endpoint Devices
CN104679626A (en) System and method for debugging and detecting BIOS (Basic Input / Output System)
CN113377586A (en) Automatic server detection method and device and storage medium
CN110570897B (en) Memory detection system, memory detection method and error mapping table establishing method
US11494289B2 (en) Automatic framework to create QA test pass
CN104678292A (en) Test method and device for CPLD (Complex Programmable Logic Device)
CN110321171B (en) Startup detection device, system and method
CN112269705A (en) Detection board for fault location of X86 architecture system
CN104572423A (en) Debugging system and debugging device and method thereof
CN201117004Y (en) Firmware detecting card
CN111221701A (en) Chip and circuit logic reconfiguration system thereof
TWI789983B (en) Power management method and power management device
CN115442207B (en) Hardware operation and maintenance management system based on BMC+SoC+network switching module
CN111352785B (en) Detection method and system for non-version-number firmware of storage server

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant