WO2017063505A1 - 一种服务器硬件故障检测方法及其装置和服务器 - Google Patents

一种服务器硬件故障检测方法及其装置和服务器 Download PDF

Info

Publication number
WO2017063505A1
WO2017063505A1 PCT/CN2016/100618 CN2016100618W WO2017063505A1 WO 2017063505 A1 WO2017063505 A1 WO 2017063505A1 CN 2016100618 W CN2016100618 W CN 2016100618W WO 2017063505 A1 WO2017063505 A1 WO 2017063505A1
Authority
WO
WIPO (PCT)
Prior art keywords
server
hardware
phase
information
basic input
Prior art date
Application number
PCT/CN2016/100618
Other languages
English (en)
French (fr)
Inventor
李存龙
Original Assignee
中兴通讯股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中兴通讯股份有限公司 filed Critical 中兴通讯股份有限公司
Publication of WO2017063505A1 publication Critical patent/WO2017063505A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/22Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing

Definitions

  • the present invention relates to the field of computers and communications, and in particular, to a server hardware fault detection method, apparatus and server thereof.
  • the server On the current high-end servers, the server generally has a part of the "black box" function, which is used to record the fault information when the operating system crashes. It can reset various kernel exceptions of the OS (Operating System) such as kernel errors and restarts. Recording abnormal information, etc., you can also record some simple hardware errors through SEL (System Event Log), or through the out-of-band method (such as joint test link) to collect errors in the field after the fault occurs. Or passively monitor device anomalies through the in-band anomaly trigger mechanism, and the in-band anomaly trigger mechanism requires an abnormal condition to trigger its abnormality recording module to record. These methods can help maintenance personnel to determine the cause of the fault to a certain extent, but these methods still have the following defects:
  • the above method is to passively trigger the detection record, lacking active detection of the server, especially active screening of the server hardware failure.
  • the system does not trigger the fault information record when the system is normally started and running, and the service quality is greatly reduced. This will cause the fault information to be missed, making it difficult for maintenance personnel to trace the fault information during maintenance.
  • the system Since the fault information will be recorded for the detection only when the system crashes or an abnormal trigger occurs, the system has the ability to collect and analyze the hardware fault during the operation of the system (service), which leads to the early warning capability of the system. Insufficient, reducing the stability and reliability of the system.
  • the fault information recorded is too simple and scattered, and there is no accurate and unified record management. It is impossible to analyze the fault information in one step, and a large amount of analysis and screening and cross-validation are needed in the later stage to find the main fault source.
  • the current server fault information recording implementation scheme can realize the fault information detection record only under certain conditions, and the fault information recorded therein is simple and scattered, and requires a large amount of analysis in the later stage.
  • the main technical problem to be solved by the present invention is to provide a server hardware fault detection method, a device thereof and a server, which solve the technical problem that the real-time fault information detection and record storage of the hardware in each working phase of the server cannot be realized in the prior art. .
  • the present invention provides a server hardware fault detection method, including:
  • the basic input/output system device of the server detects that the server enters a startup phase
  • the basic input/output system device begins to perform hardware failure detection on the server in each working phase, and the working phase includes the startup phase;
  • the basic input/output system device stores the detected hardware failure information.
  • the startup phase includes an initialization phase
  • the basic input/output system device performs hardware failure detection on the server in the initialization phase, including:
  • the basic input/output system device performs hardware pre-detection on at least one of a CPU, a memory, a chipset, and a power source of the server according to a hardware detection mechanism provided by the server, to acquire current hardware information, from the hardware information. Filter out the faulty hardware information for analysis and processing to get the corresponding hardware fault information.
  • the startup phase further includes a device enumeration phase
  • the basic input/output system device performs hardware failure detection on the server in the device enumeration phase, including:
  • the basic input/output system device acquires state information and resource information of each hardware on the server, and identifies fault information of the faulty hardware therefrom.
  • the startup phase is a cold start phase or a hot start phase.
  • the work phase further includes at least one of an operating system pre-boot phase and an operating system service run phase.
  • the basic input/output system device when the working phase includes an operating system pre-booting phase, performs hardware fault detection on the server in the operating system pre-booting phase, including:
  • the basic input/output system device performs pre-detection on a hardware device out of the server to be booted
  • the basic input/output system device When the working phase includes an operating system service running phase, the basic input/output system device performs hardware fault detection on the server during the operation phase of the operating system service, and the basic input/output system device determines the hardware of the server. Whether the interrupt signal arrives, and if so, the basic input/output system device detects the related hardware of the operating system; and acquires fault information of the hardware.
  • the method before the basic input/output system device stores the detected fault information, the method further includes: assigning, on the server serial flash memory, a fault for storing the hardware fault information. Storage area.
  • the present invention also provides a basic input/output system device, including:
  • the fault information detection triggering module is configured to detect whether the server enters a startup phase
  • the fault information detecting module is configured to, when the fault detecting triggering module detects that the server enters a startup phase, start performing hardware fault detection on the working phase of the server, where the working phase includes the startup phase;
  • the fault information storage module is configured to perform hardware fault information detected by the fault information detecting module. storage.
  • the method further includes a storage setting module configured to allocate one of the server serial flash memories for storing the hardware before the fault information storage module stores the hardware fault information Fault storage area for hardware failure information.
  • the present invention also provides a server comprising the basic input/output system device as described above.
  • the invention provides a server hardware fault detection method, a device thereof and a server.
  • a basic input/output system (BIOS) device of a server detects that a server enters a startup phase, it starts each working phase of the server.
  • the hardware performs fault detection analysis to obtain corresponding hardware fault information. Since the server's own BIOS device is utilized, it is possible to detect all hardware failures that may occur during the entire period of the server operation, improve the comprehensiveness and accuracy of the hardware failure information detection, and facilitate the realization of the server hardware failure information.
  • the unified storage management ensures that the maintenance personnel can accurately obtain the hardware fault information when the server is being maintained, and know the location of the hardware that needs to be faulted and the cause of the fault, thereby further improving the stability and reliability of the server.
  • FIG. 1 is a flowchart of a method for detecting a hardware failure of a server according to the present invention
  • FIG. 2 is a flowchart of performing hardware failure detection in a server initialization phase provided by the present invention
  • FIG. 3 is a flow chart of hardware failure detection in the enumeration phase of the device of the present invention.
  • FIG. 5 is a flowchart of performing hardware fault detection in an operating phase of an operating system service according to the present invention.
  • FIG. 6 is a structural block diagram of a basic input/output system device provided by the present invention.
  • Embodiment 1 is a diagrammatic representation of Embodiment 1:
  • FIG. 1 is a flowchart of a method for detecting a fault of a server hardware according to the present invention.
  • the method for detecting a fault of a server hardware provided by this embodiment is to understand that the hardware of the server is actively performed by the basic input/output system device.
  • Fault detection where the active means refers to the detection mechanism preset by the server, and the BIOS device immediately performs a fault detection operation on the running hardware of the server or each work of the BIOS device on the server when the server is started.
  • the hardware fault detection operation is performed in the phase, including the following steps:
  • the basic input/output system device of the server detects that the server enters a startup phase
  • the startup phase of the server is a cold start phase or a hot start phase;
  • the basic input and output The detecting, by the system device, that the server enters the startup phase means that when the startup phase is the cold start phase, the basic input/output system device can detect whether to enter the startup phase by using the following manner, but is not limited to the following manner: detecting the server Whether the power on/off key of the server is pressed or whether the power supply circuit of the server is connected to the server power interface or by checking the status flag of the power source. If yes, the server has entered the startup phase, the server is already running, and step S102 is performed; otherwise, Continue to test;
  • the basic input/output system device detects whether the server has a reset start signal input, and if so, the server starts a hot start operation, and performs step S102; otherwise, continues to detect;
  • the reset start signal may be triggered by hardware, for example, by resetting the key; or by means of software, such as: inputting to the server periodically by code or tool; or by actively using the command by the user. Or operate the "Restart" button to enter.
  • the basic input/output system device starts fault detection on hardware of each working phase of the server, where the working phase includes the startup phase;
  • the basic input/output system device stores the detected hardware fault information.
  • step S103 further comprising: allocating a fault storage area for storing the hardware fault information on the server serial flash memory; further, the basic input output system device records the stored
  • the fault information includes: time, event, severity, specific location or fault details, and recommended handling.
  • the maintenance personnel may use an out-of-band control platform or network connected to the storage area.
  • the user interface directly obtains the stored hardware fault information, which facilitates the maintenance personnel to track the fault occurrence track, restore and replace the fault hardware on the site (for example, directly replacing a CPU, directly replacing the first memory module, and directly replacing the faulty bus interface card) .
  • hot-swap technology including but not limited to: CPU hot swap, memory hot swap, bus interface hot swap
  • the server peripherals cannot be used (such as the network port is unreachable, the screen is not lit, and the keyboard and mouse are not responding), and valid fault information can still be obtained.
  • the operation and maintenance personnel obtain the hardware failure information through the control platform, and in addition to processing in time, the hardware failure information may be dumped to another out-of-band storage device.
  • the startup phase includes an initialization phase
  • the step of the basic input/output system device performing fault detection on the hardware of the server in the initialization phase is as shown in FIG. 2, and specifically includes:
  • the basic input/output system device initializes a CPU, a memory, a chipset, and a power source;
  • the basic input/output system device detects and acquires current hardware information of at least one of a CPU, a memory, a chipset, and a power source.
  • the basic input/output system device performs pre-detection on at least one of a CPU, a memory, a chipset, and a power source of the server according to a hardware detection mechanism provided by the server to obtain a current
  • the hardware information is used to filter faulty hardware information from the hardware information for analysis and processing to obtain corresponding hardware fault information.
  • the BIOS device when the basic input/output system device detects that the server has entered an initialization phase, the BIOS device may use the BIOS device itself to actively increase pressure or utilize a CPU and a chipset.
  • Corresponding hardware information pre-analyze, pre-statistic, pre-screen, scan, measure hardware, and collect test results, and filter out valid fault information (including possible subsequent system abnormalities)
  • the information is recorded in detail and stored; so that when the server has a system abnormality in this phase, the server is guaranteed to acquire and record more detailed hardware failure information before the system abnormality occurs.
  • the fault information stored in the record includes but is not limited to: CPU error and alarm, CBO (Caching Agent) error and alarm, QPI (Quick Path Interconnect) error and alarm, IIO ( Integrated Input/Output, Integrated I/O) Port Errors and Alarms, HA (Local Agent, Home Agent) Errors and Alarms, IMC (Integrated Memory Controller) Errors and Alarms, PCU (Power Control Unit, Power Control) Unit) error and alarm, power and voltage errors and alarms, memory errors and alarms (including memory module error and alarm, memory channel error and alarm, memory insertion error and alarm, memory voltage error and alarm, memory incompatibility error and Alarms, configuration errors, alarms, etc.).
  • the startup phase further includes a device enumeration phase
  • the flowchart for performing hardware fault detection in the phase is as shown in FIG. 3, and specifically includes the following steps:
  • the basic input/output system device starts device enumeration
  • the basic input/output system device detects current information of the acquiring device.
  • the basic input/output system device acquires state information and resource information of each hardware on the server, and identifies fault information of the failed hardware and software.
  • the fault information includes but is not limited to: device access error (including memory and IO requirements are illegal), third-party firmware (OPTION ROM) is not executed (including insufficient space, the format is incorrect), and device damage is disabled.
  • the BIOS device starts to identify an industrial specification according to the detection mechanism.
  • Third-party firmware (OPTION ROM) identifier, vendor information, device classification information and capacity, check hardware status indication information (such as link status, bandwidth information, etc.), and identify faulty information of faulty hardware from the above information. storage.
  • the working phase further includes at least one of an operating system pre-booting phase and an operating system service running phase.
  • an operating system pre-booting phase and an operating system service running phase are performed.
  • the hardware failure detection analysis in the operating system pre-boot phase includes the following steps:
  • the basic input/output system device performs pre-detection on a hardware device out of the server to be booted.
  • the out-of-band hardware device includes but is not limited to: a hard disk, a server network port, and a device booting attribute; and the fault information includes but is not limited to: no bootable device, and the hard disk (or the USB disk) is damaged. (including MBR partition destruction), PXE network boot failure (including port information, network ping nowhere), ME (Management Engine) working state is abnormal.
  • the basic input/output system device when the basic input/output system device performs fault detection on the hard disk partition in the stage, the basic input/output system device actively initiates a detection acquisition signal to acquire a master boot record of the hard disk (U disk) ( MBR partition) data, analyze the boot flag, the end flag and the error information data area, determine whether the hard disk (U disk) can be booted or damaged according to the hardware detection mechanism provided by the server; determine the server and the host by issuing a self-test command Check the status of the communication link and the working mode. Check whether the network is connected through the DHCP (Dynamic Host Configuration Protocol) communication. Check the board to start the device and check whether there is a bootable device.
  • DHCP Dynamic Host Configuration Protocol
  • the hardware failure detection analysis in the operating phase of the operating system includes the following steps:
  • the basic input/output system device detects related hardware of the operating system
  • the basic input/output system device when it is determined that the hardware interrupt signal arrives, the basic input/output system device performs fault detection on hardware related to the service operation, and analyzes and classifies the detected hardware fault information. , statistics, and then store the fault information.
  • the fault information detected at this stage includes but is not limited to: CPU error and alarm, CBO error and alarm, QPI error and alarm, VT-D error and alarm, IIO port error and alarm, memory error and alarm, PCIE error and alarm, PCU error and alarm, Ubox (Utility Box) error and alarm.
  • the BIOS device starts the MCA (Machine Check Architecture) function and the AER (Advance Error Report) function, and opens the error detection block corresponding to each component (Machine Check Error Bank) The switch, the hook fault identification classification function, and the error handling hook function of each component.
  • MCA Machine Check Architecture
  • AER Advanced Error Report
  • the hardware pulls the error status pin low and generates a system management interrupt (SMI).
  • SMI system management interrupt
  • the BIOS device obtains the control right, reads the error status register of the CPU and the bridge chip through the hardware fault identification classification function, obtains the specific information of the error check block (Machine Check Error Bank), and then performs detailed analysis according to the chip manual. Separate and interpret specific hardware error information.
  • Embodiment 2 is a diagrammatic representation of Embodiment 1:
  • the basic input output system device 60 includes:
  • the fault information detection triggering module 61 is configured to detect that the server enters a startup phase
  • the fault information detecting module 62 is configured to start fault detection on hardware of each working phase of the server, where the working phase includes the startup phase;
  • the fault information storage module 63 is configured to store, by the basic input/output system device, hardware fault information obtained by detecting and analyzing.
  • the fault information detecting module 62 performs pre-detection on at least one of the CPU, the memory, the chipset, and the power source of the server according to the hardware detection mechanism provided by the server to obtain the current
  • the hardware information is filtered out from the hardware information to analyze the faulty hardware information to obtain corresponding hardware fault information.
  • the fault information detecting module 62 acquires state information and resource information of each hardware on the server, and identifies fault information of the faulty hardware therefrom.
  • the fault information detecting module 62 performs pre-detection on the out-of-band hardware device of the server to be booted;
  • the fault information detecting module 62 performs hardware fault detection on the server, including: the basic input/output system device determines whether the hardware interrupt signal of the server arrives, and if so, The basic input/output system device detects related hardware of the operating system; acquires fault information of the hardware.
  • the storage setting module 64 is further configured to allocate, on the server serial flash memory, a hardware failure information, before the fault information storage module stores the fault information. Fault storage area.
  • a server comprising the basic input output system means as described above.
  • the technical solution provided by the invention can be widely applied to a computer, a network communication device and the like, and the hardware device in the whole cycle of the operation of the server is detected by a basic input/output system device, thereby preventing the server from being in operation.
  • a failure has improved the stability and reliability of the operation of the server.
  • the invention is applicable to the fields of computer and communication, and is used for implementing hardware for each working stage of the server. Detection and record storage of fault information.

Landscapes

  • Engineering & Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Debugging And Monitoring (AREA)

Abstract

一种服务器硬件故障检测方法及其装置和服务器,所述方法包括:服务器的基本输入输出系统装置检测到所述服务器进入启动阶段(S101);所述基本输入输出系统装置开始对所述服务器在各工作阶段的硬件进行故障检测,所述工作阶段包括所述启动阶段(S102);所述基本输入输出系统装置将检测得到的硬件故障信息进行存储(S103)。本方法通过基本输入输出系统装置检测硬件故障信息,覆盖了所述服务器运行的整个周期对所述服务器的硬件进行故障预检测,从而及时处理所述服务器在运行过程中出现的故障,提高了所述服务器运行的稳定性和可靠性。将检测得到的硬件故障信息进行存储,方便人员处理故障,并实现了对所述硬件故障信息的统一存储管理。

Description

一种服务器硬件故障检测方法及其装置和服务器 技术领域
本发明涉及计算机及通信领域,尤其涉及一种服务器硬件故障检测方法及其装置和服务器。
背景技术
在目前的中高端服务器上,服务器一般都具有部分“黑匣子”功能,用于操作系统崩溃时的故障信息记录,可以将OS(操作系统,Operating System)的各种内核异常如内核错误、重启复位、异常打印信息等记录下来,也可以通过SEL(系统事件日志,System Event Log)记录部分简单的硬件错误,再或者通过带外的方式(比如联合测试链路)在故障发生后在现场采集错误,又或者通过带内的异常触发机理被动地监控设备异常,而带内的异常触发机理需要异常条件去触发其异常记录模块才进行记录。这些方法可以一定程度上帮助维护人员确定故障产生的原因,但是这些方法仍存在如下缺陷:
1、上述的方法是通过被动地触发检测记录,缺少对服务器的主动检测,尤其是对服务器硬件故障的主动甄别监控。对于在系统正常启动并运行,且业务质量大幅度下降的情况,系统并不会触发故障信息记录,这时就会造成故障信息被遗漏,使得维护人员在维护时对故障信息的追查困难。
2、由于只有在系统崩溃或者产生异常触发时,才会对检测记录故障信息,因此,造成了系统(业务)运行过程中对硬件故障的采集能力和分析能力严重不足,从而导致系统的预警能力不足,降低了系统的稳定性和可靠性。
3、对于记录的故障信息过于简单、零散,没有准确统一的记录管理,无法做到对故障信息分析一步到位,后期需要大量的分析和筛查、交叉验证才能找到主要故障源。
4、通过带外的方式对故障信息采集,会受限于专业人员、局点环境、信息安全等,环境部署、人员协调、环境恢复等成本高昂。
因此,目前的服务器故障信息记录实现方案,只有在特定的条件下才能实现故障信息的检测记录,并且其记录的故障信息简单、零散,需要后期的大量分析。
发明内容
本发明要解决的主要技术问题是,提供一种服务器硬件故障检测方法及其装置和服务器,解决现有技术中无法实现对服务器各个工作阶段的硬件进行实时故障信息的检测和记录存储的技术问题。
为解决上述技术问题,本发明提供一种服务器硬件故障检测方法,包括:
服务器的基本输入输出系统装置检测到所述服务器进入启动阶段;
所述基本输入输出系统装置开始对所述服务器在各工作阶段进行硬件故障检测,所述工作阶段包括所述启动阶段;
所述基本输入输出系统装置将检测得到的硬件故障信息进行存储。
在本发明一实施例中,所述启动阶段包括初始化阶段,所述基本输入输出系统装置在所述初始化阶段对所述服务器进行硬件故障检测包括:
所述基本输入输出系统装置根据所述服务器提供的硬件检测机制对所述服务器的CPU、内存、芯片组和电源中的至少一个进行硬件的预检测获取当前的硬件信息,从所述硬件信息中筛选出有故障的硬件信息进行分析处理得到相应的硬件故障信息。
在本发明另一实施例中,所述启动阶段还包括设备枚举阶段,所述基本输入输出系统装置在所述设备枚举阶段对所述服务器进行硬件故障检测包括:
所述基本输入输出系统装置获取所述服务器上各硬件的状态信息和资源信息,并从中识别出现故障的硬件的故障信息。
在本发明另一实施例中,所述启动阶段为冷启动阶段或者热启动阶段。
在本发明另一实施例中,所述工作阶段还包括操作系统预引导阶段和操作系统业务运行阶段中的至少一个。
在本发明另一实施例中,所述工作阶段包括操作系统预引导阶段时,所述基本输入输出系统装置在所述操作系统预引导阶段对所述服务器进行硬件故障检测包括:
所述基本输入输出系统装置对将要引导启动的所述服务器带外的硬件设备进行预检测;
获取所述硬件设备的当前硬件信息;
从所述当前硬件信息中筛选出出现故障的硬件设备的故障信息;
所述工作阶段包括操作系统业务运行阶段时,所述基本输入输出系统装置在所述操作系统业务运行阶段对所述服务器进行硬件故障检测包括:所述基本输入输出系统装置判断所述服务器的硬件中断信号是否到来,若是,则所述基本输入输出系统装置对所述操作系统的相关硬件进行检测;获取所述硬件的故障信息。
在本发明另一实施例中,在所述基本输入输出系统装置将检测得到的故障信息进行存储之前,还包括在所述服务器串行闪存存储器上分配一个用于存储所述硬件故障信息的故障存储区。
为解决上述技术问题,本发明还提供一种基本输入输出系统装置,包括:
故障信息检测触发模块,设置为检测服务器是否进入启动阶段;
故障信息检测模块,设置为在所述故障检测触发模块检测到所述服务器进入启动阶段时,开始对所述服务器在各工作阶段进行硬件故障检测,所述工作阶段包括所述启动阶段;
故障信息存储模块,设置为所述故障信息检测模块检测得到的硬件故障信息进行 存储。
在本发明另一实施例中,还包括存储设置模块,设置为在所述故障信息存储模块将所述硬件故障信息进行存储之前,在所述服务器串行闪存存储器上分配一个用于存储所述硬件故障信息的故障存储区。
为解决上述技术问题,本发明还提供一种服务器包括如上所述的基本输入输出系统装置。
本发明的有益效果是:
本发明提供的一种服务器硬件故障检测方法及其装置和服务器,由服务器的基本输入输出系统(Basic Input Output System,BIOS)装置检测到服务器进入启动阶段时,开始对该服务器的各工作阶段的硬件进行故障检测分析进而得到相应的硬件故障信息。由于利用的是服务器自身的BIOS装置,因此可以检测服务器运行的整个周期内可能出现的所有硬件故障,可提升对硬件故障信息检测的全面性和准确度,并更利于实现对服务器硬件故障信息的统一存储管理,保证了维护人员在对所述服务器进行维护时,能准确获取到硬件故障信息,得知需要故障处理的硬件的位置和故障原因,进一步地提高了服务器的稳定性和可靠性。
附图说明
图1为本发明所提供的服务器硬件故障检测方法的流程图;
图2为本发明所提供的服务器初始化阶段进行硬件故障检测的流程图;
图3为本发明设备枚举阶段进行硬件故障检测的流程图;
图4为本发明操作系统预引导阶段进行硬件故障检测的流程图;
图5为本发明操作系统业务运行阶段进行硬件故障检测的流程图;
图6为本发明提供的基本输入输出系统装置结构框图。
具体实施方式
下面通过具体实施方式结合附图对本发明作进一步详细说明。
实施例一:
请参考图1,图1为本发明所提供的服务器硬件故障检测方法的流程图,本实施例提供的服务器硬件故障检测方法应当理解的是通过所述基本输入输出系统装置主动对服务器的硬件进行故障检测,这里的主动指的是根据服务器预设的检测机制,在服务器启动时所述BIOS装置立即执行对所述服务器的运行硬件进行故障检测操作或者所述BIOS装置对所述服务器的各工作阶段都进行硬件故障检测操作,具体包括以下步骤:
S101,服务器的基本输入输出系统装置检测到所述服务器进入启动阶段;
本实施例中,服务器的启动阶段为冷启动阶段或者热启动阶段;所述基本输入输 出系统装置检测到所述服务器进入启动阶段指的是:当所述启动阶段为冷启动阶段时,所述基本输入输出系统装置可以通过以下方式检测是否进入启动阶段但不限于以下方式:检测服务器上的电源开关键是否有按下或者检测服务器的供电电路是否与服务器电源接口接通或者通过检查电源的状态标志位,若是,则服务器已进入启动阶段,服务器已运行,执行步骤S102,否则,继续检测;
当所述启动阶段为热启动阶段时,所述基本输入输出系统装置通过检测所述服务器是否有复位启动信号输入,若是,则服务器开始进行热启动运行,执行步骤S102,否则,继续检测;这里的复位启动信号可以是由硬件触发输入,比如:通过复位按键的方式输入;也可以是通过软件实现的方式输入,比如:通过代码、工具实现定时地向服务器输入;还可以是用户主动通过命令或者操作“重新启动”按钮输入。
S102,所述基本输入输出系统装置开始对所述服务器在各工作阶段的硬件进行故障检测,所述工作阶段包括所述启动阶段;
S103,所述基本输入输出系统装置将检测得到的硬件故障信息进行存储。
在本实施例中,在步骤S103之前,还包括在所述服务器串行闪存存储器上分配一个用于存储所述硬件故障信息的故障存储区;进一步的,所述基本输入输出系统装置记录存储的故障信息内容包括:时间,发生的事件,严重程度,具体位置或故障详情,建议处理方式。
在本实施例中,执行完上述的步骤检测到硬件故障信息并进行存储后,当维护人员需要对所述服务器进行维护时,维护人员可以通过与所述存储区连接的带外控制平台或者网络用户界面直接获取所述存储的硬件故障信息,方便维护人员追踪故障发生轨迹,现场恢复、置换故障硬件(比如:直接更换某颗CPU,直接更换第几根内存条,直接替换故障总线接口卡)。在中、高端服务器通过热插拔技术(包含但不限于:CPU热插拔、内存热插拔、总线接口热插拔)完全可以保证系统运行不间断,达到早发现、早预警、早预防、早处理的目的。即使服务器在冷启动或热启动过程中挂死,服务器外设无法使用(如网口不通、屏幕未亮、键盘鼠标不响应),仍然可以获取到有效的故障信息。
在本实施例中,运维人员通过控制平台获取到硬件故障信息,除了及时处理之外,还可以将所述硬件故障信息转储到另外的带外存储设备上。
在本实施例中,所述启动阶段包括初始化阶段,所述基本输入输出系统装置在所述初始化阶段对所述服务器的硬件进行故障检测的步骤如图2所示,其具体包括:
S201,所述基本输入输出系统装置初始化CPU、内存、芯片组和电源;
S202,所述基本输入输出系统装置检测获取CPU、内存、芯片组和电源中至少一个的当前硬件信息;
在本实施例中,所述基本输入输出系统装置是根据所述服务器提供的硬件检测机制对所述服务器的CPU、内存、芯片组和电源中的至少一个进行预检测获取当前的 硬件信息,从所述硬件信息中筛选出有故障的硬件信息进行分析处理得到相应的硬件故障信息。
具体的,在本实施例中,当所述基本输入输出系统装置检测到所述服务器已进入初始化阶段时,所述BIOS装置可以利用所述BIOS装置本身或主动增加压力、或利用CPU和芯片组提供的硬件检测机制、或利用带内的集成工具(如内存测试工具、系统事件日记测试工具)等方式,主动发起对CPU、内存、芯片组和电源等服务器硬件的故障和配置进行检测,获取对应的硬件信息,然后对所获取到的硬件信息进行预分析判断、预统计、预甄别、扫描、度量硬件,并收集测试结果,并筛选出有效的故障信息(包括可能会触发系统后续异常的信息)进行详细的记录并进行存储;使得当服务器在该阶段中发生系统异常情况时,保证了所述服务器在系统异常发生之前获取并记录更多的详细硬件故障信息。在该阶段中,所述记录存储的故障信息包括但不限于:CPU错误与告警、CBO(缓存区,Caching Agent)错误与告警、QPI(快速通道互联,Quick Path Interconnect)错误与告警、IIO(集成输入/输出,Integrated I/O)端口错误与告警、HA(本地代理,Home Agent)错误与告警、IMC(整合内存控制器,Integrated Memory Controller)错误与告警、PCU(电源控制单元,Power Control Unit)错误与告警、电源和电压错误与告警、内存错误与告警(包括内存条本身错误与告警、内存通道错误与告警、内存插法错误与告警、内存电压错误与告警、内存不兼容错误与告警、配置错误与告警等)。
在本实施例中,所述启动阶段还包括设备枚举阶段,该阶段进行硬件故障检测的流程图如图3所示,具体包括如下步骤:
S301,所述基本输入输出系统装置开始设备枚举;
S302,所述基本输入输出系统装置检测获取设备的当前信息;
进一步的,所述基本输入输出系统装置获取所述服务器上各硬件的状态信息和资源信息,并从中识别出现故障的软硬件的故障信息。在该阶段中,所述故障信息包括但不限于:设备访问错误(包括内存和IO要求不合法)、第三方固件(OPTION ROM)未执行(包括空间不足、格式不对)、设备损坏被禁用。具体的,在本实施例中,当所述服务器对总线接口(Peripheral Component Interface Express,PCIE)外设下发探针任务,计算资源需求时,所述BIOS装置根据检测机制开始识别工业规范制定的第三方固件(OPTION ROM)标识符、厂商信息、设备分类信息及容量,检查硬件状态指示信息(如链接状态、带宽信息等)等,并从上述的信息中识别出有故障硬件的故障信息进行存储。
在本实施例中,所述工作阶段还包括操作系统预引导阶段和操作系统业务运行阶段中的至少一个;请参见图4、图5,分别为操作系统预引导阶段、操作系统业务运行阶段进行硬件故障检测的流程图;
如图4,所述操作系统预引导阶段进行硬件故障检测分析包括以下步骤:
S401,所述基本输入输出系统装置对将要引导启动的所述服务器带外的硬件设备进行预检测;
S402,获取所述硬件设备的当前硬件信息;
S403,从所述当前硬件信息中筛选出出现故障的硬件设备的故障信息;
在本实施例中,所述服务器带外的硬件设备包括但不限于:硬盘、服务器网口、设备引导属性;所述故障信息包括但不限于:无可启动设备、硬盘(或U盘)损坏(含MBR分区破坏)、PXE网络引导失败(含端口信息、网络ping不通)、ME(Management Engine)工作状态异常。优选的,当在该阶段中,所述基本输入输出系统装置对所述硬盘分区进行故障检测时,所述基本输入输出系统装置主动发起检测获取信号,获取硬盘(U盘)的主引导记录(MBR分区)数据,分析引导标志、结束标志和出错信息数据区,根据所述服务器提供的硬件检测机制判断硬盘(U盘)是否可以引导、损坏;通过下发自检命令判断服务器与主机之间的通信链路状态、工作模式;通过DHCP(Dynamic Host Configuration Protocol,动态主机配置协议)通讯检查网络是否连通;罗列单板启动设备,检查是否存在可启动设备。
如图5,所述操作系统业务运行阶段进行硬件故障检测分析包括以下步骤:
S501,判断所述服务器的硬件中断信号是否到来;
S502,若是,则所述基本输入输出系统装置对所述操作系统的相关硬件进行检测;
S503,获取所述硬件的故障信息;
在上述故障检测分析中,当判断到所述硬件中断信号到来时,所述基本输入输出系统装置对与所述业务运行相关的硬件进行故障检测,并对检测到的硬件故障信息进行分析、分类、统计,然后对所述故障信息进行存储。在该阶段检测的故障信息包含但不限于:CPU错误与告警、CBO错误与告警、QPI错误与告警、VT-D错误与告警、IIO端口错误与告警、内存错误与告警、PCIE错误与告警、PCU错误与告警、Ubox(Utility Box)错误与告警。优选的,在该阶段的硬件故障检测过程,所述BIOS装置开启MCA(Machine Check Architecture)功能和增强型错误记录AER(Advance Error Report)功能,打开各个组件对应的错误检测块(Machine Check Error Bank)开关,挂接故障识别分类函数以及各个组件的错误处理钩子函数。当MCE(Machine-Check Exception)异常发生时,硬件拉低错误状态引脚,产生系统管理中断(SMI)。此时所述BIOS装置获得控制权,通过硬件故障识别分类函数读取CPU和桥片自带的错误状态寄存器,获取错误检测块(Machine Check Error Bank)具体信息,然后根据芯片手册进行详细解析,将具体的硬件错误信息分离、解读出来。
实施例二:
本实施例提供了一种基本输入输出系统装置,应当理解的是该BIOS装置可以设置于任意服务器中,实现对服务器在任意工作阶段的硬件故障检测,请参见图6所示, 基本输入输出系统装置60包括:
故障信息检测触发模块61,用于检测到所述服务器进入启动阶段;
故障信息检测模块62,用于开始对所述服务器在各工作阶段的硬件进行故障检测,所述工作阶段包括所述启动阶段;
故障信息存储模块63,用于所述基本输入输出系统装置将检测分析得到的硬件故障信息进行存储。
在本实施例中,在服务器的启动阶段,所述故障信息检测模块62根据所述服务器提供的硬件检测机制对所述服务器的CPU、内存、芯片组和电源中的至少一个进行预检测获取当前的硬件信息,从所述硬件信息中筛选出有故障的硬件信息进行分析处理得到相应的硬件故障信息。
在服务器的设备枚举阶段,所述故障信息检测模块62获取所述服务器上各硬件的状态信息和资源信息,并从中识别出现故障的硬件的故障信息。
在服务器的操作系统预引导阶段时,所述故障信息检测模块62对将要引导启动的所述服务器带外的硬件设备进行预检测;
获取所述硬件设备的当前硬件信息;
从所述当前硬件信息中筛选出出现故障的硬件设备的故障信息;
在服务器的操作系统业务运行阶段时,所述故障信息检测模块62对所述服务器进行硬件故障检测包括:所述基本输入输出系统装置判断所述服务器的硬件中断信号是否到来,若是,则所述基本输入输出系统装置对所述操作系统的相关硬件进行检测;获取所述硬件的故障信息。
在本实施例中,还包括存储设置模块64,用于在所述故障信息存储模块将所述故障信息进行存储之前,在所述服务器串行闪存存储器上分配一个用于存储所述硬件故障信息的故障存储区。
在本发明中,还提供了一种服务器,所述服务器包括如上所述的基本输入输出系统装置。
本发明提供的技术方案可广泛应用于计算机、网络通信设备等设备上,通过基本输入输出系统装置对所述服务器运行的整个周期中的硬件设备进行故障检测,可预防所述服务器在运行过程中出现故障,提高了所述服务器运行的稳定性和可靠性。
以上内容是结合具体的实施方式对本发明所作的进一步详细说明,不能认定本发明的具体实施只局限于这些说明。对于本发明所属技术领域的普通技术人员来说,在不脱离本发明构思的前提下,还可以做出若干简单推演或替换,都应当视为属于本发明的保护范围。
工业实用性
本发明适用于计算机及通信领域,用以实现对服务器各个工作阶段的硬件进行实 时故障信息的检测和记录存储。

Claims (10)

  1. 一种服务器硬件故障检测方法,包括:
    服务器的基本输入输出系统装置检测到所述服务器进入启动阶段;
    所述基本输入输出系统装置开始对所述服务器在各工作阶段进行硬件故障检测,所述工作阶段包括所述启动阶段;
    所述基本输入输出系统装置将检测得到的硬件故障信息进行存储。
  2. 如权利要求1所述的服务器故障检测方法,其中,所述启动阶段包括初始化阶段,所述基本输入输出系统装置在所述初始化阶段对所述服务器进行硬件故障检测包括:
    所述基本输入输出系统装置根据所述服务器提供的硬件检测机制对所述服务器的CPU、内存、芯片组和电源中的至少一个进行硬件的预检测获取当前的硬件信息,从所述硬件信息中筛选出有故障的硬件信息进行分析处理得到相应的硬件故障信息。
  3. 如权利要求2所述的服务器硬件故障检测方法,其中,所述启动阶段还包括设备枚举阶段,所述基本输入输出系统装置在所述设备枚举阶段对所述服务器进行硬件故障检测包括:
    所述基本输入输出系统装置获取所述服务器上各硬件的状态信息和资源信息,并从中识别出现故障的硬件的故障信息。
  4. 如权利要求1-3任一项所述的服务器硬件故障检测方法,其中,所述启动阶段为冷启动阶段或者热启动阶段。
  5. 如权利要求1-3任一项所述的服务器硬件故障检测方法,其中,所述工作阶段还包括操作系统预引导阶段和操作系统业务运行阶段中的至少一个。
  6. 如权利要求5所述的服务器硬件故障检测方法,其中,所述工作阶段包括操作系统预引导阶段时,所述基本输入输出系统装置在所述操作系统预引导阶段对所述服务器进行硬件故障检测包括:
    所述基本输入输出系统装置对将要引导启动的所述服务器带外的硬件设备进行预检测;
    获取所述硬件设备的当前硬件信息;
    从所述当前硬件信息中筛选出出现故障的硬件设备的故障信息;
    所述工作阶段包括操作系统业务运行阶段时,所述基本输入输出系统装置在所述操作系统业务运行阶段对所述服务器进行硬件故障检测包括:所述基本输入输出系统装置判断所述服务器的硬件中断信号是否到来,若是,则所述基本输入输出系统装置对所述操作系统的相关硬件进行检测;获取所述硬件的故障信息。
  7. 如权利要求1-3任一项所述的服务器硬件故障检测方法,其中,在所述基本输入输出系统装置将检测得到的故障信息进行存储之前,还包括在所述服务器串行 闪存存储器上分配一个用于存储所述硬件故障信息的故障存储区。
  8. 一种基本输入输出系统装置,包括:
    故障信息检测触发模块,设置为检测服务器是否进入启动阶段;
    故障信息检测模块,设置为在所述故障检测触发模块检测到所述服务器进入启动阶段时,开始对所述服务器在各工作阶段进行硬件故障检测,所述工作阶段包括所述启动阶段;
    故障信息存储模块,设置为将所述故障信息检测模块检测得到的硬件故障信息进行存储。
  9. 如权利要求8所述的基本输入输出系统装置,其中,还包括存储设置模块,设置为在所述故障信息存储模块将所述硬件故障信息进行存储之前,在所述服务器串行闪存存储器上分配一个用于存储所述硬件故障信息的故障存储区。
  10. 一种服务器,包括如权利要求8或9所述的基本输入输出系统装置。
PCT/CN2016/100618 2015-10-16 2016-09-28 一种服务器硬件故障检测方法及其装置和服务器 WO2017063505A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201510673005.5 2015-10-16
CN201510673005.5A CN106598790A (zh) 2015-10-16 2015-10-16 一种服务器硬件故障检测方法及其装置和服务器

Publications (1)

Publication Number Publication Date
WO2017063505A1 true WO2017063505A1 (zh) 2017-04-20

Family

ID=58517771

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/100618 WO2017063505A1 (zh) 2015-10-16 2016-09-28 一种服务器硬件故障检测方法及其装置和服务器

Country Status (2)

Country Link
CN (1) CN106598790A (zh)
WO (1) WO2017063505A1 (zh)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110187994A (zh) * 2019-05-28 2019-08-30 北京星网锐捷网络技术有限公司 一种故障隔离方法、设备和故障隔离系统
CN110737560A (zh) * 2019-10-22 2020-01-31 北京百度网讯科技有限公司 一种服务状态检测方法、装置、电子设备和介质
CN113220407A (zh) * 2020-02-04 2021-08-06 北京京东振世信息技术有限公司 故障演练的方法和装置
CN113590413A (zh) * 2021-06-29 2021-11-02 浪潮商用机器有限公司 Unix服务器、unix服务器故障预警方法及装置
WO2023178923A1 (zh) * 2022-03-23 2023-09-28 苏州浪潮智能科技有限公司 一种智能监控微调整方法、装置、设备及存储介质

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109117299B (zh) * 2017-06-23 2022-04-05 佛山市顺德区顺达电脑厂有限公司 服务器的侦错装置及其侦错方法
CN107291584A (zh) * 2017-06-27 2017-10-24 郑州云海信息技术有限公司 一种机箱故障检测方法及系统
CN109426606A (zh) * 2017-08-23 2019-03-05 东软集团股份有限公司 内核故障诊断信息处理方法、装置、存储介质及电子设备
CN109918257B (zh) * 2017-12-12 2022-11-04 杭州海康威视数字技术股份有限公司 一种硬盘异常处理方法和装置
CN109697144B (zh) * 2018-11-22 2022-09-23 合肥联宝信息技术有限公司 一种电子设备的硬盘检测方法及电子设备
CN109783283A (zh) * 2018-12-11 2019-05-21 中国长城科技集团股份有限公司 一种硬件检测信息的处理方法、装置及终端设备
CN111722954A (zh) * 2020-06-30 2020-09-29 曙光信息产业(北京)有限公司 服务器异常定位方法、装置、存储介质及服务器
CN111767184A (zh) * 2020-09-01 2020-10-13 苏州浪潮智能科技有限公司 一种故障诊断方法、装置及电子设备和存储介质
CN112148576B (zh) * 2020-09-28 2021-06-08 北京基调网络股份有限公司 一种应用性能监测方法、系统及存储介质
CN113190278B (zh) * 2021-03-18 2023-03-17 山东英信计算机技术有限公司 一种多场景故障处理方法、系统及介质
CN113064747B (zh) * 2021-03-26 2022-10-28 山东英信计算机技术有限公司 一种服务器启动过程中的故障定位方法、系统及装置
CN115495301A (zh) * 2021-06-18 2022-12-20 华为技术有限公司 一种故障处理方法、装置、设备及系统
CN115047322B (zh) * 2022-08-17 2022-10-25 中诚华隆计算机技术有限公司 一种用于智能医疗设备的故障芯片的标识方法及系统

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090063509A1 (en) * 2007-08-30 2009-03-05 Sqlalert Corporation Method and Apparatus for Monitoring Network Servers
US20120221885A1 (en) * 2011-02-24 2012-08-30 Fujitsu Limited Monitoring device, monitoring system and monitoring method
CN103166773A (zh) * 2011-12-09 2013-06-19 国家电网公司 监测服务器运行状态的方法与系统
CN103713981A (zh) * 2013-12-31 2014-04-09 国网山东省电力公司 一种数据库服务器性能检测和预警方法

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE10153998A1 (de) * 2001-11-02 2003-05-22 Siemens Ag Verfahren zum Anzeigen von Fehlermeldungen bei einem Kleinrechner
WO2012119432A1 (zh) * 2011-08-31 2012-09-13 华为技术有限公司 提高计算机系统稳定性的方法及计算机系统

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090063509A1 (en) * 2007-08-30 2009-03-05 Sqlalert Corporation Method and Apparatus for Monitoring Network Servers
US20120221885A1 (en) * 2011-02-24 2012-08-30 Fujitsu Limited Monitoring device, monitoring system and monitoring method
CN103166773A (zh) * 2011-12-09 2013-06-19 国家电网公司 监测服务器运行状态的方法与系统
CN103713981A (zh) * 2013-12-31 2014-04-09 国网山东省电力公司 一种数据库服务器性能检测和预警方法

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110187994A (zh) * 2019-05-28 2019-08-30 北京星网锐捷网络技术有限公司 一种故障隔离方法、设备和故障隔离系统
CN110737560A (zh) * 2019-10-22 2020-01-31 北京百度网讯科技有限公司 一种服务状态检测方法、装置、电子设备和介质
CN110737560B (zh) * 2019-10-22 2023-10-20 北京百度网讯科技有限公司 一种服务状态检测方法、装置、电子设备和介质
CN113220407A (zh) * 2020-02-04 2021-08-06 北京京东振世信息技术有限公司 故障演练的方法和装置
CN113220407B (zh) * 2020-02-04 2023-09-26 北京京东振世信息技术有限公司 故障演练的方法和装置
CN113590413A (zh) * 2021-06-29 2021-11-02 浪潮商用机器有限公司 Unix服务器、unix服务器故障预警方法及装置
CN113590413B (zh) * 2021-06-29 2024-05-10 浪潮商用机器有限公司 Unix服务器、unix服务器故障预警方法及装置
WO2023178923A1 (zh) * 2022-03-23 2023-09-28 苏州浪潮智能科技有限公司 一种智能监控微调整方法、装置、设备及存储介质

Also Published As

Publication number Publication date
CN106598790A (zh) 2017-04-26

Similar Documents

Publication Publication Date Title
WO2017063505A1 (zh) 一种服务器硬件故障检测方法及其装置和服务器
US11360842B2 (en) Fault processing method, related apparatus, and computer
US8843785B2 (en) Collecting debug data in a secure chip implementation
CN102597962B (zh) 用于虚拟计算环境中的故障管理的方法和系统
WO2015039598A1 (zh) 故障定位方法及装置
TWI632462B (zh) 開關裝置及偵測積體電路匯流排之方法
CN111274059B (zh) 一种从设备的软件异常处理方法及装置
US11853150B2 (en) Method and device for detecting memory downgrade error
US10275330B2 (en) Computer readable non-transitory recording medium storing pseudo failure generation program, generation method, and generation apparatus
CN111722965A (zh) 电脑系统及其侦错方法
JP2012003651A (ja) 仮想化環境監視装置とその監視方法およびプログラム
US8793538B2 (en) System error response
JP6880961B2 (ja) 情報処理装置、およびログ記録方法
CN112506693A (zh) 一种记录异常信息的方法、装置、存储介质和电子设备
CN115599617A (zh) 总线检测方法、装置、服务器及电子设备
JP6217086B2 (ja) 情報処理装置、エラー検出機能診断方法およびコンピュータプログラム
CN113742113A (zh) 一种嵌入式系统健康管理方法、设备及储存介质
JP2015130023A (ja) 情報記録装置、情報処理装置、情報記録方法、及び情報記録プログラム
CN115686914A (zh) 一种故障记录方法、计算设备及存储介质
CN112084049B (zh) 用于监控基板管理控制器的常驻程序的方法
TWI840907B (zh) 偵測偏差的電腦系統及方法,及非暫態電腦可讀取媒體
CN115495291A (zh) 用于促进系统致命错误的记录的方法和设备
CN108415788B (zh) 用于对无响应处理电路作出响应的数据处理设备和方法
CN116489001A (zh) 交换机故障诊断及恢复方法、装置、交换机及存储介质
CN116560936A (zh) 异常监测方法、协处理器及计算设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16854884

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16854884

Country of ref document: EP

Kind code of ref document: A1