WO2015039598A1 - Fault locating method and device - Google Patents

Fault locating method and device Download PDF

Info

Publication number
WO2015039598A1
WO2015039598A1 PCT/CN2014/086684 CN2014086684W WO2015039598A1 WO 2015039598 A1 WO2015039598 A1 WO 2015039598A1 CN 2014086684 W CN2014086684 W CN 2014086684W WO 2015039598 A1 WO2015039598 A1 WO 2015039598A1
Authority
WO
WIPO (PCT)
Prior art keywords
abnormal
information
detected
trigger condition
fault
Prior art date
Application number
PCT/CN2014/086684
Other languages
French (fr)
Chinese (zh)
Inventor
刘通良
姜广吉
陈俊杰
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2015039598A1 publication Critical patent/WO2015039598A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/22Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing
    • G06F11/2284Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing by power-on test, e.g. power-on self test [POST]

Definitions

  • the embodiments of the present invention relate to computer technologies, and in particular, to a fault location method and apparatus.
  • CTR cathode ray tube
  • LCD liquid crystal display
  • FIG. 1 A logical block diagram of a typical computer or server is shown in FIG. 1.
  • the main board 100 includes: a memory 10, a CPU 20, a north bridge 30, a south bridge 40, a peripheral 50, and a basic input and output system (BIOS) 60.
  • the motherboard 100 is also connected to a peripheral expansion device (hard disk, graphics card, etc.) 70.
  • an apparatus 80 is added, wherein the apparatus 80 includes a memory 81, a controller 82 and a display module 83, as shown in FIG.
  • the apparatus 80 includes a memory 81, a controller 82 and a display module 83, as shown in FIG.
  • the structure shown in Figure 2 to perform the following steps to achieve the display of the self-test information in the POST phase:
  • the POST phase self-test information is sent to the controller 82 according to a certain data structure
  • the controller 82 sends the decoded and decoded data to the display module 83 to display the POST phase self-test information in real time.
  • the prior art requires an additional controller 82 and a display module 83 to display POST phase self-test information through the display module 83, the self-test information includes fault information; and the user needs to obtain the self-test information. Visit the equipment site.
  • the embodiment of the invention provides a fault location method and device for realizing the viewing and positioning of fault information in a remote manner.
  • an embodiment of the present invention provides a fault location method, including:
  • abnormal information when the abnormal trigger condition is detected, where the abnormal information includes at least fault information of the CPU of the central processing unit;
  • the abnormal information is reported to the monitoring server through the network.
  • collecting abnormal information includes:
  • the exception function is collected using the entry function.
  • the entry functions include:
  • monitoring hardware exception trigger conditions include:
  • the event event or exception message determines that an exception trigger condition is detected, wherein the exception event or exception message is an event or message generated in the BIOS program that triggers an exception.
  • collecting the abnormal information includes:
  • the fault information is collected, and the software call stack relationship and/or the value of the program counter and the program status register at the time of the fault occurrence are collected to indicate the location of the fault information.
  • the reporting the abnormal information to the monitoring server by using the network includes:
  • the abnormal information is reported to the monitoring server by using the intelligent platform management interface IPMI or standard Ethernet.
  • any one of the first to fifth possible implementation manners of the first aspect, in the sixth possible implementation, before the reporting the abnormality information to the monitoring server by using the network also includes:
  • the exception information is encapsulated into capsules including header information, hardware error information, program run stack information, program counters, program status register information, and trailer information.
  • an embodiment of the present invention provides a fault location apparatus, including:
  • the monitoring driving module is configured to monitor a hardware abnormal triggering condition when the power-on starting device executes the basic input/output system BIOS program;
  • An abnormality information collecting module configured to collect abnormal information when the monitoring driving module detects an abnormal triggering condition, where the abnormality information includes at least fault information of a CPU of the central processing unit;
  • the information reporting module is configured to report the abnormality information to the monitoring server through the network.
  • the abnormal information collecting module is specifically configured to: when an abnormal trigger condition is detected in the process of executing the BIOS program, trigger an entry function corresponding to each abnormal information trigger source And, using the entry function to collect exception information.
  • the abnormal information collection module is also used to:
  • the monitoring driving module is specifically configured to execute a BIOS on the power-on booting device At the time of the program, monitoring whether a system management interrupt SMI, an abnormal event event, or an exception message is generated, and if so, determining that an abnormal trigger condition is detected, wherein the abnormal event or the abnormal message is an execution of the abnormality generated by the BIOS program. Event or message.
  • the abnormal information collecting module is specifically configured to collect fault information when an abnormal trigger condition is detected, and collect a software call stack relationship and/or a program counter when the fault occurs. And the value of the program status register to indicate the location of the fault information.
  • the fault location method and device of the embodiment of the present invention can report the abnormal information to the monitoring server, so that the operator can remotely view the abnormal information, that is, the operator can perform fault location and troubleshooting according to the reported abnormal information on the monitoring server side, and reduce the use. Dimensional cost.
  • Figure 1 is a logical block diagram of a typical computer or server
  • FIG. 2 is a schematic structural diagram of a self-checking information display for a POST phase in the prior art
  • Embodiment 3 is a flowchart of Embodiment 1 of a fault location method according to the present invention.
  • Embodiment 4 is a flowchart of Embodiment 2 of a fault location method according to the present invention.
  • FIG. 5 is a diagram showing an example of a capsule encapsulation format in Embodiment 2 of the fault location method of the present invention.
  • FIG. 6 is a schematic structural diagram of Embodiment 1 of a fault locating device according to the present invention.
  • FIG. 7 is a schematic structural diagram of Embodiment 1 of a fault location system according to the present invention.
  • FIG. 3 is a flowchart of Embodiment 1 of a fault location method according to the present invention.
  • Embodiments of the present invention provide a fault location method, which may be performed by a fault location apparatus, which may be integrated in a computer or a server, and implemented by software and/or hardware. As shown in FIG. 3, the method in this embodiment includes:
  • Step 301 Monitor a hardware abnormal trigger condition when the power-on boot device executes the BIOS program.
  • the BIOS program is executed, and the hardware components of the device, such as the block diagram shown in FIG. 1, the central processing unit (CPU), the memory, the north bridge, the south bridge, and the like are performed. Initialize and query the working status of these hardware components.
  • the hardware components of the device such as the block diagram shown in FIG. 1, the central processing unit (CPU), the memory, the north bridge, the south bridge, and the like are performed.
  • a fault occurs in the POST phase, first check if the hardware environment is normal.
  • the BIOS program in the POST phase, the BIOS program is executed, and a callback function is registered in the BIOS program. If the hardware environment of the device is abnormal, for example, the CPU initialization is abnormal, indicating that an abnormal trigger condition is generated, the callback function is triggered to be triggered.
  • the abnormal information is collected, that is, step 302 is executed to collect the abnormal information generated by the abnormal trigger condition triggering subfunction caused by the hardware environment abnormality in the device, wherein the abnormal trigger condition is used as an input parameter of the callback function.
  • Step 302 Collect abnormal information when the abnormal trigger condition is detected, where the abnormal information is at least Includes fault information for the CPU.
  • the BIOS program includes at least one sub-function for collecting abnormal information, and the sub-functions are called by the above-mentioned callback function.
  • the fault information includes fault information of hardware components such as CPU, memory, north bridge, south bridge, and the like, for example, CPU fault information Core or Uncore, to determine whether the current fault information causes a CPU fault, and locate the fault cause according to the fault phenomenon;
  • the fault information includes the root port (Root Port) and the bus and interface standard (Pciepheral Component Interface Express, Pcie) device fault information.
  • the fault information is used to check whether there is a North Bridge or Pcie device fault, especially when an input/output occurs.
  • IO Or Output
  • SMI System Management Interrupt
  • DDR Double Rate Synchronous Dynamic Random Access Memory
  • Step 303 The abnormality information is reported to the monitoring server through the network.
  • the monitoring server receives the abnormal information through Ethernet communication or LPC communication, parses and records the abnormal information, and saves the abnormal information to the local storage medium, including but not limited to the hard disk and the non-volatile random access memory (Non- Volatile Random Access Memory (NVRAM) for fault management and long-term maintenance, as important data for subsequent positioning; at the same time, the operator can be informed of faults in a readable and visual form, and maintenance personnel can also query faults based on abnormal information. Library for more detailed fault location information.
  • NVRAM Non- Volatile Random Access Memory
  • the manner in which the operator knows that the fault is generated is arbitrary.
  • the monitoring server may perform the alarm, or may be learned by the operator in real time, and is not limited herein.
  • the self-test information of the computer or the server in the POST phase is displayed through a display module, such as a display device such as an LCD or a VFT.
  • the display modules are installed on the front panel of the chassis, and therefore, the operator is required to come to the device site.
  • the abnormality information is reported to the monitoring service.
  • the server allows the operator to remotely view the anomaly information through the monitoring server.
  • the abnormality information is reported to the monitoring server, so that the operator can remotely view the abnormal information, that is, the operator can perform fault location and troubleshooting according to the reported abnormal information on the monitoring server side, thereby reducing the cost of using the dimension.
  • the collection abnormal information can be further refined into:
  • the abnormal information trigger source as a hardware component whose state is abnormal, that is, the hardware component mentioned above that detects an abnormal trigger condition.
  • the abnormal trigger condition is taken as the input parameter of the entry function corresponding to the trigger source of the abnormal information.
  • triggering an entry function corresponding to each abnormal information trigger source may include:
  • the corresponding entry function of the CPU is triggered; if the memory abnormal trigger condition is detected, the memory corresponding entry function is triggered; if the North Bridge abnormal trigger condition is detected, the north bridge corresponding is triggered.
  • monitoring the hardware abnormal triggering condition may include: monitoring whether to generate an SMI, an abnormal event (Event), or an abnormal message (Message) when the power-on booting device executes the BIOS program. If yes, it is determined that an abnormal trigger condition is detected, wherein the abnormal event or abnormal message is an event or message generated in the BIOS program that triggers an abnormality.
  • the triggering mode for triggering the collection of the abnormal information according to the detected abnormal triggering condition includes: SMI mode, abnormal event mode, or abnormal message mode, as follows for each trigger mode. How to call the entry function corresponding to each exception information trigger source to explain:
  • an SMI interrupt is triggered, and an entry function corresponding to each abnormal information trigger source is called in the SMI Handler;
  • an Event is sent, and an entry function corresponding to the trigger source of each abnormal information is called in the callback function of the Event;
  • the abnormal trigger condition is detected as the abnormal message mode, a message is sent, and the entry function corresponding to the trigger source of each abnormal information is called in the callback function of the message.
  • collecting the abnormality information may include: collecting the fault information when the abnormal trigger condition is detected, and collecting the software call stack relationship and/or the program counter and the program status register when the fault occurs.
  • a value to indicate the location of the fault information In general, if only the fault information is reported, it is not enough to complete the fault location and troubleshooting. Therefore, accurate information is needed to assist in the fault location. Therefore, the present invention introduces the concept of kernel dump in linux OS, collects the software call stack relationship when the fault occurs, and saves it as a data base for precise positioning while collecting the fault information; in addition, it also collects the current program counter and program status. The value of the register saves the current running program counter and the value of the program status register, which is useful for analyzing the program running clues when the exception occurs and the register status of the processor is correct or not, and saving the CPU running status when the abnormality is saved.
  • reporting the abnormality information to the monitoring server by using the network may include: reporting the abnormality information to the monitoring server by using an Intelligent Platform Management Interface (IPMI) or a standard Ethernet.
  • IPMI Intelligent Platform Management Interface
  • abnormal information can be reported by other communication methods.
  • FIG. 4 is a flowchart of Embodiment 2 of a fault location method according to the present invention. As shown in FIG. 4, on the basis of the foregoing embodiment, the fault location method may further include the following steps:
  • Step 401 Monitor a hardware abnormal trigger condition when the power-on boot device executes the BIOS program.
  • This step refers to step 301 of the embodiment shown in FIG. 3, and details are not described herein again.
  • Step 402 Collect abnormal information when the abnormal trigger condition is detected, where the abnormal information is at least Includes fault information for the CPU.
  • This step refers to step 302 of the embodiment shown in FIG. 3, and details are not described herein again.
  • Step 403 Encapsulate the abnormality information into capsules.
  • FIG. 5 is a schematic diagram of a capsule encapsulation format in Embodiment 2 of the fault location method of the present invention.
  • the capsule may include header information, hardware error information, program run stack information, a program counter, program status register information, and trailer information.
  • the header information and the tail information are indispensable, and the middle portion of the capsule may include any combination of hardware error information, program run stack information, program counter and program status register information.
  • the hardware includes a CPU, a memory, a north bridge, and a south bridge;
  • the program running stack information includes a stack information of the current execution function and a stack information of an internal calling function of the current function, wherein the number of internal calling functions is not limited,
  • the fault location analysis should be performed by checking the running thread of the program; the values of the registers and counters involved in the running of the program, for example, the values of the program counter and the program status register, are used to analyze the environment in which the current program runs. Whether the parameter has an exception, for example, whether the value of the program pointer register is illegal, and whether the stack overflows or the like.
  • Step 404 Report the abnormality information to the monitoring server through the network.
  • the abnormal information is encapsulated into a capsule or other information format, and then reported to the monitoring server through IPMI or standard Ethernet or other communication methods; the monitoring server parses the received abnormal information according to the corresponding encapsulation format, and obtains hardware, stack, and Program counter and other information.
  • more detailed fault location information is provided by collecting fault information, a software call stack relationship when the fault occurs, and values of the current program counter and the program status register, thereby further ensuring reliability of fault location.
  • the technical solution of the present invention can be used in the product development stage, and accurate fault information can accelerate the positioning of faults in the development of the computer system/product, reduce the research and development cost, and ensure the product quality; the technical solution of the present invention can also be used in the product operation and maintenance stage. Accurate fault information reduces the difficulty of operation and maintenance.
  • FIG. 6 is a schematic structural diagram of Embodiment 1 of the fault locating device of the present invention.
  • the device of the present embodiment includes: a monitoring driving module 61, an abnormality information collecting module 62, and an information reporting module 63.
  • the monitoring driving module 61 is configured to monitor a hardware abnormal triggering condition when the power-on starting device executes the basic input/output system BIOS program, and the abnormal information collecting module 62 is configured to collect an abnormality when the monitoring driving module detects an abnormal triggering condition.
  • the information, the abnormality information includes at least the fault information of the CPU, and the information reporting module 63 is configured to report the abnormality information to the monitoring server through the network.
  • the fault locating device of this embodiment can be used to implement the technical solution of the method embodiment shown in FIG. 1 , and the implementation principle and technical effects are similar, and details are not described herein again.
  • the abnormality information collecting module 62 may be specifically configured to: when an abnormal trigger condition is detected, trigger an entry function corresponding to each abnormal information trigger source; and use the entry function Collect abnormal information.
  • the abnormality information collecting module 62 can also be configured to: if the CPU abnormal trigger condition is detected, trigger the corresponding entry function of the CPU; if the memory abnormal trigger condition is detected, trigger the memory corresponding entry function; When the north bridge abnormal trigger condition is detected, the corresponding entrance function of the north bridge is triggered; if the south bridge abnormal trigger condition is detected, the corresponding entrance function of the south bridge is triggered.
  • the monitoring driving module 61 may be specifically configured to monitor whether a system management interrupt SMI, an abnormal event event, or an abnormal message is generated when the power-on booting device executes the BIOS program, and if yes, determine that an abnormal trigger condition is detected, wherein the An exception event or exception message is an event or message generated in the BIOS program that triggers an exception.
  • the abnormality information collecting module 62 is specifically configured to collect fault information when an abnormal trigger condition is detected, and collect a software call stack relationship and/or a program counter and a program status register value when the fault occurs to indicate the The location of the fault message.
  • the information reporting module 63 can be specifically used to report the abnormal information to the monitoring server by using the intelligent platform management interface IPMI or the standard Ethernet mode.
  • the information reporting module 63 can also be configured to: encapsulate the abnormal information into capsules, the capsule includes header information, hardware error information, program running stack information, a program counter, program status register information, and tail information.
  • the fault locating device of this embodiment can be used to execute the technical method of any of the foregoing method embodiments.
  • the implementation principle and technical effect are similar, and will not be described here.
  • FIG. 7 is a schematic structural diagram of Embodiment 1 of the fault locating system of the present invention.
  • the system of the present embodiment includes: a main board 100, a fault locating device 110, and a monitoring server 200.
  • the main board 100 can adopt the logic block diagram of the typical computer or server shown in FIG. 1.
  • the fault locating device 110 can be integrated into the BIOS 60 in the main board 100 by using the structure of the apparatus embodiment shown in FIG. 6, which can be executed correspondingly.
  • the technical solution of the foregoing method embodiment is similar to the technical solution, and is not described here.
  • the monitoring server 200 integrates the abnormal information parsing module 210, and the abnormal information parsing module 210 is configured to parse the information in the fault locating device 110.
  • the abnormality information reported by the reporting module 113; the dotted line between the main board 100 and the monitoring server 200 indicates a wireless connection, and the two communicate through Ethernet communication or LPC.
  • the aforementioned program can be stored in a computer readable storage medium.
  • the program when executed, performs the steps including the foregoing method embodiments; and the foregoing storage medium includes various media that can store program codes, such as a ROM, a RAM, a magnetic disk, or an optical disk.

Abstract

Provided are a fault locating method and device. The fault locating method of the present invention comprises: monitoring a hardware exception trigger condition when a powered-on start device executes a basic input and output system (BIOS) program; collecting exception information when the exception trigger condition is monitored, the exception information at least comprising fault information about a central processing unit (CPU); and uploading the exception information to a monitoring server through a network. In the embodiments of the present invention, the viewing and accurate locating of fault information is realized in a remote manner.

Description

故障定位方法及装置Fault location method and device
本申请要求于2013年9月17日提交中国专利局、申请号201310425373.9、发明名称为“故障定位方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。The present application claims priority to Chinese Patent Application No. 201310425373.9, filed on Sep. 17, 2013, the entire disclosure of which is hereby incorporated by reference.
技术领域Technical field
本发明实施例涉及计算机技术,尤其涉及一种故障定位方法及装置。The embodiments of the present invention relate to computer technologies, and in particular, to a fault location method and apparatus.
背景技术Background technique
计算机系统,尤其是服务器产品,其可靠性一直是热门话题。若服务器出现故障,需要及时检测、定位、排除,这都要求在服务器不具备标准阴极射线管(Cathode Ray Tube,简称:CRT)或液晶(Liquid Crystal Display,简称:LCD)显示器功能时,能够方便的将故障信息呈现给用户。此时,上电自检(Power On Self-Test,简称:POST)阶段的故障信息收集和定位就显得非常重要。The reliability of computer systems, especially server products, has always been a hot topic. If the server is faulty, it needs to be detected, located, and removed in time. This requires that the server can be conveniently used when the server does not have a standard cathode ray tube (CRT) or liquid crystal display (LCD) display function. Present the fault information to the user. At this time, the collection and location of fault information in the Power On Self-Test (POST) phase is very important.
典型计算机或服务器的逻辑框图如图1所示,主板100包括:内存10、CPU 20、北桥30、南桥40、外设50和基本输入输出系统(Basic Input and Output System,简称:BIOS)60。另外,主板100还连接有外围扩展设备(硬盘、显卡等)70。A logical block diagram of a typical computer or server is shown in FIG. 1. The main board 100 includes: a memory 10, a CPU 20, a north bridge 30, a south bridge 40, a peripheral 50, and a basic input and output system (BIOS) 60. . In addition, the motherboard 100 is also connected to a peripheral expansion device (hard disk, graphics card, etc.) 70.
现有技术中,在图1的基础上,增加装置80,其中,装置80包括:存储器81、控制器82和显示模块83,如图2所示。使用图2所示的结构执行以下步骤实现POST阶段自检信息的显示:In the prior art, on the basis of FIG. 1, an apparatus 80 is added, wherein the apparatus 80 includes a memory 81, a controller 82 and a display module 83, as shown in FIG. Use the structure shown in Figure 2 to perform the following steps to achieve the display of the self-test information in the POST phase:
1)上电启动计算机设备,执行BIOS 60程序;1) Power on the computer device and execute the BIOS 60 program;
2)将POST阶段自检信息按照一定的数据结构发送给控制器82;2) The POST phase self-test information is sent to the controller 82 according to a certain data structure;
3)控制器82把解码、译码后的数据,发送给显示模块83,实时显示POST阶段自检信息。3) The controller 82 sends the decoded and decoded data to the display module 83 to display the POST phase self-test information in real time.
该现有技术需要额外的控制器82及显示模块83,通过显示模块83显示POST阶段自检信息,该自检信息包括故障信息;且获取该自检信息需要用户 亲临设备现场。The prior art requires an additional controller 82 and a display module 83 to display POST phase self-test information through the display module 83, the self-test information includes fault information; and the user needs to obtain the self-test information. Visit the equipment site.
发明内容Summary of the invention
本发明实施例提供一种故障定位方法及装置,以通过远程方式实现故障信息的查看和定位。The embodiment of the invention provides a fault location method and device for realizing the viewing and positioning of fault information in a remote manner.
第一方面,本发明实施例提供一种故障定位方法,包括:In a first aspect, an embodiment of the present invention provides a fault location method, including:
在上电启动设备执行基本输入输出系统BIOS程序时,监测硬件异常触发条件;Monitoring the hardware abnormal trigger condition when the power-on boot device executes the basic input/output system BIOS program;
当监测到异常触发条件时,采集异常信息,所述异常信息至少包括中央处理器CPU的故障信息;Collecting abnormal information when the abnormal trigger condition is detected, where the abnormal information includes at least fault information of the CPU of the central processing unit;
将所述异常信息通过网络上报给监控服务器。The abnormal information is reported to the monitoring server through the network.
在第一方面的第一种可能的实现方式中,所述当监测到异常触发条件时,采集异常信息包括:In a first possible implementation manner of the first aspect, when the abnormal trigger condition is detected, collecting abnormal information includes:
在执行所述BIOS程序过程中,若监测到异常触发条件,则触发各异常信息触发源对应的入口函数;During the execution of the BIOS program, if an abnormal trigger condition is detected, an entry function corresponding to each abnormal information trigger source is triggered;
利用所述入口函数采集异常信息。The exception function is collected using the entry function.
根据第一方面的第一种可能的实现方式,在第二种可能的实现方式中,所述在执行所述BIOS程序过程中,若监测到异常触发条件,则触发各异常信息触发源对应的入口函数包括:According to the first possible implementation manner of the first aspect, in a second possible implementation manner, in the process of executing the BIOS program, if an abnormal trigger condition is detected, triggering an identifier corresponding to each abnormal information trigger source The entry functions include:
若监测到中央处理器CPU异常触发条件时,则触发CPU对应的入口函数;If the CPU processor abnormal trigger condition is detected, the corresponding entry function of the CPU is triggered;
若监测到内存异常触发条件时,则触发内存对应的入口函数;If the memory exception trigger condition is detected, the memory corresponding entry function is triggered;
若监测到北桥异常触发条件时,则触发北桥对应的入口函数;If the north bridge abnormal trigger condition is detected, the corresponding entrance function of the north bridge is triggered;
若监测到南桥异常触发条件时,则触发南桥对应的入口函数。If the South Bridge abnormal trigger condition is detected, the corresponding entry function of the South Bridge is triggered.
根据第一方面、第一方面的第一种至第二种可能的实现方式的任意一种,在第三种可能的实现方式中,所述在上电启动设备执行基本输入输出系统BIOS程序时,监测硬件异常触发条件包括:According to the first aspect, any one of the first to the second possible implementation manners of the first aspect, in a third possible implementation manner, when the power-on booting device executes the basic input/output system BIOS program , monitoring hardware exception trigger conditions include:
在上电启动设备执行BIOS程序时,监测是否生成系统管理中断SMI、异 常事件Event或异常消息,若是,则确定监测到异常触发条件,其中,所述异常事件或异常消息为执行所述BIOS程序中生成的会触发异常的事件或消息。Monitors whether to generate a system management interrupt SMI or different when the device is powered on to execute the BIOS program. The event event or exception message, if yes, determines that an exception trigger condition is detected, wherein the exception event or exception message is an event or message generated in the BIOS program that triggers an exception.
在第一方面的第四种可能的实现方式中,所述当监测到异常触发条件时,采集异常信息包括:In a fourth possible implementation manner of the first aspect, when the abnormal trigger condition is detected, collecting the abnormal information includes:
当监测到异常触发条件时,采集故障信息,并采集故障发生时的软件调用堆栈关系和/或程序计数器及程序状态寄存器的数值,以指示所述故障信息的位置。When an abnormal trigger condition is detected, the fault information is collected, and the software call stack relationship and/or the value of the program counter and the program status register at the time of the fault occurrence are collected to indicate the location of the fault information.
在第一方面的第五种可能的实现方式中,所述将所述异常信息通过网络上报给监控服务器包括:In a fifth possible implementation manner of the foregoing aspect, the reporting the abnormal information to the monitoring server by using the network includes:
采用智能平台管理接口IPMI或标准以太网方式上报所述异常信息给监控服务器。The abnormal information is reported to the monitoring server by using the intelligent platform management interface IPMI or standard Ethernet.
根据第一方面、第一方面的第一种至第五种可能的实现方式的任意一种,在第六种可能的实现方式中,所述将所述异常信息通过网络上报给监控服务器之前,还包括:According to the first aspect, any one of the first to fifth possible implementation manners of the first aspect, in the sixth possible implementation, before the reporting the abnormality information to the monitoring server by using the network, Also includes:
封装所述异常信息成胶囊,所述胶囊包括头部信息、硬件错误信息、程序运行堆栈信息、程序计数器、程序状态寄存器信息和尾部信息。The exception information is encapsulated into capsules including header information, hardware error information, program run stack information, program counters, program status register information, and trailer information.
第二方面,本发明实施例提供一种故障定位装置,包括:In a second aspect, an embodiment of the present invention provides a fault location apparatus, including:
监测驱动模块,用于在上电启动设备执行基本输入输出系统BIOS程序时,监测硬件异常触发条件;The monitoring driving module is configured to monitor a hardware abnormal triggering condition when the power-on starting device executes the basic input/output system BIOS program;
异常信息采集模块,用于当所述监测驱动模块监测到异常触发条件时,采集异常信息,所述异常信息至少包括中央处理器CPU的故障信息;An abnormality information collecting module, configured to collect abnormal information when the monitoring driving module detects an abnormal triggering condition, where the abnormality information includes at least fault information of a CPU of the central processing unit;
信息上报模块,用于将所述异常信息通过网络上报给监控服务器。The information reporting module is configured to report the abnormality information to the monitoring server through the network.
在第二方面的第一种可能的实现方式中,所述异常信息采集模块具体用于在执行所述BIOS程序过程中,若监测到异常触发条件,则触发各异常信息触发源对应的入口函数;及,利用所述入口函数采集异常信息。In a first possible implementation manner of the second aspect, the abnormal information collecting module is specifically configured to: when an abnormal trigger condition is detected in the process of executing the BIOS program, trigger an entry function corresponding to each abnormal information trigger source And, using the entry function to collect exception information.
根据第二方面的第一种可能的实现方式,在第二种可能的实现方式中,所 述异常信息采集模块还用于:According to a first possible implementation of the second aspect, in a second possible implementation, The abnormal information collection module is also used to:
若监测到CPU异常触发条件时,则触发CPU对应的入口函数;If the CPU abnormal trigger condition is detected, the corresponding entry function of the CPU is triggered;
若监测到内存异常触发条件时,则触发内存对应的入口函数;If the memory exception trigger condition is detected, the memory corresponding entry function is triggered;
若监测到北桥异常触发条件时,则触发北桥对应的入口函数;If the north bridge abnormal trigger condition is detected, the corresponding entrance function of the north bridge is triggered;
若监测到南桥异常触发条件时,则触发南桥对应的入口函数。If the South Bridge abnormal trigger condition is detected, the corresponding entry function of the South Bridge is triggered.
根据第二方面、第二方面的第一种至第二种可能的实现方式的任意一种,在第三种可能的实现方式中,所述监测驱动模块具体用于在上电启动设备执行BIOS程序时,监测是否生成系统管理中断SMI、异常事件Event或异常消息,若是,则确定监测到异常触发条件,其中,所述异常事件或异常消息为执行所述BIOS程序中生成的会触发异常的事件或消息。According to the second aspect, any one of the first to the second possible implementation manners of the second aspect, in a third possible implementation, the monitoring driving module is specifically configured to execute a BIOS on the power-on booting device At the time of the program, monitoring whether a system management interrupt SMI, an abnormal event event, or an exception message is generated, and if so, determining that an abnormal trigger condition is detected, wherein the abnormal event or the abnormal message is an execution of the abnormality generated by the BIOS program. Event or message.
在第二方面的第四种可能的实现方式中,所述异常信息采集模块具体用于当监测到异常触发条件时,采集故障信息,并采集故障发生时的软件调用堆栈关系和/或程序计数器及程序状态寄存器的数值,以指示所述故障信息的位置。In a fourth possible implementation manner of the second aspect, the abnormal information collecting module is specifically configured to collect fault information when an abnormal trigger condition is detected, and collect a software call stack relationship and/or a program counter when the fault occurs. And the value of the program status register to indicate the location of the fault information.
本发明实施例故障定位方法及装置,通过将异常信息上报给监控服务器,实现操作人员远程查看异常信息,即操作人员在监控服务器侧即可根据所上报的异常信息进行故障定位和排查,降低用维成本。The fault location method and device of the embodiment of the present invention can report the abnormal information to the monitoring server, so that the operator can remotely view the abnormal information, that is, the operator can perform fault location and troubleshooting according to the reported abnormal information on the monitoring server side, and reduce the use. Dimensional cost.
附图说明DRAWINGS
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图做一简单地介绍,显而易见地,下面描述中的附图是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, a brief description of the drawings used in the embodiments or the prior art description will be briefly described below. Obviously, the drawings in the following description It is a certain embodiment of the present invention, and other drawings can be obtained from those skilled in the art without any inventive labor.
图1为典型计算机或服务器的逻辑框图;Figure 1 is a logical block diagram of a typical computer or server;
图2为现有技术中用于POST阶段自检信息显示的结构示意图;2 is a schematic structural diagram of a self-checking information display for a POST phase in the prior art;
图3为本发明故障定位方法实施例一的流程图;3 is a flowchart of Embodiment 1 of a fault location method according to the present invention;
图4为本发明故障定位方法实施例二的流程图; 4 is a flowchart of Embodiment 2 of a fault location method according to the present invention;
图5为本发明故障定位方法实施例二中胶囊封装格式示例图;5 is a diagram showing an example of a capsule encapsulation format in Embodiment 2 of the fault location method of the present invention;
图6为本发明故障定位装置实施例一的结构示意图;FIG. 6 is a schematic structural diagram of Embodiment 1 of a fault locating device according to the present invention; FIG.
图7为本发明故障定位系统实施例一的结构示意图。FIG. 7 is a schematic structural diagram of Embodiment 1 of a fault location system according to the present invention.
具体实施方式detailed description
为使本发明实施例的目的、技术方案和优点更加清楚,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。The technical solutions in the embodiments of the present invention will be clearly and completely described in conjunction with the drawings in the embodiments of the present invention. It is a partial embodiment of the invention, and not all of the embodiments. All other embodiments obtained by those skilled in the art based on the embodiments of the present invention without creative efforts are within the scope of the present invention.
随着服务器的广泛应用,一个数据中心或机房中会部署大量的服务器,通常是在机房外面通过远程方式监控服务器运行状态,因此,需要远程进行POST阶段故障信息收集和监控。With the widespread use of servers, a large number of servers are deployed in a data center or a computer room. The server running status is usually monitored remotely outside the equipment room. Therefore, remote POST phase fault information collection and monitoring are required.
图3为本发明故障定位方法实施例一的流程图。本发明实施例提供了一种故障定位方法,该方法可以由故障定位装置来执行,该装置可以集成在计算机或服务器中,通过软件和/或硬件实现。如图3所示,本实施例的方法包括:FIG. 3 is a flowchart of Embodiment 1 of a fault location method according to the present invention. Embodiments of the present invention provide a fault location method, which may be performed by a fault location apparatus, which may be integrated in a computer or a server, and implemented by software and/or hardware. As shown in FIG. 3, the method in this embodiment includes:
步骤301:在上电启动设备执行BIOS程序时,监测硬件异常触发条件。Step 301: Monitor a hardware abnormal trigger condition when the power-on boot device executes the BIOS program.
通常情况下,在POST阶段,执行BIOS程序,对设备的硬件部件,例如图1所示的框图中,中央处理器(Central Processing Unit,简称:CPU)、内存、北桥和南桥等硬件部件进行初始化,并查询该些硬件部件的工作状态。当POST阶段有故障产生时,首先检查硬件环境是否正常。Normally, in the POST phase, the BIOS program is executed, and the hardware components of the device, such as the block diagram shown in FIG. 1, the central processing unit (CPU), the memory, the north bridge, the south bridge, and the like are performed. Initialize and query the working status of these hardware components. When a fault occurs in the POST phase, first check if the hardware environment is normal.
本实施例中,在POST阶段,执行BIOS程序,并在BIOS程序中注册一个回调函数,若设备的硬件环境出现异常,例如,CPU初始化异常,说明有异常触发条件生成,则调用该回调函数触发异常信息的采集,即执行步骤302,采集因设备中硬件环境异常所导致、由异常触发条件触发子函数生成的异常信息,其中,异常触发条件作为该回调函数的输入参数。In this embodiment, in the POST phase, the BIOS program is executed, and a callback function is registered in the BIOS program. If the hardware environment of the device is abnormal, for example, the CPU initialization is abnormal, indicating that an abnormal trigger condition is generated, the callback function is triggered to be triggered. The abnormal information is collected, that is, step 302 is executed to collect the abnormal information generated by the abnormal trigger condition triggering subfunction caused by the hardware environment abnormality in the device, wherein the abnormal trigger condition is used as an input parameter of the callback function.
步骤302:当监测到异常触发条件时,采集异常信息,所述异常信息至少 包括CPU的故障信息。Step 302: Collect abnormal information when the abnormal trigger condition is detected, where the abnormal information is at least Includes fault information for the CPU.
具体地,BIOS程序包括至少一个子函数,用于异常信息的采集,该些子函数通过上述回调函数调用。其中,故障信息包括CPU、内存、北桥、南桥等硬件部件的故障信息,例如,CPU的故障信息Core或Uncore,以判断当前故障信息是否引起CPU故障,并根据故障现象定位故障原因;北桥的故障信息包括根端口(Root Port)和总线和接口标准(Peripheral Component Interface Express,简称:Pcie)设备的故障信息,通过该些故障信息检查是否存在北桥或Pcie设备故障,尤其在发生输入输出(Input or Output,简称:IO)错误时,检查Pcie的相关寄存器信息就更尤为重要;南桥的故障信息,以检查南桥挂载的设备是否出现异常;内存的故障信息,包括系统管理中断(System Management Interrupt,简称:SMI)和双倍速率同步动态随机存储器(Double Data Rate,简称:DDR)故障信息,例如,SMI通道误码、DIMM条ECC错误或DIMM检测失败等等。Specifically, the BIOS program includes at least one sub-function for collecting abnormal information, and the sub-functions are called by the above-mentioned callback function. The fault information includes fault information of hardware components such as CPU, memory, north bridge, south bridge, and the like, for example, CPU fault information Core or Uncore, to determine whether the current fault information causes a CPU fault, and locate the fault cause according to the fault phenomenon; The fault information includes the root port (Root Port) and the bus and interface standard (Pciepheral Component Interface Express, Pcie) device fault information. The fault information is used to check whether there is a North Bridge or Pcie device fault, especially when an input/output occurs. Or Output, referred to as: IO) When checking for errors, it is especially important to check the relevant register information of Pcie; the fault information of the south bridge to check whether the equipment mounted on the south bridge is abnormal; the fault information of the memory, including the system management interrupt (System) Management Interrupt (SMI) and Double Rate Synchronous Dynamic Random Access Memory (DDR) fault information, such as SMI channel error, DIMM strip ECC error or DIMM detection failure.
步骤303:将所述异常信息通过网络上报给监控服务器。Step 303: The abnormality information is reported to the monitoring server through the network.
监控服务器通过以太网通信或LPC通信接收异常信息,对该异常信息进行解析记录,并保存异常信息到本地存储介质,该本地存储介质包括但不限于硬盘和非易失性随机访问存储器(Non-Volatile Random Access Memory,简称:NVRAM),以便故障管理和长期维护,作为后续定位的重要数据;同时,以可读、可视化的形式告知操作人员有故障产生,维护人员还可以根据异常信息,查询故障库,获取更加详细的故障定位信息。The monitoring server receives the abnormal information through Ethernet communication or LPC communication, parses and records the abnormal information, and saves the abnormal information to the local storage medium, including but not limited to the hard disk and the non-volatile random access memory (Non- Volatile Random Access Memory (NVRAM) for fault management and long-term maintenance, as important data for subsequent positioning; at the same time, the operator can be informed of faults in a readable and visual form, and maintenance personnel can also query faults based on abnormal information. Library for more detailed fault location information.
另需说明的是,操作人员获知有故障产生的方式是任意的,例如,可以是监控服务器进行报警的方式,也可以是操作人员实时关注的方式获知,在此不对其进行限制。It should be noted that the manner in which the operator knows that the fault is generated is arbitrary. For example, the monitoring server may perform the alarm, or may be learned by the operator in real time, and is not limited herein.
现有技术中,计算机或服务器在POST阶段自检信息通过显示模块,例如LCD或VFT等显示设备显示,该些显示模块安装在机箱前面板上,因此,需要操作人员亲临设备现场。而在本实施例中,通过将异常信息上报给监控服 务器,使得操作人员可以通过监控服务器远程查看异常信息。In the prior art, the self-test information of the computer or the server in the POST phase is displayed through a display module, such as a display device such as an LCD or a VFT. The display modules are installed on the front panel of the chassis, and therefore, the operator is required to come to the device site. In this embodiment, the abnormality information is reported to the monitoring service. The server allows the operator to remotely view the anomaly information through the monitoring server.
本发明实施例通过将异常信息上报给监控服务器,实现操作人员远程查看异常信息,即操作人员在监控服务器侧即可根据所上报的异常信息进行故障定位和排查,降低用维成本。In the embodiment of the present invention, the abnormality information is reported to the monitoring server, so that the operator can remotely view the abnormal information, that is, the operator can perform fault location and troubleshooting according to the reported abnormal information on the monitoring server side, thereby reducing the cost of using the dimension.
在上述实施例的基础上,当监测到异常触发条件时,采集异常信息可以进一步细化为:Based on the above embodiment, when an abnormal trigger condition is detected, the collection abnormal information can be further refined into:
1、在执行所述BIOS程序过程中,若监测到异常触发条件,则触发各异常信息触发源对应的入口函数;1. In the process of executing the BIOS program, if an abnormal trigger condition is detected, an entry function corresponding to each abnormal information trigger source is triggered;
2、利用所述入口函数采集异常信息。2. Collecting abnormal information by using the entry function.
具体地,本领域技术人员可以将异常信息触发源理解为状态发生异常的硬件部件,即上文中提及的监测到有异常触发条件的硬件部件。将异常触发条件作为其对应异常信息触发源的入口函数的输入参数。Specifically, a person skilled in the art can understand the abnormal information trigger source as a hardware component whose state is abnormal, that is, the hardware component mentioned above that detects an abnormal trigger condition. The abnormal trigger condition is taken as the input parameter of the entry function corresponding to the trigger source of the abnormal information.
具体地,在所述BIOS芯片启动过程中,若监测到异常触发条件,则触发各异常信息触发源对应的入口函数可以包括:Specifically, in the startup process of the BIOS chip, if an abnormal trigger condition is detected, triggering an entry function corresponding to each abnormal information trigger source may include:
若监测到中央处理器CPU异常触发条件时,则触发CPU对应的入口函数;若监测到内存异常触发条件时,则触发内存对应的入口函数;若监测到北桥异常触发条件时,则触发北桥对应的入口函数;若监测到南桥异常触发条件时,则触发南桥对应的入口函数;以此类推,若监测到其它各硬件异常触发条件时,则触发该硬件,即异常信息触发源对应的入口函数,此处不再一一赘述。If the CPU CPU abnormal trigger condition is detected, the corresponding entry function of the CPU is triggered; if the memory abnormal trigger condition is detected, the memory corresponding entry function is triggered; if the North Bridge abnormal trigger condition is detected, the north bridge corresponding is triggered. The entry function; if the south bridge abnormal trigger condition is detected, the corresponding entrance function of the south bridge is triggered; and so on, if other hardware abnormality trigger conditions are detected, the hardware is triggered, that is, the abnormal information trigger source corresponds to The entry function is not repeated here.
在上述基础上,在上电启动设备执行BIOS程序时,监测硬件异常触发条件可以包括:在上电启动设备执行BIOS程序时,监测是否生成SMI、异常事件(Event)或异常消息(Message),若是,则确定监测到异常触发条件,其中,所述异常事件或异常消息为执行所述BIOS程序中生成的会触发异常的事件或消息。此时,根据监测到的异常触发条件触发异常信息采集的触发方式即包括:SMI方式、异常Event方式或异常消息方式,以下对各触发方式下如 何调用各异常信息触发源对应的入口函数进行说明:On the basis of the foregoing, when the power-on booting device executes the BIOS program, monitoring the hardware abnormal triggering condition may include: monitoring whether to generate an SMI, an abnormal event (Event), or an abnormal message (Message) when the power-on booting device executes the BIOS program. If yes, it is determined that an abnormal trigger condition is detected, wherein the abnormal event or abnormal message is an event or message generated in the BIOS program that triggers an abnormality. At this time, the triggering mode for triggering the collection of the abnormal information according to the detected abnormal triggering condition includes: SMI mode, abnormal event mode, or abnormal message mode, as follows for each trigger mode. How to call the entry function corresponding to each exception information trigger source to explain:
若监测到异常触发条件为SMI方式,则触发一个SMI中断,在SMI Handler中调用各异常信息触发源对应的入口函数;If the abnormal trigger condition is detected as SMI mode, an SMI interrupt is triggered, and an entry function corresponding to each abnormal information trigger source is called in the SMI Handler;
若监测到异常触发条件为异常Event方式,则发送一个Event,在Event的回调函数中调用各异常信息触发源对应的入口函数;If the abnormal trigger condition is detected as the abnormal event mode, an Event is sent, and an entry function corresponding to the trigger source of each abnormal information is called in the callback function of the Event;
若监测到异常触发条件为异常消息方式,则发送一个消息,在消息的回调函数中调用各异常信息触发源对应的入口函数。If the abnormal trigger condition is detected as the abnormal message mode, a message is sent, and the entry function corresponding to the trigger source of each abnormal information is called in the callback function of the message.
其中,所述当监测到异常触发条件时,采集异常信息可以包括:当监测到异常触发条件时,采集故障信息,并采集故障发生时的软件调用堆栈关系和/或程序计数器及程序状态寄存器的数值,以指示所述故障信息的位置。通常情况下,如果只上报故障信息,不足以完成故障定位和排查,因此需要精确的信息来辅助,完成故障定位。因此,本发明引入了linux OS中kernel dump的理念,在采集故障信息的同时,采集故障发生时的软件调用堆栈关系并保存,作为精准定位的数据基础;另外,还采集当前程序计数器及程序状态寄存器的数值,保存当前运行的程序计数器及程序状态寄存器的数值,有利于分析出现异常时的程序运行线索和处理器的寄存器状态正确与否,保存异常时的CPU运行状态。When the abnormal trigger condition is detected, collecting the abnormality information may include: collecting the fault information when the abnormal trigger condition is detected, and collecting the software call stack relationship and/or the program counter and the program status register when the fault occurs. A value to indicate the location of the fault information. In general, if only the fault information is reported, it is not enough to complete the fault location and troubleshooting. Therefore, accurate information is needed to assist in the fault location. Therefore, the present invention introduces the concept of kernel dump in linux OS, collects the software call stack relationship when the fault occurs, and saves it as a data base for precise positioning while collecting the fault information; in addition, it also collects the current program counter and program status. The value of the register saves the current running program counter and the value of the program status register, which is useful for analyzing the program running clues when the exception occurs and the register status of the processor is correct or not, and saving the CPU running status when the abnormality is saved.
进一步地,将所述异常信息通过网络上报给监控服务器可以包括:采用智能平台管理接口(Intelligent Platform Management Interface,简称:IPMI)或标准以太网方式上报所述异常信息给监控服务器。另外,还可以通过其它通信方式上报异常信息。Further, reporting the abnormality information to the monitoring server by using the network may include: reporting the abnormality information to the monitoring server by using an Intelligent Platform Management Interface (IPMI) or a standard Ethernet. In addition, abnormal information can be reported by other communication methods.
图4为本发明故障定位方法实施例二的流程图。如图4所示,本实施例在上述实施例的基础上,故障定位方法还可包括以下步骤:FIG. 4 is a flowchart of Embodiment 2 of a fault location method according to the present invention. As shown in FIG. 4, on the basis of the foregoing embodiment, the fault location method may further include the following steps:
步骤401:在上电启动设备执行BIOS程序时,监测硬件异常触发条件。Step 401: Monitor a hardware abnormal trigger condition when the power-on boot device executes the BIOS program.
本步骤参照图3所示实施例的步骤301,在此不再赘述。This step refers to step 301 of the embodiment shown in FIG. 3, and details are not described herein again.
步骤402:当监测到异常触发条件时,采集异常信息,所述异常信息至少 包括CPU的故障信息。Step 402: Collect abnormal information when the abnormal trigger condition is detected, where the abnormal information is at least Includes fault information for the CPU.
本步骤参照图3所示实施例的步骤302,在此不再赘述。This step refers to step 302 of the embodiment shown in FIG. 3, and details are not described herein again.
步骤403:封装所述异常信息成胶囊。Step 403: Encapsulate the abnormality information into capsules.
图5为本发明故障定位方法实施例二中胶囊封装格式示例图。参照图5可知,胶囊可以包括头部信息、硬件错误信息、程序运行堆栈信息、程序计数器、程序状态寄存器信息和尾部信息。其中,头部信息和尾部信息是不可缺少的,胶囊中间部分可以包括硬件错误信息、程序运行堆栈信息、程序计数器和程序状态寄存器信息的任意组合。在这里,硬件包括CPU、内存、北桥和南桥等;程序运行堆栈信息包括当前执行函数的堆栈信息以及当前函数的内部调用函数的堆栈信息,其中,内部调用函数的个数不受限制,在硬件没有故障发生时,需通过检查程序的运行线索进行故障定位分析;采集程序运行过程中所涉及寄存器和计数器的数值,例如,程序计数器及程序状态寄存器的数值,用于分析当前程序运行的环境参数是否发生异常,例如,程序指针寄存器的数值是否非法,以及堆栈是否溢出等。FIG. 5 is a schematic diagram of a capsule encapsulation format in Embodiment 2 of the fault location method of the present invention. Referring to FIG. 5, the capsule may include header information, hardware error information, program run stack information, a program counter, program status register information, and trailer information. Wherein, the header information and the tail information are indispensable, and the middle portion of the capsule may include any combination of hardware error information, program run stack information, program counter and program status register information. Here, the hardware includes a CPU, a memory, a north bridge, and a south bridge; the program running stack information includes a stack information of the current execution function and a stack information of an internal calling function of the current function, wherein the number of internal calling functions is not limited, When there is no fault in the hardware, the fault location analysis should be performed by checking the running thread of the program; the values of the registers and counters involved in the running of the program, for example, the values of the program counter and the program status register, are used to analyze the environment in which the current program runs. Whether the parameter has an exception, for example, whether the value of the program pointer register is illegal, and whether the stack overflows or the like.
步骤404:将所述异常信息通过网络上报给监控服务器。Step 404: Report the abnormality information to the monitoring server through the network.
具体地,将异常信息封装成胶囊或其它信息格式后,通过IPMI或标准以太网或其它通信方式上报给监控服务器;监控服务器按照对应的封装格式解析接收到的异常信息,分别得到硬件、堆栈、程序计数器等信息。Specifically, the abnormal information is encapsulated into a capsule or other information format, and then reported to the monitoring server through IPMI or standard Ethernet or other communication methods; the monitoring server parses the received abnormal information according to the corresponding encapsulation format, and obtains hardware, stack, and Program counter and other information.
在本实施例中,通过采集故障信息、故障发生时的软件调用堆栈关系和当前程序计数器及程序状态寄存器的数值等,提供更加详细的故障定位信息,进一步保证故障定位的可靠性。In this embodiment, more detailed fault location information is provided by collecting fault information, a software call stack relationship when the fault occurs, and values of the current program counter and the program status register, thereby further ensuring reliability of fault location.
本发明的技术方案可用于产品研发阶段,精确的故障信息可加快计算机系统/产品的研发中故障的定位,降低研发成本,并保证产品质量;本发明的技术方案还可用于产品运维阶段,精准的故障信息,降低运维的难度。The technical solution of the present invention can be used in the product development stage, and accurate fault information can accelerate the positioning of faults in the development of the computer system/product, reduce the research and development cost, and ensure the product quality; the technical solution of the present invention can also be used in the product operation and maintenance stage. Accurate fault information reduces the difficulty of operation and maintenance.
图6为本发明故障定位装置实施例一的结构示意图,如图6所示,本实施例的装置包括:监测驱动模块61、异常信息采集模块62和信息上报模块63。 FIG. 6 is a schematic structural diagram of Embodiment 1 of the fault locating device of the present invention. As shown in FIG. 6, the device of the present embodiment includes: a monitoring driving module 61, an abnormality information collecting module 62, and an information reporting module 63.
其中,监测驱动模块61用于在上电启动设备执行基本输入输出系统BIOS程序时,监测硬件异常触发条件;异常信息采集模块62用于当所述监测驱动模块监测到异常触发条件时,采集异常信息,所述异常信息至少包括CPU的故障信息;信息上报模块63用于将所述异常信息通过网络上报给监控服务器。The monitoring driving module 61 is configured to monitor a hardware abnormal triggering condition when the power-on starting device executes the basic input/output system BIOS program, and the abnormal information collecting module 62 is configured to collect an abnormality when the monitoring driving module detects an abnormal triggering condition. The information, the abnormality information includes at least the fault information of the CPU, and the information reporting module 63 is configured to report the abnormality information to the monitoring server through the network.
本实施例的故障定位装置,可以用于执行图1所示方法实施例的技术方案,其实现原理和技术效果类似,此处不再赘述。The fault locating device of this embodiment can be used to implement the technical solution of the method embodiment shown in FIG. 1 , and the implementation principle and technical effects are similar, and details are not described herein again.
在上述实施例中,异常信息采集模块62可具体用于在执行所述BIOS程序过程中,若监测到异常触发条件,则触发各异常信息触发源对应的入口函数;及,利用所述入口函数采集异常信息。In the above embodiment, the abnormality information collecting module 62 may be specifically configured to: when an abnormal trigger condition is detected, trigger an entry function corresponding to each abnormal information trigger source; and use the entry function Collect abnormal information.
在上述基础上,异常信息采集模块62还可以用于:若监测到CPU异常触发条件时,则触发CPU对应的入口函数;若监测到内存异常触发条件时,则触发内存对应的入口函数;若监测到北桥异常触发条件时,则触发北桥对应的入口函数;若监测到南桥异常触发条件时,则触发南桥对应的入口函数。On the basis of the above, the abnormality information collecting module 62 can also be configured to: if the CPU abnormal trigger condition is detected, trigger the corresponding entry function of the CPU; if the memory abnormal trigger condition is detected, trigger the memory corresponding entry function; When the north bridge abnormal trigger condition is detected, the corresponding entrance function of the north bridge is triggered; if the south bridge abnormal trigger condition is detected, the corresponding entrance function of the south bridge is triggered.
进一步地,监测驱动模块61可具体用于在上电启动设备执行BIOS程序时,监测是否生成系统管理中断SMI、异常事件Event或异常消息,若是,则确定监测到异常触发条件,其中,所述异常事件或异常消息为执行所述BIOS程序中生成的会触发异常的事件或消息。Further, the monitoring driving module 61 may be specifically configured to monitor whether a system management interrupt SMI, an abnormal event event, or an abnormal message is generated when the power-on booting device executes the BIOS program, and if yes, determine that an abnormal trigger condition is detected, wherein the An exception event or exception message is an event or message generated in the BIOS program that triggers an exception.
优选地,异常信息采集模块62可具体用于当监测到异常触发条件时,采集故障信息,并采集故障发生时的软件调用堆栈关系和/或程序计数器及程序状态寄存器的数值,以指示所述故障信息的位置。Preferably, the abnormality information collecting module 62 is specifically configured to collect fault information when an abnormal trigger condition is detected, and collect a software call stack relationship and/or a program counter and a program status register value when the fault occurs to indicate the The location of the fault message.
在上述基础上,信息上报模块63可具体用于采用智能平台管理接口IPMI或标准以太网方式上报所述异常信息给监控服务器。On the basis of the above, the information reporting module 63 can be specifically used to report the abnormal information to the monitoring server by using the intelligent platform management interface IPMI or the standard Ethernet mode.
在上述基础上,信息上报模块63还可以用于:封装所述异常信息成胶囊,所述胶囊包括头部信息、硬件错误信息、程序运行堆栈信息、程序计数器、程序状态寄存器信息和尾部信息。On the basis of the above, the information reporting module 63 can also be configured to: encapsulate the abnormal information into capsules, the capsule includes header information, hardware error information, program running stack information, a program counter, program status register information, and tail information.
本实施例的故障定位装置,可以用于执行上述任一方法实施例的技术方 案,其实现原理和技术效果类似,此处不再赘述。The fault locating device of this embodiment can be used to execute the technical method of any of the foregoing method embodiments. The implementation principle and technical effect are similar, and will not be described here.
图7为本发明故障定位系统实施例一的结构示意图,如图7所示,本实施例的系统包括:主板100、故障定位装置110和监控服务器200。其中,主板100可以采用图1所示的典型计算机或服务器的逻辑框图;故障定位装置110可以采用图6所示装置实施例的结构,集成在主板100中的BIOS60中,其对应地,可以执行上述任一方法实施例的技术方案,其实现原理和技术效果类似,此处不再赘述;监控服务器200中集成异常信息解析模块210,该异常信息解析模块210用于解析故障定位装置110中信息上报模块113上报的异常信息;主板100与监控服务器200间虚线表示无线连接,二者通过以太网通信或LPC通信。FIG. 7 is a schematic structural diagram of Embodiment 1 of the fault locating system of the present invention. As shown in FIG. 7, the system of the present embodiment includes: a main board 100, a fault locating device 110, and a monitoring server 200. The main board 100 can adopt the logic block diagram of the typical computer or server shown in FIG. 1. The fault locating device 110 can be integrated into the BIOS 60 in the main board 100 by using the structure of the apparatus embodiment shown in FIG. 6, which can be executed correspondingly. The technical solution of the foregoing method embodiment is similar to the technical solution, and is not described here. The monitoring server 200 integrates the abnormal information parsing module 210, and the abnormal information parsing module 210 is configured to parse the information in the fault locating device 110. The abnormality information reported by the reporting module 113; the dotted line between the main board 100 and the monitoring server 200 indicates a wireless connection, and the two communicate through Ethernet communication or LPC.
本领域普通技术人员可以理解:实现上述各方法实施例的全部或部分步骤可以通过程序指令相关的硬件来完成。前述的程序可以存储于一计算机可读取存储介质中。该程序在执行时,执行包括上述各方法实施例的步骤;而前述的存储介质包括:ROM、RAM、磁碟或者光盘等各种可以存储程序代码的介质。One of ordinary skill in the art will appreciate that all or part of the steps to implement the various method embodiments described above may be accomplished by hardware associated with the program instructions. The aforementioned program can be stored in a computer readable storage medium. The program, when executed, performs the steps including the foregoing method embodiments; and the foregoing storage medium includes various media that can store program codes, such as a ROM, a RAM, a magnetic disk, or an optical disk.
最后应说明的是:以上各实施例仅用以说明本发明的技术方案,而非对其限制;尽管参照前述各实施例对本发明进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分或者全部技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本发明各实施例技术方案的范围。 Finally, it should be noted that the above embodiments are merely illustrative of the technical solutions of the present invention, and are not intended to be limiting; although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art will understand that The technical solutions described in the foregoing embodiments may be modified, or some or all of the technical features may be equivalently replaced; and the modifications or substitutions do not deviate from the technical solutions of the embodiments of the present invention. range.

Claims (12)

  1. 一种故障定位方法,其特征在于,包括:A fault location method, comprising:
    在上电启动设备执行基本输入输出系统BIOS程序时,监测硬件异常触发条件;Monitoring the hardware abnormal trigger condition when the power-on boot device executes the basic input/output system BIOS program;
    当监测到异常触发条件时,采集异常信息,所述异常信息至少包括中央处理器CPU的故障信息;Collecting abnormal information when the abnormal trigger condition is detected, where the abnormal information includes at least fault information of the CPU of the central processing unit;
    将所述异常信息通过网络上报给监控服务器。The abnormal information is reported to the monitoring server through the network.
  2. 根据权利要求1所述的方法,其特征在于,所述当监测到异常触发条件时,采集异常信息包括:The method according to claim 1, wherein when the abnormal trigger condition is detected, collecting abnormal information includes:
    在执行所述BIOS程序过程中,若监测到异常触发条件,则触发各异常信息触发源对应的入口函数;During the execution of the BIOS program, if an abnormal trigger condition is detected, an entry function corresponding to each abnormal information trigger source is triggered;
    利用所述入口函数采集异常信息。The exception function is collected using the entry function.
  3. 根据权利要求2所述的方法,其特征在于,所述在执行所述BIOS程序过程中,若监测到异常触发条件,则触发各异常信息触发源对应的入口函数包括:The method according to claim 2, wherein, in the process of executing the BIOS program, if an abnormal trigger condition is detected, triggering an entry function corresponding to each abnormal information trigger source includes:
    若监测到中央处理器CPU异常触发条件时,则触发CPU对应的入口函数;If the CPU processor abnormal trigger condition is detected, the corresponding entry function of the CPU is triggered;
    若监测到内存异常触发条件时,则触发内存对应的入口函数;If the memory exception trigger condition is detected, the memory corresponding entry function is triggered;
    若监测到北桥异常触发条件时,则触发北桥对应的入口函数;If the north bridge abnormal trigger condition is detected, the corresponding entrance function of the north bridge is triggered;
    若监测到南桥异常触发条件时,则触发南桥对应的入口函数。If the South Bridge abnormal trigger condition is detected, the corresponding entry function of the South Bridge is triggered.
  4. 根据权利要求1或2或3所述的方法,其特征在于,所述在上电启动设备执行基本输入输出系统BIOS程序时,监测硬件异常触发条件包括:The method according to claim 1 or 2 or 3, wherein when the power-on booting device executes the basic input/output system BIOS program, the monitoring hardware abnormal triggering conditions include:
    在上电启动设备执行BIOS程序时,监测是否生成系统管理中断SMI、异常事件Event或异常消息,若是,则确定监测到异常触发条件,其中,所述异常事件或异常消息为执行所述BIOS程序中生成的会触发异常的事件或消息。 When the power-on boot device executes the BIOS program, it is monitored whether a system management interrupt SMI, an abnormal event event, or an abnormal message is generated, and if yes, it is determined that an abnormal trigger condition is detected, wherein the abnormal event or abnormal message is to execute the BIOS program. An event or message generated in the event that will trigger an exception.
  5. 根据权利要求1所述的方法,其特征在于,所述当监测到异常触发条件时,采集异常信息包括:The method according to claim 1, wherein when the abnormal trigger condition is detected, collecting abnormal information includes:
    当监测到异常触发条件时,采集故障信息,并采集故障发生时的软件调用堆栈关系和/或程序计数器及程序状态寄存器的数值,以指示所述故障信息的位置。When an abnormal trigger condition is detected, the fault information is collected, and the software call stack relationship and/or the value of the program counter and the program status register at the time of the fault occurrence are collected to indicate the location of the fault information.
  6. 根据权利要求1所述的方法,其特征在于,所述将所述异常信息通过网络上报给监控服务器包括:The method according to claim 1, wherein the reporting the abnormality information to the monitoring server through the network comprises:
    采用智能平台管理接口IPMI或标准以太网方式上报所述异常信息给监控服务器。The abnormal information is reported to the monitoring server by using the intelligent platform management interface IPMI or standard Ethernet.
  7. 根据权利要求1-6任一项所述的方法,其特征在于,所述将所述异常信息通过网络上报给监控服务器之前,还包括:The method according to any one of claims 1-6, wherein before the reporting the abnormality information to the monitoring server through the network, the method further includes:
    封装所述异常信息成胶囊,所述胶囊包括头部信息、硬件错误信息、程序运行堆栈信息、程序计数器、程序状态寄存器信息和尾部信息。The exception information is encapsulated into capsules including header information, hardware error information, program run stack information, program counters, program status register information, and trailer information.
  8. 一种故障定位装置,其特征在于,包括:A fault locating device, comprising:
    监测驱动模块,用于在上电启动设备执行基本输入输出系统BIOS程序时,监测硬件异常触发条件;The monitoring driving module is configured to monitor a hardware abnormal triggering condition when the power-on starting device executes the basic input/output system BIOS program;
    异常信息采集模块,用于当所述监测驱动模块监测到异常触发条件时,采集异常信息,所述异常信息至少包括中央处理器CPU的故障信息;An abnormality information collecting module, configured to collect abnormal information when the monitoring driving module detects an abnormal triggering condition, where the abnormality information includes at least fault information of a CPU of the central processing unit;
    信息上报模块,用于将所述异常信息通过网络上报给监控服务器。The information reporting module is configured to report the abnormality information to the monitoring server through the network.
  9. 根据权利要求8所述的装置,其特征在于,所述异常信息采集模块具体用于在执行所述BIOS程序过程中,若监测到异常触发条件,则触发各异常信息触发源对应的入口函数;及,利用所述入口函数采集异常信息。The apparatus according to claim 8, wherein the abnormality information collecting module is configured to trigger an entry function corresponding to each abnormal information trigger source if an abnormal trigger condition is detected during execution of the BIOS program; And using the entry function to collect abnormal information.
  10. 根据权利要求9所述的装置,其特征在于,所述异常信息采集模块还用于:The device according to claim 9, wherein the abnormality information collecting module is further configured to:
    若监测到CPU异常触发条件时,则触发CPU对应的入口函数;If the CPU abnormal trigger condition is detected, the corresponding entry function of the CPU is triggered;
    若监测到内存异常触发条件时,则触发内存对应的入口函数; If the memory exception trigger condition is detected, the memory corresponding entry function is triggered;
    若监测到北桥异常触发条件时,则触发北桥对应的入口函数;If the north bridge abnormal trigger condition is detected, the corresponding entrance function of the north bridge is triggered;
    若监测到南桥异常触发条件时,则触发南桥对应的入口函数。If the South Bridge abnormal trigger condition is detected, the corresponding entry function of the South Bridge is triggered.
  11. 根据权利要求8或9或10所述的装置,其特征在于,所述监测驱动模块具体用于在上电启动设备执行BIOS程序时,监测是否生成系统管理中断SMI、异常事件Event或异常消息,若是,则确定监测到异常触发条件,其中,所述异常事件或异常消息为执行所述BIOS程序中生成的会触发异常的事件或消息。The device according to claim 8 or 9 or 10, wherein the monitoring driving module is specifically configured to monitor whether a system management interrupt SMI, an abnormal event Event or an abnormal message is generated when the power-on booting device executes the BIOS program. If yes, it is determined that an abnormal trigger condition is detected, wherein the abnormal event or abnormal message is an event or message generated in the BIOS program that triggers an abnormality.
  12. 根据权利要求8所述的装置,其特征在于,所述异常信息采集模块具体用于当监测到异常触发条件时,采集故障信息,并采集故障发生时的软件调用堆栈关系和/或程序计数器及程序状态寄存器的数值,以指示所述故障信息的位置。 The device according to claim 8, wherein the abnormality information collecting module is configured to collect fault information when an abnormal trigger condition is detected, and collect a software call stack relationship and/or a program counter when the fault occurs. The value of the program status register to indicate the location of the fault message.
PCT/CN2014/086684 2013-09-17 2014-09-17 Fault locating method and device WO2015039598A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201310425373.9A CN103500133A (en) 2013-09-17 2013-09-17 Fault locating method and device
CN201310425373.9 2013-09-17

Publications (1)

Publication Number Publication Date
WO2015039598A1 true WO2015039598A1 (en) 2015-03-26

Family

ID=49865348

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2014/086684 WO2015039598A1 (en) 2013-09-17 2014-09-17 Fault locating method and device

Country Status (2)

Country Link
CN (1) CN103500133A (en)
WO (1) WO2015039598A1 (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107018035A (en) * 2016-01-27 2017-08-04 北京新唐思创教育科技有限公司 A kind of teaching equipment monitoring management system and its method
CN110874279A (en) * 2018-08-29 2020-03-10 阿里巴巴集团控股有限公司 Fault positioning method, device and system
CN111130919A (en) * 2019-11-13 2020-05-08 贵州医渡云技术有限公司 Interface monitoring method, device and system and storage medium
CN111130934A (en) * 2019-12-20 2020-05-08 国铁吉讯科技有限公司 Monitoring method, device and system of communication system
CN111209164A (en) * 2020-01-03 2020-05-29 杭州迪普科技股份有限公司 Abnormal information storage method and device, electronic equipment and storage medium
CN112015681A (en) * 2020-08-19 2020-12-01 苏州鑫信腾科技有限公司 IO port processing method, device, equipment and medium
CN112214373A (en) * 2020-09-17 2021-01-12 上海金仕达软件科技有限公司 Hardware monitoring method and device and electronic equipment
CN113391611A (en) * 2020-03-12 2021-09-14 中国移动通信集团河北有限公司 Early warning method, device and system for dynamic environment monitoring system
CN114189552A (en) * 2021-10-29 2022-03-15 济南浪潮数据技术有限公司 Data reporting method and system
CN115225470A (en) * 2022-07-28 2022-10-21 天翼云科技有限公司 Business abnormity monitoring method and device, electronic equipment and storage medium
CN115225469A (en) * 2022-07-28 2022-10-21 深圳市基纳控制有限公司 Network monitoring system and method based on network special-shaped interface
CN116133357A (en) * 2023-04-14 2023-05-16 合肥安迅精密技术有限公司 Vacuum pressure monitoring system and method for chip mounter

Families Citing this family (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103500133A (en) * 2013-09-17 2014-01-08 华为技术有限公司 Fault locating method and device
CN105224426A (en) * 2014-06-09 2016-01-06 中兴通讯股份有限公司 Physical host fault detection method, device and empty machine management method, system
CN104991801A (en) * 2015-07-06 2015-10-21 青岛海信宽带多媒体技术有限公司 Bootloader debugging information acquisition method, device and system
CN105183575A (en) * 2015-08-24 2015-12-23 浪潮(北京)电子信息产业有限公司 Processor fault diagnosis method, device and system
CN106569904A (en) * 2015-10-09 2017-04-19 中兴通讯股份有限公司 Information storage method and device and server
CN105808398A (en) * 2016-03-08 2016-07-27 浪潮电子信息产业股份有限公司 Method for rapidly analyzing and positioning hardware exceptions
TWI582586B (en) * 2016-06-01 2017-05-11 神雲科技股份有限公司 Method For Outputting Information Related To Machine Check Exception of Computer System
CN106227672B (en) * 2016-08-10 2019-07-09 中车株洲电力机车研究所有限公司 A kind of built-in application program failure captures and processing method
CN106789306B (en) * 2016-12-30 2021-01-26 深圳市风云实业有限公司 Method and system for detecting, collecting and recovering software fault of communication equipment
CN108628694B (en) * 2017-03-20 2023-03-28 腾讯科技(深圳)有限公司 Data processing method and device based on programmable hardware
CN108628726B (en) * 2017-03-22 2021-02-23 比亚迪股份有限公司 CPU state information recording method and device
CN107168815B (en) * 2017-05-19 2020-09-18 苏州浪潮智能科技有限公司 Method for collecting hardware error information
CN108376107A (en) * 2018-03-01 2018-08-07 郑州云海信息技术有限公司 A kind of method, apparatus, equipment and the storage medium of server failure detection
CN108287775A (en) * 2018-03-01 2018-07-17 郑州云海信息技术有限公司 A kind of method, apparatus, equipment and the storage medium of server failure detection
CN110097683A (en) * 2018-07-20 2019-08-06 深圳怡化电脑股份有限公司 A kind of equipment self-inspection method, apparatus, ATM and storage medium
CN109086155A (en) * 2018-07-27 2018-12-25 郑州云海信息技术有限公司 Server failure localization method, device, equipment and computer readable storage medium
CN109522057A (en) * 2018-11-27 2019-03-26 无锡睿勤科技有限公司 A kind of equipment starting method and equipment
CN110008056A (en) * 2019-03-28 2019-07-12 联想(北京)有限公司 EMS memory management process, device, electronic equipment and computer readable storage medium
CN110289981A (en) * 2019-05-14 2019-09-27 中山大学 A kind of high-performance calculation Internet monitoring method and system
CN112506693A (en) * 2020-12-14 2021-03-16 曙光信息产业(北京)有限公司 Method and device for recording abnormal information, storage medium and electronic equipment
CN112860516A (en) * 2021-02-04 2021-05-28 展讯通信(上海)有限公司 Log saving method, communication device, chip and module equipment
CN113568777B (en) * 2021-09-27 2022-04-22 新华三半导体技术有限公司 Fault processing method, device, network chip, equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1506821A (en) * 2002-12-11 2004-06-23 联想(北京)有限公司 Detection and display method and device for computer self-test information
CN1746859A (en) * 2004-09-09 2006-03-15 英业达股份有限公司 Alarming system and method for intelligent platform event
CN101714111A (en) * 2008-10-03 2010-05-26 富士通株式会社 Computer apparatus and processor diagnostic method
CN103500133A (en) * 2013-09-17 2014-01-08 华为技术有限公司 Fault locating method and device

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5379342A (en) * 1993-01-07 1995-01-03 International Business Machines Corp. Method and apparatus for providing enhanced data verification in a computer system
CN1324464C (en) * 2004-08-04 2007-07-04 英业达股份有限公司 Method for real-time presentation of solution for error condition of computer device
CN1869947A (en) * 2005-05-24 2006-11-29 乐金电子(昆山)电脑有限公司 Auto-diagnostic system of personal computer
CN100543693C (en) * 2006-11-22 2009-09-23 英业达股份有限公司 Power-on self-detection method
CN102402473A (en) * 2011-10-28 2012-04-04 武汉供电公司变电检修中心 Computer hardware and software fault diagnosis and repair system
CN102609350A (en) * 2012-02-15 2012-07-25 浪潮电子信息产业股份有限公司 Server memory failure alarm method
CN102708015A (en) * 2012-05-15 2012-10-03 江苏中科梦兰电子科技有限公司 Debugging method based on diagnosis of CPU (central processing unit) non-maskable interrupt system problems

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1506821A (en) * 2002-12-11 2004-06-23 联想(北京)有限公司 Detection and display method and device for computer self-test information
CN1746859A (en) * 2004-09-09 2006-03-15 英业达股份有限公司 Alarming system and method for intelligent platform event
CN101714111A (en) * 2008-10-03 2010-05-26 富士通株式会社 Computer apparatus and processor diagnostic method
CN103500133A (en) * 2013-09-17 2014-01-08 华为技术有限公司 Fault locating method and device

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107018035A (en) * 2016-01-27 2017-08-04 北京新唐思创教育科技有限公司 A kind of teaching equipment monitoring management system and its method
CN110874279A (en) * 2018-08-29 2020-03-10 阿里巴巴集团控股有限公司 Fault positioning method, device and system
CN110874279B (en) * 2018-08-29 2023-05-30 阿里巴巴集团控股有限公司 Fault positioning method, device and system
CN111130919A (en) * 2019-11-13 2020-05-08 贵州医渡云技术有限公司 Interface monitoring method, device and system and storage medium
CN111130934A (en) * 2019-12-20 2020-05-08 国铁吉讯科技有限公司 Monitoring method, device and system of communication system
CN111209164A (en) * 2020-01-03 2020-05-29 杭州迪普科技股份有限公司 Abnormal information storage method and device, electronic equipment and storage medium
CN111209164B (en) * 2020-01-03 2023-09-26 杭州迪普科技股份有限公司 Abnormality information storage method and device, electronic equipment and storage medium
CN113391611B (en) * 2020-03-12 2022-11-29 中国移动通信集团河北有限公司 Early warning method, device and system for power environment monitoring system
CN113391611A (en) * 2020-03-12 2021-09-14 中国移动通信集团河北有限公司 Early warning method, device and system for dynamic environment monitoring system
CN112015681B (en) * 2020-08-19 2022-08-26 苏州鑫信腾科技有限公司 IO port processing method, device, equipment and medium
CN112015681A (en) * 2020-08-19 2020-12-01 苏州鑫信腾科技有限公司 IO port processing method, device, equipment and medium
CN112214373A (en) * 2020-09-17 2021-01-12 上海金仕达软件科技有限公司 Hardware monitoring method and device and electronic equipment
CN114189552A (en) * 2021-10-29 2022-03-15 济南浪潮数据技术有限公司 Data reporting method and system
CN115225470A (en) * 2022-07-28 2022-10-21 天翼云科技有限公司 Business abnormity monitoring method and device, electronic equipment and storage medium
CN115225469A (en) * 2022-07-28 2022-10-21 深圳市基纳控制有限公司 Network monitoring system and method based on network special-shaped interface
CN115225470B (en) * 2022-07-28 2023-10-13 天翼云科技有限公司 Business abnormality monitoring method and device, electronic equipment and storage medium
CN116133357A (en) * 2023-04-14 2023-05-16 合肥安迅精密技术有限公司 Vacuum pressure monitoring system and method for chip mounter
CN116133357B (en) * 2023-04-14 2023-06-09 合肥安迅精密技术有限公司 Vacuum pressure monitoring system and method for chip mounter

Also Published As

Publication number Publication date
CN103500133A (en) 2014-01-08

Similar Documents

Publication Publication Date Title
WO2015039598A1 (en) Fault locating method and device
EP3121726B1 (en) Fault processing method, related device and computer
US9954727B2 (en) Automatic debug information collection
WO2017063505A1 (en) Method for detecting hardware fault of server, apparatus thereof, and server
US9778988B2 (en) Power failure detection system and method
US20150106660A1 (en) Controller access to host memory
TWI632462B (en) Switching device and method for detecting i2c bus
TWI588660B (en) Method of detecting fault on communication bus using baseboard management controller and fault detector for network system
CN104639380A (en) Server monitoring method
WO2012046293A1 (en) Fault monitoring device, fault monitoring method and program
US11853150B2 (en) Method and device for detecting memory downgrade error
US10430267B2 (en) Determine when an error log was created
TW201415213A (en) Self-test system and method thereof
CN104320308A (en) Method and device for detecting anomalies of server
JP2012003651A (en) Virtualized environment motoring device, and monitoring method and program for the same
JP5529686B2 (en) Computer apparatus abnormality inspection method and computer apparatus using the same
JP5689783B2 (en) Computer, computer system, and failure information management method
CN107133130B (en) Computer operation monitoring method and device
JP2012150661A (en) Processor operation inspection system and its inspection method
CN115599617A (en) Bus detection method and device, server and electronic equipment
CN114138600A (en) Storage method, device, equipment and storage medium for firmware key information
CN109062718B (en) Server and data processing method
TW201324115A (en) Computer system and boot managing method of computer system
TW201314576A (en) Method for accessing pre-boot information
CN113867994B (en) Cabinet VPD information processing method and device, storage equipment and readable storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 14846019

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 14846019

Country of ref document: EP

Kind code of ref document: A1