CN117271182A - High-performance computer central test system - Google Patents

High-performance computer central test system Download PDF

Info

Publication number
CN117271182A
CN117271182A CN202311162777.3A CN202311162777A CN117271182A CN 117271182 A CN117271182 A CN 117271182A CN 202311162777 A CN202311162777 A CN 202311162777A CN 117271182 A CN117271182 A CN 117271182A
Authority
CN
China
Prior art keywords
self
checking
bmc
information
cpu
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311162777.3A
Other languages
Chinese (zh)
Inventor
张忠军
叶懋刚
肖安泰
车世界
计萨萨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CETC 52 Research Institute
Original Assignee
CETC 52 Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CETC 52 Research Institute filed Critical CETC 52 Research Institute
Priority to CN202311162777.3A priority Critical patent/CN117271182A/en
Publication of CN117271182A publication Critical patent/CN117271182A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/079Root cause analysis, i.e. error or fault diagnosis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0751Error or fault detection not based on redundancy
    • G06F11/0754Error or fault detection not based on redundancy by exceeding limits
    • G06F11/0757Error or fault detection not based on redundancy by exceeding limits by exceeding a time limit, i.e. time-out, e.g. watchdogs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0766Error or fault reporting or storing
    • G06F11/0769Readable error formats, e.g. cross-platform generic formats, human understandable formats
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/22Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing
    • G06F11/2205Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing using arrangements specific to the hardware being tested
    • G06F11/2236Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing using arrangements specific to the hardware being tested to test CPU or processors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/22Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing
    • G06F11/2273Test methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/22Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing
    • G06F11/26Functional testing
    • G06F11/273Tester hardware, i.e. output processing circuits
    • G06F11/2733Test interface between tester and unit under test

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The invention discloses a high-performance computer central test system, which comprises a main board, an expansion board and a power board, wherein the main board comprises a CPU, and a BMC, an FPGA and a network card chip which are all connected with the CPU, and the high-performance computer central test system comprises: the acquisition layer is used for acquiring the working state information and the self-checking information of each functional module of the computer and summarizing the working state information and the self-checking information to the BMC; the management layer is used for converting the data format of the collected and summarized working state information and the self-checking information of the BMC, storing the data format of the collected and summarized working state information and the self-checking information in the BMC, analyzing the working state information and the self-checking information of the converted working state information and the self-checking information of the current data format from bottom to top based on a fault diagnosis tree, obtaining a fault diagnosis result and giving an alarm; and the display layer is used for displaying a UI interface through a display screen or an upper computer connected with the BMC and displaying working state information, self-checking information and fault diagnosis results. The invention can comprehensively collect information, and has accurate and tight logic diagnosis and visual fault display.

Description

High-performance computer central test system
Technical Field
The invention belongs to the technical field of computers, and particularly relates to a high-performance computer central test system.
Background
The high-performance computer is taken as one of important components in modern weapon equipment systems in the fields of military industry and national defense, plays a very important role, and as the modern weapon equipment systems develop, the design of a component board card of the high-performance computer becomes more and more complex, a single circuit can be provided with a large number of electronic components, and once the high-performance computer breaks down, a large amount of manpower and material resources are required for fault detection and positioning, and if the high-performance computer cannot timely position the high-performance computer, the optimal maintenance time is always missed, so that huge losses are caused. Testing circuit boards or other electronic components and performing fault determination in an efficient manner has become an important task and very challenging task in the design of domestic high performance computers.
Aiming at a high-performance computer test system, the following technical scheme is adopted at present:
1) The external detection function card is connected with the high-performance computer to read some basic state information. However, the external detection function card is complex to use and cannot detect in time, so that the defect of key information is easy to cause when a fault occurs, and the operability is complex;
2) The network remote power-on self-checking system is a BMC chip of a high-performance computer main board, and only obtains self-checking information related to power-on of equipment. The current network remote control power-on self-checking system only aims at the power-on process of a main board, and the information acquired through the BMC chip is limited to the self-checking information in the power-on process of the main board, so that a device user cannot be assisted in quickly judging fault information;
3) And indicating the working state of each path of voltage of the computer by adopting a state indicator lamp. The mode of adopting the status indicator lamp is too single, the acquired information is limited, and the comprehensive detection requirement of the computer system cannot be met.
Disclosure of Invention
Aiming at the problems, the invention provides a high-performance computer central test system which can comprehensively collect information, and has accurate and tight logic diagnosis and visual fault display.
In order to achieve the above purpose, the technical scheme adopted by the invention is as follows:
the invention provides a high-performance computer central test system, which is applied to a computer, wherein the computer comprises a main board, an expansion board and a power board for supplying power, the main board comprises BMC, FPGA, CPU and a network card chip, the BMC, the FPGA and the network card chip are all connected with a CPU, and the high-performance computer central test system comprises an acquisition layer, a management layer and a display layer, wherein:
the acquisition layer is used for acquiring the working state information and the self-checking information of each functional module of the computer and summarizing the working state information and the self-checking information to the BMC, and specifically comprises the following steps:
collecting detection signals of a power supply time sequence, each voltage sensor, each current sensor and each temperature sensor by using a BMC;
the CPU is used for collecting CPU utilization rate, disk utilization rate, memory capacity, disk capacity, memory speed, SATA rate, network interface IP, network interface MAC, network interface bandwidth, network interface rate, network card chip firmware, connection bandwidth of external equipment, connection rate of external equipment and in-place state of the external equipment and sending the in-place state to the BMC, wherein the network interface is externally output by the network card chip, and the external equipment is an FPGA and the network card chip connected with the CPU;
the FPGA is utilized to collect the self-loop state of the serial port interface, the serial port interface speed and the discrete quantity interface state and send the self-loop state, the serial port interface speed and the discrete quantity interface state to the BMC through the CPU, and the serial port interface and the discrete quantity interface are externally output by the FPGA;
the CPU is utilized to carry out self-checking on an input/output interface, corresponding self-checking information is collected and sent to the BMC, and the input/output interface comprises a serial interface, a discrete quantity interface and a network interface;
performing self-checking on a watchdog and a timer by using an FPGA, and sending corresponding self-checking information to the BMC through a CPU;
the method comprises the steps that a BMC is utilized to obtain and analyze a power-on log collected by a CPU, wherein the power-on log comprises power-on state information of a main board power-on stage and a firmware log of a BIOS firmware operation stage in preset time;
the management layer is used for converting the data format of the collected and summarized working state information and the self-checking information of the BMC, storing the data format of the working state information and the self-checking information in a local RAM of the BMC, analyzing the working state information and the self-checking information after the conversion of the current data format from bottom to top based on a fault diagnosis tree, obtaining a fault diagnosis result and giving an alarm;
and the display layer is used for displaying a UI interface through a display screen or an upper computer connected with the BMC and displaying working state information, self-checking information and fault diagnosis results.
Preferably, the high performance computer central test system comprises three detection modes of power-on detection, period detection and maintenance detection, wherein:
the power-on detection is carried out after the computer is powered on and started for the first time, and whether the state parameters of the power supply, the CPU, the memory and the external equipment of the computer are normal or not is checked and recorded;
the period detection is carried out after the power-on is finished, and the working state information of the computer is monitored in real time;
and (3) maintenance and detection, which is carried out in a warehouse guarantee stage, and is used for testing external test points of a computer, so as to lead out and analyze self-checking information.
Preferably, the high-performance computer central test system acquires a self-checking result corresponding to self-checking information through a self-checking unit managed by the BMC, the description data structure of the self-checking unit comprises a self-checking unit serial number, a self-checking unit name, a attribution source, a self-checking unit type and a self-checking unit grade, the attribution source is divided into a test, BMC diagnosis, CPU diagnosis and others, and the self-checking unit type comprises an interrupt type and a polling type;
the self-checking result supports unit-by-unit acquisition, when the self-checking result state identification is null or the self-checking is in progress, the BMC node is allowed to report the first 2 bytes, when a new self-checking result exists, the BMC reports a complete self-checking result, the self-checking result comprises a self-checking unit serial number, a fault identification, a fault code, a time stamp and fault details, the fault identification is one of 0x00, 0x01, 0x02, 0x03, 0x04 and 0x05, wherein 0x00 represents no self-checking result, 0x01 represents the self-checking in progress, 0x02 represents a power-on self-checking result, 0x03 represents a periodic self-checking result, 0x04 represents an instruction controlled self-checking result and 0x05 represents an interrupt type self-checking result, the fault details are in the form of character string data, and the length is not more than 32 bytes.
Preferably, the BMC is maximally supported to manage 254 self-test units.
Preferably, the working state information after the data format conversion is managed in the form of data elements, the data elements are composed of data IDs and numerical values, wherein each data ID corresponds to one type of state information, the corresponding numerical value is fixed length, the data elements with the data ID marks of 0x 01-0 x0F are static data elements, and the data elements with the data ID marks of 0x 10-0 x6F are dynamic data elements.
Preferably, the static data element is used for identifying the serial number, the production date and the version number of the corresponding functional module, and the dynamic data element is used for identifying the dynamically changed working state information.
Preferably, the detection signal of the voltage sensor includes a DC-DC secondary power signal of each functional module, the detection signal of the current sensor includes a differential current signal of the main board and a differential current signal of the extension board, and the detection signal of the temperature sensor includes a temperature of the main board and a temperature of the extension board.
Preferably, the DC-DC secondary power supply signal comprises one or more of a 12V, 5V, 3.8V, 3.3V, 1.8V, 1.2V, 1.0V, 0.9V power supply signal.
Preferably, the BMC collects and gathers the working state information that gathers and also carries out signal amplification processing before carrying out data format conversion.
Preferably, the main board further comprises a CPLD chip, and the BMC is utilized to collect the CPLD reset state.
Compared with the prior art, the invention has the beneficial effects that:
the system realizes comprehensive state information acquisition and data management of the computer based on a mode of combining hardware and software, accurately acquires the current working state of the computer, is greatly convenient for fault detection and positioning of the high-performance computer, adopts a visual UI interface, combines self-checking information and fault tree logic processing results, visually displays alarm content, and also adopts a power-on detection, period detection and maintenance detection design, so that the system can detect and record the computer in real time, provide an effective means for fault analysis, comprehensively acquire the software and hardware working state information of the computer, perform fault logic diagnosis through a fault tree judgment strategy, realize diagnosis result output through the mode of the UI interface, and realize accurate and tight logic diagnosis and visual fault display.
Drawings
FIG. 1 is a hierarchical block diagram of a high performance computer central test system of the present invention;
FIG. 2 is a block diagram of a motherboard according to the present invention;
FIG. 3 is a functional block diagram of a POL log of the present invention;
FIG. 4 is a logical block diagram of a fault tree of the present invention;
FIG. 5 is a block diagram showing the connection of LCD display screens according to the present invention;
FIG. 6 is a flow chart showing the LCD information display according to the present invention;
FIG. 7 is a schematic diagram of the detection method of the present invention.
Detailed Description
The following description of the technical solutions in the embodiments of the present application will be made clearly and completely with reference to the drawings in the embodiments of the present application, and it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.
It will be understood that when an element is referred to as being "connected" to another element, it can be directly connected to the other element or intervening elements may also be present. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein in the description of the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application.
The terms in this application are explained as follows:
BMC: baseboard Management Controller baseboard management controller is independent of small operating systems other than CPU systems. The program running does not influence the normal work of the CPU system, and can be used as a management system outside the CPU system.
IPMI: intelligent Platform Management Interface intelligent platform management interface is responsible for platform health information acquisition and record.
And (3) FPGA: field Programmable Gate Array field programmable logic arrays, which are semi-custom application specific integrated circuits, implement hardware circuit functions in a programming language.
CPU: central Processing Unit, a final execution unit for information processing and program running.
1-7, a high performance computer central test system is applied to a computer, the computer comprises a main board, an expansion board and a power board for supplying power, the main board comprises BMC, FPGA, CPU and a network card chip, the BMC, the FPGA and the network card chip are all connected with a CPU, the high performance computer central test system comprises an acquisition layer, a management layer and a display layer, wherein:
the acquisition layer is used for acquiring the working state information and the self-checking information of each functional module of the computer and summarizing the working state information and the self-checking information to the BMC, and specifically comprises the following steps:
collecting detection signals of a power supply time sequence, each voltage sensor, each current sensor and each temperature sensor by using a BMC;
the CPU is used for collecting CPU utilization rate, disk utilization rate, memory capacity, disk capacity, memory speed, SATA rate, network interface IP, network interface MAC, network interface bandwidth, network interface rate, network card chip firmware, connection bandwidth of external equipment, connection rate of external equipment and in-place state of the external equipment and sending the in-place state to the BMC, wherein the network interface is externally output by the network card chip, and the external equipment is an FPGA and the network card chip connected with the CPU;
the FPGA is utilized to collect the self-loop state of the serial port interface, the serial port interface speed and the discrete quantity interface state and send the self-loop state, the serial port interface speed and the discrete quantity interface state to the BMC through the CPU, and the serial port interface and the discrete quantity interface are externally output by the FPGA;
the CPU is utilized to carry out self-checking on an input/output interface, corresponding self-checking information is collected and sent to the BMC, and the input/output interface comprises a serial interface, a discrete quantity interface and a network interface;
performing self-checking on a watchdog and a timer by using an FPGA, and sending corresponding self-checking information to the BMC through a CPU;
the method comprises the steps that a BMC is utilized to obtain and analyze a power-on log collected by a CPU, wherein the power-on log comprises power-on state information of a main board power-on stage and a firmware log of a BIOS firmware operation stage in preset time;
the management layer is used for converting the data format of the collected and summarized working state information and the self-checking information of the BMC, storing the data format of the working state information and the self-checking information in a local RAM of the BMC, analyzing the working state information and the self-checking information after the conversion of the current data format from bottom to top based on a fault diagnosis tree, obtaining a fault diagnosis result and giving an alarm;
and the display layer is used for displaying a UI interface through a display screen or an upper computer connected with the BMC and displaying working state information, self-checking information and fault diagnosis results.
In one embodiment, the high performance computer central test system includes three detection modes, namely power-on detection, cycle detection and maintenance detection, wherein:
the power-on detection is carried out after the computer is powered on and started for the first time, and whether the state parameters of the power supply, the CPU, the memory and the external equipment of the computer are normal or not is checked and recorded;
the period detection is carried out after the power-on is finished, and the working state information of the computer is monitored in real time;
and (3) maintenance and detection, which is carried out in a warehouse guarantee stage, and is used for testing external test points of a computer, so as to lead out and analyze self-checking information.
The high-performance computer central test system is deployed on a computer, and can complete detection and fault diagnosis of different functional modules, including three detection modes of power-on detection, period detection and maintenance detection, wherein the different detection modes respectively correspond to different detection contents and different execution opportunities, as shown in fig. 7.
1) The power-on detection is performed after the high-performance computer is powered on and started for the first time, and whether the state parameters of a power supply, a CPU, a memory and external equipment (peripheral components) are normal or not is checked and recorded;
2) The period detection is started after the power-on is finished, and the working state of the high-performance computer is monitored in real time. If the period detection content can cover all detection content except the functional items of the system, namely the period detection does not affect the normal use of the computer, for example, the computer is currently in network interface communication, and the period detection does not cover network interface state detection;
3) The maintenance and detection are carried out in a storehouse guarantee stage, and can be used for testing external test points of a high-performance computer, so that self-checking information is exported and analyzed. The storehouse guarantee stage is that normal work can not be carried out after equipment failure, and the storehouse needs to be entered for maintenance.
In an embodiment, the high performance computer central test system obtains a self-checking result corresponding to self-checking information through a self-checking unit managed by the BMC, the description data structure of the self-checking unit comprises a self-checking unit serial number, a self-checking unit name, a home source, a self-checking unit type and a self-checking unit grade, the home source is divided into a test, BMC diagnosis, CPU diagnosis and others, and the self-checking unit type comprises an interrupt type and a polling type;
the self-checking result supports unit-by-unit acquisition, when the self-checking result state identification is null or the self-checking is in progress, the BMC node is allowed to report the first 2 bytes, when a new self-checking result exists, the BMC reports a complete self-checking result, the self-checking result comprises a self-checking unit serial number, a fault identification, a fault code, a time stamp and fault details, the fault identification is one of 0x00, 0x01, 0x02, 0x03, 0x04 and 0x05, wherein 0x00 represents no self-checking result, 0x01 represents the self-checking in progress, 0x02 represents a power-on self-checking result, 0x03 represents a periodic self-checking result, 0x04 represents an instruction controlled self-checking result and 0x05 represents an interrupt type self-checking result, the fault details are in the form of character string data, and the length is not more than 32 bytes. The power-on self-test and the period self-test correspond to the power-on detection and the period detection respectively.
In one embodiment, the BMC maximum support manages 254 self-test units. The number of the self-checking units can be set according to actual requirements.
In an embodiment, the working state information after the data format conversion is managed in the form of data elements, the data elements are composed of data IDs and numerical values, wherein each data ID corresponds to one type of state information, the corresponding numerical value is fixed length, the data elements with the data ID marks of 0x 01-0 x0F are static data elements, and the data elements with the data ID marks of 0x 10-0 x6F are dynamic data elements.
In an embodiment, the static data element is used for identifying the serial number, the production date and the version number of the corresponding functional module, and the dynamic data element is used for identifying the dynamically changed working state information.
In an embodiment, the detection signal of the voltage sensor includes a DC-DC secondary power signal of each functional module, the detection signal of the current sensor includes a differential current signal of the main board and a differential current signal of the extension board, and the detection signal of the temperature sensor includes a temperature of the main board and a temperature of the extension board.
In one embodiment, the DC-DC secondary power signal includes one or more of a 12V, 5V, 3.8V, 3.3V, 1.8V, 1.2V, 1.0V, 0.9V power signal.
In an embodiment, the BMC further performs signal amplification processing before the data format conversion of the collected and summarized working state information.
In an embodiment, the motherboard further includes a CPLD chip, and the BMC is utilized to collect a CPLD reset state.
Specifically, the high-performance computer mainly comprises a main board, an expansion board and a power board, and can be divided into a calculation function, an input/output interface function, a BMC function and an internal module power supply function according to functions. The high-performance computer central test system consists of BMC, FPGA, CPU and network card chips, and supports functions of computer state acquisition, data management judgment (fault log recording and logic diagnosis) and interactive display, and the system architecture is divided into 3 layers, including an acquisition layer, a management layer and a display layer, as shown in fig. 1, the main functions are as follows:
1.1 acquisition layer
The collection layer is responsible for collecting and summarizing working state information, such as collecting and summarizing voltage information, current information, in-place information, running speed, connection bandwidth, real-time temperature, working log information (namely power-on log information) and the like of each functional module of the computer, and collecting and summarizing self-checking information. The hardware connection block diagram is shown in fig. 2, a CPU is interconnected with a SATA storage disk through a SATA bus to realize a storage function, the CPU is interconnected with a CPLD chip through an LPC bus to realize data communication, the CPU is interconnected with a DDR (double data rate) data bus and an address bus to realize a memory function, the CPU is interconnected with an FPGA (field programmable gate array) and a network card chip through a PCIe bus, the network card chip is interconnected with a Flash through an SPI (serial peripheral interface) bus, the FPGA outputs a serial port interface and a discrete interface, and the network card chip outputs a network interface externally.
1.1.1 State acquisition
The BMC is connected with the voltage sensor, the current sensor and the temperature sensor through the I2C bus. The voltage sensor, current sensor and temperature sensor may be any number. The information collected by each sensor can also be amplified.
1) The voltage sensor is connected with DC-DC secondary power supply detection signals of 12V, 5V, 3.8V, 3.3V, 1.8V, 1.2V, 1.0V and 0.9V of different functional modules to complete the state acquisition of all secondary power supplies of the high-performance computer and obtain corresponding voltage information;
2) The current sensors are respectively connected to differential current signals of the main board and the expansion board to complete current signal acquisition and obtain corresponding voltage information;
3) The temperature sensor is deployed near a high-performance computer main board and an expansion board (such as a core component) to finish key temperature information acquisition, such as CPU temperature, FPGA temperature and board card (including main board, expansion board and power board) temperature;
4) The CPU collects information such as CPU utilization rate, disk utilization rate, memory capacity, disk capacity, memory speed, SATA rate and the like through a driving interface;
5) The CPU collects the current connection bandwidth, connection speed and in-place state of PCIe external equipment through a PCIe bus driving interface, for example, the external equipment refers to equipment connected under the PCIe bus of the CPU, namely an FPGA and a network card chip, specifically, the collected information comprises network interface IP, network interface MAC, network interface bandwidth, network interface speed, network card chip firmware, connection bandwidth of the external equipment, connection speed of the external equipment and in-place state of the external equipment, the FPGA is utilized to collect the self-loop state, the serial interface speed and discrete interface state of the serial interface and send the self-loop state, the serial interface speed and discrete interface state to the BMC, the network card chip firmware is software configuration information necessary for the work of the network card chip, the serial interface self-loop state is a wrapping test result of the external serial interface, and the serial interface can be indicated to be good or bad;
6) The CPU executes the self-checking program of the input/output interface to finish the self-loop test result information acquisition of the peripheral interface, the self-checking of the input/output interface is triggered by the CPU, the FPGA executes the self-checking program, and the self-checking information is reported to the CPU;
7) BMC log processing: the BMC is connected with the CPU through a UART interface, the system power-on log is forwarded to the BMC through a UART bus, and the BMC stores the power-on log information to the external Flash through an SPI bus to complete the power-on log information record. As shown in fig. 3, the POL log function provides functions of power-on log record, log export and log intelligent analysis (analysis), mainly records power-on state information of a main board in a power-on stage and firmware log of a BIOS firmware operation stage in an external Flash of a BMC, can export log record through log export software, analyze the exported log record through log analysis software, display some common information and give analysis prompt to fault log.
1.1.2 State summarization
In a high-performance computer central test system, the BMC is used as an information summarizing unit in consideration of the possibility of operating system faults under extreme conditions.
8) The CPU sends the dynamic information such as CPU utilization rate, memory utilization rate, disk utilization rate, current in-place state of PCIe external equipment and the like to the BMC through a UART interface;
9) The CPU executes the self-checking program of the input/output interface, including serial interface wrapping test program, network interface wrapping program, etc., and sends the detection result to the BMC through UART interface;
10 The FPGA performs self-checking on the watchdog, the timer and the like, and self-checking results are summarized to the BMC through a private UART protocol.
1.2 management layer
1.2.1 management layer data definition
The BMC collects and gathers the working state information and self-checking information, converts the working state information and self-checking information into a specific data format through processing, stores the specific data format in a local RAM of the BMC, and can acquire the summarized information from the BMC through an IPMI interface.
1) Operating state information
The working state information is managed in the form of data elements, the data elements are composed of data IDs and numerical values, wherein each ID corresponds to one type of state information, and the numerical value is fixed in length.
The data elements of ID identifiers 0x01 to 0x0F are static data elements, and these data elements are loaded with information that does not change in actual use of the functional module, such as a serial number, a production date, and a version number of the functional module.
ID marks 0x 10-0 x6F are dynamic data elements and are used for marking dynamic change state data such as voltage, temperature, current, CPU state and the like.
2) Self-checking information
The basic unit of fault diagnosis is a self-checking unit, table 1 defines the description data structure of the self-checking unit, the self-checking unit is uniformly managed by the BMC, and 254 self-checking units are supported and managed at maximum.
Table 1 self-test unit
The self-checking result is used for describing the result information of the self-checking unit after the self-checking is finished, the self-checking result only supports unit-by-unit acquisition, the data structure is defined in the following table 2, when the self-checking result state is marked as empty or the self-checking is in progress, the BMC node is allowed to report only the first 2 bytes, and when a new self-checking result exists, the BMC must report the complete self-checking result. In table 2, the fault code definition table may be a fault code defined inside the enterprise, for example, 001 represents a voltage abnormality fault or the like.
TABLE 2 self-test results
1.2.2 management layer logic design
The management layer designs a fault diagnosis tree aiming at the currently acquired state data information, adopts a method of the fault tree from bottom to top to carry out logic analysis, comprehensively judges the system functions possibly affected according to the collected information of the collection layer and alarms, and the logic design of the fault tree of the management layer is shown in figure 4. Specifically, the fault type comprises abnormal power supply of the internal module, abnormal computing processing capacity and abnormal function of the input/output interface, wherein the abnormal power supply of the internal module is mainly judged according to the detected voltage of each functional module; the abnormal computing processing capacity is represented by CPLD abnormality, CPU abnormality, memory abnormality, SATA disk abnormality and the like, for example, CPLD abnormality is mainly judged according to detected reset information and voltage information, CPU abnormality is mainly judged according to detected temperature information, utilization rate, voltage information and current information, memory abnormality is mainly judged according to detected voltage information, operation rate, memory capacity and utilization rate, and SATA disk abnormality is mainly judged according to detected PCIE connection rate, PCIE connection bandwidth, voltage information, capacity, operation rate, utilization rate and PCIE in-place state; the input/output interface function abnormality corresponds to abnormality in the aspects of a network interface, a serial port interface and a discrete quantity interface, the network interface abnormality is expressed as gigabit network communication function abnormality, the network interface abnormality is mainly judged according to detected voltage information, PCIE connection speed, PCIE connection bandwidth, firmware state, IP/MAC address and PCIE in-place state, the serial port interface abnormality is expressed as RS232 function abnormality and RS422 function abnormality, the serial port interface abnormality is mainly judged according to detected voltage information, serial port wrapping test state, PCIE connection speed, PCIE connection bandwidth and PCIE in-place state, and the discrete quantity interface abnormality is expressed as discrete quantity input interface abnormality and discrete quantity output interface abnormality, and the discrete quantity interface abnormality is mainly judged according to detected voltage information, wrapping test state, PCIE connection speed, PCIE connection bandwidth and PCIE in-place state. The judgment logic can judge according to the threshold value, for example, the normal value of the voltage is 3.3V plus or minus 0.1V, if 3.5V is collected, the judgment is abnormal, for example, the normal value of the network interface speed is 1000Mbps, the actual test result is 100Mbps, and the judgment is abnormal. Comparing the acquired value with a preset normal range threshold value, and judging that the fault exists when the acquired value exceeds the preset threshold value. The above various abnormal conditions are merely for easy understanding, and specific abnormal state detection and judgment, and those skilled in the art can make simple adjustments according to actual requirements, which are not limited herein.
1.3 display layer
The display layer designs two interactive display modes: 1) Externally connecting an LCD display screen through the BMC; 2) And the BMC serial port interface is connected with the upper computer for UI interface display. The UI interface supports a multi-level menu and performs a menu display function by recognizing a key IO signal.
1.3.1 LCD display screen display
As shown in FIG. 5, the BMC is connected with the LCD display screen through the SPI bus, provides a concise UI interface, and can display real-time working state information, self-checking information and fault diagnosis results (such as fault codes) of the computer. Key operation information can be read through GPIO, and the BMC judges and identifies health information (such as fault codes) and controls the display on the LCD screen. If the multi-level menu is supported, the functions of selecting a secondary menu, turning up and down pages, returning to a previous level menu and the like can be realized through keys, an LCD information display flow chart is shown in fig. 6, and after the identification of the health information state of the BMC is completed, the health information is displayed on an LCD screen; the key event is judged to be long-pressed or short-pressed by identifying the key I/O signal, and menu display functions such as selection, page turning up and down and the like are executed according to the key event, and finally the menu display functions are displayed through an LCD display screen, which are well known technology for the person skilled in the art and are not repeated herein.
1.3.2 Upper computer display
The BMC is connected with the upper computer through the serial port interface, the current state information is transmitted to the upper computer, and the upper computer completes display through the UI interface.
The system realizes comprehensive state information acquisition and data management of the computer based on a mode of combining hardware and software, accurately acquires the current working state of the computer, is greatly convenient for fault detection and positioning of the high-performance computer, adopts a visual UI interface, combines self-checking information and fault tree logic processing results, visually displays alarm content, and also adopts a power-on detection, period detection and maintenance detection design, so that the system can detect and record the computer in real time, provide an effective means for fault analysis, comprehensively acquire the software and hardware working state information of the computer, perform fault logic diagnosis through a fault tree judgment strategy, realize diagnosis result output through the mode of the UI interface, and realize accurate and tight logic diagnosis and visual fault display.
The technical features of the above-described embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above-described embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The above-described embodiments are merely representative of the more specific and detailed embodiments described herein and are not to be construed as limiting the claims. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application is to be determined by the claims appended hereto.

Claims (10)

1. A high performance computer central test system is applied to a computer, and the computer comprises a main board, an expansion board and a power board for supplying power, and is characterized in that: the mainboard includes BMC, FPGA, CPU and network card chip, BMC, FPGA and network card chip all are connected with the CPU, high performance computer central test system includes acquisition layer, management layer and show layer, wherein:
the acquisition layer is used for acquiring the working state information and the self-checking information of each functional module of the computer and summarizing the working state information and the self-checking information to the BMC, and specifically comprises the following steps:
collecting detection signals of a power supply time sequence, each voltage sensor, each current sensor and each temperature sensor by using a BMC;
the CPU is used for collecting CPU utilization rate, disk utilization rate, memory capacity, disk capacity, memory speed, SATA rate, network interface IP, network interface MAC, network interface bandwidth, network interface rate, network card chip firmware, connection bandwidth of external equipment, connection rate of external equipment and in-place state of the external equipment and sending the in-place state to the BMC, wherein the network interface is externally output by the network card chip, and the external equipment is an FPGA and a network card chip connected with the CPU;
the method comprises the steps that an FPGA is utilized to collect the self-loop state of a serial port interface, the speed of the serial port interface and the state of a discrete quantity interface, and the serial port interface and the discrete quantity interface are sent to a BMC through a CPU, and are output outwards by the FPGA;
the method comprises the steps that a CPU is utilized to carry out self-checking on an input/output interface, corresponding self-checking information is collected and sent to a BMC, and the input/output interface comprises a serial interface, a discrete quantity interface and a network interface;
performing self-checking on a watchdog and a timer by using an FPGA, and sending corresponding self-checking information to the BMC through a CPU;
the method comprises the steps of obtaining and analyzing a power-on log acquired by a CPU (Central processing Unit) by utilizing a BMC (baseboard management controller), wherein the power-on log comprises power-on state information of a main board power-on stage and a firmware log of a BIOS firmware operation stage in preset time;
the management layer is used for converting the data format of the collected and summarized working state information and self-checking information of the BMC, storing the data format of the working state information and the self-checking information in a local RAM of the BMC, analyzing the working state information and the self-checking information after the conversion of the current data format from bottom to top based on a fault diagnosis tree, obtaining a fault diagnosis result and giving an alarm;
and the display layer is used for displaying the working state information, the self-checking information and the fault diagnosis result through a display screen or an upper computer connected with the BMC.
2. The high performance computer central test system according to claim 1, wherein: the high-performance computer central test system comprises three detection modes of power-on detection, period detection and maintenance detection, wherein:
the power-on detection is carried out after the computer is powered on and started for the first time, and whether the state parameters of the power supply, the CPU, the memory and the external equipment of the computer are normal or not is checked and recorded;
the period detection is carried out after the power-on is finished, and the working state information of the computer is monitored in real time;
and (3) maintenance and detection, which is carried out in a warehouse guarantee stage, and is used for testing external test points of a computer, so as to lead out and analyze self-checking information.
3. The high performance computer central test system according to claim 1, wherein: the high-performance computer central test system acquires a self-checking result corresponding to self-checking information through a self-checking unit managed by a BMC, wherein a description data structure of the self-checking unit comprises a self-checking unit serial number, a self-checking unit name, a attribution source, a self-checking unit type and a self-checking unit grade, the attribution source is divided into a test, BMC diagnosis, CPU diagnosis and others, and the self-checking unit type comprises an interrupt type and a polling type;
the self-checking result supports unit-by-unit acquisition, when the self-checking result state identification is empty or the self-checking is in progress, the BMC node is allowed to report the first 2 bytes, when a new self-checking result exists, the BMC reports a complete self-checking result, the self-checking result consists of a self-checking unit serial number, a fault identification, a fault code, a time stamp and fault details, the fault identification is one of 0x00, 0x01, 0x02, 0x03, 0x04 and 0x05, wherein 0x00 represents no self-checking result, 0x01 represents the self-checking in progress, 0x02 represents the power-on self-checking result, 0x03 represents the periodic self-checking result, 0x04 represents the instruction-controlled self-checking result and 0x05 represents the interrupt self-checking result, and the fault details are in the form of character string data, and the length is not more than 32 bytes.
4. The high performance computer central test system according to claim 3, wherein: the BMC maximally supports management of 254 self-test units.
5. The high performance computer central test system according to claim 1, wherein: the working state information after the data format conversion is managed in a data element form, wherein the data elements consist of data IDs and numerical values, each data ID corresponds to one type of state information, the corresponding numerical value is fixed length, the data elements with the data ID marks of 0x 01-0 x0F are static data elements, and the data elements with the data ID marks of 0x 10-0 x6F are dynamic data elements.
6. The high performance computer central test system according to claim 5, wherein: the static data element is used for identifying the serial number, the production date and the version number of the corresponding functional module, and the dynamic data element is used for identifying the dynamic change working state information.
7. The high performance computer central test system according to claim 1, wherein: the detection signals of the voltage sensor comprise DC-DC secondary power signals of all the functional modules, the detection signals of the current sensor comprise differential current signals of a main board and differential current signals of an expansion board, and the detection signals of the temperature sensor comprise the temperature of the main board and the temperature of the expansion board.
8. The high performance computer central test system according to claim 7, wherein: the DC-DC secondary power supply signal includes one or more of a 12V, 5V, 3.8V, 3.3V, 1.8V, 1.2V, 1.0V, 0.9V power supply signal.
9. The high performance computer central test system according to claim 1, wherein: and the BMC acquires and gathers the working state information and performs signal amplification processing before performing data format conversion.
10. The high performance computer central test system according to claim 1, wherein: the main board also comprises a CPLD chip, and the BMC is utilized to acquire the CPLD reset state.
CN202311162777.3A 2023-09-08 2023-09-08 High-performance computer central test system Pending CN117271182A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311162777.3A CN117271182A (en) 2023-09-08 2023-09-08 High-performance computer central test system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311162777.3A CN117271182A (en) 2023-09-08 2023-09-08 High-performance computer central test system

Publications (1)

Publication Number Publication Date
CN117271182A true CN117271182A (en) 2023-12-22

Family

ID=89211457

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311162777.3A Pending CN117271182A (en) 2023-09-08 2023-09-08 High-performance computer central test system

Country Status (1)

Country Link
CN (1) CN117271182A (en)

Similar Documents

Publication Publication Date Title
US20190171540A1 (en) Apparatus fault detecting system and fault detection device
CN105938450B (en) The method and system that automatic debugging information is collected
CN103744824B (en) One is dispatched from the factory method of testing and test system
CN106383763B (en) Data center's intelligent trouble detects alarm system
CN103500133A (en) Fault locating method and device
CN104850485A (en) BMC based method and system for remote diagnosis of server startup failure
CN2932488Y (en) Fault detecting device
CN107766448A (en) Rule-based satellite telemetering data analysis system
CN107015901B (en) Log analysis method and device
CN103744769A (en) Rapid error positioning method of power supply of server based on complex programmable logic device (CPLD)
CN112234707B (en) High-energy synchrotron radiation light source magnet power failure recognition system
CN110377136A (en) A kind of PSU original value log recording method and device
CN112231163A (en) Multifunctional computer detection equipment and operation method thereof
CN115543746A (en) Graphics processor monitoring method, system and device and electronic equipment
CN104239174A (en) BMC (baseboard management controller) remote debugging system and method
CN103176759A (en) BIOS POST code display system and BIOS POST code display method
CN107247505B (en) Cloud server power supply blackbox design method easy to view
CN102681928B (en) Abnormal information output system of computer system
CN108710318A (en) A kind of computer system monitoring circuit
CN112027111A (en) Real-time acquisition and display method and system for aircraft bus data
CN106195247B (en) A kind of control system of speed variator based on big data management mode
CN117271182A (en) High-performance computer central test system
WO2020000669A1 (en) Data code analysis method and apparatus
CN113742166B (en) Method, device and system for recording logs of server system devices
CN105260280A (en) Method and device for detecting sensors in servers

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination