CN116225802A - Fault testing method and device and computing equipment - Google Patents

Fault testing method and device and computing equipment Download PDF

Info

Publication number
CN116225802A
CN116225802A CN202310126796.4A CN202310126796A CN116225802A CN 116225802 A CN116225802 A CN 116225802A CN 202310126796 A CN202310126796 A CN 202310126796A CN 116225802 A CN116225802 A CN 116225802A
Authority
CN
China
Prior art keywords
fault
hardware
programmable logic
logic unit
tested
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310126796.4A
Other languages
Chinese (zh)
Inventor
赵树梅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
XFusion Digital Technologies Co Ltd
Original Assignee
XFusion Digital Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by XFusion Digital Technologies Co Ltd filed Critical XFusion Digital Technologies Co Ltd
Priority to CN202310126796.4A priority Critical patent/CN116225802A/en
Publication of CN116225802A publication Critical patent/CN116225802A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/22Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing
    • G06F11/2205Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing using arrangements specific to the hardware being tested
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/22Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing
    • G06F11/2247Verification or detection of system hardware configuration

Abstract

A fault test method relates to the technical field of computers. The method includes setting a programmable logic unit of a server to a test mode; under the condition that the programmable logic unit is in a test mode, a fault signal is obtained and is used for simulating the fault of hardware to be tested in the server; according to the fault signal, setting the state of the hardware to be tested recorded in the programmable logic unit as a fault state; after the management unit of the server reads that the hardware to be tested is in a fault state from the programmable logic unit, the management unit sends out fault information which is used for indicating that the hardware to be tested is faulty. Therefore, the fault detection performance of the computing equipment can be automatically tested without manually constructing a real hardware fault, the testing efficiency is high, and the cost is low.

Description

Fault testing method and device and computing equipment
Technical Field
The present disclosure relates to the field of computer technologies, and in particular, to a fault testing method, a fault testing device, and a computing device.
Background
Servers typically include multiple hardware components, such as processors, memory, hard disks, fans, etc., that may fail in a wide variety of ways during operation, and once a hardware failure occurs, it affects the use of the server, resulting in a damaged service on the server. Therefore, the server needs to perform real-time fault detection on hardware, perform self-healing processing or report to an end user to perform fault hardware replacement and the like, so as to ensure the stability and safety of the operation of the server.
Therefore, the fault detection performance of the server is of great importance, and in order to ensure the reliability of the fault detection performance of the server, the fault detection performance of the server needs to be subjected to a test before the server runs on the network.
Disclosure of Invention
The application provides a fault test method, a fault test device, a fault test computing device storage medium and a fault test computing device program product, and the reliability of fault tests of the fault test computing device such as a server can be improved.
In a first aspect, the present application provides a fault testing method, the method including setting a programmable logic unit of a server to a test mode; under the condition that the programmable logic unit is in a test mode, a fault signal is obtained and is used for simulating the fault of hardware to be tested in the server; according to the fault signal, setting the state of the hardware to be tested recorded in the programmable logic unit as a fault state; after the management unit of the server reads that the hardware to be tested is in a fault state from the programmable logic unit, the management unit sends out fault information which is used for indicating that the hardware to be tested is faulty.
In this embodiment, the computing device (i.e. the server) is in communication connection with the management terminal, the hardware to be tested of the computing device may be hardware such as a processor, a memory, a hard disk, a fan, a network card, etc., and the user may input a related instruction at the management terminal side to simulate a signal when a certain type of fault (such as voltage abnormality, temperature abnormality, etc.) occurs in the hardware on the computing device, and send the constructed fault signal to the computing device, so that the programmable logic unit of the computing device sets the recorded state of the hardware to be tested as a fault state, and the management unit of the server may read the fault state from the programmable logic unit to perform fault diagnosis and fault information reporting. Therefore, the management terminal can test the fault detection performance (such as accuracy, timeliness and the like) of the computing equipment on the hardware to be tested by verifying whether the fault information is consistent with the faults represented in the fault signals, the real hardware faults do not need to be constructed artificially, the test cost is reduced, and the efficiency is improved. The programmable logic unit may be a complex programmable logic device CPLD, and the management unit may be a baseboard management controller BMC, but is not limited thereto.
In some possible implementations, the programmable logic unit includes a register for storing a state of the hardware under test, the fault signal includes a fault flag, the fault flag is used to indicate that the hardware under test is in a fault state, the register is set to a readable and writable mode in a case that the programmable logic unit is in a test mode, the fault flag is written into the register corresponding to the hardware under test, and the management unit reads the fault flag from the register.
In some possible implementations, the fault signal includes at least one fault type of the hardware under test, the programmable logic unit includes at least one register, and setting the state of the hardware under test recorded in the programmable logic unit to the fault state includes: placing a register corresponding to each fault type into a queue to be operated;
and setting the state of the hardware to be tested recorded in all the registers in the queue to be operated as a fault state.
In some possible implementations, the fault types include one or more of unpowered, out-of-place, temperature anomalies, voltage anomalies, or power anomalies.
In some possible implementations, the method further includes: under the condition that the programmable logic unit is in a test mode, acquiring a fault recovery signal, wherein the fault recovery signal is used for simulating recovery faults of hardware to be tested in the server; according to the fault recovery signal, setting the state of the hardware to be tested recorded in the programmable logic unit as a normal state; after the management unit reads that the hardware to be tested is in a normal state from the programmable logic unit, the management unit sends out fault recovery information which is used for indicating the recovery fault of the hardware to be tested.
Therefore, after the test is completed, the management terminal can control the computing equipment to restore factory settings, compared with the manual construction of the real fault, the real fault does not need to be repaired manually, the professional requirements on testers are reduced, the labor input cost is reduced, and the test automation degree is improved.
In some possible implementations, the method further includes: setting the programmable logic unit to an operation mode;
and under the condition that the programmable logic unit is in the running mode, restoring the register in the programmable logic unit to the factory setting state. The method further comprises the steps of: and sending the fault information and the fault recovery information to the management terminal.
The server also comprises a software and hardware interface, the software and hardware interface is connected with the programmable logic unit, and the software and hardware interface is used for acquiring fault signals.
In a second aspect, embodiments of the present application provide a server comprising a programmable logic unit and a management unit, the programmable logic unit being connected to the management unit, the server being configured to perform a method as described in the first aspect or any one of the possible implementations.
In a third aspect, the present application provides an electronic device, comprising: at least one memory for storing a program; at least one processor for executing programs stored in the memory; wherein the processor is adapted to perform the method described in the first aspect or any one of the possible implementations thereof, when the memory-stored program is executed.
In a fourth aspect, the present application provides a computer readable storage medium storing a computer program which, when run on a processor, causes the processor to perform the method described in the first aspect or any one of the possible implementations thereof.
In a fifth aspect, the present application provides a computer program product, characterized in that the computer program product, when run on a processor, causes the processor to perform the method described in the first aspect or any one of the possible implementations thereof.
In a sixth aspect, the present application provides a chip comprising at least one processor and an interface; at least one processor obtains program instructions or data through an interface; at least one processor is configured to execute program line instructions to implement the method described in the first aspect or any one of the possible implementations thereof.
It will be appreciated that the advantages of the second to sixth aspects may be found in the relevant description of the first aspect, and are not described here again.
Drawings
Fig. 1 is a schematic diagram of a test scenario provided in an embodiment of the present application;
fig. 2 is a schematic structural diagram of a management terminal according to an embodiment of the present application;
FIG. 3 is a schematic diagram of a computing device according to an embodiment of the present application;
fig. 4 is a schematic flow chart of a fault testing method provided in an embodiment of the present application;
FIG. 5 is a schematic illustration of a user interface provided by an embodiment of the present application;
FIG. 6 is a flow chart of a fault testing method in one specific example of the present application;
FIG. 7 is a schematic illustration of another user interface provided by an embodiment of the present application;
FIG. 8 is a schematic flow chart of a fault testing method according to another embodiment of the present application;
FIG. 9 is a schematic illustration of yet another user interface provided by an embodiment of the present application;
FIG. 10 is a schematic illustration of yet another user interface provided by an embodiment of the present application;
FIG. 11 is a flow chart of a fault testing method in an example of yet another embodiment of the present application;
fig. 12 is a schematic structural diagram of a fault testing device according to an embodiment of the present application;
fig. 13 is a schematic structural diagram of a chip according to an embodiment of the present application.
Detailed Description
The term "and/or" herein is an association relationship describing an associated object, and means that there may be three relationships, for example, a and/or B may mean: a exists alone, A and B exist together, and B exists alone. The symbol "/" herein indicates that the associated object is or is a relationship, e.g., A/B indicates A or B.
The terms "first" and "second" and the like in the description and in the claims are used for distinguishing between different objects and not for describing a particular sequential order of objects. For example, the first user interface and the second user interface, etc., are used to distinguish between different user interfaces, rather than to describe a particular order of user interfaces.
In the embodiments of the present application, words such as "exemplary" or "such as" are used to mean serving as examples, illustrations, or descriptions. Any embodiment or design described herein as "exemplary" or "for example" should not be construed as preferred or advantageous over other embodiments or designs. Rather, the use of words such as "exemplary" or "such as" is intended to present related concepts in a concrete fashion.
In the description of the embodiments of the present application, unless otherwise specified, the meaning of "a plurality of" means two or more, for example, a plurality of processing units means two or more processing units and the like; the plurality of elements means two or more elements and the like.
In order to facilitate understanding of the technical solutions of the embodiments of the present application, technical terms referred to herein are explained below.
BMC (baseboard management controller) a baseboard management controller capable of implementing a series of monitoring and control functions, the object of which is system hardware. Such as monitoring the temperature, voltage, fans, power supplies, etc. of the system and making corresponding adjustments to ensure that the system is in a healthy state, can be responsible for recording information and log records of various hardware for prompting the user and the location of subsequent problems. The BMC is a stand-alone system that may have a separate power module on the computer and may communicate with other hardware on the computer (e.g., CPU, memory, etc.) over a physical channel.
CPLD (complex programmable logic device) the complex programmable logic device is a digital integrated circuit with logic functions which are built by users according to the needs of the users, and corresponding logic function units can be realized on the CPLD through programmable I/O, registers and the like.
The fault detection capability of the server to the integrated hardware can be realized through the BMC integrated on the server, so that when the fault detection performance of the server (such as whether the fault detection capability exists, the accuracy and timeliness of the capability and the like) is tested, the performance of the BMC to the fault detection of the hardware component is verified. A hardware fault injection test (hardware fault injection test, HFIT) may be performed on a new board, i.e. the detection performance of the BMC is verified by manually constructing a real fault of the hardware on the new board, for example, a inspector pulls out a certain hardware component to construct an out-of-place fault of the hardware component, then if the inspector sees that the BMC reports the fault information, it is determined that the BMC has the capability of detecting the fault, and the capability of detecting other faults of the BMC is tested in the same way. However, such a test method is large in investment and low in efficiency, and each performance test requires a manual construction failure, cannot be automated, and is low in reliability.
In order to improve the reliability of testing the fault detection performance of the server and reduce the test cost, the embodiment of the application provides a fault test method, which mainly comprises the steps that a management terminal triggers a test mode to a fault detection device on a computing device (such as a server), and a simulated signal of the fault occurrence of hardware of the computing device is sent to the computing device in the mode, so that the detection and alarm capacity of the fault detection device to the fault signals are verified, the fault detection performance of the fault detection device of the computing device is tested, the automatic test is realized, the real hardware fault is not required to be constructed manually, the test cost is reduced, and the test efficiency is improved.
In order to facilitate understanding of the technical solutions of the present application, a scenario of testing the fault detection performance of the computing device in the embodiments of the present application is first described below.
By way of example, a schematic diagram of a fault detection performance test for a computing device is shown in FIG. 1. It can be understood that the testing method provided in this embodiment is mainly used for testing the software performance of the hardware fault detection device 22 of the computing device 20 in the case that the tested hardware (such as a processor, a memory, a fan, a hard disk, etc.) 21 on the board of the computing device 20 is stable, and for this purpose, as shown in fig. 1, in this embodiment, the fault detection performance of the computing device 20 can be tested by the management terminal 10.
Wherein the management terminal 10 can provide a test function for the fault detection performance of the computing device 20 by installing test software therein, and the terminal device 10 can be further connected with input/output devices such as a display 101, a keyboard 102, a mouse 103, etc., so that a graphical user interface (graphical user interface, GUI) provided by the test software is displayed on the display 101 when the test software is running, and instructions or information is input based on the user interface through the keyboard 102, the mouse 103, etc.
In this embodiment, the management terminal 10 and the computing device 20 are connected in communication, and the user can simulate the fault signal of the tested hardware 21 on the computing device 20 by inputting instructions or information through the test software on the management terminal 10, and send the simulated hardware fault signal to the fault detection device 22, so as to test the detection performance of the fault detection device 22. By way of example and not limitation, the fault detection device 22 of the computing device may include a baseboard management controller (baseboard management controller, BMC), a complex programmable logic device (complex programmable logic device, CPLD), and the like.
Next, a detailed description will be given of a management terminal and a computing device provided in the embodiments of the present application, respectively.
Fig. 2 is a schematic structural diagram of a management terminal according to an embodiment of the present application. The management terminal may be a personal computer (personal computer, PC), a server, a super terminal, or the like, but is not limited thereto. As shown in fig. 2, the management terminal 10 may include: a processor 110, a memory 120, an input/output interface 130, and a network interface 140. Wherein the processor 110, the network interface 140, the input/output interface 130 and the memory 120 may be connected by a bus or other means.
In this embodiment, the processor 110 (or referred to as a central processing unit (central processing unit, CPU)) is a computing core and a control core of the management terminal 10. In some embodiments, the processor 110 may perform some or all of the steps of the methods provided in the present embodiments.
The memory 120 (memory) is a memory device of the management terminal 10 for storing programs and data. It is appreciated that the memory 120 in this example may be a high-speed random access memory (random access memory, RAM) or a non-volatile memory (non-volatile memory), such as at least one disk memory. The memory 120 provides a storage space storing an operating system and executable program code for managing the terminal 10, which may include, but is not limited to: windows system, linux system, hong Meng system, etc., without limitation. In addition, the memory 120 may further store a program 121 for testing the fault detection performance of the computing device, where it is understood that the test software program 121 refers to a file/instruction set formed to be executable by a computer after compiling and packaging the computer program code, and at least some steps in the following test method may be implemented when the file/instruction set is executed, for testing the reliability of the fault detection performance of the computing device 20.
The input/output interface 130 may connect output devices such as the display 101, the keyboard 102, and the mouse 103 to the input device and to the processor 110. In this way, external instructions, information (i.e., input information), may be retrieved by input devices such as keyboard 102 and mouse 103, and transferred to processor 110. Processor 110 may process the input information to generate output information, which may be temporarily or permanently stored in memory 102, may be displayed on display 101 for use by a user, or may be transmitted to an external device (e.g., computing device 20 shown in FIG. 1).
The network interface 140 may include a wired interface, a wireless interface (e.g., WI-FI, mobile communication interface, etc.), controlled by the processor 110, for transceiving data, e.g., transmitting local data to the computing device 20, or receiving data transmitted by the computing device 20 and transmitting to the processor 110.
In this embodiment, the processor 110 may run the test software program 121 stored in the memory 120, and display a user interface provided by the program on the display 101, where the user interface may include characters, symbols, graphics, icons, controls, and the like, to display certain information and receive instructions and information input by a user.
Exemplary, fig. 3 shows a schematic structural diagram of a computing device according to an embodiment of the present application. The computing device may be a device capable of providing data processing, computing, or storage functions, etc., including but not limited to a PC, hardware server, cloud server, etc. As shown in fig. 3, the computing device 20 may include a plurality of hardware under test 21 (i.e., hardware resources of the device) and a management unit 221 and a programmable logic unit 222 that have detection capabilities for the hardware 21, where the management unit 221 and the programmable logic unit 222 with detection capabilities are used as fault detection devices of the hardware 21 and may be in communication and/or electrically connected with the hardware under test 21 through a physical channel.
By way of example, the hardware resources of computing device 20 may include a processor, memory, a communication interface, which may be connected via a bus to accomplish communication with each other, and a power supply, which may be used to power all of the hardware resources of computing device 20.
The processor of the computing device 20 may include various processing devices, such as a central processing unit (central processing unit, CPU), a System On Chip (SOC), a processor integrated on the SOC, a separate processor chip or controller, etc.: the processor 210 may also include special purpose processing devices such as application specific integrated circuits (application specific integrated circuit, ASIC), field programmable gate arrays (field programmable gate array, FPGA), digital signal processors (digital signal processor, DSP), etc. The processor 210 may be a processor group of multiple processors coupled to each other by one or more buses. As an example, the processor may include the CPU211 shown in fig. 3.
The memory of computing device 20 may be coupled to the processor of computing device 20 through one or more memory controllers. The memory may be used to store computer program instructions, including computer Operating Systems (OS) and various programs. The memory may include non-powered-down volatile memory, such as embedded multimedia cards (embedded multi media card, EMMC), universal flash storage (universal flash storage, UFS) or read-only memory (ROM), but may also be powered-down volatile memory (volatile memory), such as random access memory (random access memory, RAM) or other types of dynamic storage devices that can store information and instructions, but may also be electrically erasable programmable read-only memory, magnetic disk storage media, or any other computer-readable storage medium that can be used to carry or store program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to such. By way of example, the storage may include the memory 212 and the hard disk 213 shown in fig. 3.
The communication interface of computing device 20 may be used to implement communications between modules, apparatus, units, and/or devices in embodiments of the present application. By way of example, a communication interface of computing device 20 may access network card 215 shown in FIG. 3 to enable communication between computing device 20 and terminal 10.
In addition, a fan 214 may be included on the computing device 20 to provide heat dissipation capabilities. A power supply 216 may also be included to power at least the hardware 21 of the computing device.
In this embodiment, the computing device 20 may also include a management unit 221 and a programmable logic unit 222 thereon. The programmable logic unit 222 may be a CPLD, or other suitable digital integrated circuit with a logic function. The following embodiments will be described with respect to the programmable logic unit 222 as a CPLD. Specifically, the CPLD222 includes a plurality of registers A1 to An, which can be used to temporarily store some calculation data or operation results during the operation of the computing device 20, and output the calculation data or operation results to the relevant devices. As An example, the CPLD222 may receive the fault signal reported by the hardware under test 21, and perform a corresponding set operation on the registers A1 to An according to the fault signal. In addition, a test module 2221 is further built in the CPLD222, where the test module 2221 is configured to support the CPLD222 to enter a test mode, and in the test mode, the CPLD222 may obtain a hardware fault signal simulated by the management terminal 10 through the management unit 221, and perform a corresponding setting operation on the registers A1 to An.
The management unit 221 may be referred to as an out-of-band management module. For example, the management unit 221 may perform remote maintenance and management on the computing device through a dedicated data channel, and may be used to obtain an operation state of the hardware 21 (such as whether a fault occurs, voltage, temperature, etc.), and report alarm information when the hardware 21 is in the fault state. By way of example, the management unit 221 may include a management system in a management chip outside the processor or a computing device baseboard management control unit (baseboardmanagement controller, BMC), etc., but is not limited thereto. The embodiments of the present application are not limited to the specific form of the management unit, but are merely exemplary illustrations herein. The following embodiment will be described by taking the management unit 221 as a BMC as an example.
As An example, as shown in fig. 3, the management unit 221 may include a control module 2211 and a detection module 2212, where the control module 2211 may receive An instruction or information from the management terminal 10, perform a corresponding control operation on the programmable logic unit 222, and the detection module 2212 may obtain values in registers A1 to An in the CPLD222, perform hardware fault diagnosis, identification, and alarm. It is understood that the control module 2211 and the detection module 221 may be software, hardware, or a combination of software and hardware.
In this embodiment, as shown in connection with fig. 1-3, a user may input some instructions (including instructions for instructing the CPLD of the computing device 20 to enter a test mode) or related information of a hardware fault to be configured, such as information of a temperature abnormality fault of the configuration CPU, information of a voltage abnormality fault of the configuration fan, etc., through a user interface displayed by the display 101 via the device such as the keyboard 102, the mouse 103, etc. Specifically, after the terminal 10 receives the above-described instruction for instructing the CPLD222 to enter the test mode, the instruction may be transmitted to the computing device 20 through the network interface 140. After the BMC221 of the computing device 20 receives the instruction, it controls the CPLD222 to operate the test module 2221 to enter a test mode, at this time, when the terminal 10 receives the above related information about the hardware fault to be configured, a corresponding analog fault signal may be generated, and sent to the computing device 20 through the network interface 140, and then transmitted to the CPLD222 via an information channel between the BMC221 and the CPLD222, where the CPLD222 performs a write operation on a corresponding register. Then, the detection module 2212 of the BMC221 may read the values in these registers, thereby implementing fault diagnosis, and generate fault information (may also be referred to as "alarm information") to report to the management terminal 10. If the fault information received by the management terminal 10 is consistent with the fault represented by the simulation fault signal sent at this time, it can be verified that the BMC221 and the CPLD222 of the computing device 20 can realize fault detection and alarm of hardware, so as to realize automatic test of the hardware fault detection performance of the computing device, without constructing a real fault manually, thereby being beneficial to reducing test cost and improving test efficiency.
In addition, in the normal running mode (non-test mode) of the CPLD222, the CPLD222 may receive a fault signal (real fault) reported by the hardware 21 and perform a set operation on the corresponding registers A1 to An, and then the detection module 2212 of the BMC221 may read the values in these registers, so as to perform fault diagnosis, generate fault alarm information and report, so as to implement fault diagnosis in the running process of the hardware 21 in the computing device.
It should be understood that the illustrated structure of the embodiments of the present application does not constitute a particular limitation on the management terminal 10 or the computing device 20. In other embodiments of the present application, the management terminal 10 or the computing device 20 may each include more or less components than illustrated, or certain components may be combined, or certain components may be split, or different arrangements of components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.
Next, a fault test method provided in the embodiment of the present application is described based on the above description. It will be appreciated that the method is set forth based on what has been described above, some or all of which may be found in the description above.
Referring to fig. 4, fig. 4 is a flow chart of a fault testing method according to an embodiment of the present application. It will be appreciated that the method may be implemented by the computing device of fig. 1 or 2 described above, or may be performed by other suitable computing, processing capable apparatus, devices, platforms, clusters of devices, etc. As shown in fig. 4, the fault testing method may include:
in S401, the management terminal acquires a first instruction, where the first instruction is used to instruct the fault detection device of the computing device to enter a test mode.
In this embodiment, a test module may be configured in a fault detection device of a computing device, where the test module is configured to support the fault detection device to enter a test mode, and in the test mode, the computing device may receive, through a software/hardware interface, a hardware fault signal simulated by a management terminal, so as to detect and report an alarm of the fault signal. As an example, the computing device may be the computing device 20 shown in fig. 1 or fig. 3, the management terminal may be the management terminal 10 shown in fig. 1 or fig. 3, the fault detection device may include the management unit 221 and the logic unit 222 shown in fig. 3, and the test module may be the test module 2221 shown in fig. 3, which are described below with reference to the management terminal 10, the computing device 20, and the BMC221 and the CPLD222 for convenience of understanding.
For example, as shown in connection with fig. 1-3, after the management terminal 10 and the computing device 20 are powered on and initialized, the step S401 described above may be performed by the management terminal 10 before the test module 2221 is triggered to enter the test mode. Specifically, the management terminal 10 runs the test software program 121 and displays the first user interface. The user may input a first instruction based on the first user interface, and after the terminal 10 obtains the first instruction, the first instruction is sent to the computing device 20, so as to trigger the CPLD222 to enter the test mode through the BMC 221.
Illustratively, the user interface displayed on the terminal 10 provides a software interface for obtaining instructions from a user, for example, the interface may have controls such as a menu (MunuStrip), buttons (Button), check boxes (checkBox), etc. to support input operations by the user. For example, as shown in fig. 5, a button 151 is provided on the first user interface 150 such that when the user clicks the button 151 via the mouse 103, a first instruction is generated.
S402, the computing equipment receives a first instruction sent by the management terminal and controls the fault detection device to enter a test mode.
In this embodiment, as shown with reference to fig. 6, the CPLD222 may default to the run mode after the computing device 20 is powered up. In the operation mode, the CPLD222 may receive the real fault signal reported by each hardware to be tested (including, but not limited to, the CPU211, the memory 212, the hard disk 213, the fan 214, the network card 215, etc.) on the computing device 20, and record the fault state of the hardware to be tested 21 on the corresponding registers A1 to An for the BMC221 to read. The BMC221 reads the real fault state from the registers A1-An for identification, so as to generate corresponding alarm information for reporting for the user to know. It will be appreciated that the alert information may be reported to the computing device 20 (the management device may be the management terminal 10 described above or another management server).
When the CPLD222 is in the run mode, if the computing device 20 receives a first instruction, then the BMC221 may control the CPLD222 to execute the test module 2221 to enter the test mode according to the instruction. In test mode, CPLD222 may control the read and write states of registers A1-An to be set to a read-write (RW) mode, so that the corresponding registers may be read and written.
S403, the management terminal acquires a second instruction, and the second instruction is used for constructing the fault type of hardware to be tested on the computing equipment so as to generate a corresponding fault signal.
In this embodiment, the user may input the fault type of the hardware object to be tested to be configured based on the second user interface displayed when the terminal 10 runs the test software program 121, so as to generate a corresponding fault signal, that is, the fault signal may be used to simulate that the hardware to be tested is faulty. And the terminal 10 also stores information of a failure of the current construction (hereinafter also referred to as "construction failure information"). The fault types may include, but are not limited to, unpowered, non-in-place (i.e., hardware not connected), temperature anomalies (e.g., overheating), voltage anomalies, power anomalies, and the like. In some examples, the second user interface may be an interface that jumps from the first user interface described above.
In some possible implementations, the second user interface displayed on the terminal 10 may be as shown in fig. 7 after triggering the CPLD222 to enter the test mode. The second user interface 152a includes a test object (hardware under test) menu 153a and a fault type menu 154a, as shown in fig. 7 (7 a), and the user clicks these menus to select, and as shown in fig. 7 (7 b), one hardware object and one fault type may be selected from the menu 153a and the menu 154a, respectively. For example, upon selecting the "memory" object from menu 153a and selecting the "no bit" option from menu 154a, clicking on "test" button 155a, terminal 10 obtains a second instruction from which processor 110 of terminal 10 may correspondingly generate a fault signal that may be used to characterize a fault of a memory no bit of computing device 20.
For example, a fault identifier may be placed in the fault information, where the fault identifier may be used to represent a fault state of the hardware to be tested, for example, an identifier of "no-bit" fault of the memory is set to "1", so that the identifier "1" can be written into a designated register to represent that the memory is in the fault state, and it may be understood that if the fault identifier of a certain hardware is "1", a default value (such as a value "0") of the corresponding register may be used as an identifier (hereinafter also referred to as "normal identifier") representing that the hardware is in the normal state.
Thus, through the second user interface, the user can simulate various types of faults occurring in various hardware on the computing device, and signals of the various types of faults occurring in the hardware are constructed, so that the detection performance of the computing device on the various faults can be tested efficiently.
S404, acquiring a fault signal under the condition that the programmable logic unit of the computing device is in a test mode.
In this embodiment, the terminal 10 may transmit the generated fault signal to the CPLD222 via the information channel interface of the BMC221 to the CPLD222. Alternatively, the fault signal is transmitted to the CPLD222 via a software and hardware interface. Alternatively, the fault signal is transmitted to the CPLD222 via the in-band processor of the computing device, and so on.
S405, according to the fault signal, setting the state of the hardware to be tested recorded in the programmable logic unit as a fault state.
In this embodiment, each of the registers A1 to An may be predefined and used to temporarily store status data of different hardware and/or different types of faults, so that when the actual fault signal or the simulated fault signal is transmitted to the CPLD222, the corresponding register may be set to record the corresponding fault status. For example, the default value of the register may be set to "0", and when the simulated fault signal is received and the fault flag "1" is included in the fault signal, the value of the register may be rewritten to "1", that is, the setting is completed once, and the state of the corresponding hardware 21 is recorded as the fault state. In some examples, a reset may be made to a "0" when the value on the register is read away.
The following gives an example of a configuration of a plurality of registers on the CPLD222, and it should be understood that, in this embodiment, the registers on the CPLD222 may also perform configuration of other temporary fault information functions according to an actual test scenario, which is not enumerated one by one.
TABLE 1
Figure BDA0004082372170000081
Figure BDA0004082372170000091
Referring to the example shown in table 1, the CPLD222 has a plurality of registers A1 to A5, and after the registers are configured according to table 1, the registers can be set according to the real fault signal reported by the hardware 21 or the fault signal simulated by the terminal 10, so as to represent the corresponding event by the set valid value. For example, in the CPLD222 operating mode, if the CPU211 of the current device reports a temperature anomaly fault signal, the value of the register A1 may be written from "0" to "1", i.e., a fault identification is written, according to the configuration of table 1. In the CPLD222 test mode, if the simulated fault signal S1 is received, where the signal S1 indicates that the CPU has a temperature abnormality, the test module 2221 of the CPLD222 writes the corresponding value of the register A1 from "0" to "1" of the fault identifier, records the fault state of the CPU, and after the subsequent BMC221 reads the value, can reset the value of the register A1 to "0", that is, restore to the normal identifier.
Thus, by constructing various types of faults for various hardware by the terminal 10, if the corresponding fault state (i.e., the effective value of the register) can be correctly registered on the register of the CPLD222 for each of these different faults for different hardware, it can be said that the CPLD222 has reliable performance in the fault detection process. Testing the registers of the CPLD222 for proper fault conditions can be verified by subsequent BMCs 221 reading the values of these registers.
S406, the management unit reads the fault state in the programmable logic unit and sends out fault information according to the fault state, wherein the fault information is used for indicating that hardware to be tested breaks down.
In this embodiment, the BMC221 may read data from the register of the CPLD222 in a polling manner, and the detection module 2212 of the BMC221 may identify a corresponding hardware fault according to the read register value (fault state), so as to generate fault information, so as to send the fault information to the terminal 10 for performing fault alarm. It is understood that the detection module 2212 may be implemented according to a predefined rule or table when diagnosing and identifying hardware faults, for example, but not limited to, a mapping table between register values and hardware faults.
Illustratively, when the detection module 2212 generates fault information according to the identified fault, the fault information may be specifically implemented by modifying a configuration file, where the configuration file may be used to describe detailed information of the hardware fault, including, but not limited to, a hardware name (or identity ID number), a fault type, and the like. The configuration file may be an XML format file, or may also be an INI format, a JSON format, or the like, which is not limited in this example.
And S407, the management unit reports the fault information to the management terminal.
In this embodiment, the BMC221 may report the fault information to the management terminal 10 and display it on its display 101, informing the user of the current hardware fault.
And S408, the management terminal verifies the fault detection performance of the computing equipment according to the fault information.
In this embodiment, when the management terminal 10 generates the fault signal, the management terminal 10 stores the configuration fault information, after the management terminal 10 receives the fault information of the BMC221, the configuration fault information may be invoked to compare with the fault information, and if the faults represented by the configuration fault information and the fault information are consistent, the CPLD222 and the BMC221 on the current computing device 20 can prove that the detection performance of the fault is reliable, that is, the fault can be accurately detected. In this way, the whole test process can realize the test full coverage of various fault signal detection capabilities of the CPLD222 and various fault detection and alarm capabilities of the BMC221, and particularly in a test scene after updating the CPLD and/or BMC version, the logic and software of the fault detection devices can be fully, conveniently and efficiently verified whether the logic and software of the fault detection devices have reliable detection performance for all faults after updating, and the problem that the devices or devices are in fault detection omission or false detection when being updated and put into operation is avoided.
For example, the management terminal 10 may also start timing after receiving the second instruction, generating the simulated fault signal or sending the fault signal, and end when receiving the corresponding fault information, and test the timeliness of the fault detection by calculating the time length of the timing. If the time length of the timing is within the preset threshold range, a timely test result can be obtained, otherwise, the fact that the fault detection of the computing equipment is not timely is indicated, and the reliability of the fault detection performance is low.
Therefore, through the steps from S401 to S408, the user can automatically complete the fault detection performance test of the computing equipment by only triggering once or twice at least, the manual investment is small, the test efficiency is high, compared with the manual construction of a real fault, the method of the embodiment is more convenient, the professional requirements on testers are relatively lower, the method is suitable for the universality test of various versions of fault detection devices, and the test automation degree is high.
By way of example, referring again to FIG. 4, after the test is completed, the method may further comprise:
s409, obtaining a third instruction to generate a fault recovery signal, where the fault recovery signal is used to simulate a recovery fault of hardware to be tested in the computing device.
In this embodiment, after the test on the fault detection performance is completed, the user may receive an input instruction (i.e., a third instruction) through a third user interface at the terminal side, generate a fault recovery signal, and trigger a reset operation on the fault detection device of the computing device 20. The third user interface may jump from the second user interface, but is not limited thereto. One of the ways of obtaining the third instruction may be similar to that of the first instruction, which is not described herein.
S410, the computing device acquires a fault recovery signal, and sets the state of the hardware to be tested recorded in the programmable logic unit as a normal state according to the fault recovery signal.
In this embodiment, the fault recovery signal may include a normal identifier, where the normal identifier is used to indicate that the hardware to be tested is in a normal state. The terminal 10 may send the fault recovery signal to the CPLD222 via a channel from the BMC221 to the CPLD222 to control the CPLD222 to recover from factory settings, even if the CPLD222 writes the normal identifier (e.g. the value of "0") into a register corresponding to the hardware to be tested, so as to recover each register to a default value, restore the recorded state of the hardware to be tested 21 to be set to a normal state, and the CPLD222 exits the test. Thus, the CPLD222 may enter an operational mode to monitor for a true failure of the hardware 21. Alternatively, the fault recovery signal is transmitted to the CPLD222 via a software and hardware interface. Alternatively, the fault recovery signal may be transmitted to the CPLD222 via an in-band processor of the computing device, etc.
S411, after the management unit reads that the hardware to be tested is in a normal state from the programmable logic unit, the management unit sends out fault recovery information, wherein the fault recovery information is used for indicating the recovery fault of the hardware to be tested;
s412, the management unit sends the fault recovery information to the management terminal.
In this embodiment, after the CPLD222 exits the test, the BMC221 polls each register, reads the normal identifier in the register, that is, determines that the hardware to be tested is in a normal state, and may generate corresponding fault recovery information by filling in a configuration file to send out, so as to inform the management terminal, etc. that the CPLD222 has recovered from factory settings, and the hardware to be tested recovers from faults.
Therefore, after the test is completed, the management terminal can control the computing equipment to restore factory settings, compared with the manual construction of the real fault, the real fault does not need to be repaired manually, the professional requirements on testers are reduced, the labor input cost is reduced, and the test automation degree is improved.
In other possible implementations of the present embodiment, the batch configuration of hardware faults may be performed through the second user interface to improve the test efficiency.
Specifically, in the present implementation, the method may include:
S801, a management terminal acquires a first instruction, wherein the first instruction is used for indicating a fault detection device of computing equipment to enter a test mode;
s802, the computing equipment receives a first instruction sent by the management terminal and controls the fault detection device to enter a test mode.
In this implementation, steps S801 to S802 may refer to steps S401 to S402 in the above embodiment, and will not be described herein.
S803, the management terminal acquires a second instruction, wherein the second instruction is used for constructing the fault type of hardware to be tested on the computing equipment so as to generate a corresponding fault signal.
In this implementation, S803 differs from S403 described above in that the user can perform batch build hardware failures based on the second user interface.
Illustratively, a schematic diagram of another second user interface is shown in FIG. 9, and referring to FIG. 9 (9 a), the second user interface 152b includes a plurality of test object check boxes 156 and at least one failure mode menu 154b thereon. The fault mode menu 154b may include, but is not limited to, various types of faults that are not in place, are not powered up, are abnormal in temperature, are abnormal in voltage, are abnormal in power supply, and the like.
Next, as shown in fig. 9 (9 b), the user may click on these check boxes 156 to select one or more detection objects, and may also select a failure type from the menu 154b. For example, when the objects such as "CPU1", "memory", "hard disk 3", "network card 2" are selected from the check boxes 156, and the "voltage abnormality" option is selected from the menu 154b, the terminal 10 acquires the second instruction, and the processor 110 of the terminal 10 correspondingly generates the corresponding fault signal according to the second instruction, where the fault signal may be used to characterize that the processor identified as "CPU1", the memory, the hard disk identified as "hard disk 3" and the network card identified as "network card 2" in the computing device 20 all have the faults of voltage abnormality. Therefore, the hardware fault construction can be carried out in a one-time self-defined batch manner, and whether the one-time test fault detection device has fault detection capability on the objects is facilitated, so that the test efficiency of the test fault detection device on different hardware detection performances is improved.
S804, under the condition that the programmable logic unit of the computing device is in a test mode, a fault signal is obtained, and then the computing device sets the state of the hardware to be tested recorded by the programmable logic unit as a fault state according to the fault signal.
In this implementation, S804 is different from S405 above in that S804 specifically may include:
s8041, when the fault signal is used for simulating faults of a plurality of hardware to be tested, setting the states of the hardware to be tested recorded in all registers corresponding to the plurality of hardware to be tested as fault states;
s8042. placing all registers in the queue to be operated.
In an example, after the CPLD222 of the computing device 20 receives a fault signal that currently simulates that a certain type of fault has occurred in the plurality of hardware 21, all corresponding registers may be set separately. The CPLD222 then places all registers involved in the currently constructed fault type into an operation queue. As shown in fig. 8, registers A1, A3, …, an are set, and then registers A1, A3, …, an (or the values of the registers) are placed in the operation queue.
S805, the management unit reads the fault state in the programmable logic unit and generates fault information according to the fault state;
And S806, the management unit reports the fault information to the management terminal.
In this implementation, S806 is different from S406 described above in that, when the BMC221 reads data, the values of the registers may be read from the operation queue one by one, so that when S806 is executed, the BMC221 correspondingly generates and reports corresponding fault information one by one. As in fig. 8, the operation queue includes registers A1, A3, …, an, and fault information T1, T2, …, tk is generated corresponding to these registers.
For example, the BMC221 may read the value of one register to generate a fault information for reporting, or may read the values of all registers at one time, and then generate corresponding fault information one by one for reporting together, so as to improve the test efficiency.
S807, the management terminal verifies the fault detection performance of the computing device according to the fault information.
In this implementation manner, the processor 110 of the terminal 10 may compare the obtained fault information with the previously recorded structural fault information in the current test process, so as to verify the consistency of the fault, thereby testing whether the detection result is accurate, and may also test the timeliness thereof by timing, etc. The verification process may refer to the step S408 in the above embodiment, and will not be described herein.
In other possible implementations, after the CPLD222 is triggered to enter the test mode, when a fault signal is generated, a fault of the tested hardware of the complete machine may also be configured through the second user interface, so as to implement default detection of all hardware. For example, a schematic diagram of yet another second user interface is shown in fig. 10, and referring to fig. 10 (10 a), the second user interface 152c includes at least one failure mode menu 154c thereon, the failure modes including not in place, not powered on, temperature anomalies, voltage anomalies, power anomalies, and so forth. As shown in fig. 10 (10 b), the user may then select one of the failure modes on menu 154 c. For example, after selecting the "voltage abnormality" option in the menu 154c, clicking the "test" button 155c, the terminal 10 obtains a second instruction, and according to the second instruction, the processor 110 of the terminal 10 may correspondingly generate a corresponding fault signal, where the fault signal may be used to characterize a fault that all the tested hardware in the computing device 20 has abnormal voltage. Therefore, fault construction operation can be carried out on the tested hardware of the whole machine by default at one time, and the testing efficiency is improved.
It can be understood that, in this implementation manner, after the fault signals representing all hardware faults are obtained in the foregoing manner, the subsequent testing process may be completed according to the steps from S804 to S808 in the foregoing embodiment, which is not described herein again.
Therefore, all hardware faults can be constructed in batches by one-time triggering, so that the fault detection performance of the computing equipment, such as the detection capability, the alarm capability and the like of a certain fault of all hardware, is tested, and the accuracy and the timeliness of the detection and alarm capability are tested, so that more reliable test results are obtained on the fault detection performance of the computing equipment, the test efficiency is improved, and the automation of performance test is realized.
It will be appreciated that the several types of detected objects and fault types shown on the user interfaces shown in fig. 7, 8, and 10 are by way of example only and not limitation, and in other examples there may be fewer or more types and numbers of hardware on the computing device, and fewer or more corresponding fault types, which are not listed here.
In some possible implementations, the computing device may be a server, as shown in fig. 11, and when the fault testing method provided in the embodiment of the present application is executed on the server, the method may include the following steps:
s1101, the programmable logic unit of the server is set to the test mode.
In this embodiment, the server includes a processor, memory waiting hardware, and may further include a programmable logic unit CPLD and a management unit BMC. When testing the fault detection performance of the CPLD and the BMC of the server on the hardware to be tested, the CPLD may be set to the test mode through this step S1101. As an example, the manner in which the CPLD is set to the test mode may be described with reference to S401 to S402 in the above-described embodiment.
S1102, under the condition that the programmable logic unit is in a test mode, a fault signal is obtained, and the fault signal is used for simulating the fault of hardware to be tested in the server.
In this embodiment, when the CPLD is in the test mode, a fault signal simulating that the hardware to be tested in the current server is faulty may be obtained, and the manner of obtaining the fault signal may be described with reference to S403 to S404 in the above embodiment, but is not limited thereto.
S1103, according to the fault signal, the state of the hardware to be tested recorded in the programmable logic unit is set as a fault state.
In this embodiment, after the CPLD obtains the fault signal, a register on the CPLD for recording the state of the hardware to be tested may be set to characterize that the hardware to be tested is in the fault state. Specifically, for setting the state of the hardware under test recorded in the CPLD as the fault state, reference may be made to the description of S405 in the above embodiment, but is not limited thereto. .
S1104, after the management unit of the server reads that the hardware to be tested is in a fault state from the programmable logic unit, the management unit sends out fault information, and the fault information is used for indicating that the hardware to be tested is faulty.
In this embodiment, after the CPLD of the current server records the fault state of the hardware to be tested, the BMC on the server may read the fault state from the CPLD, generate corresponding fault information, and send the fault information to the management terminal, for example, so that the management terminal verifies the fault detection performance of the server according to the fault information, thereby implementing an automatic test. For example, the process of transmitting the fault information for performance verification may refer to the descriptions of S406 to S408 in the above embodiments, but is not limited thereto.
Based on the method in the above embodiment, the embodiment of the present application provides a fault testing device. Referring to fig. 12, fig. 12 is a schematic structural diagram of a fault detection performance device according to an embodiment of the present application.
As shown in fig. 12, the fault detection performance apparatus 1100 may be applied to a computing device side, where the computing device includes hardware to be tested, a programmable logic unit, and a management unit, and the apparatus 1100 may include: an acquisition module 1101 and a processing module 1102. Wherein the processing module 1102 may be configured to set the programmable logic unit of the server to a test mode; the obtaining module 1101 may be configured to obtain a fault signal when the programmable logic unit is in the test mode, where the fault signal is used to simulate a fault occurring in hardware to be tested in the server; the processing module 1102 may be further configured to set the state of the hardware to be tested recorded in the programmable logic unit to be a fault state according to the fault signal; and then after the management unit of the server reads that the hardware to be tested is in a fault state from the programmable logic unit, the management unit sends out fault information, wherein the fault information is used for indicating that the hardware to be tested is in a fault state.
The obtaining module 1101 may be further configured to obtain a fault recovery signal when the programmable logic unit is in the test mode, where the fault recovery signal is used to simulate a recovery fault of hardware to be tested in the server. The processing module 1102 may be further configured to set the state of the hardware to be tested recorded in the programmable logic unit to a normal state according to the fault recovery signal; then after the management unit reads that the hardware to be tested is in a normal state from the programmable logic unit, the management unit sends out fault recovery information, wherein the fault recovery information is used for indicating that the hardware to be tested recovers from a fault
The programmable logic unit includes a register for storing a state of the hardware under test, the fault signal includes a fault flag, the fault flag is used to indicate a fault state of the hardware under test, the register is set to a readable and writable mode when the programmable logic unit is in the test mode, the fault flag is written into the register corresponding to the hardware under test, and the management unit reads the fault flag from the register.
In some other implementations, as shown in fig. 12, the fault detection performance apparatus 1100 may be applied to a management terminal side, and the apparatus 1100 may include: an acquisition module 1101 and a processing module 1102. Wherein, the obtaining module 1101 may be configured to obtain a second instruction of a user, where the second instruction is used to construct a fault of hardware to be tested on the computing device, so as to generate a fault signal; the processing module 1102 may be configured to send a fault signal to the computing device, so that the computing device generates corresponding fault information according to the fault signal; the obtaining module 1101 may also be configured to obtain fault information; the processing module 1102 may also verify fault detection performance of the computing device based on the fault information and the fault signal.
The exemplary computing device further includes a programmable logic unit and a management unit, where the obtaining module 1101 may be further configured to obtain a first instruction of a user, where the first instruction is used to instruct the programmable logic unit to enter a test mode, and the test mode is used to support the programmable logic unit to record a hardware fault state to be tested corresponding to the fault signal, so that the management unit generates fault information according to the fault state of the hardware.
It should be understood that, the foregoing apparatus is used to perform the method in the foregoing embodiment, and corresponding program modules in the apparatus implement principles and technical effects similar to those described in the foregoing method, and reference may be made to corresponding processes in the foregoing method for the working process of the apparatus, which are not repeated herein.
Based on the method in the above embodiment, an embodiment of the present application provides an electronic device. The electronic device may include: at least one memory for storing a program; at least one processor for executing programs stored in the memory; wherein the processor is adapted to perform the methods of the above embodiments when the program stored in the memory is executed.
Based on the method in the above embodiment, the present application provides a computer-readable storage medium storing a computer program, which when executed on a processor, causes the processor to perform the method in the above embodiment.
Based on the method in the above embodiment, the present application provides a computer program product, which is characterized in that the computer program product when run on a processor causes the processor to perform the method in the above embodiment.
Based on the method in the above embodiment, the embodiment of the present application further provides a chip. Referring to fig. 13, fig. 13 is a schematic structural diagram of a chip according to an embodiment of the present application. As shown in fig. 13, the chip 1200 includes one or more processors 1201 and interface circuitry 1202. Optionally, the chip 1200 may also include a bus 1203. Wherein:
The processor 1201 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in the processor 1201 or by instructions in the form of software. The processor 1201 may be a general purpose processor, a digital communicator (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components. The methods and steps disclosed in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The interface circuit 1202 may be used for transmitting or receiving data, instructions or information, the processor 1201 may process using the data, instructions or other information received by the interface circuit 1202, and the process completion information may be transmitted through the interface circuit 1202.
Optionally, the chip 1200 also includes memory, which may include read only memory and random access memory, and provides operating instructions and data to the processor. A portion of the memory may also include non-volatile random access memory (NVRAM).
Optionally, the memory stores executable software modules or data structures and the processor may perform corresponding operations by invoking operational instructions stored in the memory (which may be stored in an operating system).
Alternatively, the interface circuit 1202 may be configured to output the execution result of the processor 1201.
It should be noted that, the functions corresponding to the processor 1201 and the interface circuit 1202 may be implemented by a hardware design, a software design, or a combination of hardware and software, which is not limited herein.
It will be appreciated that the steps of the method embodiments described above may be performed by logic circuitry in the form of hardware in a processor or instructions in the form of software.
It should be understood that, the sequence number of each step in the foregoing embodiment does not mean the execution sequence, and the execution sequence of each process should be determined by the function and the internal logic of each process, and should not limit the implementation process of the embodiment of the present application in any way. In addition, in some possible implementations, each step in the foregoing embodiments may be selectively performed according to practical situations, and may be partially performed or may be performed entirely, which is not limited herein.
It is to be appreciated that the processor in embodiments of the present application may be a central processing unit (central processing unit, CPU), but may also be other general purpose processors, digital signal processors (digital signal processor, DSP), application specific integrated circuits (application specific integrated circuit, ASIC), field programmable gate arrays (field programmable gate array, FPGA) or other programmable logic devices, transistor logic devices, hardware components, or any combination thereof. The general purpose processor may be a microprocessor, but in the alternative, it may be any conventional processor.
The method steps in the embodiments of the present application may be implemented by hardware, or may be implemented by a processor executing software instructions. The software instructions may be comprised of corresponding software modules that may be stored in random access memory (random access memory, RAM), flash memory, read-only memory (ROM), programmable ROM (PROM), erasable programmable PROM (EPROM), electrically erasable programmable EPROM (EEPROM), registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC.
In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present application, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in or transmitted across a computer-readable storage medium. The computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by a wired (e.g., coaxial cable, fiber optic, digital Subscriber Line (DSL)), or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., a floppy disk, a hard disk, a magnetic tape), an optical medium (e.g., a DVD), or a semiconductor medium (e.g., a Solid State Disk (SSD)), or the like.
It will be appreciated that the various numerical numbers referred to in the embodiments of the present application are merely for ease of description and are not intended to limit the scope of the embodiments of the present application.

Claims (10)

1. A method of fault testing, the method comprising:
setting a programmable logic unit of a server to a test mode;
acquiring a fault signal under the condition that the programmable logic unit is in the test mode, wherein the fault signal is used for simulating the fault of hardware to be tested in the server;
according to the fault signal, setting the state of the hardware to be tested recorded in the programmable logic unit as a fault state;
after the management unit of the server reads that the hardware to be tested is in a fault state from the programmable logic unit, the management unit sends out fault information, and the fault information is used for indicating that the hardware to be tested is faulty.
2. The method according to claim 1, wherein the programmable logic unit comprises a register for storing a state of the hardware under test, the fault signal comprises a fault flag for indicating that the hardware under test is in a fault state, the register is set to a readable and writable mode with the programmable logic unit in the test mode, the fault flag is written into a register corresponding to the hardware under test, and the management unit reads the fault flag from the register.
3. The method according to claim 1 or 2, wherein the hardware under test is plural, the programmable logic unit includes at least one register, and the setting the state of the hardware under test recorded in the programmable logic unit to the failure state includes:
when the fault signal is used for simulating faults of a plurality of hardware to be tested, setting the states of the hardware to be tested recorded in all registers corresponding to the plurality of hardware to be tested as fault states;
and placing all the registers in a queue to be operated.
4. The method of claim 3, wherein the fault type comprises one or more of unpowered, out-of-place, temperature anomaly, voltage anomaly, or power anomaly.
5. The method of any one of claims 1-4, further comprising:
under the condition that the programmable logic unit is in the test mode, acquiring a fault recovery signal, wherein the fault recovery signal is used for simulating the recovery fault of hardware to be tested in the server;
according to the fault recovery signal, setting the state of the hardware to be tested recorded in the programmable logic unit as a normal state;
After the management unit reads that the hardware to be tested is in a normal state from the programmable logic unit, the management unit sends out fault recovery information, wherein the fault recovery information is used for indicating that the hardware to be tested recovers to a fault.
6. The method according to any one of claims 1 to 5, further comprising:
setting the programmable logic unit to an operation mode;
and restoring the register in the programmable logic unit to a factory setting state under the condition that the programmable logic unit is in the running mode.
7. The method of claim 6, further comprising;
and monitoring a real fault signal reported by the hardware to be tested under the condition that the programmable logic unit is in the running mode.
8. The method of claim 5, wherein the method further comprises: and sending the fault information and the fault recovery information to a management terminal.
9. The method according to any one of claims 1-8, wherein the server further comprises a software-hardware interface, the software-hardware interface being connected to the programmable logic unit, the software-hardware interface being configured to obtain the fault signal.
10. A server, the server comprising: a programmable logic unit and a management unit, the programmable logic unit being connected to the management unit, the server being adapted to perform the fault testing method according to any one of claims 1 to 9.
CN202310126796.4A 2023-02-16 2023-02-16 Fault testing method and device and computing equipment Pending CN116225802A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310126796.4A CN116225802A (en) 2023-02-16 2023-02-16 Fault testing method and device and computing equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310126796.4A CN116225802A (en) 2023-02-16 2023-02-16 Fault testing method and device and computing equipment

Publications (1)

Publication Number Publication Date
CN116225802A true CN116225802A (en) 2023-06-06

Family

ID=86588651

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310126796.4A Pending CN116225802A (en) 2023-02-16 2023-02-16 Fault testing method and device and computing equipment

Country Status (1)

Country Link
CN (1) CN116225802A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117176606A (en) * 2023-09-04 2023-12-05 上海合芯数字科技有限公司 Initialization abnormality detection method, system, server and medium for intelligent network card

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117176606A (en) * 2023-09-04 2023-12-05 上海合芯数字科技有限公司 Initialization abnormality detection method, system, server and medium for intelligent network card

Similar Documents

Publication Publication Date Title
US7340649B2 (en) System and method for determining fault isolation in an enterprise computing system
CN110502374A (en) The traffic capture debugging tool of the basic reason of equipment fault when identification is tested automatically
CN107391333B (en) OSD disk fault testing method and system
CN116225802A (en) Fault testing method and device and computing equipment
CN111881014A (en) System test method, device, storage medium and electronic equipment
CN111324502A (en) Batch test system and method thereof
CN110674034A (en) Health examination method and device, electronic equipment and storage medium
CN111858201A (en) BMC (baseboard management controller) comprehensive test method, system, terminal and storage medium
CN114510381A (en) Fault injection method, device, equipment and storage medium
US7188275B2 (en) Method of verifying a monitoring and responsive infrastructure of a system
CN112817869A (en) Test method, test device, test medium, and electronic apparatus
US9354962B1 (en) Memory dump file collection and analysis using analysis server and cloud knowledge base
CN113708986B (en) Server monitoring apparatus, method and computer-readable storage medium
CN114372003A (en) Test environment monitoring method and device and electronic equipment
US10996270B1 (en) System and method for multiple device diagnostics and failure grouping
CN116382968B (en) Fault detection method and device for external equipment
CN111966599A (en) Virtualization platform reliability testing method, system, terminal and storage medium
CN113094221B (en) Fault injection method, device, computer equipment and readable storage medium
CN115657633A (en) Electronic control unit electric detection method and device, storage medium and electronic equipment
CN116915583B (en) Communication abnormality diagnosis method, device and electronic equipment
KR102307997B1 (en) Method of automating a test of PLC program and PLC server program
CN115629931A (en) HBA card stability testing method, device, terminal and storage medium
CN116560921A (en) RAID card testing method and device, electronic equipment and storage medium
CN116662085A (en) Disk fault simulation test method, test device and electronic equipment
CN117608887A (en) Fault determination method and device, storage medium and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination