CN116137603A - Link fault detection method and device, storage medium and electronic device - Google Patents

Link fault detection method and device, storage medium and electronic device Download PDF

Info

Publication number
CN116137603A
CN116137603A CN202310159294.1A CN202310159294A CN116137603A CN 116137603 A CN116137603 A CN 116137603A CN 202310159294 A CN202310159294 A CN 202310159294A CN 116137603 A CN116137603 A CN 116137603A
Authority
CN
China
Prior art keywords
target
link
current connection
fault
protocol link
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310159294.1A
Other languages
Chinese (zh)
Inventor
郭伯亚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Inspur Intelligent Technology Co Ltd
Original Assignee
Suzhou Inspur Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Inspur Intelligent Technology Co Ltd filed Critical Suzhou Inspur Intelligent Technology Co Ltd
Priority to CN202310159294.1A priority Critical patent/CN116137603A/en
Publication of CN116137603A publication Critical patent/CN116137603A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/08Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters
    • H04L43/0805Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters by checking availability
    • H04L43/0811Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters by checking availability by checking connectivity
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0631Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/12Network monitoring probes

Abstract

The application discloses a method and a device for detecting link failure, a storage medium and an electronic device, wherein the method for detecting link failure comprises the following steps: acquiring the current connection state of a target protocol link corresponding to a target communication protocol in a server; generating a target fault calling instruction under the condition that the current connection state indicates the current connection fault of the target protocol link; the target device is used for responding to the target fault calling instruction and returning fault information, and comprises first link equipment and second link equipment, wherein the first link equipment is equipment for reducing signal attenuation in the target protocol link, and the second link equipment is equipment for expanding an interface in the target protocol link.

Description

Link fault detection method and device, storage medium and electronic device
Technical Field
The embodiment of the application relates to the field of computers, in particular to a method and a device for detecting link faults, a storage medium and an electronic device.
Background
In the communication field, a plurality of hardware devices including a network card, a graphics processor card, a hard disk and the like are deployed in a server, wherein each device can be connected by using a link, the normal connection of the link is the basis for establishing communication and data transmission between the devices, and under the condition that the link fails, the reasons of the failure need to be collected and analyzed, wherein the link failure may include abnormal link transmission rate, abnormal link bandwidth or link error reporting, and the link failure may be related to data actually transmitted, so that in practical application, it is difficult to quickly reproduce the problem of the link failure and find the reason of the occurrence of the problem.
Currently, link failure detection of a server may use a debug interface to collect link failures, where the debug interface is an interface for debugging chip hardware and software problems. In general, after the chip design is completed, functional verification and debugging are required to ensure that the chip can work normally according to the design requirements. In this process, the Debug interface plays a very important role, and in the existing system design, in order to prevent server faults caused by false triggering of the Debug interface and based on the consideration of server security, the Debug interface of the chip is disposed beside the chip, and a chassis cover needs to be opened and connected with a corresponding jig to connect to the Debug interface of the chip, however, when a server erected on a cabinet breaks down, it is difficult to collect required debugging information while retaining a fault phenomenon without opening the chassis cover in advance to connect the Debug interface. Particularly, how to collect enough debugging information under the fault condition is very important for timely analyzing and processing the problems, the complex detection process is realized by opening a case cover connection debug interface by a maintainer, and the maintainer can not timely detect the faults which are difficult to reproduce in the laboratory, so that the loss of the fault information is caused, and the subsequent fault analysis and solution are influenced.
Aiming at the problems of poor detection timeliness of link faults and the like in the related technology, no effective solution is proposed yet.
Disclosure of Invention
The embodiment of the application provides a method and a device for detecting link faults, a storage medium and an electronic device, which are used for at least solving the problems of poor detection timeliness and the like of the link faults in the related technology.
According to an embodiment of the present application, there is provided a method for detecting a link failure, including:
acquiring a current connection state of a target protocol link corresponding to a target communication protocol in a server, wherein the current connection state is used for indicating the current connection condition of the target protocol link, and the target protocol link is used for connecting equipment which uses the target communication protocol to communicate in the server;
generating a target fault calling instruction under the condition that the current connection state indicates the current connection fault of the target protocol link, wherein the target fault calling instruction is used for calling the current fault information of the target protocol link;
and sending the target fault calling instruction to target equipment in the target protocol link, wherein the target equipment is used for responding to the target fault calling instruction and returning the fault information, the target equipment comprises first link equipment and second link equipment, the first link equipment is equipment for reducing signal attenuation in the target protocol link, and the second link equipment is equipment for expanding an interface in the target protocol link.
Optionally, the obtaining the current connection state of the target protocol link corresponding to the target communication protocol in the server includes:
acquiring a target connection parameter of the target protocol link, wherein the target connection parameter is a corresponding connection parameter under the condition that the target protocol link is normally connected;
detecting the current connection parameter of the target protocol link, wherein the current connection parameter is a connection parameter in the current connection process of the target protocol link;
and determining the current connection state of the target protocol link according to the target connection parameter and the current connection parameter.
Optionally, the determining the current connection state of the target protocol link according to the target connection parameter and the current connection parameter includes:
comparing the target connection parameter with the current connection parameter;
under the condition that the target connection parameter is consistent with the current connection parameter, determining that the current connection state indicates that the current connection of the target protocol link is normal;
and under the condition that the target connection parameter is inconsistent with the current connection parameter, determining that the current connection state indicates the current connection fault of the target protocol link.
Optionally, the determining the current connection state of the target protocol link according to the target connection parameter and the current connection parameter includes:
detecting the heartbeat lamp signal frequency corresponding to the first link device and the second link device in the target protocol link respectively;
under the condition that the frequency of the heartbeat lamp signal falls into a target frequency threshold range, determining that the current connection state indicates that the current connection of the target protocol link is normal;
and under the condition that the heartbeat lamp signal frequency does not fall into the target frequency threshold range, determining that the current connection state indicates the current connection fault of the target protocol link.
Optionally, the obtaining the target connection parameter of the target protocol link includes:
detecting a startup attribute corresponding to a last executed target startup operation after the server is electrified, wherein the startup attribute is used for indicating the occurrence time of the target startup operation;
under the condition that the starting attribute indicates that the target starting operation is the first starting operation after the server is electrified, the link connection parameter of the target protocol link is called from a basic input/output system corresponding to the server, and the equipment operation parameter of equipment connected through the target protocol link is called;
And determining the link connection parameter and the equipment operation parameter as the target connection parameter.
Optionally, the generating the target fault call instruction includes:
generating an initial fault calling instruction according to the current connection state;
obtaining a target equipment model of the target equipment;
and editing the initial fault calling instruction according to the target equipment model to obtain the target fault calling instruction, wherein the target fault calling instruction is an instruction which is allowed to be identified by equipment corresponding to the target equipment model.
Optionally, before the generating the target fault call instruction, the method includes:
under the condition that a plurality of target devices are deployed in the target protocol link, a plurality of target device identifiers corresponding to the target devices one by one are identified;
and creating a plurality of target communication addresses in one-to-one correspondence with the plurality of target device identifications, wherein each target communication address is used for indicating an address for communicating with each corresponding target device.
According to another embodiment of the embodiments of the present application, there is also provided a device for detecting a link failure, including:
the system comprises an acquisition module, a communication module and a communication module, wherein the acquisition module is used for acquiring the current connection state of a target protocol link corresponding to a target communication protocol in a server, wherein the current connection state is used for indicating the current connection condition of the target protocol link, and the target protocol link is used for connecting equipment which uses the target communication protocol to communicate in the server;
The generating module is used for generating a target fault calling instruction under the condition that the current connection state indicates the current connection fault of the target protocol link, wherein the target fault calling instruction is used for calling the current fault information of the target protocol link;
the issuing module is configured to issue the target fault call instruction to a target device in the target protocol link, where the target device is configured to respond to the target fault call instruction and return the fault information, and the target device includes a first link device and a second link device, where the first link device is a device for reducing signal attenuation in the target protocol link, and the second link device is a device for expanding an interface in the target protocol link.
According to yet another aspect of the embodiments of the present application, there is also provided a computer readable storage medium having a computer program stored therein, wherein the computer program is configured to perform the above-described method of detecting a link failure when run.
According to still another aspect of the embodiments of the present application, there is further provided an electronic device including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor executes the method for detecting a link failure described above through the computer program.
In the embodiment of the application, a current connection state of a target protocol link corresponding to a target communication protocol in a server is obtained, wherein the current connection state is used for indicating the current connection condition of the target protocol link, and the target protocol link is used for connecting equipment which uses the target communication protocol to communicate in the server; generating a target fault calling instruction under the condition that the current connection state indicates the current connection fault of the target protocol link, wherein the target fault calling instruction is used for calling the current fault information of the target protocol link; and issuing a target fault calling instruction to target equipment in a target protocol link, wherein the target equipment is used for responding to the target fault calling instruction and returning fault information, the target equipment comprises first link equipment and second link equipment, the first link equipment is equipment for reducing signal attenuation in the target protocol link, the second link equipment is equipment for expanding an interface in the target protocol link, namely, the current connection state of the target protocol link corresponding to the target communication protocol in the server is obtained, and the connection state is used for indicating the current connection condition of the target protocol link. The target protocol link connects devices in the server that communicate using the target communication protocol. If the current connection status indicates that the current connection of the target protocol link has a fault, a target fault call instruction is generated. The target fault retrieving instruction aims to retrieve current fault information of the target protocol link. And issuing a target fault calling instruction to target equipment in the target protocol link, wherein the target equipment is used for responding to the target fault calling instruction and returning fault information. The target device includes a first link device and a second link device. The first link equipment is positioned in the target protocol link and used for reducing signal attenuation, the second link equipment is also positioned in the target protocol link and used for expanding an interface, and by adopting the method, maintenance personnel are not required to open a case cover to connect a debug interface under the condition that the target protocol link fails, so that the failure information of the target protocol link can be automatically acquired at the first time of failure of the target protocol link, and the information loss caused by untimely failure information acquisition is avoided. By adopting the technical scheme, the problems of poor detection timeliness of the link fault and the like in the related technology are solved, and the technical effect of improving the detection timeliness of the link fault is realized.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required to be used in the description of the embodiments or the prior art will be briefly described below, and it will be obvious to those skilled in the art that other drawings can be obtained from these drawings without inventive effort.
Fig. 1 is a schematic diagram of a hardware environment of a method for detecting a link failure according to an embodiment of the present application;
fig. 2 is a flow chart of a method of detecting a link failure according to an embodiment of the present application;
FIG. 3 is a schematic diagram of a server link connection according to an embodiment of the present application;
fig. 4 is a schematic diagram of a link failure detection flow according to an embodiment of the present application;
fig. 5 is a block diagram of a link failure detection apparatus according to an embodiment of the present application.
Detailed Description
In order to make the present application solution better understood by those skilled in the art, the following description will be made in detail and with reference to the accompanying drawings in the embodiments of the present application, it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, shall fall within the scope of the present application.
It should be noted that the terms "first," "second," and the like in the description and claims of the present application and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that embodiments of the present application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
The method embodiments provided in the embodiments of the present application may be performed in a computer terminal, a device terminal, or a similar computing apparatus. Taking a computer terminal as an example, fig. 1 is a schematic diagram of a hardware environment of a method for detecting a link failure according to an embodiment of the present application. As shown in fig. 1, the computer terminal may include one or more (only one is shown in fig. 1) processors 102 (the processor 102 may include, but is not limited to, a microprocessor MCU or a processing device such as a programmable logic device FPGA) and a memory 104 for storing data, and in one exemplary embodiment, may also include a transmission device 106 for communication functions and an input-output device 108. It will be appreciated by those skilled in the art that the configuration shown in fig. 1 is merely illustrative and is not intended to limit the configuration of the computer terminal described above. For example, a computer terminal may also include more or fewer components than shown in FIG. 1, or have a different configuration than the equivalent functions shown in FIG. 1 or more than the functions shown in FIG. 1.
The memory 104 may be used to store a computer program, for example, a software program of application software and a module, such as a computer program corresponding to a method for detecting a link failure in an embodiment of the present invention, and the processor 102 executes the computer program stored in the memory 104 to perform various functional applications and data processing, that is, implement the above-mentioned method. Memory 104 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory remotely located relative to the processor 102, which may be connected to the computer terminal via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The transmission device 106 is used to receive or transmit data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of a computer terminal. In one example, the transmission device 106 includes a network adapter (Network Interface Controller, simply referred to as NIC) that can connect to other network devices through a base station to communicate with the internet. In one example, the transmission device 106 may be a Radio Frequency (RF) module, which is configured to communicate with the internet wirelessly.
In this embodiment, a method for detecting a link failure is provided and applied to the computer terminal, and fig. 2 is a flowchart of a method for detecting a link failure according to an embodiment of the present application, as shown in fig. 2, where the flowchart includes the following steps:
step S202, a current connection state of a target protocol link corresponding to a target communication protocol in a server is obtained, wherein the current connection state is used for indicating the current connection condition of the target protocol link, and the target protocol link is used for connecting equipment which uses the target communication protocol to communicate in the server;
step S204, generating a target fault calling instruction under the condition that the current connection state indicates the current connection fault of the target protocol link, wherein the target fault calling instruction is used for calling the current fault information of the target protocol link;
step S206, the target fault call instruction is issued to a target device in the target protocol link, where the target device is configured to respond to the target fault call instruction and return the fault information, and the target device includes a first link device and a second link device, where the first link device is a device for reducing signal attenuation in the target protocol link, and the second link device is a device for expanding an interface in the target protocol link.
Through the steps, the current connection state of the target protocol link corresponding to the target communication protocol in the server is obtained, and the connection state is used for indicating the current connection condition of the target protocol link. The target protocol link connects devices in the server that communicate using the target communication protocol. If the current connection status indicates that the current connection of the target protocol link has a fault, a target fault call instruction is generated. The target fault retrieving instruction aims to retrieve current fault information of the target protocol link. And issuing a target fault calling instruction to target equipment in the target protocol link, wherein the target equipment is used for responding to the target fault calling instruction and returning fault information. The target device includes a first link device and a second link device. The first link equipment is positioned in the target protocol link and used for reducing signal attenuation, the second link equipment is also positioned in the target protocol link and used for expanding an interface, and by adopting the method, maintenance personnel are not required to open a case cover to connect a debug interface under the condition that the target protocol link fails, so that the failure information of the target protocol link can be automatically acquired at the first time of failure of the target protocol link, and the information loss caused by untimely failure information acquisition is avoided. By adopting the technical scheme, the problems of poor detection timeliness of the link fault and the like in the related technology are solved, and the technical effect of improving the detection timeliness of the link fault is realized.
It should be noted that, the execution subject of the above steps may be, but not limited to, a BMC (Baseboard Management Controller ) in a server, and a device or apparatus with a similar function as the BMC, where the BMC may perform firmware upgrade on a machine, check a machine device, and so on when the machine is not powered on.
In the technical solution provided in step S202, the target communication protocol may include, but is not limited to, the following:
USB (Universal Serial Bus): the USB protocol is a standard for communication between a computer and peripheral devices, and is used to connect the computer to peripheral devices such as printers, keyboards, mice, and the like. The USB protocol is similar to PCIe protocol, is also a point-to-point connection based architecture, uses synchronous and asynchronous communication modes, and has multiple rate and bandwidth options;
SATA (Serial ATA): SATA is a communication protocol used to connect storage devices such as hard disk drives, optical disk drives, and the like. SATA protocols are also point-to-point connection based architectures, using serial communication modes, with high rate and bandwidth options;
ethernet: the Ethernet protocol is a standard for local area network communication, and is used for connecting devices such as computers, switches, routers, and the like. The Ethernet protocol is also an architecture based on point-to-point connections, with multiple rate and bandwidth options using broadcast communication modes.
Optionally, in this embodiment, the types of communication links corresponding to different communication protocols are also different. The target protocol link may, but is not limited to, be a link using a target communication protocol, in this application, PCIe (Peripheral Component Interconnect express) is taken as a target communication protocol, and PCIe link is taken as a target protocol link to describe a detection process of a link fault, and a specific target communication protocol type is defined, where PCIe is a high-speed serial computer expansion bus standard, PCIe belongs to a high-speed serial point-to-point dual-channel high-bandwidth transmission, and connected devices allocate a unique channel bandwidth, do not share a bus bandwidth, and mainly support functions such as active power management, error reporting, end-to-end reliability transmission, hot plug, quality of service (QOS), and the like.
Optionally, in this embodiment, a plurality of devices may be deployed in the server, where the plurality of devices use protocol links to perform communications and data transmission, where multiple different communications protocols may exist in the server at the same time, and each communications protocol may be applicable to a mechanism for detecting a link failure in this application.
Optionally, in this embodiment, fig. 3 is a schematic diagram of a server link connection according to an embodiment of the present application, as shown in fig. 3, where a DEVICE link in a server is provided with two response chips (response 0, response 1), one PCIe switch chip and two PCIe DEVICEs (DEVICE 0, DEVICE 1), a debug interface of the response chip is an I2C (Inter-Integrated Circuit, two-wire serial bus) interface, an interface of the PCIe switch chip is a UART (Universal Asynchronous Receiver/Transmitter) interface, the server link may further include a BMC and a CPU, an I2C interface of the response chip on a hardware line is connected to the BMC, and a PCIe interface of the PCIe switch chip is connected to the BMC, where communications may be performed based on a UART protocol. The BMC (or CPU) can acquire the current connection state of the target protocol link corresponding to the target communication protocol (PCIe protocol) in the server, and determine the current connection condition of the target protocol link.
In one exemplary embodiment, the current connection state of the target protocol link corresponding to the target communication protocol in the server may be obtained, but is not limited to, by: acquiring a target connection parameter of the target protocol link, wherein the target connection parameter is a corresponding connection parameter under the condition that the target protocol link is normally connected; detecting the current connection parameter of the target protocol link, wherein the current connection parameter is a connection parameter in the current connection process of the target protocol link; and determining the current connection state of the target protocol link according to the target connection parameter and the current connection parameter.
Optionally, in this embodiment, the target protocol link corresponding to the target communication protocol in the server may include multiple target protocol links, and for each target protocol link, obtaining a corresponding target connection parameter may be adopted and a current connection parameter of the target protocol link is detected, where the target connection parameters between the multiple target protocol links may be the same or different, for example, each target protocol link in the multiple target protocol links may have a different target connection parameter due to a difference in transmission data or configuration parameters.
Alternatively, in this embodiment, the target connection parameter may be, but not limited to, directly obtained from the stored data, or may be obtained from a history connection record of the target protocol link, where the connection parameter is under the condition that the target protocol link is normally connected, as the target connection parameter.
Alternatively, in this embodiment, the target connection parameter may be, but not limited to, one or more parameters, where each parameter may indicate a current connection situation of the target protocol link, for example, a link rate or a link bandwidth of the target protocol link.
In one exemplary embodiment, the current connection state of the target protocol link may be determined from the target connection parameter and the current connection parameter, but is not limited to, by: comparing the target connection parameter with the current connection parameter; under the condition that the target connection parameter is consistent with the current connection parameter, determining that the current connection state indicates that the current connection of the target protocol link is normal; and under the condition that the target connection parameter is inconsistent with the current connection parameter, determining that the current connection state indicates the current connection fault of the target protocol link.
Optionally, in this embodiment, the target connection parameter and the current connection parameter are compared, and in the case that the target connection parameter and the current connection parameter are inconsistent, it is determined that the current connection state indicates that the target protocol link is currently connected to the fault, where a definition that the target connection parameter and the current connection parameter are inconsistent may, but is not limited to, dividing a certain interval threshold, for example, if a phase difference between the target connection parameter and the current connection parameter is greater than the interval threshold, it is determined that the target connection parameter and the current connection parameter are inconsistent.
In one exemplary embodiment, the current connection state of the target protocol link may be determined from the target connection parameter and the current connection parameter, but is not limited to, by: detecting the heartbeat lamp signal frequency corresponding to the first link device and the second link device in the target protocol link respectively; under the condition that the frequency of the heartbeat lamp signal falls into a target frequency threshold range, determining that the current connection state indicates that the current connection of the target protocol link is normal; and under the condition that the heartbeat lamp signal frequency does not fall into the target frequency threshold range, determining that the current connection state indicates the current connection fault of the target protocol link.
Alternatively, in this embodiment, different from the above manner of detecting the connection parameter, the heartbeat signal frequencies corresponding to the first link device and the second link device in the target protocol link may also be detected, as shown in fig. 3, where in the case where the target communication protocol is PCIe, the first link device may be a device chip (device 0, device 1), and the second link device may be, but is not limited to, a PCIe Switch chip, where the device chip and the PCIe Switch chip are chips for a PCI Express (PCIe) bus, and their roles are different. The RETIMER chip is used for prolonging the transmission distance of PCIe signals, and signal attenuation is compensated by regenerating and re-timing PCIe signals, so that the quality and reliability of the signals are improved. PCIe Switch chips are chips used to connect multiple PCIe devices, which typically have multiple PCIe ports between which forwarding and routing can take place, thereby enabling communication and data transfer between the multiple PCIe devices. PCIe Switch chips may also support other high speed bus protocols such as Ethernet and InfiniBand.
Alternatively, in this embodiment, the heartbeat lamp signal generally refers to a signal indicating that the device is operating normally, similar to the heartbeat of a living being. Specifically, the heartbeat lamp is usually a periodic flashing signal sent by an LED indicator lamp, and the frequency and period of the flashing signal can be adjusted according to the working state of the specific device. Normally, when the device is in a normal working state, the heartbeat lamp will flash periodically to indicate that the device is working normally, and provide services for users. If the equipment is abnormal or fails, the flicker frequency and period of the heartbeat lamp may change or the flicker is stopped, and at this time, a user can judge whether the equipment works normally according to the change of the heartbeat lamp signal. By observing the state of the heartbeat lamp, whether the equipment works normally or not can be judged quickly, and the reliability and stability of the equipment can be improved.
Optionally, in this embodiment, the BMC monitors the heartbeat signal of the response chip and the PCIe switch chip to connect to the GPIO of the BMC (further, the GPIO resources of the BMC may be saved during the GPIO expander). The server will not provide POWER to the re-timer chip and the PCIe switch chip in the S5 (soft off, software shutdown, including POWER BUTTON or upper computer triggering) state, and the RETIMER chip and the PCIe switch chip will start to operate in the S0 (normal operating state, the server formally operating, CPU, DIM, PCH and hard disk all start to operate) state, and if in the normal operating state, the heartbeat lamp will flash at a fixed frequency. After the server is started, the BMC monitors heartbeat lamp signals of the RETIMER chip and the PCIe switch chip to judge whether the chip is in a normal working state.
In one exemplary embodiment, the target connection parameters of the target protocol link may be, but are not limited to, obtained by: detecting a startup attribute corresponding to a last executed target startup operation after the server is electrified, wherein the startup attribute is used for indicating the occurrence time of the target startup operation; under the condition that the starting attribute indicates that the target starting operation is the first starting operation after the server is electrified, the link connection parameter of the target protocol link is called from a basic input/output system corresponding to the server, and the equipment operation parameter of equipment connected through the target protocol link is called; and determining the link connection parameter and the equipment operation parameter as the target connection parameter.
Optionally, in this embodiment, the target connection parameter may include, but is not limited to, a link connection parameter and a device operation parameter, where the link connection parameter is used to indicate a connection condition of a link, and the device operation parameter is used to indicate an operation condition of a device.
Optionally, in this embodiment, when the power-on attribute indicates that the target power-on operation is the first power-on operation after the server is powered on, the default target protocol link is connected normally, and the device connected by the target protocol link is operated normally, so that the link connection parameter and the device operation parameter may be determined as the target connection parameter.
Optionally, in this embodiment, when the BMC is powered on, the BMC may acquire information (i.e. an equipment operating parameter) and a link state (i.e. a link connection parameter) of each PCIe device from the BIOS, after AC (alternating current) power-on, the BMC may record a target connection parameter when the first power-on is performed, and compare the target connection parameter with the target connection parameter when the first power-on is performed later, so as to determine whether a PCIe fault occurs, and the PCIe device has no link or link status error.
In the technical solution provided in the step S204, when the BMC monitors that the heartbeat lamp signal frequency of the remote chip or the PCIe switch chip is abnormal or when the BMC monitors that the target connection parameter after the startup is inconsistent with the information when the AC is powered on for the first time, a command for obtaining the debug information is sent to the debug interface of the corresponding remote chip/PCIe switch chip, and the corresponding debug information (fault information) is obtained.
In one exemplary embodiment, the target fault call instruction may be generated, but is not limited to, by: generating an initial fault calling instruction according to the current connection state; obtaining a target equipment model of the target equipment; and editing the initial fault calling instruction according to the target equipment model to obtain the target fault calling instruction, wherein the target fault calling instruction is an instruction which is allowed to be identified by equipment corresponding to the target equipment model.
Alternatively, in this embodiment, the BMC may, but is not limited to, obtain debug information according to the target device model of the target device, the cured software interface, and the target fault call instruction.
In one exemplary embodiment, prior to the generating the target fault fetch instruction, the method may further include, but is not limited to, the following: under the condition that a plurality of target devices are deployed in the target protocol link, a plurality of target device identifiers corresponding to the target devices one by one are identified; and creating a plurality of target communication addresses in one-to-one correspondence with the plurality of target device identifications, wherein each target communication address is used for indicating an address for communicating with each corresponding target device.
Optionally, in this embodiment, when a plurality of target DEVICEs are deployed in a target protocol link, as shown in fig. 3, a response 1 and a PCIe switch are deployed in the target protocol link where the DEVICE 1 is located, and at this time, corresponding target fault retrieving instructions may be sent to the response 1 and the PCIe switch respectively, and fault information returned by the response 1 and the PCIe switch may be obtained respectively, where the response 1 and the PCIe switch belong to the target DEVICEs, but the target DEVICE identifier and the target communication address are unique, and the BMC uses the target communication address to send the instruction and receive data to the corresponding target DEVICE.
In the technical solution provided in step S206, the target device returns the fault information collected by the target device to the BMC when receiving the target fault call instruction, where the collection mode and timing of the fault information are not limited, and the fault information may be collected and stored in the target device when the fault occurs, or the target device may collect the fault information from the corresponding link or device after receiving the target fault call instruction.
In order to better understand the process of detecting the link failure, the following description is given with reference to an alternative embodiment, but the technical solution of the embodiment of the present application is not limited.
In this embodiment, a method for detecting a link failure is provided, and fig. 4 is a schematic diagram of a flow for detecting a link failure according to an embodiment of the present application, as shown in fig. 4, mainly including the following steps:
step S401: the I2C interface of the Retimer chip is connected to the BMC, the UART interface of the PCIe switch chip is connected to the BMC, an instruction (namely the target fault calling instruction) is acquired in the BMC according to the cured software interface and debug information (namely the target fault calling instruction) of the specific model of the Retimer chip and the PCIe switch chip, the AC is powered on for the first time, and the BMC records PCIe device information (namely the target connection parameter) transmitted by the BIOS as a reference standard;
step S402: the BMC compares PCIe device information (i.e. the current connection parameters) transmitted by the BIOS with PCIe device information (i.e. the target connection parameters) when the AC is powered on for the first time
Step S403: if the PCIe device information is the same, performing no debug information collection action;
step S404: if the PCIe devices are different, firstly judging which link the PCIe devices belong to according to the changed PCIe devices, initializing a debug interface for a re-timer/PCIe switch on the link, and sending a command for receiving the debug information.
It should be noted that, by adopting the above-mentioned link fault detection flow, additional components and parts are not required to be added, the circuit maturity is high, debug information can be simply and rapidly collected when a problem occurs, problem analysis is quickened, automatic collection of PCIe related debug information can be realized, first hand field information is mastered, when a fault occurs, required printing information can be conveniently collected without opening a case cover, and problem analysis is helped.
From the description of the above embodiments, it will be clear to a person skilled in the art that the method according to the above embodiments may be implemented by means of software plus the necessary general hardware platform, but of course also by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk), comprising several instructions for causing a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the method of the embodiments of the present application.
Fig. 5 is a block diagram of a link failure detection apparatus according to an embodiment of the present application; as shown in fig. 5, includes:
an obtaining module 502, configured to obtain a current connection state of a target protocol link corresponding to a target communication protocol in a server, where the current connection state is used to indicate a current connection situation of the target protocol link, and the target protocol link is used to connect a device in the server that uses the target communication protocol to perform communication;
a generating module 504, configured to generate a target failure call instruction when the current connection state indicates a current connection failure of the target protocol link, where the target failure call instruction is used to call current failure information of the target protocol link;
and a issuing module 506, configured to issue the target fault call instruction to a target device in the target protocol link, where the target device is configured to respond to the target fault call instruction and return the fault information, and the target device includes a first link device and a second link device, where the first link device is a device for reducing signal attenuation in the target protocol link, and the second link device is a device for expanding an interface in the target protocol link.
It should be noted that each of the above modules may be implemented by software or hardware, and for the latter, it may be implemented by, but not limited to: the modules are all located in the same processor; alternatively, the above modules may be located in different processors in any combination.
Through the above embodiment, the current connection state of the target protocol link corresponding to the target communication protocol in the server is obtained, where the connection state is used to indicate the current connection situation of the target protocol link. The target protocol link connects devices in the server that communicate using the target communication protocol. If the current connection status indicates that the current connection of the target protocol link has a fault, a target fault call instruction is generated. The target fault retrieving instruction aims to retrieve current fault information of the target protocol link. And issuing a target fault calling instruction to target equipment in the target protocol link, wherein the target equipment is used for responding to the target fault calling instruction and returning fault information. The target device includes a first link device and a second link device. The first link equipment is positioned in the target protocol link and used for reducing signal attenuation, the second link equipment is also positioned in the target protocol link and used for expanding an interface, and by adopting the method, maintenance personnel are not required to open a case cover to connect a debug interface under the condition that the target protocol link fails, so that the failure information of the target protocol link can be automatically acquired at the first time of failure of the target protocol link, and the information loss caused by untimely failure information acquisition is avoided. By adopting the technical scheme, the problems of poor detection timeliness of the link fault and the like in the related technology are solved, and the technical effect of improving the detection timeliness of the link fault is realized.
In one exemplary embodiment, the acquisition module includes:
the first acquisition unit is used for acquiring the target connection parameters of the target protocol link, wherein the target connection parameters are corresponding connection parameters under the condition that the target protocol link is normally connected;
the detection unit is used for detecting the current connection parameter of the target protocol link, wherein the current connection parameter is a connection parameter in the current connection process of the target protocol link;
and the determining unit is used for determining the current connection state of the target protocol link according to the target connection parameter and the current connection parameter.
In an exemplary embodiment, the determining unit is further configured to:
comparing the target connection parameter with the current connection parameter;
under the condition that the target connection parameter is consistent with the current connection parameter, determining that the current connection state indicates that the current connection of the target protocol link is normal;
and under the condition that the target connection parameter is inconsistent with the current connection parameter, determining that the current connection state indicates the current connection fault of the target protocol link.
In an exemplary embodiment, the determining unit is further configured to:
Detecting the heartbeat lamp signal frequency corresponding to the first link device and the second link device in the target protocol link respectively;
under the condition that the frequency of the heartbeat lamp signal falls into a target frequency threshold range, determining that the current connection state indicates that the current connection of the target protocol link is normal;
and under the condition that the heartbeat lamp signal frequency does not fall into the target frequency threshold range, determining that the current connection state indicates the current connection fault of the target protocol link.
In an exemplary embodiment, the first obtaining unit is further configured to:
detecting a startup attribute corresponding to a last executed target startup operation after the server is electrified, wherein the startup attribute is used for indicating the occurrence time of the target startup operation;
under the condition that the starting attribute indicates that the target starting operation is the first starting operation after the server is electrified, the link connection parameter of the target protocol link is called from a basic input/output system corresponding to the server, and the equipment operation parameter of equipment connected through the target protocol link is called;
and determining the link connection parameter and the equipment operation parameter as the target connection parameter.
In one exemplary embodiment, the generating module includes:
the generating unit is used for generating an initial fault calling instruction according to the current connection state;
a second obtaining unit, configured to obtain a target device model of the target device;
and the editing unit is used for editing the initial fault calling instruction according to the target equipment model to obtain the target fault calling instruction, wherein the target fault calling instruction is an instruction which is allowed to be identified by equipment corresponding to the target equipment model.
In one exemplary embodiment, the apparatus includes:
the identifying module is used for identifying a plurality of target device identifiers corresponding to the plurality of target devices one by one under the condition that the plurality of target devices are deployed in the target protocol link before the target fault calling instruction is generated;
the creation module is used for creating a plurality of target communication addresses corresponding to the target device identifiers one by one, wherein each target communication address is used for indicating an address for communicating with each corresponding target device.
Embodiments of the present application also provide a computer readable storage medium having a computer program stored therein, wherein the computer program is arranged to perform the steps of any of the method embodiments described above when run.
In one exemplary embodiment, the computer readable storage medium may include, but is not limited to: a usb disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing a computer program.
Embodiments of the present application also provide an electronic device comprising a memory having stored therein a computer program and a processor arranged to run the computer program to perform the steps of any of the method embodiments described above.
In an exemplary embodiment, the electronic device may further include a transmission device connected to the processor, and an input/output device connected to the processor.
Specific examples in this embodiment may refer to the examples described in the foregoing embodiments and the exemplary implementation, and this embodiment is not described herein.
It will be appreciated by those skilled in the art that the modules or steps of the application described above may be implemented in a general purpose computing device, they may be concentrated on a single computing device, or distributed across a network of computing devices, they may be implemented in program code executable by computing devices, so that they may be stored in a storage device for execution by computing devices, and in some cases, the steps shown or described may be performed in a different order than that shown or described herein, or they may be separately fabricated into individual integrated circuit modules, or multiple modules or steps of them may be fabricated into a single integrated circuit module. Thus, the present application is not limited to any specific combination of hardware and software.
The foregoing description is only of the preferred embodiments of the present application and is not intended to limit the same, but rather, various modifications and variations may be made by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the principles of the present application should be included in the protection scope of the present application.

Claims (10)

1. A method for detecting a link failure, comprising:
acquiring a current connection state of a target protocol link corresponding to a target communication protocol in a server, wherein the current connection state is used for indicating the current connection condition of the target protocol link, and the target protocol link is used for connecting equipment which uses the target communication protocol to communicate in the server;
generating a target fault calling instruction under the condition that the current connection state indicates the current connection fault of the target protocol link, wherein the target fault calling instruction is used for calling the current fault information of the target protocol link;
and sending the target fault calling instruction to target equipment in the target protocol link, wherein the target equipment is used for responding to the target fault calling instruction and returning the fault information, the target equipment comprises first link equipment and second link equipment, the first link equipment is equipment for reducing signal attenuation in the target protocol link, and the second link equipment is equipment for expanding an interface in the target protocol link.
2. The method according to claim 1, wherein the obtaining the current connection state of the target protocol link corresponding to the target communication protocol in the server includes:
acquiring a target connection parameter of the target protocol link, wherein the target connection parameter is a corresponding connection parameter under the condition that the target protocol link is normally connected;
detecting the current connection parameter of the target protocol link, wherein the current connection parameter is a connection parameter in the current connection process of the target protocol link;
and determining the current connection state of the target protocol link according to the target connection parameter and the current connection parameter.
3. The method of claim 2, wherein said determining a current connection state of the target protocol link based on the target connection parameter and the current connection parameter comprises:
comparing the target connection parameter with the current connection parameter;
under the condition that the target connection parameter is consistent with the current connection parameter, determining that the current connection state indicates that the current connection of the target protocol link is normal;
and under the condition that the target connection parameter is inconsistent with the current connection parameter, determining that the current connection state indicates the current connection fault of the target protocol link.
4. The method of claim 2, wherein said determining a current connection state of the target protocol link based on the target connection parameter and the current connection parameter comprises:
detecting the heartbeat lamp signal frequency corresponding to the first link device and the second link device in the target protocol link respectively;
under the condition that the frequency of the heartbeat lamp signal falls into a target frequency threshold range, determining that the current connection state indicates that the current connection of the target protocol link is normal;
and under the condition that the heartbeat lamp signal frequency does not fall into the target frequency threshold range, determining that the current connection state indicates the current connection fault of the target protocol link.
5. The method according to claim 2, wherein the obtaining the target connection parameters of the target protocol link comprises:
detecting a startup attribute corresponding to a last executed target startup operation after the server is electrified, wherein the startup attribute is used for indicating the occurrence time of the target startup operation;
under the condition that the starting attribute indicates that the target starting operation is the first starting operation after the server is electrified, the link connection parameter of the target protocol link is called from a basic input/output system corresponding to the server, and the equipment operation parameter of equipment connected through the target protocol link is called;
And determining the link connection parameter and the equipment operation parameter as the target connection parameter.
6. The method of any one of claims 1 to 5, wherein generating the target fault fetch instruction comprises:
generating an initial fault calling instruction according to the current connection state;
obtaining a target equipment model of the target equipment;
and editing the initial fault calling instruction according to the target equipment model to obtain the target fault calling instruction, wherein the target fault calling instruction is an instruction which is allowed to be identified by equipment corresponding to the target equipment model.
7. The method of any of claims 1-5, wherein prior to the generating a target fault fetch instruction, the method comprises:
under the condition that a plurality of target devices are deployed in the target protocol link, a plurality of target device identifiers corresponding to the target devices one by one are identified;
and creating a plurality of target communication addresses in one-to-one correspondence with the plurality of target device identifications, wherein each target communication address is used for indicating an address for communicating with each corresponding target device.
8. A device for detecting a link failure, comprising:
The system comprises an acquisition module, a communication module and a communication module, wherein the acquisition module is used for acquiring the current connection state of a target protocol link corresponding to a target communication protocol in a server, wherein the current connection state is used for indicating the current connection condition of the target protocol link, and the target protocol link is used for connecting equipment which uses the target communication protocol to communicate in the server;
the generating module is used for generating a target fault calling instruction under the condition that the current connection state indicates the current connection fault of the target protocol link, wherein the target fault calling instruction is used for calling the current fault information of the target protocol link;
the issuing module is configured to issue the target fault call instruction to a target device in the target protocol link, where the target device is configured to respond to the target fault call instruction and return the fault information, and the target device includes a first link device and a second link device, where the first link device is a device for reducing signal attenuation in the target protocol link, and the second link device is a device for expanding an interface in the target protocol link.
9. A computer-readable storage medium, characterized in that the computer-readable storage medium comprises a stored program, wherein the program when run performs the method of any one of claims 1 to 7.
10. An electronic device comprising a memory and a processor, characterized in that the memory has stored therein a computer program, the processor being arranged to execute the method according to any of claims 1 to 7 by means of the computer program.
CN202310159294.1A 2023-02-23 2023-02-23 Link fault detection method and device, storage medium and electronic device Pending CN116137603A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310159294.1A CN116137603A (en) 2023-02-23 2023-02-23 Link fault detection method and device, storage medium and electronic device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310159294.1A CN116137603A (en) 2023-02-23 2023-02-23 Link fault detection method and device, storage medium and electronic device

Publications (1)

Publication Number Publication Date
CN116137603A true CN116137603A (en) 2023-05-19

Family

ID=86332772

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310159294.1A Pending CN116137603A (en) 2023-02-23 2023-02-23 Link fault detection method and device, storage medium and electronic device

Country Status (1)

Country Link
CN (1) CN116137603A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116436526A (en) * 2023-06-13 2023-07-14 苏州浪潮智能科技有限公司 Method, device, system, storage medium and electronic equipment for controlling signal transmission

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116436526A (en) * 2023-06-13 2023-07-14 苏州浪潮智能科技有限公司 Method, device, system, storage medium and electronic equipment for controlling signal transmission
CN116436526B (en) * 2023-06-13 2024-02-20 苏州浪潮智能科技有限公司 Method, device, system, storage medium and electronic equipment for controlling signal transmission

Similar Documents

Publication Publication Date Title
CN110380907B (en) Network fault diagnosis method and device, network equipment and storage medium
CN101800675B (en) Failure monitoring method, monitoring equipment and communication system
CN102571498B (en) Fault injection control method and device
CN103812726A (en) Automated testing method and device for data communication equipment
CN112055096B (en) Method and device for automatically setting communication address of equipment
TW201500935A (en) System and method of controlling shutdown and booting of servers
CN110740072A (en) fault detection method, device and related equipment
CN105242980A (en) Complementary watchdog system and complementary watchdog monitoring method
CN105183575A (en) Processor fault diagnosis method, device and system
CN102354261A (en) Remote control system for power supply switches of machine room servers
CN116137603A (en) Link fault detection method and device, storage medium and electronic device
CN102664755B (en) Control channel fault determining method and device
CN110445932B (en) Abnormal card dropping processing method and device, storage medium and terminal
WO2020088351A1 (en) Method for sending device information, computer device and distributed computer device system
CN115858221A (en) Management method and device of storage equipment, storage medium and electronic equipment
CN101667953B (en) Reporting method of rapid looped network physical link state and device therefor
CN104639358A (en) Batched network port switching method and system
CN101136756B (en) Electric self-checking method, system and BMC chip on network long-range control host machine
CN115599617B (en) Bus detection method and device, server and electronic equipment
CN109446002B (en) Jig plate, system and method for grabbing SATA hard disk by server
CN115098342A (en) System log collection method, system, terminal and storage medium
CN113645048B (en) Network card switching method and device and field programmable gate array FPGA
CN106649002A (en) Server and method for automatically overhauling baseboard management controller
CN103532728A (en) Method and device for resetting fault digital signal processor (DSP) chip
CN114201439A (en) Server signal identification optimization method, system and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination