CN118051375B - Method and system for fault diagnosis of direct links between computing devices - Google Patents

Method and system for fault diagnosis of direct links between computing devices Download PDF

Info

Publication number
CN118051375B
CN118051375B CN202410451366.4A CN202410451366A CN118051375B CN 118051375 B CN118051375 B CN 118051375B CN 202410451366 A CN202410451366 A CN 202410451366A CN 118051375 B CN118051375 B CN 118051375B
Authority
CN
China
Prior art keywords
link
direct
data
direct link
state machine
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202410451366.4A
Other languages
Chinese (zh)
Other versions
CN118051375A (en
Inventor
请求不公布姓名
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Bi Ren Technology Co ltd
Beijing Bilin Technology Development Co ltd
Original Assignee
Shanghai Bi Ren Technology Co ltd
Beijing Bilin Technology Development Co ltd
Filing date
Publication date
Application filed by Shanghai Bi Ren Technology Co ltd, Beijing Bilin Technology Development Co ltd filed Critical Shanghai Bi Ren Technology Co ltd
Priority to CN202410451366.4A priority Critical patent/CN118051375B/en
Publication of CN118051375A publication Critical patent/CN118051375A/en
Application granted granted Critical
Publication of CN118051375B publication Critical patent/CN118051375B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

Embodiments of the present invention relate to a method and system for fault diagnosis of a direct link between computing devices, the method comprising: determining whether the data is completely transmitted via the direct link; in response to complete transmission of data via the direct link, determining whether link training state machines corresponding to two direct ports at two ends of the direct link are in a normal state; determining whether the data is correct after being transmitted through the direct link or not according to the fact that the link training state machines are in normal states; determining whether the data throughput of the direct link meets a threshold condition in response to the data being correct after being transmitted via the direct link; and in response to the data throughput of the direct link meeting the threshold condition, performing link retraining for the direct link to determine whether the direct link fails. The invention can determine whether the direct link has faults or not when the direct link is in an abnormal working state, so as to be convenient for diagnosing and debugging the faulty direct link.

Description

Method and system for fault diagnosis of direct links between computing devices
Technical Field
Embodiments of the present invention relate generally to the field of chip technology and, more particularly, to a method and system for fault diagnosis of a direct link between computing devices.
Background
PCIe (peripheral component interconnect express) bus is a high bandwidth expansion bus commonly used to connect a host, such as a Central Processing Unit (CPU), with various computing devices, such as a Graphics Processing Unit (GPU), a General Purpose Graphics Processor (GPGPU), etc., to enable interactions, e.g., data transfer, between the host and the computing devices. In other words, data transfer between a host and a computing device implemented via a PCIe bus conforms to the PCIe bus standard (also known as the PCIe protocol).
On this basis, if one of the plurality of computing devices connected with the CPU is regarded as a main device, the main device and other computing devices can realize direct connection between the computing devices based on the PCIe protocol, and data transmission between the computing devices can be performed without passing through the CPU.
However, in the prior art, if a direct link based on PCIe protocol between computing devices is in an abnormal working state, it cannot be determined whether the direct link itself fails or other modules in the computing devices fail. That is, the prior art has disadvantages in that: it cannot be determined whether the abnormal operation state of the direct link is caused by a fault at the direct link.
Disclosure of Invention
In view of the above problems, the present invention provides a method and a system for diagnosing a failure of a direct link between computing devices, so that when the direct link is in an abnormal working state, whether the direct link itself fails or not can be determined, so as to diagnose and debug the failed direct link.
According to a first aspect of the present invention, there is provided a method for fault diagnosis of a direct link between computing devices, comprising: determining whether the data is completely transmitted via the direct link; in response to complete transmission of data via the direct link, determining whether link training state machines corresponding to two direct ports at two ends of the direct link are in a normal state; determining whether the data is correct after being transmitted through the direct link or not according to the fact that the link training state machines are in normal states; determining whether the data throughput of the direct link meets a threshold condition in response to the data being correct after being transmitted via the direct link; and in response to the data throughput of the direct link meeting the threshold condition, performing link retraining for the direct link to determine whether the direct link fails.
In some embodiments, determining whether the data is fully transmitted via the direct link comprises: the amount of data transmitted through two direct-connect ports on both ends of the direct-connect link is compared to determine whether the data is completely transmitted.
In some embodiments, the two direct-connect ports at both ends of the direct-connect link are a first direct-connect port and a second direct-connect port. In these embodiments, comparing the amount of data transmitted through two direct-connect ports on both ends of the direct-connect link includes: calculating a first amount of data transmitted by the first direct port; calculating a second amount of data received by the second direct port; and responsive to the first number being equal to the second number, determining that the data is fully transmitted via the direct link.
In some embodiments, determining whether link training state machines respectively corresponding to two direct connection ports at both ends of a direct connection link are both in a normal state comprises: determining whether a key event occurs to the link training state machine; determining whether a link training state machine has jump abnormality; and determining that the link training state machine is in a normal state in response to the link training state machine not having a critical event and the link training state machine not having a jump exception
In some embodiments, determining whether a critical event has occurred for the link training state machine comprises: reading a value of a counter for counting key events related to a link training state machine; and in response to the read value being zero, determining that a critical event has not occurred for the link training state machine, wherein the critical event comprises any one of: link disabled, link disconnected, and hot reset.
In some embodiments, determining whether a link training state machine has a skip anomaly comprises: log information associated with the link training state machine is read to determine whether a jump exception occurred with the link training state machine.
In some embodiments, determining whether the data is correct after transmission via the direct link comprises: generating an original first sequence at any one of the direct connection ports of the direct connection link; transmitting the original first sequence to another direct connection port of the direct connection link via the direct connection link to receive the transmitted original first sequence at the other direct connection port of the direct connection link to obtain a received sequence; based at least on the received sequence, it is determined whether the data is correct after transmission via the direct link.
In some embodiments, the threshold condition comprises: the data throughput of the direct link reaches the maximum theoretical bandwidth.
In some embodiments, link retraining for the direct link to determine whether the direct link is malfunctioning includes: carrying out link reconnection aiming at the direct link so as to acquire parameters related to the reconnection direct link; and determining that the direct link fails in response to the acquired value of the parameter not meeting the target value.
In some embodiments, the parameters include: link speed and link width.
According to a second aspect of the present invention, there is provided a system for fault diagnosis of a direct link between computing devices, characterized by comprising: a fault diagnosis circuit configured to connect with a direct connection port at either end of a direct connection link, wherein the fault diagnosis circuit comprises: a data full transmission determination module configured to determine whether data is fully transmitted via a direct link; the link training state machine state detection module is configured to determine whether a link training state machine corresponding to the direct connection port is in a normal state; the data correctness determining module is configured to determine the correctness of the data after being transmitted through the direct link; a data throughput judging module configured to determine whether the data throughput of the direct link satisfies a threshold condition; and the link retraining module is configured to carry out link reconnection aiming at the direct link and collect parameters related to the reconnected direct link so as to determine whether the direct link fails.
In some embodiments, the data full transmission determination module comprises: a transmission data counter configured to count the amount of data transmitted or received by the direct connection port.
In some embodiments, the link training state machine state detection module comprises: a key event counter configured to count the number of key events that have occurred by the link training state machine; and a link training state machine log storage unit configured to store log information related to the link training state machine.
In some embodiments, the data correctness determination module comprises: a sequence generator configured to generate an original first sequence, the original first sequence being transmitted via a direct link; a sequence validator configured to validate whether the received sequence is correct based on the received transmitted original first sequence.
In some embodiments, the data throughput determination module comprises: a throughput calculator configured to calculate an amount of data transmitted or received per unit time of the direct connection port
It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the invention or to delineate the scope of the invention. Other features of the present invention will become apparent from the description that follows.
Drawings
The above and other features, advantages and aspects of embodiments of the present invention will become more apparent by reference to the following detailed description when taken in conjunction with the accompanying drawings. In the drawings, the same or similar reference numerals denote the same or similar elements.
Fig. 1 shows a schematic topology of a PCIe protocol based system according to the present invention.
Fig. 2 shows a schematic diagram of a fault diagnosis circuit according to an embodiment of the invention.
Fig. 3 shows an architectural diagram of an exemplary fault diagnosis circuit according to an embodiment of the present invention.
FIG. 4 illustrates a flowchart of a method for fault diagnosis of a direct link between computing devices according to an embodiment of the invention.
FIG. 5 sets forth a flow chart illustrating an exemplary method for fault diagnosis of a direct link between computing devices according to embodiments of the present invention.
Detailed Description
Exemplary embodiments of the present invention will now be described with reference to the accompanying drawings, in which various details of the embodiments of the present invention are included to facilitate understanding, and are to be considered merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
The term "comprising" and variations thereof as used herein means open ended, i.e., "including but not limited to. The term "based on" means "based at least in part on". The terms "one example embodiment" and "one embodiment" mean "at least one example embodiment. The term "another embodiment" means "at least one additional embodiment". The terms "first," "second," and the like, may refer to different or the same object. Other explicit and implicit definitions are also possible below.
As described above, the interaction between the host (e.g., CPU) and the computing device (e.g., GPU, GPGPU, etc.) may be implemented via a PCIe bus, such as data transfer. In addition, data transmission between computing devices can also be realized through a direct link based on PCIe protocol. Fig. 1 shows a schematic topology of a PCIe protocol based system 100 according to the present invention.
As shown in fig. 1, the system 100 includes a host (CPU) 110, a Root Complex (RC) 120, at least one PCIe switch (PCIE SWITCH) (i.e., a first PCIe switch 130A and a second PCIe switch 130B, which may be collectively referred to as PCIe switches 130), at least one computing device (GPGPU or GPU) (i.e., a first computing device 140A, a second computing device 140B, a third computing device 140C, a fourth computing device 140D, which may be collectively referred to as computing devices 140).
As shown in fig. 1, at least one Root Port (RP) may be configured on the RC 120, i.e., a first Root Port RP1 and a second Root Port RP2.
As shown in fig. 1, each PCIe switch 130 may be configured with an upstream Port (USP) and at least one downstream Port (DSP). For example, the first PCIe switch 130A is configured with one upstream port (i.e., the first upstream port USP 1) and two downstream ports (i.e., the first downstream port DSP1 and the second downstream port DSP 2); similarly, the second PCIe switch 130B is configured with a second upstream port USP2, a third downstream port DSP3, and a fourth downstream port DSP2.
In general, the root port of RC 120 may be connected to an upstream port of PCIe switch 130 to form a link for data transmission between RC 120 and PCIe switch 130. For example, as shown in FIG. 1, a first root port RP1 of the RC 120 is coupled to a first upstream port USP1 of a first PCIe switch 130A and a second root port RP2 of the RC 120 is coupled to a second upstream port USP2 of a second PCIe switch 130B.
As shown in fig. 1, each computing device 140 may have an upper port configured thereon for connection to PCIe switch 130. Taking the first computing device 140A as an example, a first upper port P1 is configured thereon, and the first upper port P1 is connected to a first downstream port DSP1 of the first PCIe switch 130A, thereby forming a Link (i.e., link 0) for data transmission between the first computing device 140A and the first PCIe switch 130A. Thus, the first computing device 140A may enable interaction with the host CPU and system memory (not shown) via the link between its first upper port P1 and the first downstream port DSP1 of the first PCIe switch 130A, the link between the first upstream port USP1 of the first PCIe switch 130A and the first root port RP1 of the RC 120, the RC 120. Likewise, the second computing device 140B, the third computing device 140C, the fourth computing device 140D may implement interactions with a host CPU and system memory (not shown) in a similar fashion.
Further, each computing device 140 may also be configured with a direct connection port for interconnecting with other computing devices 140. As shown in fig. 1, for example, the first computing device 140A is configured with a plurality of direct connection ports, i.e., a first direct connection port PIP1, a second direct connection port PIP2, and a third direct connection port PIP3, wherein the first direct connection port PIP1 of the first computing device 140A is connected with the direct connection port PIP4 of the second computing device 140B to form a direct connection Link (i.e., link 1) for data transmission between the first computing device 140A and the second computing device 140B. Similarly, the second direct port PIP2 of the first computing device 140A connects with the direct port PIP5 of the third computing device 140C to form a direct Link (i.e., link 2) for data transfer between the first computing device 140A and the third computing device 140C; the third direct port PIP3 of the first computing device 140A connects with the direct port PIP6 of the fourth computing device 140D to form a direct Link (i.e., link 3) for data transfer between the first computing device 140A and the fourth computing device 140D. In other words, in the system 100 as shown in fig. 1, the first computing device 140A may implement a direct connection with the second computing device 140B, the third computing device 140C, and the fourth computing device 140D via Link1, link2, and Link3, respectively, based on the PCIe protocol, and perform interactions between the computing devices through Link1, link2, and Link3, for example, access memories of the counterpart computing devices to each other, and so on.
Typically, each direct-connect port of a computing device is configured with an advanced error report register (AER register for short) for storing AER errors, such as completion time-outs (completion timeout), unsupported requests (unsupported request), completer aborts (completer abort), receiver overflows (Receiver Overflow), flow control protocol exceptions (Flow Control Protocol Error), and so forth. Whether an AER error occurs can be determined by reading information in the AER register of the direct port. However, the occurrence of an AER error in an AER register of a direct port does not necessarily mean that the direct link with the direct port as one end has failed. In some cases, other modules of the computing device may have errors, which may also cause AER errors in AER registers of the direct port. For example, a module in the first computing device 140A in fig. 1 initiates a read to the fourth computing device 140D via Link3, and no response is returned to the first computing device 140A regarding the read because the memory of the fourth computing device 140D is corrupted. In this case, link3 is in an abnormal operation state because the third direct port PIP3 of the first computing device 140A does not receive a response within a predetermined time, and an AER error of completion timeout will occur in the AER register of the third direct port PIP 3. Obviously, this AER error is not caused by Link3 itself failing. However, as described above, the related art cannot determine whether the abnormal operation state of the direct link is caused by the failure of the direct link itself. That is, when an AER error occurs in the AER register of the direct connection port, it cannot be determined whether the AER error is caused by the failure of the direct connection link itself.
To at least partially address one or more of the above problems, as well as other potential problems, example embodiments of the present invention propose a solution for fault diagnosis of a direct link between computing devices. In this scheme, by determining whether data is completely transmitted via a direct link; in response to complete transmission of data via the direct link, determining whether link training state machines corresponding to two direct ports at two ends of the direct link are in a normal state; determining whether the data is correct after being transmitted through the direct link or not according to the fact that the link training state machines are in normal states; determining whether the data throughput of the direct link meets a threshold condition in response to the data being correct after being transmitted via the direct link; and in response to the data throughput of the direct link meeting the threshold condition, performing link retraining on the direct link to determine whether the direct link fails, so that when the direct link is in an abnormal working state, whether the direct link fails can be determined, and diagnosis and debugging of the failed direct link are facilitated.
Schemes for fault diagnosis of direct links between computing devices according to embodiments of the present invention will be described in detail below in conjunction with fig. 2-5.
Fig. 2 shows a schematic diagram of a fault diagnosis circuit 200 according to an embodiment of the invention. It should be appreciated that fault diagnosis circuit 200 may also include additional modules not shown and/or may omit modules shown, the scope of the invention being not limited in this respect.
According to the inventive concept, the fault diagnosis circuit 200 is configured to be connected with the direct connection ports of the computing devices so as to enable fault diagnosis of direct connection links between the direct connection ports of different computing devices. In some embodiments, one fault diagnosis circuit 200 may be configured for each direct-connect port of each computing device.
With respect to computing devices, there may be, for example, GPUs, GPGPUs.
With respect to direct connection ports, ports configured on a computing device for direct interconnection with other computing devices may be used. According to an embodiment of the invention, the direct connection port may be a PCIe interconnect port (PCIe Interconnection Port, PIP) of the computing device.
As shown in fig. 2, the fault diagnosis circuit 200 may include: a data full transmission determination module 210, a link training state machine state detection module 220, a data correctness determination module 230, a data throughput determination module 240, and a link retraining module 250.
With respect to the data full transmission determination module 210, it may be configured to determine whether the data is fully transmitted via the direct link. According to some embodiments of the present invention, a transmission data counter may be configured in the data full transmission determination module 210 for counting the amount of data transmitted or received by the direct connection port, thereby determining whether the data is fully transmitted by comparing the amount of transmitted data with the amount of received data.
With respect to the link training state machine state detection module 220, it may be configured to determine whether the link training state machine corresponding to the direct port is in a normal state. According to some embodiments of the present invention, a link training state machine log storage unit may be configured in the link training state machine state detection module 220 for storing log information related to the link training state machine. According to some embodiments of the present invention, the link training state machine state detection module 220 may be further configured with a key event counter for counting the number of key events that have occurred in the link training state machine.
With respect to a link training state machine (LTSSM), it may be provided in a computing device for initializing a direct link, link training, and the like. According to some embodiments of the present invention, an LTSSM may be provided in each direct port of a computing device for initializing and training a direct link with the direct port as one end, etc.
Regarding critical events, they may include link disable (link disable), link down (link down), and hot reset (hot reset).
Regarding the data correctness determination module 230, it may be configured to determine the correctness of the data after transmission via the direct link. According to some embodiments of the present invention, the sequence generator and sequence verifier may be configured in the data correctness determination module 230 to determine the correctness of the data after transmission via the direct link by generating and transmitting a sequence and based on the received sequence.
With respect to the sequence generator, it may be configured to generate an original first sequence, which may be transmitted via a direct link. In some embodiments, the sequence generator may be, for example, a random binary sequence (PRBS) generator. It should be appreciated that the sequence generator herein may be any generator suitable for generating binary sequences, as the invention is not limited in this regard.
With respect to the sequence validator, it may be configured to receive a sequence transmitted via the direct link and validate whether the received sequence is correct. For example, when the sequence generator is a PRBS generator, the sequence verifier is accordingly a PRBS verifier for verifying whether the received data (i.e. the transmitted original first sequence) is correct.
With respect to the data throughput determination module 240, it may be configured to determine whether the data throughput of the direct link satisfies a threshold condition. According to some embodiments of the present invention, a throughput calculator may be configured in the data throughput determination module 240 for calculating the amount of data transmitted or received per unit time of the direct connection port. Further, the data throughput of the direct link may be determined based on the calculated amount of data transmitted or received per unit time of the direct link port, so as to determine whether the direct link fails at least according to the data throughput and the maximum theoretical bandwidth.
Regarding the threshold condition, according to an embodiment of the present invention, it may include: for example, the data throughput of the direct link reaches a maximum theoretical bandwidth. Here, taking an example of writing data, after the writing data passes through a transaction layer, a data link layer and a physical packet, the writing data is transmitted to an opposite end on a link; and the opposite end unpacks the data packet through the physical layer, the data link layer and the transaction layer to obtain the write data. In this example, the maximum theoretical bandwidth may refer to the write data bandwidth, i.e., the effective data bandwidth. In general, the value of the effective data bandwidth may be related to parameters such as maximum load size (max_payload_size), maximum read request size (max_read_request_size), read completion boundary (read_completion_boundary), and the like. In some examples of the invention, the value of the maximum theoretical bandwidth may be preset.
With respect to the link retraining module 250, it may be configured to link reconnect for a direct link and collect parameters related to the reconnected direct link to determine whether the direct link has failed.
With respect to parameters, it refers to parameters for reflecting the performance of the direct link, such as link speed (LINK SPEED) and link width (LINK WIDTH). In some embodiments, the parameters may also include the current state of the link training state machine corresponding to two direct ports on both ends of the direct link.
Fig. 3 illustrates an architectural schematic diagram of an exemplary fault diagnosis circuit 300 according to an embodiment of the present invention. It should be appreciated that exemplary fault diagnosis circuit 300 may also include additional modules not shown and/or may omit modules shown, the scope of the present invention being not limited in this respect.
As shown in fig. 3, the fault diagnosis circuit 300 is configured to connect with a direct port PIP of a computing device.
Regarding the direct connection port PIP, it has a Controller (Controller) and a physical layer (Phy), and a PCIe physical layer interface (PIPE) is provided between the Controller and the physical layer for transmission of data and commands.
As shown in fig. 3, the fault diagnosis circuit 300 includes: a data full transmission determination module 310 configured with a transmission data counter 312, a link training state machine state detection module 320 configured with a PIPE analyzer 322 and a link training state machine log storage unit 324, a data correctness determination module 330 configured with a PRBS generator 332 and a PRBS verifier 334, a data throughput determination module 340 configured with a throughput calculator 342, and a link retraining module 350. The data complete transmission determining module 310 and the data correctness determining module 330 are connected to a controller of the direct connection port PIP, the link training state machine state detecting module 320 is connected to a PCIe physical layer interface of the direct connection port PIP, and the data throughput determining module 340 is connected to the data complete transmission determining module 310.
It should be appreciated that the architecture of the fault diagnosis circuit 300 shown in fig. 3 is merely exemplary, and the fault diagnosis circuit provided by the present invention may have other suitable architectures, which are not limited herein.
There is also provided, in accordance with an embodiment of the present invention, a system for fault diagnosis of a direct link between computing devices, which may include one or more of the fault diagnosis circuits described above (e.g., fault diagnosis circuit 200 of fig. 2, fault diagnosis circuit 300 of fig. 3). Each fault diagnosis circuit may be connected to one of the direct connection ports of the computing device, in which case, for a direct connection link, fault diagnosis and debugging of the direct connection link may be achieved by fault diagnosis circuits respectively connected to two direct connection ports at both ends of the direct connection link. In the following, with reference to fig. 4, a detailed description will be given of how the system for diagnosing a failure of a direct link between computing devices according to the present invention implements a scheme for diagnosing a failure of a direct link between computing devices.
FIG. 4 illustrates a flow chart of a method 400 for fault diagnosis of a direct link between computing devices according to an embodiment of the invention. It should be appreciated that method 400 may also include additional actions not shown and/or may omit actions shown, the scope of the invention being not limited in this respect.
In step 402, it is determined via the system whether the data is completely transmitted via the direct link.
The data is completely transmitted, which means that a request or response sent by a direct connection port at one end of a direct connection link is completely transmitted to the direct connection port at the other end of the direct connection link via the direct connection link.
According to embodiments of the present invention, the amount of data transmitted through two direct-connect ports at both ends of the direct-connect link may be compared to determine whether the data is completely transmitted. Specifically, in one example, for two direct-connect ports (a first direct-connect port and a second direct-connect port) at both ends of a direct-connect link, a first amount of data sent by the first direct-connect port is calculated, a second amount of data received by the second direct-connect port is calculated, and the first amount and the second amount are compared. If the first number is equal to the second number, it may be determined that the data sent by the first direct port is completely transmitted to the second direct port via the direct link, that is, the data is completely transmitted via the direct link. If the first number is greater than the second number, the data sent by the first direct connection port is not completely transmitted to the second direct connection port, so that the direct connection link can be determined to be faulty.
In some embodiments, the amount of data sent or received by the direct port may be counted by the transmit data counter 312 of the data full transmission determination module 310 as shown in fig. 3.
In step 404, in response to the data being completely transmitted via the direct link, it is determined via the system whether the link training state machines respectively corresponding to the two direct ports at both ends of the direct link are both in a normal state.
According to some embodiments of the present invention, whether the link training state machine is in a normal state may be determined by determining whether a critical event has occurred to the link training state machine and whether a jump exception has occurred to the link training state machine. Specifically, in an example, for a link training state machine in a computing device where two direct connection ports at two ends of a direct connection link are respectively located, determining whether a critical event has occurred in the link training state machine, and determining whether a jump anomaly has occurred in the link training state machine.
Regarding critical events, it may indicate important events that have occurred on the direct link. Key events may include, but are not limited to, link disabling, link disconnection, and hot reset, according to embodiments of the present invention.
With respect to determining whether a critical event has occurred in a link training state machine, according to embodiments of the present invention, it may be determined whether a critical event has occurred by reading the value of a counter associated with the link training state machine for counting critical events. In response to the read value being zero, determining that a critical event has not occurred in the link training state machine; and in response to the read value being non-zero, determining that a critical event has occurred to the link training state machine, that is, that one or more critical practices such as link disabling, link disconnection, and hot reset have occurred on the direct link, such that a failure of the direct link can be determined.
Regarding determining whether a link training state machine has a skip anomaly, according to an embodiment of the present invention, it may be determined whether a link training state machine has a skip anomaly by reading log information related to the link training state machine. For example, in some embodiments, it may be determined whether a link training state machine has a skip anomaly by reading log information about the link training state machine stored in a link training state machine log storage unit as shown in fig. 3. Responding to the read log information related to the link training state machine, wherein the read log information comprises information indicating jump abnormality, so that the occurrence of jump abnormality of the link training state machine can be determined, and the occurrence of abnormality of a direct link can be determined; in response to the read log information related to the link training state machine not including information indicating a jump abnormality, it may be determined that the jump abnormality has not occurred in the link training state machine.
Further, according to the embodiment of the invention, in response to that the link training state machine does not have a critical event and the link training state machine does not have a jump abnormality, it can be determined that the link training state machine is in a normal state.
According to some embodiments of the present invention, for two direct connection ports at two ends of a direct connection link, it is required to determine that two link training state machines corresponding to the two direct connection ports are in a normal state, that is, that neither of the two link training state machines has a critical event and is not in a jump exception. If any of the link training state machines is in an abnormal state (such as a critical event or/and a jump exception has occurred), the direct link is determined to be failed.
In connection with fig. 3, in an example, the state of a link training state machine (LTSSM) corresponding to a direct port PIP may be resolved by monitoring signals of a PCIe physical layer interface (PIPE interface) disposed between a controller of the direct port PIP and the physical layer, for example, by a PIPE analyzer 322 such as in the link training state machine state detection module 320 of fig. 3. Typically the LTSSM status information stored in the LTSSM status register is 6 bits and there is a4 bit signal (current_data_rate) in the PIPE interface which can be used to indicate the rate of the current data transmission. Further, the fault diagnosis circuit 300 can generate a time stamp based on the clock. The aforementioned 6-bit LTSSM status information, 4-bit signals representing the rate of the current data transmission, and clock-generated time stamps may then be stored in a Static Random Access Memory (SRAM) of the fault diagnosis circuit 300 (such as the link training state machine log storage unit 324 in the link training state machine state detection module 320 shown in fig. 3). Thus, log information relating to the LTSSM may be obtained by reading the link training state machine log storage unit 324 to determine whether the LTSSM has a skip anomaly and thus whether the direct link has failed. Further, for critical events such as link disabling, link disconnection, and hot reset generated based on the LTSSM jump procedure, the number of times these critical events occur for the direct link may be counted by a critical event counter in the link training state machine state detection module 320 of the fault diagnosis circuit 300, thereby enabling to determine whether the direct link has failed according to the number of critical events.
In step 406, in response to the link training state machines all being in a normal state, a determination is made via the system as to whether the data is correct after transmission via the direct link.
According to an embodiment of the present invention, it is possible to determine whether data is correct after being transmitted based on a received transmitted sequence by generating a sequence such as a string at a direct port at one end of a direct link and transmitting the sequence to the direct port at the other end thereof via the direct link to receive the transmitted sequence at the direct port at the other end. Specifically, the original first sequence may be generated at any one of the direct links; transmitting the original first sequence to another direct connection port of the direct connection link via the direct connection link to receive the transmitted original first sequence at the other direct connection port of the direct connection link to obtain a received sequence; and determining whether the data is correct after transmission via the direct link based at least on the received sequence.
For example, in one example, a number of raw first sequences may be generated by a sequence generator (such as PRBS generator 332 of fig. 3) coupled to a direct port of a direct link based on a Linear Feedback Shift Register (LFSR), which may be transmitted to another direct port of the direct link via the direct link; the transmitted original first sequence is then received by a sequence validator (such as PRBS validator 334 of FIG. 3) coupled to another direct port of the direct link. In this case, the PRBS verifier 334, after receiving a portion of the pattern, can generate the next bit sequence based on the LFSR and then compare the generated next bit sequence with the transmitted original first sequence received later. If the next bit sequence generated is identical to the transmitted original first sequence received later, it can be determined that the received data is correct, that is, that the data is correct after transmission via the direct link. If the generated next bit sequence is different from the original transmitted first sequence received later, the received data is incorrect, so that the direct link can be determined to be faulty.
In step 408, in response to the data being correct after transmission over the direct link, a determination is made via the system whether the data throughput of the direct link satisfies a threshold condition.
With respect to the threshold condition, it means that the data throughput of the direct link meets expectations, for example, it may be that the data throughput of the direct link reaches a maximum theoretical bandwidth. In still other embodiments, the threshold condition may also be that the data throughput of the direct link reaches a preset value.
For example, according to an embodiment of the present invention, the data throughput of the direct link may be calculated and compared with the maximum theoretical bandwidth to determine whether the data throughput of the direct link reaches the maximum theoretical bandwidth. For example, the amount of data transmitted or received per unit time of the direct port may be calculated by a throughput calculator in the data throughput determining module 340 as shown in fig. 3.
According to an embodiment of the present invention, the fault diagnosis circuit can actively send a read request or a write request to a direct connection port connected thereto at a strength exceeding a maximum theoretical bandwidth of the direct connection link. Taking the example that the fault diagnosis circuit sends a preset number of write requests to the direct connection port connected with the fault diagnosis circuit, when the fault diagnosis circuit sends the write requests with the maximum theoretical bandwidth exceeding the direct connection link, the data throughput of the direct connection link obtained by the throughput calculator is the maximum bandwidth. And if the maximum bandwidth is lower than the maximum theoretical bandwidth, determining that the direct link fails.
In step 410, in response to the data throughput of the direct link meeting the threshold condition, link retraining is performed for the direct link via the system to determine whether the direct link has failed.
With respect to link retraining, it is referred to as direct link initiating link training (LINK TRAINING) to cause the direct link to be re-linked.
According to the embodiment of the invention, whether the direct link fails or not can be determined by analyzing parameters of the direct link after being re-linked. Specifically, in some embodiments of the present invention, link reconnection may be performed for a direct link to collect parameters related to the reconnected direct link; and determining that the direct link fails in response to the value of the acquired parameter not meeting the target value.
As for the parameters, as described above, it may be parameters for reflecting the performance of the direct link, such as the link speed, the link width, and the like.
According to the embodiment of the invention, the current link speed can be acquired and compared with the target link speed by acquiring the parameters related to the reconnected direct link and comparing the acquired values of the parameters with the target value, for example, for the reconnected direct link, if the value of the current link speed fails to reach the value of the target link speed, the direct link is determined to be failed. Similarly, the current link width may also be collected and compared to the target link width, and if the value of the current link width fails to reach the value of the target link width, then it is determined that the direct link fails. In yet another example, if the value of the current link speed acquired for the reconnected direct link reaches the value of the target link speed and the value of the current link width reaches the value of the target link width, it may be determined that the reconnected direct link is normal.
Further, as described above, the parameters may also include the current state of the link training state machine corresponding to two direct ports on both ends of the direct link. Specifically, whether the direct link is normal can be determined by collecting and determining whether the current state of the link training state machine is the L0 state. For example, for a reconnected direct link, the current state of the link training state machine corresponding to two direct ports at both ends of the direct link may be collected, and it may be determined whether the collected current state of the link training state machine is an L0 state. And responding to the current state of the link training state machine as L0 state, wherein the state indicates that the direct link is in an active state, namely the direct link can transmit data, so that the normal direct link after reconnection can be determined. If the current state of the link training state machine is not the L0 state, that is, the reconnected direct link is not in a state in which data can be transmitted, it is determined that the direct link fails.
In summary, according to the inventive concept of the present invention, by determining that data is completely transmitted through a direct link, determining that link training state machines corresponding to two direct link ports at two ends of the direct link are in a normal state, determining that data is correct after being transmitted through the direct link, determining that data throughput of the direct link meets a threshold condition, and determining that acquired parameters related to the direct link after re-connection meet target values after link re-training is performed for the direct link, it is possible to determine that the direct link and the ports between the two ends thereof are normal. If any of the above conditions is not satisfied, it can be determined that the direct link fails.
FIG. 5 further illustrates a flowchart of an exemplary method 500 for fault diagnosis of a direct link between computing devices, according to an embodiment of the invention. It should be appreciated that method 500 may also include additional actions not shown and/or may omit actions shown, the scope of the invention being not limited in this respect.
At step 502, a transmit data counter is read via the system.
For example, the transmission data counter in the fault diagnosis circuit connected to the two direct connection ports at both ends of the direct connection link may be read by the host via the system, respectively.
At step 504, a determination is made via the system whether the data is fully transmitted via a direct link.
For example, data read from transmission data counters in a failure diagnosis circuit connected to two direct connection ports at both ends of a direct connection link may be compared. It is assumed that two direct connection ports at two ends of the direct connection link are a first direct connection port and a second direct connection port, wherein the first direct connection port is connected with the first transmission data counter, and the second direct connection port is connected with the second transmission data counter. Comparing the first number obtained by the first transmission data counter with the second number obtained by the second transmission data counter, if the first number is not equal to the second number, it is determined that the data is not completely transmitted through the direct link, i.e. the data in the direct port at one end of the direct link is not completely transmitted to the direct port at the other end, so the process proceeds to step 530, and it is determined that the direct link is abnormal, i.e. the direct link itself fails. If the first number is equal to the second number, it is determined that the data is completely transmitted via the direct link, and the process proceeds to step 506.
At step 506, a critical event counter is read via the system.
For example, a critical event counter in a fault diagnosis circuit connected via two direct connection ports at both ends of a direct connection link may be read by a host via a system, respectively.
At step 508, a determination is made via the system whether a critical event has occurred with the LTSSM.
For example, the value read from the critical event counter at step 506 may be compared to 0, and in response to the value read from the critical event counter being other than 0, it is determined that the LTSSM has occurred a critical event, so proceeding to step 530, where a direct link anomaly is determined. In response to the value read from the critical event counter being 0, it is determined that the LTSSM has not occurred a critical event, and then the process proceeds to step 510.
In step 510, LTSSM log storage unit is read via the system.
For example, LTSSM log storage units in a fault diagnosis circuit connected via two direct connection ports at both ends of a direct connection link may be read by a host via a system, respectively.
At step 512, a determination is made via the system as to whether a jump exception has occurred with the LTSSM.
For example, the log information related to LTSSM read from the LTSSM log storage unit at step 510 may be analyzed, and if the read log information related to LTSSM includes information indicating a jump abnormality, such as that LTSSM has jumped to an abnormal state, it is determined that a jump abnormality occurs to LTSSM, so proceeding to step 530, it is determined that a direct link abnormality occurs. If the read log information related to the LTSSM does not include information indicating a jump exception, it is determined that the LTSSM has not occurred a jump exception, then the process proceeds to step 514.
At step 514, a determination is made via the system as to whether the data is correct after transmission via the direct link.
For example, the host may control the data correctness determination module connected by two direct connection ports at two ends of the direct connection link, so that the PRBS verifier in the data correctness determination module verifies the PRBS sequence transmitted via the direct connection link, and if the PRBS verifier corresponding to any direct connection port fails, it is determined that the data is incorrect after being transmitted via the direct connection link, so that the process proceeds to step 530 to determine that the direct connection link is abnormal. If the PRBS validators corresponding to both direct ports on both ends of the direct link do not report errors, it is determined that the data is correct after being transmitted via the direct link, and then the process proceeds to step 516.
In step 516, it is determined via the system whether the data throughput of the direct link is normal.
In order to determine whether the data is correct after transmission over the direct link, a number of PRBS sequences may be generated by a PRBS generator in the data correctness determination module and all transmitted over the direct link, in step 514, according to embodiments of the present invention. On this basis, after all of these PRBS sequences have completed transmission via the direct link, the calculation result of the throughput calculator can be read by the host via the system. If the read calculation result does not meet the threshold condition, such as the data throughput of the direct link calculated by the throughput calculator does not reach the maximum theoretical bandwidth, it is determined that the data throughput of the direct link is abnormal, so the process proceeds to step 530, and it is determined that the direct link is abnormal. If the read calculation result satisfies the threshold condition, such as the data throughput of the direct link calculated by the throughput calculator reaches the maximum theoretical bandwidth, it is determined that the data throughput of the direct link is normal, and the process proceeds to step 518.
In step 518, it is determined via the system whether the parameters of the direct link that is being reconnected via the link meet the target value.
For example, link training may be initiated by a link retraining module in the host control system to the direct link to cause the direct link to reconnect. Parameters such as link speed, link width, etc. are then collected based on the reconnected direct link, and the values of the collected parameters are compared with target values. If the value of the collected parameter fails to reach the target value, such as the value of the current link speed fails to reach the value of the target link speed, the value of the current link width fails to reach the value of the target link width, or the value of the current link speed fails to reach the value of the target link speed and the value of the current link width fails to reach the value of the target link width, it is determined that the parameter of the direct link that is being reconnected via the link fails to meet the target value, so the process proceeds to step 530 to determine that the direct link is abnormal. If the value of the collected parameter reaches the target value, such as the value of the current link speed reaches the value of the target link speed and the value of the current link width reaches the value of the target link width, it is determined that the parameter of the direct link that is being reconnected through the link meets the target value, then the process proceeds to step 520, where it is determined that the direct link is normal. In some embodiments, it may also be determined at step 518 whether the current state of the link training state machine corresponding to two direct ports on both ends of the direct link is the L0 state. If the current state of the link training state machine is not the L0 state, proceeding to step 530, determining that the direct link is abnormal; if the current state of the link training state machine is the L0 state, then proceed to step 520 to determine that the direct link is normal.
Further, according to some embodiments of the present invention, different diagnostic functions for the direct port and the direct link may also be implemented by sending control signals to the system, such as by the host, to control the enablement of the modules in the fault diagnosis circuit. For example, in one example, the control signal may include: a data path test signal, an LTSSM test signal, and a retrain initiation signal.
Regarding a data path test signal (data_path_test_start), it may indicate that a data correctness determination module and a data throughput determination module in the fault diagnosis circuit are enabled to determine correctness after data transmission via the direct link and whether data throughput of the direct link satisfy threshold conditions, respectively.
An LTSSM test signal (LTSSM _log_en) may indicate that a link training state machine state detection module in the fault diagnosis circuit is enabled to determine whether the link training state machine corresponding to the direct connection port is in a normal state.
A retraining initiation signal (init retraining) may indicate that a link retraining module in the fault diagnosis circuit is enabled to perform a link reconnection for the direct link and collect parameters related to the reconnected direct link to determine whether the direct link has a fault.
Therefore, the host machine can send a control signal to the system to control the current direct connection port, or the direct connection link taking the current direct connection port as one end, or both to carry out targeted fault diagnosis, thereby improving the efficiency of fault diagnosis.
The foregoing description of embodiments of the invention has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the various embodiments described. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or the technical improvements in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.
The above is only an alternative embodiment of the present invention and is not intended to limit the present invention, and various modifications and variations will be apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (15)

1. A method for fault diagnosis of a direct link between computing devices, comprising:
Determining whether data is completely transmitted via the direct link;
Determining whether link training state machines corresponding to two direct connection ports at two ends of the direct connection link are in a normal state or not respectively in response to complete transmission of data through the direct connection link;
determining whether the data is correct after being transmitted through the direct link or not according to the fact that the link training state machines are in normal states;
determining whether a data throughput of the direct link meets a threshold condition in response to data being correct after being transmitted via the direct link; and
And in response to the data throughput of the direct link meeting a threshold condition, performing link retraining on the direct link to determine whether the direct link fails.
2. The method of claim 1, wherein determining whether data is fully transmitted via the direct link comprises:
The amount of data transmitted through the two direct connection ports at both ends of the direct connection link is compared to determine whether the data is completely transmitted.
3. The method of claim 2, wherein the two direct connection ports at two ends of the direct connection link are a first direct connection port and a second direct connection port;
Comparing the amount of data transmitted through two direct-connect ports at both ends of the direct-connect link includes:
calculating a first amount of data transmitted by the first direct port;
Calculating a second amount of data received by the second direct port; and
In response to the first number being equal to the second number, it is determined that data is completely transmitted via the direct link.
4. The method of claim 1, wherein determining whether link training state machines respectively corresponding to two direct connection ports at both ends of the direct connection link are both in a normal state comprises:
determining whether a key event occurs to the link training state machine;
Determining whether the link training state machine has jump abnormality; and
And determining that the link training state machine is in a normal state in response to the link training state machine not having a critical event and the link training state machine not having a jump exception.
5. The method of claim 4, wherein determining whether a critical event has occurred with the link training state machine comprises:
Reading a value of a counter for counting key events related to the link training state machine; and
In response to the read value being zero, determining that no critical event has occurred with the link training state machine,
Wherein the critical event comprises any one of the following: link disabled, link disconnected, and hot reset.
6. The method of claim 4, wherein determining whether the link training state machine has a skip anomaly comprises:
log information associated with a link training state machine is read to determine whether a jump anomaly has occurred with the link training state machine.
7. The method of claim 1, wherein determining whether data is correct after transmission via the direct link comprises:
generating an original first sequence at any one of the direct links;
Transmitting the original first sequence to another direct connection port of the direct connection link via the direct connection link to receive the transmitted original first sequence at the other direct connection port of the direct connection link to obtain a received sequence;
based at least on the received sequence, it is determined whether the data is correct after transmission via the direct link.
8. The method of claim 1, wherein the threshold condition comprises:
the data throughput of the direct link reaches the maximum theoretical bandwidth.
9. The method of claim 1, wherein performing link retraining for the direct link to determine whether the direct link is malfunctioning comprises:
carrying out link reconnection aiming at the direct link so as to acquire parameters related to the reconnection direct link;
And determining that the direct link fails in response to the acquired value of the parameter not meeting the target value.
10. The method of claim 9, wherein the parameters include: link speed and link width.
11. A system for performing the method for fault diagnosis of a direct link between computing devices of claim 1, comprising:
And a fault diagnosis circuit configured to be connected to a direct connection port at either end of the direct connection link, wherein the fault diagnosis circuit includes:
A data full transmission determination module configured to determine whether data is fully transmitted via the direct link;
The link training state machine state detection module is configured to determine whether a link training state machine corresponding to the direct connection port is in a normal state;
the data correctness determining module is configured to determine the correctness of the data after being transmitted through the direct link;
A data throughput judging module configured to determine whether the data throughput of the direct link satisfies a threshold condition; and
And the link retraining module is configured to carry out link reconnection on the direct link and collect parameters related to the reconnected direct link so as to determine whether the direct link fails.
12. The system of claim 11, wherein the data full transmission determination module comprises:
A transmission data counter configured to count the amount of data transmitted or received by the direct connection port.
13. The system of claim 11, wherein the link training state machine state detection module comprises:
a critical event counter configured to count a number of critical events that have occurred for the link training state machine; and
And a link training state machine log storage unit configured to store log information related to the link training state machine.
14. The system of claim 11, wherein the data correctness determination module comprises:
A sequence generator configured to generate an original first sequence, the original first sequence being transmitted via the direct link;
A sequence validator configured to validate whether the received sequence is correct based on the received transmitted original first sequence.
15. The system of claim 11, wherein the data throughput determination module comprises:
And a throughput calculator configured to calculate an amount of data transmitted or received per unit time of the direct connection port.
CN202410451366.4A 2024-04-15 Method and system for fault diagnosis of direct links between computing devices Active CN118051375B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410451366.4A CN118051375B (en) 2024-04-15 Method and system for fault diagnosis of direct links between computing devices

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410451366.4A CN118051375B (en) 2024-04-15 Method and system for fault diagnosis of direct links between computing devices

Publications (2)

Publication Number Publication Date
CN118051375A CN118051375A (en) 2024-05-17
CN118051375B true CN118051375B (en) 2024-07-05

Family

ID=

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102457409A (en) * 2010-11-02 2012-05-16 中兴通讯股份有限公司 Method and system for link failure detection
CN102984011A (en) * 2012-12-04 2013-03-20 杭州华三通信技术有限公司 Link failure positioning method and equipment

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102457409A (en) * 2010-11-02 2012-05-16 中兴通讯股份有限公司 Method and system for link failure detection
CN102984011A (en) * 2012-12-04 2013-03-20 杭州华三通信技术有限公司 Link failure positioning method and equipment

Similar Documents

Publication Publication Date Title
US8151145B2 (en) Flow control timeout mechanism to detect PCI-express forward progress blockage
CN111414268B (en) Fault processing method and device and server
JPH052654A (en) Method and circuit for detecting fault of microcomputer
KR100637780B1 (en) Mechanism for field replaceable unit fault isolation in distributed nodal environment
US6845469B2 (en) Method for managing an uncorrectable, unrecoverable data error (UE) as the UE passes through a plurality of devices in a central electronics complex
CN112306766A (en) Method, electronic device, storage system and computer program product for error detection
US8914683B2 (en) Repairing high-speed serial links
CN111078492A (en) System and method for monitoring state of SoC internal bus
US11823759B2 (en) Testing of fault detection circuit
CN118051375B (en) Method and system for fault diagnosis of direct links between computing devices
US20180113779A1 (en) Intelligent packet analyzer circuits, systems, and methods
CN118051375A (en) Method and system for fault diagnosis of direct links between computing devices
US20100162269A1 (en) Controllable interaction between multiple event monitoring subsystems for computing environments
CN115766526B (en) Method and device for testing physical layer chip of switch and electronic equipment
CN114721862B (en) Watchdog circuit with signal checking function and working method thereof
CN101458624A (en) Loading method of programmable logic device, processor and apparatus
JP3883856B2 (en) Fault diagnosis method and apparatus for signal processing system
JP2007293678A (en) Apparatus for diagnosing common bus connection
CN110907857B (en) Automatic connector detection method based on FPGA
JP2013200616A (en) Information processor and restoration circuit of information processor
CN118245328A (en) Method and monitoring system for monitoring direct links between computing devices
TW202113385A (en) Boundary scan test system and method thereof
CN111367838A (en) Method and device for detecting data storage system and data storage system
US11500717B2 (en) Method for detecting data storage system, device and data storage system
US11928022B2 (en) Introduction and detection of parity error in a UART

Legal Events

Date Code Title Description
PB01 Publication
SE01 Entry into force of request for substantive examination
GR01 Patent grant