CN115904772A

CN115904772A - Error determination method, device, equipment and storage medium for PCIe link

Info

Publication number: CN115904772A
Application number: CN202211265086.1A
Authority: CN
Inventors: 贾帅帅; 李道童; 艾山彬; 陈衍东
Original assignee: Suzhou Inspur Intelligent Technology Co Ltd
Current assignee: Suzhou Inspur Intelligent Technology Co Ltd
Priority date: 2022-10-14
Filing date: 2022-10-14
Publication date: 2023-04-04

Abstract

The present disclosure provides an error determination method, apparatus, device and storage medium for a PCIe link, which are applied to a root node device in the PCIe link, and the method includes: increasing an error event count each time an event satisfying a preset condition is detected to be generated; when the error event count is greater than or equal to a preset first threshold value, sequentially performing repair processing of at least one layer on the PCIe link; after each time of repair processing, repeating the step of increasing the error event count when each time an event meeting preset conditions is detected to be generated, wherein each time of repair processing is directed to one layer, the repair processing of the next layer is determined based on the accumulated times of the repair processing of the previous layer, and different layers correspond to different objects in the PCIe link; and when the repair cutoff condition is met, determining the error type of the PCIe link based on the repair processing of the last layer.

Description

Error determination method, device, equipment and storage medium for PCIe link

Technical Field

The present disclosure relates to the field of communications technologies, and in particular, to a method, an apparatus, a device, and a storage medium for determining an error of a PCIe link.

Background

In recent years, PCIe (Peripheral Component Interconnect express) devices are widely used in the server and storage fields, and various non-fatal errors (usually expressed as packet errors) may occur in a PCIe link between a motherboard and a PCIe device. The traditional PCIe early warning mechanism is very simple and can repair error alarm, and has low early warning accuracy rate and small coverage.

Disclosure of Invention

The disclosure provides an error determination method for a PCIe link, which is applied to a root node device in the PCIe link, and the method comprises the following steps:

increasing the error event count when detecting the generation of an event satisfying a preset condition; the preset condition is that an error event is generated in a target clock period in a current window period;

determining whether to carry out repair processing or not based on a preset error rate period and the error event count;

if yes, sequentially performing repair processing of at least one layer on the PCIe link; after each time of repair processing, repeating the step of increasing the error event count when each time an event meeting preset conditions is detected to be generated, wherein each time of repair processing is directed to one layer, the repair processing of the next layer is determined based on the accumulated times of the repair processing of the previous layer, and different layers correspond to different objects in the PCIe link;

and when the repair cutoff condition is met, determining the error type of the PCIe link based on the repair processing of the last layer.

Optionally, the determining whether to perform the repair process based on the preset error rate period and the error event count includes:

detecting whether the current period of the preset error rate is reached; wherein the bit error rate period is greater than the window period;

if yes, reducing the error event count according to a preset difference value;

if the reduced error event count reaches a preset first threshold value, determining to sequentially repair the PCIe link;

and if the reduced error event count does not reach a preset first threshold value, determining that the PCIe link is not repaired.

Optionally, the sequentially performing repair processing on at least one layer of PCIe links includes:

performing first-layer repair processing on a signal transmission layer of the PCIe link;

and when the accumulated repairing times of the repairing of the first layer is larger than or equal to a preset second threshold value, performing the repairing of a second layer, wherein the second layer comprises a hardware layer.

Optionally, the performing, by the signal transmission layer of the PCIe link, a repair process of a first layer includes:

at the signal transmission layer, respectively repairing at least one object of the PCIe link; wherein the at least one object comprises: signal quality and/or signal transmission speed.

Optionally, in the signal transmission layer, respectively performing repair processing on at least one object of the PCIe link, including:

performing cross repair processing on each object of the PCIe link, wherein the previous repair processing is different from the next repair processing;

or, in one repair, performing repair processing on at least one object of the PCIe link;

or, performing repair processing on the current object of the PCIe link, and if the number of times of repair processing on the current object reaches a preset number of times, performing repair processing on a next object of the current object until the repair is successful or the accumulated number of times of repair reaches the preset second threshold.

Optionally, the repair process of the first layer includes a speed reduction process and/or a re-equalization process.

Optionally, the root node device is connected to a plurality of terminal devices, and when the cumulative repair number of the first repair process reaches a preset second threshold, performing a second-level repair process, including:

when the accumulated repairing times of the first repairing process reaches the preset second threshold value, positioning the target terminal equipment with the fault;

performing at least one hardware repair process on the target terminal equipment; wherein different hardware repair processes are used to recover the target terminal device.

Optionally, the performing at least one hardware repair process on the port where the target terminal device is located includes:

resetting the port where the target terminal equipment is located;

and when the accumulated reset times of the reset processing exceed a preset third threshold value, powering off and then powering on the target terminal equipment.

Optionally, the window period includes multiple clock periods, and the target clock period is a first clock period in the window period.

Optionally, the method further comprises:

detecting whether a preset error rate period is reached currently; wherein the bit error rate period is greater than the window period;

if yes, reducing the error event count according to a preset difference value;

when the error event count is greater than or equal to a preset first threshold, sequentially performing repair processing of at least one layer on the PCIe link, including:

and when the reduced error event count is greater than or equal to a preset first threshold value, sequentially performing repair processing of at least one layer on the PCIe link.

Optionally, the determining the error type of the PCIe link based on the repair processing of the last layer includes:

and determining the error type of the PCIe link based on the accumulated repairing times of the repairing process of the last layer and/or based on the repairing result of the repairing process of the last layer.

The present disclosure also provides an error determination apparatus for a PCIe link, which is applied to a root node device in the PCIe link, and the method includes:

the counting module is used for increasing the error event count when detecting that the event meeting the preset condition is generated; the preset condition is that an error event is generated in a target clock period in a current window period;

the repair processing module is used for sequentially performing repair processing on at least one layer of the PCIe link when the error event count is greater than or equal to a preset first threshold; after each time of repair processing, repeating the step of increasing the error event count when each time an event meeting preset conditions is detected to be generated, wherein each time of repair processing is directed to one layer, the repair processing of the next layer is determined based on the accumulated times of the repair processing of the previous layer, and different layers correspond to different objects in the PCIe link;

and the error determining module is used for determining the error type of the PCIe link based on the repair processing of the last layer when the repair cutoff condition is met.

The present disclosure also provides an electronic device storing a computer program that causes a processor to execute the error determination method for a PCIe link.

The present disclosure also provides a computer readable storage medium storing a computer program for causing a processor to execute the method for determining an error of a PCIe link

By adopting the technical scheme of the embodiment of the application, the error event count can be increased when the event meeting the preset condition is detected to be generated; then, whether repair processing is needed or not is judged based on the error event count and the preset error rate period, if so, at least one layer of repair processing is carried out on the PCIe link in sequence so as to carry out different layer of repair on the PCIe link; and determining the error type of the PCIe link based on the repair processing of the last layer.

After the repair processing is executed once, the step of increasing the error event count is repeated when the event meeting the preset condition is detected to be generated, each repair processing is specific to one layer, the repair processing of the next layer is determined based on the accumulated times of the repair processing of the previous layer, and different layers correspond to different objects in the PCIe link. Therefore, when the error rate cycle is combined to judge that the error event count reaches a certain threshold value, different layers of repair are carried out on the PCIe link, and each layer is directed at one object in the PCIe link, so that which object in the PCIe link has a fault can be determined based on the results of the repair processing of the different layers, that is, the source of the error event can be determined, and thus the error type can be accurately positioned.

The foregoing description is only an overview of the technical solutions of the present disclosure, and the embodiments of the present disclosure are described below in order that the technical means of the present disclosure may be clearly understood, and the foregoing and other objects, features, and advantages of the present disclosure may be more clearly understood.

Drawings

In order to clearly illustrate the embodiments of the present disclosure or technical solutions in related arts, the drawings used in the description of the embodiments or related arts will be briefly introduced below, and it is obvious that the drawings in the description below are some embodiments of the present disclosure, and other drawings can be obtained by those skilled in the art without creative efforts. It should be noted that the sizes and shapes of the figures in the drawings are not to be considered true scale, but are merely intended to schematically illustrate the present invention. The same or similar reference numbers in the drawings identify the same or similar elements or elements having the same or similar functionality.

Fig. 1 schematically illustrates a communication environment schematic diagram of the present application;

FIG. 2 is a schematic diagram of a hardware environment supporting the error determination method for PCIe links according to the present application;

FIG. 3 schematically illustrates a flow chart of steps of a method of error determination for a PCIe link;

FIG. 4 schematically illustrates an exemplary flow diagram of an error determination method for a PCIe link;

fig. 5 schematically shows a structural diagram of an error determination apparatus for a PCIe link according to the present application.

Detailed Description

To make the objects, technical solutions and advantages of the embodiments of the present disclosure more clear, the technical solutions of the embodiments of the present disclosure will be described clearly and completely with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are some, but not all embodiments of the present disclosure. All other embodiments, which can be derived by a person skilled in the art from the embodiments disclosed herein without inventive step, are intended to be within the scope of the present disclosure.

In the related art, server failures can be divided into two major categories, namely downtime failures and non-downtime failures. The downtime type fault is mainly embodied in two parts of downtime in the startup process and downtime in operation. The non-downtime faults comprise abnormal monitoring of power supply temperature indexes, abnormal monitoring of main board fans, statistics monitoring of repairable faults and non-fatal faults of a CPU/internal memory/GPU/storage device/network device/PCIe external plug-in device, monitoring of health states of components and links and the like.

After the server is started, various non-downtime type errors inevitably exist in the operation process, wherein for the early warning of the PCIe extrapolation device, a very simple repairable Error alarm is generally used, the repairable Error is a common non-downtime type Error, and in the related art, a CE (Correctable Error) threshold limiting policy is used to determine whether the PCIe device has a hardware fault.

However, this policy does not well predict whether the PCIe device actually has a hardware failure, and the related art only sets that a repair operation such as reset is performed when the PCIe device has a fatal error, so that an accurate prediction of the error-repairable CE cannot be performed.

In view of this, the present application provides a PCIe error early warning mechanism, after a repairable error frequently occurs in a PCIe device, the PCIe device may perform gradient repair through a series of repair mechanisms to sequentially perform error repair on a PCIe link from different layers, if an error is continuously reported, when the repair mechanism reaches a threshold value, a fatal error such as a hardware fault occurs, and if the error is not continuously reported, the error is indicated to be repairable, so that accuracy of error determination on the PCIe link may be improved by counting error events and the number of times of repairing the gradient repair mechanism.

Referring to fig. 1, a schematic diagram of a communication environment of the present application is shown, as shown in fig. 1, including: a CPU, a GPU, a BMC (baseboard management Controller), a plurality of root node devices, and a plurality of terminal devices connected to each root node device of the server. The Root node device is also called a Root port device, and in practice, it may be a PCIe Switch (PCIe Switch), and the Root node device may divide a PCIe link into more segments and finally reach the terminal device.

In this embodiment, a data transmission link where a Root port device is located may be referred to as a PCIe link, generally, the Root port device has a plurality of downlink expansion ports, that is, the Root port device may expand the number of PCIe ports, so that more terminal devices (EP in fig. 1) may be plugged in, that is, in the PCIe link, the Root port device and the plurality of terminal devices exist, and different crowdsourcing equipment may be referred to as a link segment in the PCIe link.

Because one Root port device is plugged with a plurality of terminal devices, the data received or sent by the terminal devices can reach corresponding addresses of the CPU and the GPU through the Root port device, and therefore error early warning can be executed by the Root port device.

The method provided by the disclosure can be applied to server products of intel product lines and can be expanded to server products or platforms with RAS functions.

Referring to fig. 2, a schematic diagram of a hardware environment of the error determination method supporting a PCIe link according to the present application is shown, where RX is a receiving end, TX is a transmitting end, and based on a PCIe mechanism, a device may be either a transmitting end or a receiving end, that is, in one case, device a is a transmitting end and device B is a receiving end, and in another case, device a is a receiving end and device B is a transmitting end, where device a and device B communicate through a PCIe link, and in the PCIe link communication, signals need to be routed through a Root port device.

As shown in fig. 2, in the present application, the Root port device may include a register, where the register is a burst protection mechanism for PCIe, and may effectively prevent too many errors from being generated instantaneously. The recording of an error is allowed only once within a prescribed time to avoid the generation of an error storm, whereby it is possible to determine whether or not to perform a repair process of various gradients by an error event recorded in a register and to determine the type of the error based on an accumulated count of the repair processes recorded in the register.

The error determination method for the PCIe link according to the present application may be applied to a Root node device in the PCIe link, where the Root node device may also be referred to as a Root port device, one Root node device may be connected to multiple terminal devices through an extended downlink port, and an uplink port of one Root node device may be connected to a BMC device.

Next, the error determination method for the PCIe link according to the present application is described with reference to the communication environment shown in fig. 1 and the hardware environment shown in fig. 2. Referring to fig. 3, a flowchart illustrating steps of the error determination method for PCIe links according to the present application is shown, and as shown in fig. 3, the method specifically includes the following steps:

step S301: increasing an error event count each time an event satisfying a preset condition is detected to be generated; the preset condition is that an error event is generated in a target clock period in a current window period;

step S302: determining whether to carry out restoration processing or not based on a preset error rate period and the error event count;

step S303: under the condition of repair processing, sequentially performing repair processing of at least one layer on the PCIe link;

specifically, step S301 may be repeated after each execution of the repair process; wherein, each repair process corresponds to one layer, different repair processes are used for repairing the PCIe link in different layers, and different layers correspond to different objects in the PCIe link; and the repair process of the next layer is determined based on the accumulated number of repair processes of the previous layer.

Step S304: when the repair deadline condition is triggered, determining the error type of the PCIe link based on the accumulated repair times of the repair processing of the last layer, wherein the error type comprises a non-fatal type and a fatal type.

In this embodiment, there are generally multiple window cycles, and one window cycle may include multiple clock cycles, where the number of clock cycles may be determined according to an operating frequency of the PCIe link, and a clock cycle is a time amount, and the higher the operating frequency is, the smaller the clock cycle is. The window period may be set by a user according to an actual situation, and is not limited herein.

When each window cycle arrives, whether an error event is generated in the target clock cycle within the window cycle or not can be detected, and if an error event is generated, the error event count is increased, for example, after the error event is generated, the error event count is increased by 1. In some cases, the target clock cycle may also be multiple, which may be a continuous clock cycle, or may be a discontinuous clock cycle. Specifically, to avoid a fault storm caused by recording multiple fault events in one window period, the target clock period is preferably one clock period.

More specifically, the target clock cycle may be selected to be any one of the clock cycles within the window period, for example, the first clock cycle, the second clock cycle, and the like.

In each window period, as long as an error event is generated in the target clock period, the error event count is incremented by 1, which is repeated, that is, as long as an event satisfying the preset condition (the error event is generated in the target clock period in the current window period) is detected, the error event count is incremented.

The error rate cycle can exclude events caused by error codes from being counted into error events, so the error rate cycle can be set, and when the error rate cycle arrives, the error event count can be subtracted, so the final error event count can be determined according to the preset error rate cycle and the current error event count, and when the final error event count exceeds a threshold value, the fact that substantive error events are generated in a plurality of window cycles is represented, so that early warning needs to be carried out on a PCIe link, and the type of errors is determined. In the method and the device, as long as the repair processing is determined to be needed, a multi-layer repair strategy for the PCIe link can be started, in the multi-layer repair strategy, multiple kinds of repair processing can be carried out on the PCIe link, and different layers correspond to different kinds of repair processing.

Specifically, when performing the repair process of at least one layer, the step of increasing the error event count each time an event satisfying the preset condition is detected to be generated may be repeated at each repair process, that is, each time the repair process is performed, the step returns to step S301 to continue detecting the error event count, and as long as the error event count exceeds the preset first threshold again, the repair process is performed again.

In specific implementation, each repair process may be only for one layer, in this application, the layers may include a physical layer and a software layer, the software layer may perform repair through a soft policy, and the physical layer may perform repair through a policy of a physical operation. The software layer may include, for an object in the PCIe link: signals, packets, transport protocols, etc., and the physical layer may be for physical devices present in the PCIe link, e.g., for end devices or for downstream expansion ports of the root node device.

Because each time of repair processing is directed at one layer, the error type can be positioned through a multi-gradient repair strategy.

In a specific implementation, whether to enter the next-layer repair process and which-layer repair process to enter may be determined by the accumulated number of times of the previous-layer repair process. That is to say, the cumulative number of times of performing repair processing of one layer on the PCIe link may determine whether to enter repair processing of the next layer, and in some cases, in the case of entering repair processing of the next layer, may also determine which layer of repair processing to specifically enter.

For example, after the multiple repair processes are performed on the layer a, if the accumulated number of repair processes of the layer a reaches a certain threshold, it indicates that the next repair process needs to be performed, and when a condition that the next error event count exceeds the preset first threshold comes, the next repair process of the layer is performed. Specifically, which layer is determined according to the accumulated number of times, if the accumulated number of times reaches a certain threshold 1, the repair process of the next layer may be the repair process of the layer B, and if the accumulated number of times reaches a certain threshold 2, the repair process of the next layer may be the repair process of the layer C; when the accumulated number of repairing processes of the layer B or C reaches the corresponding threshold value again, the next repairing process of the layer D can be continuously entered.

The threshold 1 and the threshold 2 may be set by a user, and it should be noted that, when the repair process of the next layer is performed, the repair process of the previous layer may be suspended.

After each repair process, the above step S301 may be repeatedly executed, so that the error type of the PCIe link may be determined based on the repair process of the last layer. Specifically, the error type may include a correctable type and a fatal type, where the correctable type of error is also called a repairable error, which is a common non-downtime type error, and the fatal type of error is also called a fatal error, which is an error caused by a hardware failure.

Of course, in some embodiments, the error type may include other types according to requirements, such as a type including a temperature anomaly, a type including a fan anomaly, and so on, and it should be noted that, in practice, it may be determined which layer of repair processing is performed according to the type of the error to be detected. For example, if the types of errors to be detected are a repairable type and a fatal type, the at least one layer of repair processing may include software-layer repair processing and hardware-layer repair processing.

In this embodiment, since the next-layer repair process is determined based on the accumulated number of times of the previous-layer repair process, that is, whether to enter the next-layer repair process is determined based on the previous-layer repair process, it can be understood as a layer-layer progressive repair, and each repair is directed at a fault that may occur on one layer, and thus, when a repair cutoff condition is triggered, it can be determined whether the fault on the layer is repaired through the last-layer repair process, and thus, the error type can be determined based on the last-layer repair process.

Wherein, the repair cutoff condition may be the following condition: all the layers of repair processing are carried out, and the last layer of repair processing is executed for a preset number of times, or the result of the repair processing is successful repair, or the error event count is changed from exceeding a preset first threshold value to being lower than the preset first threshold value; any of the above conditions is triggered, which may indicate that the repair cutoff condition is satisfied.

By adopting the technical scheme of the embodiment of the application, after the repair processing is executed once, the step of increasing the error event count is repeated when the event meeting the preset condition is detected to be generated, each repair processing is performed on one layer, the repair processing of the next layer is determined based on the accumulated times of the repair processing of the previous layer, and different layers correspond to different objects in the PCIe link. Therefore, when the error event count reaches a certain threshold value, different layers of repair can be performed on the PCIe link, and each layer aims at one object in the PCIe link, so that which object in the PCIe link has a fault can be determined based on the results of the repair processing of the different layers, that is, the source of the error event can be determined, and thus the error type can be accurately positioned.

In an optional example, when the error type of the PCIe link is determined based on the repair processing of the last layer, the error type of the PCIe link may be determined based on the accumulated repair times of the repair processing of the last layer; and/or determining the error type of the PCIe link based on the repair result of the repair processing of the last layer.

In one case, the error type may be determined by comparing the accumulated repair times of the repair process of the last layer with a preset threshold, for example, if the accumulated repair times of the repair process of the last layer reaches the preset threshold, it may be determined that the repair process of the layer has been performed multiple times, and the error type may be determined whether the error is repaired or not. For example, the last level of repair is a hardware level at which 3 times of repair has been performed, and whether or not an error is repaired, it can be determined that the type of error is a fatal type of error.

In another case, the error type may be determined based on the repair result of the repair process of the last layer. Specifically, if the repair result is successful repair, it may be determined that the error type is a repairable error, and if the repair result is unsuccessful repair, it may be determined that the error type is a fatal error. In this case, the repair processing of the last layer can be performed only once, and the error type is determined according to the result of the one-time repair processing, so that the number of times of repair processing can be reduced, and a high load on the root node device can be avoided.

In another case, the error type may be determined by combining the accumulated repair times and the repair result of the repair process of the last layer. Specifically, the error type may be determined according to a result of the repair processing when the accumulated repair frequency of the repair processing of the last layer reaches a preset threshold. For example, when the cumulative repair number of the repair process of the last layer reaches the preset threshold, and the repair process result is a successful repair, the error type may be qualified as a repairable error, and if the repair process result is an unsuccessful repair, the error type may be qualified as a fatal error. In this case, the accuracy of the determination of the error type can be improved.

In an optional example, based on the preset error rate period and the error event count, the determination process of determining whether to perform the repair process may be as follows:

whether the current period reaches a preset error rate period can be detected; and if so, reducing the error event count according to a preset difference value. Then, when the reduced error event count is greater than or equal to a preset first threshold, it may be determined that at least one layer of repair processing is performed on the PCIe link in sequence.

In this embodiment, the error rate period may be greater than the window period, that is, in the working process of the PCIe link, it may be determined whether an event meeting a preset condition is generated according to an incoming window period, if the event is generated, the error event count is increased by 1, and if the event is not generated, the error count is not increased, but when the error rate period comes, the error event count is subtracted.

Where the predetermined difference to decrement the error event count may be 1, or may be some other value.

If the error event count is increased, whether the error rate period is reached is detected, if so, the error event count is subtracted, then, the value obtained after the error event count is subtracted is compared with a preset first threshold value, if the value is greater than or equal to the preset first threshold value, the multi-layer repair processing is started, and if the value is less than the preset first threshold value, the multi-layer repair processing is not started, and the step S301 is continuously repeated.

Of course, it is also possible to periodically detect whether the error rate period has been reached, instead of increasing the error event count, and to subtract the error event count as long as the error rate period has been reached in the case of periodic detection, so that the error event count can be changed according to whether an event satisfying the preset condition occurs after subtracting the error event count.

Therefore, the problem of continuous accumulation of error event counts caused by the error rate can be avoided, and the accuracy of determining the error type of the application is improved.

On the basis of the above-described embodiment, how to perform the multi-gradient repair process will be described.

In performing multi-gradient repair processing, in an alternative embodiment, two levels of repair may be performed on the PCIe link, one level being a signaling level and the other level being a hardware level. The signal transmission layer can be understood as a software layer, that is, error events are repaired by a soft means; the hardware level may be understood as a physical level, i.e. the error event is repaired by physical means, such as plugging and unplugging the terminal device, resetting the port, etc.

Under the condition that the error event is repaired by a soft means, objects such as speed, quality, bandwidth and the like of a transmission signal in a PCIe link can be repaired so as to repair errors caused by software, an operating system and the like; when the error event is repaired through a physical means, objects such as terminal equipment, ports and the like in the PCIe link can be repaired so as to repair errors caused by abnormal temperature, abnormal power supply and the like.

In this optional embodiment, when the error event count corrected by the preset error rate cycle is greater than or equal to the preset first threshold, the error of the PCIe link is repaired in the signal transmission layer by the soft means, and if an error event still occurs and the error cannot be repaired by the soft means, the error of the PCIe link is repaired in the hardware layer by the physical means, so that the error type of the PCIe link is determined by the repair in the signal transmission layer and the repair result in the hardware layer.

In a specific implementation, in step S303, when the number of error events corrected by the preset error rate period is greater than or equal to a preset first threshold, the first layer of the signal transmission layer may be repaired, and when the accumulated repairing number of times of the repairing process of the first layer is greater than or equal to a preset second threshold, the second layer may be repaired.

Wherein the first layer comprises a signaling layer and the second layer comprises a hardware layer.

In this embodiment, the preset second threshold may be determined according to an actual situation, and is not particularly set herein, as described above, each time of the repair processing is performed on one layer, in practice, multiple times of repair processing may be performed on one layer, that is, multiple times of repair processing on the first layer may be performed, and when the accumulated repair times of the repair processing on the first layer is greater than or equal to the preset second threshold, the repair processing representing the first layer cannot avoid the occurrence of the error event, that is, the cause of the error event may not be in the first layer, and in this case, the repair processing on the second layer may be performed.

The signal transmission layer may set objects, such as bandwidth setting, transmission speed setting, signal processing, etc., corresponding to software of the PCIe link, and in this case, the repair processing of the first layer is mainly used to repair such objects, such as adjusting bandwidth setting, transmission speed setting, signal processing, etc.

In this case, the repair processing of the second layer is mainly used to perform repair processing on the physical objects, such as power on and power off, reset, initialization, and the like.

After the repair processing of the first layer is performed, if it is shown that the repair is successful, the repair processing of the second layer may not be performed, that is, the repair cutoff condition is satisfied, and it may be determined that the error type of the PCIe link is a repairable type based on the repair processing of the first layer, that is, the error of the PCIe link is a repairable error.

After the repair processing of the first layer is performed, if it is shown that the repair is unsuccessful, under such a condition, the error count is continuously accumulated, the repair processing of the first layer is continuously performed, when the accumulated repair frequency of the repair processing of the first layer exceeds a preset second threshold, the repair of the second layer is entered, if the repair of the second layer is successful, that is, the repair cutoff condition is satisfied, it may be determined that the error type of the PCIe link is a repairable type (although it is an error caused by a hardware reason, but it is still repairable) based on the repair processing of the second layer, if the repair of the second layer is performed for multiple times, it is still shown that the repair is unsuccessful, when the accumulated repair frequency exceeds a certain threshold, the repair cutoff condition is satisfied, and at this time, it may be determined that the error type of the PCIe link is a fatal type, that is an unrepairable type.

By adopting the technical scheme of the embodiment of the application, the PCIe link can be respectively repaired from the signal transmission layer and the hardware layer when the error event count exceeds the preset first threshold, and the repair processing of the second layer is carried out when the repair processing of the first layer reaches the preset second threshold, so that the software object and the physical object in the PCIe link can be repaired in sequence, the error is repaired while the reason of the error is checked, and the error types of the software and the hardware in the PCIe link can be covered.

When the repair processing of the first layer is performed, that is, when the repair processing of the signal transmission layer is performed, the repair processing can be performed on the signal transmission speed and the quality of the transmitted signal, so that at least one object of the PCIe link can be repaired; wherein, the repair processing of the signal transmission speed comprises speed reduction processing.

Wherein the at least one object comprises: signal quality and/or signal transmission speed.

In one case, the repair processing for signal quality may include re-equalization processing, the repair processing for signal transmission speed may include speed reduction processing for reducing the data transmission speed of the PCIe link, and the re-equalization processing for improving the quality of the transmission signal.

In the speed reduction process, the current signal transmission speed may be obtained, and then the current signal transmission speed is reduced to the next level of signal transmission speed, for example, if the current signal transmission speed is Gen5, the signal transmission speed may be reduced from Gen5 to Gen4 or Gen3.

Of course, in an alternative example, if the accumulated repair number of the speed reduction processing reaches a preset number, the level of the speed reduction processing may be increased, wherein the higher the level of the speed reduction processing is, the lower the reduced signal transmission speed is. For example, if the signal transmission speed is reduced from Gen5 to Gen4 and still generated by an error event after the accumulation is performed 3 times, the signal transmission speed may be reduced from Gen4 to Gen3, and the process is repeated until the accumulation of the repair process of the first layer reaches the preset second threshold.

Of course, in other cases, the repair process of the first layer may also include a bandwidth reduction process, and the like, which is not limited herein.

In one embodiment, the respectively performing the repairing process on at least one object may perform the repairing process in the following manner:

< first-layer repair treatment >

The first method is as follows: and performing cross repair processing aiming at each object of the PCIe link, wherein the previous repair processing is different from the next repair processing.

In the first mode, when performing the repair processing of the first layer, since a plurality of objects may be involved, for example, signal quality and signal transmission speed are involved, the plurality of objects may be cross-repaired during repair, for example, when performing the first repair, signal quality is repaired, and when performing the second repair, signal transmission speed may be repaired, thus performing cross-cycling; or, when the signal transmission speed is repaired for the first repair, the signal quality can be repaired for the second repair, and the steps are circulated in a crossed manner.

By adopting the first embodiment, since the cross-repairing processing is performed on each object, the probability of successfully repairing the error can be improved, and when a certain repairing is successful, the type of the error can be more accurately determined and the error reason with more fine positioning can be more accurately positioned.

The second method comprises the following steps: in one repair, a repair process is performed for each object of the PCIe link.

In the second embodiment, in a single repair process, repair processing may be performed on each object, specifically, in sequential repair, repair processing may be performed on both signal quality and signal transmission speed, that is, both re-equalization processing and speed reduction processing may be performed, in this case, in the single repair process, re-equalization processing and speed reduction processing may be performed separately, for example, speed reduction processing is performed after re-equalization processing, or speed reduction processing is performed before re-equalization processing.

By adopting the second embodiment, since the restoration processing is performed on each object in the primary restoration processing, the probability of successfully restoring the error can be improved.

The third method comprises the following steps: and executing repair processing on the current object of the PCIe link, and if the number of times of repair processing on the current object reaches a preset first number, executing repair processing on the next object of the current object until the repair is successful or the accumulated number of times of repair reaches a preset second threshold.

In the third mode, a certain object may be repaired first, and when the repair of the object reaches a preset number of times, the next object of the object is repaired, and the process is repeated until the repair is successful or the accumulated number of times of repair reaches a preset second threshold.

In specific implementation, the signal quality may be repaired first, for example, re-equalization processing may be performed, if the re-equalization processing reaches a preset number of times, the signal transmission speed may be repaired, for example, speed reduction processing may be performed, and if the number of times of speed reduction processing reaches the preset number of times, the signal quality may be repaired again, so that the repair processing of the first layer may be ended until the repair is successful or the accumulated repair number reaches the preset second threshold, it should be noted that any one of the repair success and the accumulated repair number reaches the preset second threshold is satisfied.

It should be noted that, in the third embodiment, the preset number is smaller than the preset second threshold, and if the second preset threshold can be set to 6, the preset number can be set to 2 or 3. In practice, the preset number of times may be set according to actual requirements, and is not particularly limited herein.

By adopting the third embodiment, different objects are continuously repaired for multiple times, and each object is independently and continuously repaired, so that the probability of successfully repairing errors can be improved, and the error types can be more accurately positioned.

< second-level repair treatment >

When the repair process of the second layer is performed, that is, when the repair process of the hardware layer is performed, at least one hardware repair process may be used to perform the repair.

In specific implementation, when the accumulated repairing times of the first repairing process reaches a preset second threshold, the target terminal equipment with the fault is positioned; and performing at least one hardware repair process on the target terminal device.

Wherein different hardware repair processes are used to recover the target terminal device.

The hardware repair process may refer to performing a physical operation on the target terminal device, for example, sending an electrical signal to the target terminal device to power on or power off the target terminal device. In this way, the target terminal device with a fault can be subjected to hardware repair through various physical means, so that the target terminal device can be repaired as far as possible, physical correctable errors of the target terminal can be repaired, and errors such as temporary faults caused by poor power supply contact and temperature rise can be repaired.

For example, when an error event occurs, the B/D/F number in the error event may be read, and a device corresponding to the B/D/F number is determined as a faulty device.

In one implementation manner, in order to achieve more accurate prediction of the error type of the target terminal device, the target terminal device may be recovered by two hardware repair processes, in specific implementation, the port where the target terminal device is located may be reset, and when the accumulated reset times of the reset processes exceed a preset third threshold, the target terminal device is powered off and then powered on.

In this embodiment, the port where the target terminal device is located may refer to a port in which the target terminal device is plugged in the downlink extension port of the root node device, and in practice, the port may be reset, that is, the port is initialized, so that the port is reconnected to the target terminal device. When the accumulated number of times of executing the reset processing exceeds a preset third threshold, the reset processing of the characterization port cannot repair the error, at this time, the target terminal device may be powered off and then powered on, specifically, the target terminal device may be removed from the root node device, that is, the target terminal device is powered off, and then the target terminal device is connected to the root node device under the cooperation of the operating system, that is, the target terminal device may be powered on, so that the connection to the target terminal is restored.

When the cumulative execution times of the power-down and power-up processing exceeds a certain threshold, the error type can be determined to be a fatal type, that is, a fatal type error. After power-off and power-on processing, errors are not generated in an accumulated mode, and the type of the errors can be determined to be repairable errors. In this way, not only can the possible hardware errors be repaired by a plurality of hardware repair methods, but also the error types with finer granularity can be determined by the result of each hardware repair processing.

In an alternative example, the window period includes a plurality of clock periods, and the target clock period is a first clock period within the window period. If the first clock cycle generates an error event, the error event count is increased by 1, and if the first clock cycle does not generate an error event, the error event count is not performed, wherein under the condition that the error event is not generated in the first clock cycle, no matter whether the error event is generated in the subsequent clock cycle or not, the error event count is not performed, so that the error is allowed to be recorded only once within the specified time, and the generation of an error storm can be avoided.

In the following, a specific example is taken to exemplarily describe the error determination method for PCIe links according to the present application:

referring to fig. 4, an exemplary flowchart of the error determination method for PCIe link of the present application is shown, and as shown in fig. 4 and fig. 2, the method includes the following steps:

s1: a root node device on the PCIe link between the receiving end and the sending end, the root node device having a register.

S2: in the first clock cycle of one window cycle, as shown in fig. 4, in the case where Error Event =1/Aggr _ Cnt =0, if an Error Event occurs, ERR _ Count +1 is set to obtain ERR _ Cnt;

s3: comparing the current value of ERR _ Cnt with a preset first threshold value, determining whether the current value of ERR _ Cnt reaches the error rate period or not, if the error rate period is reached, then ERR _ Count-1, comparing the value of ERR _ Count subtracted by 1 with the preset first threshold value, if the value of ERR _ Count does not exceed the preset first threshold value, then not entering the multilayer repair processing, and returning to the step S2.

S4: and if the number exceeds the preset first threshold value, performing multi-layer repair processing. The method specifically comprises the following steps:

s41: as shown in fig. 2, setting G3 b level to 1 and setting digradeen to 1, i.e. enabling G3 b level and digradeen to perform re-equalization processing and speed reduction processing at the same time, and returning to step S2 after performing re-equalization processing and speed reduction processing each time; if the error is repaired, the error event count is not accumulated under the condition that the error event is not generated, and it should be noted that, because the error event count is determined whether to be accumulated in the first clock cycle of each window cycle, the error event count is not accumulated when the error repair is successful, and the rebalancing process and the speed reduction process are not continued.

S42: when the accumulated counts of the rebalancing process and the speed reduction process exceed the preset second threshold, the second layer of repair process is performed, such as EDPC process shown in fig. 4, that is, the target terminal device with the problem can be removed and a recovery action is performed in cooperation with the operation system.

S43: and if the accumulated repairing times of the repairing treatment of the second layer exceeds a preset third threshold, generating early warning information and reporting the early warning information to the BMC for treatment. And if the accumulated repairing times of the repairing processing of the second layer does not exceed a preset third threshold, continuing the EDPC processing, performing EDPC times +1 every time of the EDPC processing, generating error information, and reporting the error information to the operating system to indicate that the error is of a hardware type.

As shown in fig. 4, no matter whether the repair processing of the second layer exceeds the preset third threshold, it may be determined that the repair processing is an error caused by the hardware device, so that an alarm message of the hardware error may be reported to the operating system, and it may be determined that the error type is the error type of the hardware device.

By adopting the technical scheme of the embodiment of the application, the method has the following advantages:

1. when the error event count reaches a certain threshold value, different layers of repair can be performed on the PCIe link, and each layer aims at one object in the PCIe link, so that which object in the PCIe link has a fault can be determined based on the results of the repair processing of the different layers, that is, the source of the error event can be determined, and thus the error type can be accurately positioned.

2. Since the error event count can be subtracted every error rate period, the error count caused by the error rate can be avoided, and the accuracy of determining the error type is improved.

3. Since the error event is counted only when the error event is generated in the first clock cycle in each window cycle, and the error event is not counted when the error event is generated in other clock cycles in the window cycle, the problem of error storm can be avoided, the fault which really generates errors can be checked, and therefore the accuracy of error type judgment is improved.

Based on the same inventive concept, the present application further provides an error determination apparatus for PCIe links, as shown in fig. 5, a schematic structural diagram of the apparatus is shown, and as shown in fig. 5, the apparatus may specifically include the following modules:

a counting module 501, configured to increment an error event count each time an event meeting a preset condition is detected to be generated; the preset condition is that an error event is generated in a target clock period in a current window period;

a determining module 502, configured to determine whether to perform a repair process based on a preset error rate period and the error event count;

a repair processing module 503, configured to sequentially perform repair processing on at least one layer on the PCIe link when the error event count is greater than or equal to a preset first threshold; after each time of repair processing, repeating the step of increasing the error event count when each time an event meeting preset conditions is detected to be generated, wherein each time of repair processing is directed to one layer, the repair processing of the next layer is determined based on the accumulated times of the repair processing of the previous layer, and different layers correspond to different objects in the PCIe link;

an error determining module 504, configured to determine an error type of the PCIe link based on the repair processing of the last layer when the repair cutoff condition is satisfied.

Optionally, the repair processing module 503 includes:

the first processing unit is used for performing first-layer repair processing on a signal transmission layer of the PCIe link;

and the second processing unit is used for performing the repair processing of a second layer when the accumulated repair times of the repair processing of the first layer is greater than or equal to a preset second threshold, wherein the second layer comprises a hardware layer.

Optionally, the first processing unit is specifically configured to, in the signal transmission layer, respectively perform repair processing on at least one object of the PCIe link;

Optionally, the first processing unit is specifically configured to:

Optionally, the root node device is connected to a plurality of terminal devices, and the second processing unit includes:

the positioning unit is used for positioning the target terminal equipment with the fault when the accumulated repairing times of the first repairing process reaches the preset second threshold;

a recovery unit, configured to perform at least one hardware repair process on the target terminal device; wherein different hardware repair processes are used to recover the target terminal device.

Optionally, the recovery unit includes:

the first recovery subunit is used for resetting the port where the target terminal device is located;

and the second recovery subunit is used for powering off and then powering on the target terminal equipment when the accumulated reset times of the reset processing exceeds a preset third threshold value.

Optionally, the determining module 502 includes:

the detection unit is used for detecting whether the current period reaches a preset error rate period or not; wherein the bit error rate period is greater than the window period;

the counting and subtracting unit is used for reducing the error event counting according to a preset difference value when a preset error rate period is reached;

and the determining unit is used for determining that the PCIe links are repaired in sequence when the reduced error event count reaches a preset first threshold value.

Optionally, the error determination module 504 is specifically configured to:

Based on the same inventive concept, the present application further provides a computer-readable storage medium storing a computer program for causing a processor to execute the error determination method for a PCIe link.

Based on the same inventive concept, the present application further provides an electronic device, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor implements the error determination method for PCIe link when executing the method.

Finally, it should also be noted that, unless otherwise defined, the terms "first," "second," and the like, as used herein do not denote any order, quantity, or importance, but rather are used to distinguish one element from another. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrases "comprising a," "8230," "8230," or "comprising" does not exclude the presence of other like elements in a process, method, article, or apparatus comprising the element. The terms "connected" or "coupled" and the like are not restricted to physical or mechanical connections, but may include electrical connections, whether direct or indirect.

The method, the apparatus, the device and the storage medium for determining an error of a PCIe link provided in the present disclosure are described in detail above, and a specific example is applied in the present disclosure to illustrate the principle and the implementation of the present disclosure, and the description of the above embodiment is only used to help understanding the method and the core idea of the present disclosure; meanwhile, for a person skilled in the art, based on the idea of the present disclosure, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present disclosure should not be construed as a limitation to the present disclosure.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Reference herein to "one embodiment," "an embodiment," or "one or more embodiments" means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the present disclosure. Moreover, it is noted that instances of the word "in one embodiment" are not necessarily all referring to the same embodiment.

In the description provided herein, numerous specific details are set forth. However, it is understood that embodiments of the disclosure may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The disclosure may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solutions of the present disclosure, not to limit them; although the present disclosure has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present disclosure.

Claims

1. An error determination method for a PCIe link, applied to a root node device in the PCIe link, the method comprising:

increasing an error event count each time an event satisfying a preset condition is detected to be generated; the preset condition is that an error event is generated in a target clock period in a current window period;

determining whether to carry out restoration processing or not based on a preset error rate period and the error event count;

if yes, sequentially performing repair processing of at least one layer on the PCIe link; after each time of repair processing, repeating the step of increasing the error event count when each event meeting preset conditions is detected to be generated, wherein each time of repair processing is performed on one layer, and the repair processing of the next layer is determined based on the accumulated times of the repair processing of the previous layer;

2. The method of claim 1, wherein determining whether to perform a repair process based on the predetermined ber period and the error event count comprises:

detecting whether the current period reaches the preset error rate period or not; wherein the bit error rate period is greater than the window period;

if so, reducing the error event count according to a preset difference value;

and if the reduced error event count reaches a preset first threshold value, determining to sequentially repair the PCIe link.

3. The method of claim 1, wherein the sequentially performing at least one layer of repair processing on the PCIe link includes:

and when the accumulated repairing times of the repairing of the first layer surface is larger than or equal to a preset second threshold value, performing the repairing of a second layer surface, wherein the second layer surface comprises a hardware layer surface.

4. The method of claim 3, wherein performing the first-level repair process on the signaling plane of the PCIe link comprises:

5. The method of claim 4, wherein performing repair processing on at least one object of the PCIe link in the signaling plane respectively comprises:

6. A method according to any one of claims 3 to 5, wherein the first level repair process comprises a de-rate process and/or a re-equalisation process.

7. The method according to any one of claims 3 to 5, wherein the root node device is connected to a plurality of terminal devices, and performing the repair processing of the second layer when the accumulated repair times of the repair processing of the first layer reaches a preset second threshold value includes:

8. The method according to claim 7, wherein the performing at least one hardware repair process on the port where the target terminal device is located includes:

resetting the port where the target terminal equipment is located;

9. The method of claim 1, wherein the window period comprises a plurality of clock periods, and wherein the target clock period is a first clock period within the window period.

10. The method of claim 1, wherein determining the error type for the PCIe link based on the last-layer repair process comprises:

11. An apparatus for error determination of a PCIe link, applied to a root node device in the PCIe link, the apparatus comprising:

the judging module is used for determining whether to carry out restoration processing or not based on a preset error rate period and the error event count;

the repair processing module is used for sequentially performing repair processing on at least one layer of PCIe link under the condition of performing repair processing; after each time of repair processing, repeating the step of increasing the error event count when each time an event meeting preset conditions is detected to be generated, wherein each time of repair processing is directed to one layer, the repair processing of the next layer is determined based on the accumulated times of the repair processing of the previous layer, and different layers correspond to different objects in the PCIe link;

12. An electronic device storing a computer program that causes a processor to execute the error determination method for a PCIe link according to any one of claims 1 to 10.

13. A computer-readable storage medium storing a computer program for causing a processor to execute the error determination method for a PCIe link according to any one of claims 1 to 10.