CN112463521B

CN112463521B - Fault-tolerant method and device for improving reliability of serial high-speed bus equipment

Info

Publication number: CN112463521B
Application number: CN202011227058.1A
Authority: CN
Inventors: 史文举; 吴学荣
Original assignee: Suzhou Inspur Intelligent Technology Co Ltd
Current assignee: Suzhou Inspur Intelligent Technology Co Ltd
Priority date: 2020-11-06
Filing date: 2020-11-06
Publication date: 2022-07-19
Anticipated expiration: 2040-11-06
Also published as: CN112463521A

Abstract

The invention provides a fault-tolerant method and a fault-tolerant device for improving the reliability of serial high-speed bus equipment, wherein the method comprises the following steps: setting the serial high-speed bus equipment to run at the highest speed; starting error code monitoring to obtain the number of correctable error codes; retraining the serial high-speed bus link and counting the retraining times when the accumulated number of correctable error codes reaches a first threshold value after the first time threshold value; when the accumulated times of retraining the serial high-speed bus link reaches a second threshold value times, forcibly decelerating the serial high-speed bus equipment to a first running speed and counting the accumulated times of forcibly decelerating; and when the accumulated times of the forced speed reduction reaches the third threshold value times, fixing the serial high-speed bus equipment to the first running speed for running, and outputting alarm information to prompt replacement. The reliability of the storage product is improved, the frequent replacement of the equipment is reduced, and resources are saved.

Description

Fault-tolerant method and device for improving reliability of serial high-speed bus equipment

Technical Field

The invention relates to the technical field of cloud computing data center storage, in particular to a fault-tolerant method and a fault-tolerant device for improving the reliability of serial high-speed bus equipment.

Background

With the advent of the data explosion era, the number of users is increased by hundreds of times, and the real-time concurrent access volume is larger and larger. The NVME SSD full flash memory storage system is used more and more, and the reliability requirement of a data center on storage equipment is higher and higher in order to reduce the workload of maintenance. The full flash memory storage system uses a plurality of PCIE interface devices, such as NVME SSD, network cards, hard disk expansion cards, optical fiber cards and the like, and the devices adopt PCIE buses as physical interfaces to interact with a CPU. Especially, in order to realize a large storage capacity, the number of the NVME SSDs in the storage system is large, and is as many as hundreds. Currently, the rate of PCIE3.0 is widely used, and PCIE is gradually evolving to a higher rate such as PCIE 4.0/5.0/6.0.

Because the rate of the PCIE3.0 signal can reach 8Gb/s, and the routing of the link channel may also be very long, which may cause the attenuation of the high-speed signal to be too large, and the information cannot be analyzed at the receiving end. Therefore, an equalization setting is used at both Tx and Rx ends of PCIE3.0 to compensate for the attenuation of high speed signals in long links. The lengths of transmission links of PCIE signals in an actual product are inconsistent, sometimes long, sometimes short, and at this time, good compensation may be achieved only by balancing at a Tx end, and balancing at an Rx end is not needed; or the optimum eye pattern can be obtained only by using the corresponding equalization level combination at the Tx transmitting end and the Rx receiving end. While the Tx end has 11 stages of Preset equalization settings, the Rx end also has various equalization algorithms and settings such as CTLE, DFE, etc. with behavior characteristics (i.e. related parameter settings may be different from case to case), and CDR clock recovery, which makes it more complicated to set different equalization settings for different link cases. Therefore, when the PCIE device is powered on and performs first negotiation, it is difficult to obtain the optimal equalization parameter, which is not the optimal equalization sampling parameter, and some error codes may be inevitably generated in the case of a large data volume.

In order to improve reliability, some existing methods monitor error codes to execute a hierarchical alarm replacement strategy for different error code quantities, and some methods directly disable corresponding PCIE devices and prompt replacement when a certain error code is detected to avoid kernel breakdown. Even after signal integrity test, a group of eye pattern screening better balance parameters is fixed on related equipment, because many PCIE equipment in a large system are in different positions in a machine, the workload of parameter screening is very large, in addition, the card insertion equipment of different manufacturers are different in wiring on a circuit board, customers can increase the PCIE equipment of different manufacturers, and the group of selected parameters cannot cover all the equipment. As described above, because of many combinations of equalization parameters, it is difficult for a PCIE device to negotiate optimal parameters when powering on at the first time. The alarm is given and forbidden directly according to the error code detection, and some PCIE equipment is alarmed and replaced when a certain number of error codes are generated because the optimal balance parameters cannot be negotiated for the first time. A lot of maintenance workload is brought to the data center and the clients. This not only reduces the reliability of the storage product, but also causes a waste of resources.

Disclosure of Invention

For the purpose of directly alarming and disabling according to error code detection, some PCIE equipment is alarmed and replaced when a certain number of error codes are generated because the optimal balance parameters cannot be negotiated for the first time. A lot of maintenance workload is brought to the data center and the clients. The invention not only reduces the reliability of storage products, but also causes the problem of resource waste, and provides a fault-tolerant method and a fault-tolerant device for improving the reliability of serial high-speed bus equipment.

The technical scheme of the invention is as follows:

on one hand, the technical scheme of the invention provides a fault tolerance method for improving the reliability of serial high-speed bus equipment in a storage system, which comprises the following steps:

setting the serial high-speed bus equipment to run at the highest speed;

starting error code monitoring to obtain the number of correctable error codes;

retraining the serial high-speed bus link and counting the retraining times when the accumulated number of correctable error codes reaches a first threshold value after the first time threshold value;

when the accumulated times of retraining the serial high-speed bus link reaches a second threshold value times, forcibly decelerating the serial high-speed bus equipment to a first running speed and counting the accumulated times of forcibly decelerating;

and when the accumulated times of the forced speed reduction reaches the third threshold value times, fixing the serial high-speed bus equipment to the first running speed for running, and outputting alarm information to prompt replacement.

The method comprehensively judges whether the serial high-speed bus equipment has real faults or not by monitoring correctable error codes of the serial high-speed bus equipment, controlling the serial high-speed bus equipment to carry out multiple times of hot reset renegotiation on the equilibrium parameters, and finally obtaining better equilibrium parameters and reducing the speed to run a measure for monitoring the error codes. The reliability of the storage product is improved, frequent alarming and replacement of equipment are reduced, maintenance labor is saved, and equipment resources are saved.

Further, the step of enabling error detection to obtain the number of correctable errors comprises the correctable error correction step including a corrupted link-transport-layer packet and a corrupted data-link-layer packet.

Further, the step of starting error monitoring to obtain the number of correctable errors comprises:

reading the initial values of the damaged link transmission layer packet and the damaged data link layer packet register of the corresponding port serial high-speed bus equipment, and starting timing;

and after the first time threshold value is set, reading the values of the damaged link transmission layer packet and the damaged data link layer packet register of the serial high-speed bus equipment again.

Further, after the step of reading the damaged link transport layer packet and the damaged data link layer packet register of the serial high-speed bus device again after setting the first time threshold, the method further includes:

calculating the difference value of the registers read in two times;

the difference values of the two registers are added to obtain the accumulated number of correctable errors. The time window accumulates the serial high-speed bus link error codes, and the serial high-speed bus equipment carries out hot reset renegotiation parameters to a certain degree.

Further, when the cumulative number of correctable errors reaches the first threshold after the first time threshold is set, the steps of retraining the serial high-speed bus link and counting the retraining times specifically include:

judging that the accumulated number of correctable error codes reaches a first threshold value after a set time threshold value is set;

if so, retraining the serial high-speed bus link and counting the retraining times;

if not, returning to the step: and starting error code monitoring to acquire the number of correctable error codes. The negotiation times of the serial high-speed bus equipment are accumulated to reach a certain number, and the optimal parameters can be negotiated fully.

Further, when the accumulated number of times of retraining the serial high-speed bus link reaches a second threshold number of times, the step of forcibly decelerating the serial high-speed bus device to the first operating speed and counting the accumulated number of times of forcibly decelerating specifically includes:

judging whether the accumulated times of retraining the serial high-speed bus link reaches a second threshold value times;

if yes, forcibly reducing the speed of the serial high-speed bus equipment to a first running speed and counting the accumulated times of the forced speed reduction;

if not, returning to the step: and starting error code monitoring to acquire the number of correctable error codes. And after the renegotiation reaches the set times, degrading the corresponding equipment to the first running speed for running.

Further, when the accumulated number of times of forced speed reduction reaches a third threshold number of times, the step of fixing the serial high-speed bus equipment to the first operating speed for operation and outputting an alarm message to prompt replacement specifically comprises the following steps:

judging whether the accumulated times of the forced deceleration reaches a third threshold value times or not;

if so, fixing the serial high-speed bus equipment to a first operation speed for operation, and outputting alarm information to prompt replacement;

if not, the serial high-speed bus equipment is fixed under the first running speed, and error code monitoring is started to obtain the quantity of correctable error codes;

judging whether the situation that the accumulated number of correctable error codes reaches the first threshold value after the first time threshold value appears in the second time threshold value; if so, disconnecting the serial high-speed bus equipment and outputting alarm information to prompt replacement; if not, retraining the serial high-speed bus link; and returning to the step: the serial high speed bus device is set to run at the highest speed.

On the other hand, the technical scheme of the invention also provides a fault-tolerant device for improving the reliability of the serial high-speed bus equipment in the storage system, which comprises a setting module, a monitoring module, a link training statistical module, a speed reduction setting module and a processing output module;

the setting module is used for setting the serial high-speed bus equipment to run at the highest speed;

the monitoring module is used for starting error code monitoring to obtain the number of correctable error codes;

the link training statistical module is used for retraining the serial high-speed bus link and counting the retraining times when the accumulated number of correctable error codes reaches a first threshold value after a first time threshold value;

the speed reduction setting module is used for forcibly reducing the speed of the serial high-speed bus equipment to a first running speed and counting the accumulated times of forced speed reduction when the accumulated times of retraining the serial high-speed bus link reaches a second threshold value;

and the processing output module is used for fixing the serial high-speed bus equipment to the first running speed to run when the accumulated times of the forced speed reduction reaches a third threshold value times, and outputting alarm information to prompt replacement.

Further, the correctable errors include corrupted link transport layer packets and corrupted data link layer packets.

Furthermore, the monitoring module comprises a reading unit, a timing unit and a calculation processing unit;

the reading unit is used for reading the initial values of the damaged link transmission layer packet and the damaged data link layer packet register of the serial high-speed bus equipment with the corresponding port number; the timing unit is also used for reading the values of the damaged link transmission layer packet and the damaged data link layer packet register of the serial high-speed bus equipment again after the timing reaches the first time threshold value;

the timing unit is used for reading initial values of damaged link transmission layer packets and damaged data link layer packet registers of the serial high-speed bus equipment with the corresponding port number and then starting timing;

the calculation processing unit is used for calculating the register difference value read twice before and after; and the difference values of the two registers are added to obtain the accumulated number of correctable bit errors.

Further, the link training statistical module comprises a first judgment unit and a link training unit;

the judging unit is used for judging that the accumulated number of correctable error codes reaches a first threshold value after a set time threshold value is set;

and the link training unit is used for retraining the serial high-speed bus link and counting the retraining times.

Further, the deceleration setting module comprises a second judging unit and a deceleration setting unit;

the second judgment unit is used for judging whether the accumulated times of retraining the serial high-speed bus link reaches a second threshold value times;

and the speed reduction setting unit is used for forcibly reducing the speed of the serial high-speed bus equipment to a first running speed and counting the accumulated times of forced speed reduction.

Furthermore, the processing output module comprises a third judging unit, an alarm output unit and a processing unit

The third judging unit is used for judging whether the accumulated times of the forced deceleration reaches a third threshold value times; the method is also used for judging whether the accumulated number of correctable errors reaches the first threshold value after the first time threshold value appears in the second time threshold value;

the alarm output unit is used for fixing the serial high-speed bus equipment to a first running speed for running and outputting alarm information to prompt replacement; the device is also used for outputting alarm information to prompt replacement after the serial high-speed bus equipment is disconnected and forbidden;

the monitoring module is also used for starting error code monitoring to acquire the number of correctable error codes when the serial high-speed bus equipment is fixed to run at a first running speed;

and the processing unit is used for disconnecting and disabling the serial high-speed bus equipment.

According to the technical scheme, the invention has the following advantages: the method comprehensively judges whether the serial high-speed bus equipment has real faults or not by monitoring the error codes of the serial high-speed bus equipment, controlling the serial high-speed bus equipment to carry out multiple times of hot reset renegotiation on the equilibrium parameters and reducing the speed to run a monitoring error code measure. The reliability of the storage product is improved, the frequent replacement of the equipment is reduced, and resources are saved.

In addition, the invention has reliable design principle, simple structure and very wide application prospect.

Therefore, compared with the prior art, the invention has prominent substantive features and remarkable progress, and the beneficial effects of the implementation are also obvious.

Drawings

In order to more clearly illustrate the embodiments or technical solutions in the prior art of the present invention, the drawings used in the description of the embodiments or prior art will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained based on these drawings without creative efforts.

FIG. 1 is a schematic flow diagram of a method of one embodiment of the invention.

Fig. 2 is a schematic flow diagram of a method of another embodiment of the invention.

Fig. 3 is a schematic block diagram of an apparatus of one embodiment of the present invention.

Detailed Description

In order to make those skilled in the art better understand the technical solution of the present invention, the technical solution in the embodiment of the present invention will be clearly and completely described below with reference to the drawings in the embodiment of the present invention, and it is obvious that the described embodiment is only a part of the embodiment of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

As shown in fig. 1, an embodiment of the present invention provides a fault tolerance method for improving reliability of a serial high-speed bus device in a storage system, including the following steps:

step 1: setting the serial high-speed bus equipment to run at the highest speed;

and 2, step: starting error code monitoring to obtain the number of correctable error codes;

and step 3: retraining the serial high-speed bus link and counting the retraining times when the accumulated number of correctable error codes reaches a first threshold value after the first time threshold value;

and 4, step 4: when the accumulated times of retraining the serial high-speed bus link reaches a second threshold value, forcibly decelerating the serial high-speed bus equipment to a first running speed and counting the accumulated times of forced deceleration;

and 5: and when the accumulated times of the forced speed reduction reaches the third threshold value times, fixing the serial high-speed bus equipment to the first running speed for running, and outputting alarm information to prompt replacement.

It should be noted that, in step 2, in the step of starting error detection to obtain the number of correctable errors, the correctable errors include a damaged link transport layer packet and a damaged data link layer packet.

In some embodiments, the step 2 of enabling error detection to obtain the number of correctable errors comprises:

step 21: reading the initial values of the damaged link transmission layer packet and the damaged data link layer packet register of the corresponding port serial high-speed bus equipment, and starting timing;

step 22: after setting the first time threshold, reading the values of the damaged link transmission layer packet and the damaged data link layer packet register of the serial high-speed bus equipment again;

step 23: calculating the difference value of the registers read in two times;

step 24: the difference values of the two registers are added to obtain the accumulated number of correctable errors. The time window accumulates the serial high-speed bus link error codes, and the serial high-speed bus equipment carries out hot reset renegotiation parameters to a certain degree.

In some embodiments, in step 3, when the cumulative number of correctable errors reaches the first threshold after the first time threshold is set, the steps of retraining the serial high-speed bus link and counting the retraining times specifically include:

In some embodiments, step 4 specifically includes the following steps:

judging whether the accumulated times of retraining the serial high-speed bus link reaches a second threshold value times or not;

In some embodiments, step 5 specifically includes the following steps:

if not, the serial high-speed bus equipment is fixed under the first running speed to run, and error code monitoring is started to obtain the number of correctable error codes;

judging whether the situation that the accumulated number of correctable errors reaches the first threshold value after the first time threshold value occurs in the second time threshold value; if so, disconnecting the serial high-speed bus equipment and outputting alarm information to prompt replacement; if not, retraining the serial high-speed bus link; and returning to the step: the serial high speed bus device is set to run at the highest speed.

It should be noted that, in the serial high-speed bus of the present invention, the PCIE bus is taken as an example to be described as follows, as shown in fig. 2, the specific steps are as follows:

SS 1: setting the PCIE equipment to run at the highest speed;

SS 2: starting error code monitoring, reading initial values of a damaged link transmission layer packet and a damaged data link layer packet register of PCIE equipment at a corresponding port number, and starting timing;

SS 3: after 20s, reading the values of the damaged link transmission layer packet and the damaged data link layer packet register of the PCIE equipment again; the first time threshold is 20 s;

SS4, calculating the difference value of the registers read twice;

SS5, adding the difference values of the two registers to obtain the accumulated number of correctable error codes;

SS6, judging that the accumulative number of correctable error codes reaches 4 after 20 s; if yes, executing the step SS7, otherwise, returning to the step SS 2; it should be noted that the first time threshold is 20s, and the first threshold is 4;

SS7, retraining PCIE link and counting the retraining times;

SS8, judging whether the accumulated times of retraining PCIE links reaches 12 times; if yes, executing SS9, otherwise, returning to SS 2; the second threshold number of times is 12;

SS 9: forcibly reducing the speed of the PCIE equipment to PCIE1.0 and counting the accumulated times of the forced speed reduction; here, the first operation speed is an operation speed of PCIE 1.0;

SS 10: judging whether the accumulated times of forcibly reducing the speed to PCIE1.0 reaches 3 times or not; if yes, SS11 is executed; otherwise, SS12 is executed; the third threshold number of times is 3;

SS 11: fixing the PCIE equipment to PCIE1.0 for operation, and outputting alarm information to prompt replacement;

SS 12: the PCIE equipment is fixed under the operation of PCIE1.0, and starts error code monitoring to obtain the number of correctable error codes;

SS 13: judging whether the accumulated correctable error number reaches 4 within 20s within 30 minutes; if yes, executing SS14, otherwise, executing SS 15; the second time threshold is 30 minutes;

SS 14: the PCIE equipment is disabled by disconnecting and outputting alarm information to prompt replacement;

SS 15: retraining the PCIE link; returning to SS 1.

Accumulating PCIE link error codes in a time window, and allowing the PCIE equipment to perform hot reset renegotiation on parameters to a certain extent; the negotiation times of the PCIE equipment are accumulated to reach a certain number, and the optimal parameters can be fully negotiated; after 12 times of renegotiation, the corresponding equipment is degraded to PCIE1.0 speed for running; the speed is reduced to reach 3 times by accumulation until PCIE1.0, and the low-speed equipment is replaced by alarming; starting up 30 minutes of accumulated error codes at PCIE1.0, if the requirements are met, allowing the operation at the highest negotiated speed again, and recovering the performance; and if the PCIE1.0 does not meet the requirement of the error rate, the PCIE port is forbidden and the alarm is given for replacement. Taking the PCIE bus as an example, other serial high-speed buses using the spirit of the present invention are also protected.

As shown in fig. 3, an embodiment of the present invention further provides a fault-tolerant apparatus for improving reliability of a serial high-speed bus device in a storage system, including a setting module, a monitoring module, a link training statistics module, a speed reduction setting module, and a processing output module;

the link training statistical module is used for retraining the serial high-speed bus link and counting the retraining times when the accumulated number of correctable bit errors reaches a first threshold value after a first time threshold value;

and the processing output module is used for fixing the serial high-speed bus equipment to the first running speed for running when the accumulated times of the forced speed reduction reaches the third threshold times, and outputting alarm information to prompt replacement.

It should be noted that correctable errors include corrupted link-transport-layer packets and corrupted data-link-layer packets.

In some embodiments, the monitoring module comprises a reading unit, a timing unit and a calculation processing unit;

In some embodiments, the link training statistic module includes a first determining unit, a link training unit;

In some embodiments, the deceleration setting module comprises a second judging unit and a deceleration setting unit;

In some embodiments, the processing output module includes a third determining unit, an alarm output unit, and a processing unit;

the third judging unit is used for judging whether the accumulated times of the forced deceleration reaches a third threshold value times; the method is also used for judging whether the situation that the accumulated number of correctable error codes reaches the first threshold value after the first time threshold value appears in the second time threshold value;

the alarm output unit is used for fixing the serial high-speed bus equipment to a first running speed for running and outputting alarm information to prompt replacement; the alarm is also used for outputting alarm information to prompt replacement after the serial high-speed bus equipment is disconnected and forbidden;

Although the present invention has been described in detail by referring to the drawings in connection with the preferred embodiments, the present invention is not limited thereto. Various equivalent modifications or substitutions can be made on the embodiments of the present invention by those skilled in the art without departing from the spirit and scope of the present invention, and these modifications or substitutions are within the scope of the present invention/any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.

Claims

1. A fault-tolerant method for improving reliability of serial high-speed bus equipment is characterized by comprising the following steps:

setting the serial high-speed bus equipment to run at the highest speed;

starting error code monitoring to obtain the number of correctable error codes;

2. The fault tolerant method of improving reliability of a serial high speed bus device of claim 1 wherein the step of initiating error detection to obtain a number of correctable errors comprises a corrupted link transport layer packet and a corrupted data link layer packet.

3. The fault tolerant method of improving reliability of a serial high speed bus device as claimed in claim 2, wherein the step of starting error monitoring to obtain the number of correctable errors comprises:

4. The fault tolerant method for improving reliability of serial high speed bus device as claimed in claim 3, wherein after the step of reading the values of the damaged link transport layer packet and the damaged data link layer packet register of the serial high speed bus device again after setting the first time threshold, further comprising:

calculating the difference value of the registers read in two times;

the difference values of the two registers are added to obtain the accumulated number of correctable errors.

5. The fault-tolerant method for improving the reliability of a serial high-speed bus device according to claim 4, wherein when the accumulated number of correctable errors reaches a first threshold after a first time threshold is set, the steps of retraining the serial high-speed bus link and counting the retraining times specifically comprise:

if not, returning to the step: and starting error code monitoring to acquire the number of correctable error codes.

6. The fault-tolerant method for improving the reliability of a serial high-speed bus device according to claim 5, wherein when the accumulated number of times for retraining the serial high-speed bus link reaches a second threshold number of times, the step of forcibly slowing down the serial high-speed bus device to the first operating speed and counting the accumulated number of times for forcibly slowing down includes:

7. The fault-tolerant method for improving the reliability of the serial high-speed bus device according to claim 6, wherein when the accumulated number of forced speed reductions reaches a third threshold number, the step of fixing the serial high-speed bus device to the first operating speed for operation and outputting an alarm message to prompt replacement specifically comprises:

8. A fault-tolerant device for improving the reliability of serial high-speed bus equipment is characterized by comprising a setting module, a monitoring module, a link training statistical module, a speed reduction setting module and a processing output module;

9. The fault-tolerant apparatus for improving reliability of a serial high-speed bus device of claim 8, wherein the correctable errors comprise corrupted link-transport-layer packets and corrupted data-link-layer packets.

10. The fault-tolerant apparatus for improving the reliability of a serial high-speed bus device according to claim 9, wherein the monitoring module comprises a reading unit and a timing unit;

and the timing unit is used for reading the initial values of the damaged link transmission layer packet and the damaged data link layer packet register of the serial high-speed bus equipment with the corresponding port number and then starting timing.