CN112256539B

CN112256539B - PCIE link error statistical method, device, terminal and storage medium

Info

Publication number: CN112256539B
Application number: CN202010990038.3A
Authority: CN
Inventors: 李长飞
Original assignee: Suzhou Inspur Intelligent Technology Co Ltd
Current assignee: Suzhou Inspur Intelligent Technology Co Ltd
Priority date: 2020-09-18
Filing date: 2020-09-18
Publication date: 2022-07-19
Anticipated expiration: 2040-09-18
Also published as: CN112256539A

Abstract

The invention discloses a PCIE link error statistical method, a device, a terminal and a storage medium, which can monitor the non-fatal error count of a PCIE link in real time; when the monitored non-fatal error count is continuously changed within a first preset time period, and the change times exceed a first time threshold value, an alarm is sent out and/or the PCIE link is interrupted; when N sections of changes of the non-fatal error count within a second preset duration are monitored, N exceeds a second section number threshold, and the number of times of changes of the non-fatal error count does not exceed a first time threshold in each continuous change, an alarm is sent out and/or the PCIE link is interrupted; the change within the first preset time period is a segment change. The invention carries out statistics in two dimensions of error quantity and error generation time, and when the generated errors meet the statistical conditions, alarms are generated or the link is interrupted, thereby avoiding serious system faults caused by excessive errors and greatly improving the stability and reliability of system operation.

Description

PCIE link error statistical method, device, terminal and storage medium

Technical Field

The invention relates to the field of PCIE link monitoring, in particular to a PCIE link error statistical method, a device, a terminal and a storage medium.

Background

In recent years, with the continuous improvement of user requirements for integration, unification, efficiency, space and energy consumption, PCIE (peripheral component interconnect express) devices are widely applied in the fields of servers and storage, so that the health state of a PCIE link can be effectively monitored, and a security protection policy is adopted according to the monitoring condition, so as to improve the stability and reliability of system operation. At present, most of various PCIE devices provide error data, how to effectively utilize the data to determine the health state of a link is a difficulty in the field, and an effective error statistical method is not available at present.

Disclosure of Invention

In order to solve the above problems, the present invention provides a method, an apparatus, a terminal and a storage medium for PCIE link error statistics, which perform reasonable statistics on non-fatal errors on a PCIE link to avoid a system serious failure caused by too many errors.

The technical scheme of the invention is as follows: a PCIE link error statistical method comprises the following steps:

monitoring the non-fatal error count of the PCIE link in real time;

when the monitored non-fatal error count is continuously changed within a first preset time period, and the change times exceed a first time threshold value, an alarm is sent out and/or the PCIE link is interrupted;

when N-segment changes of the non-fatal error count within a second preset duration are monitored, N exceeds a second-segment number threshold, and the non-fatal error count change times do not exceed a first time threshold every time the non-fatal error count changes continuously, an alarm is sent out and/or the PCIE link is interrupted; the change within the first preset time period is a one-segment change.

Further, the method also comprises the following steps:

applying for a plurality of object pools; the number of the object pools is the same as the threshold value of the second segment number;

when monitoring that the non-fatal wrong counting is changed, recording monitoring information in a corresponding object pool;

if the non-fatal error count is continuously changed within the first preset duration, continuously updating the monitoring information in the current object pool;

if the time interval between the next non-fatal error counting change and the last non-fatal error counting change is larger than a first preset time length, moving to the next object pool to record monitoring information, and using each object pool according to the sequencing cycle of the object pools in a covering mode;

if all the object pools are used in a covering manner within the second preset time length, it means that when it is monitored that the non-fatal error count within the second preset time length changes by N segments, where N exceeds the threshold of the second segment number, and the number of times of change of the non-fatal error count does not exceed the threshold of the first number of times in each continuous change, an alarm is sent out and/or the PCIE link is interrupted.

Further, the recorded monitoring information includes: the time when the non-fatal error count changes, the latest numerical value of the non-fatal error count and the number of times the non-fatal error count changes are monitored last time.

Further, the non-fatal error counts include data link layer packet error counts and transport layer packet error counts.

The technical scheme of the invention also comprises a PCIE link error statistical device, which comprises,

a counting monitoring module: monitoring the non-fatal error count of the PCIE link in real time;

a first exception handling module: when the monitored non-fatal error count is continuously changed within a first preset time period, and the change times exceed a first time threshold value, an alarm is sent out and/or the PCIE link is interrupted;

a second exception handling module: when N sections of changes of the non-fatal error count within a second preset duration are monitored, N exceeds a second section number threshold, and the number of times of changes of the non-fatal error count does not exceed a first time threshold in each continuous change, an alarm is sent out and/or the PCIE link is interrupted; the change within the first preset time period is a one-segment change.

Further, the method also comprises the following steps of,

an object pool application module: applying for a plurality of object pools; the number of the object pools is the same as the threshold value of the second segment number;

monitoring information record module: when monitoring that the non-fatal error count changes, recording monitoring information in a corresponding object pool; if the non-fatal error count is continuously changed within the first preset duration, continuously updating the monitoring information in the current object pool; if the time interval between the next non-fatal error counting change and the last non-fatal error counting change is larger than a first preset time length, moving to the next object pool to record monitoring information, and circularly covering and using each object pool according to the sequence of the object pools;

and the second exception handling module monitors whether all the object pools are used in a covering mode within a second preset time length, if so, the second exception handling module indicates that the non-fatal error count within the second preset time length is monitored to be changed by N sections, N exceeds a second section number threshold, and the number of times of change of the non-fatal error count does not exceed a first time threshold every time the non-fatal error count is continuously changed, and then an alarm is sent out and/or the PCIE link is interrupted.

The technical scheme of the invention also comprises a terminal, which comprises:

a processor;

a memory for storing instructions for execution by the processor;

wherein the processor is configured to perform any of the methods described above.

The invention also comprises a computer-readable storage medium storing a computer program which, when executed by a processor, implements the method as defined in any one of the above.

According to the PCIE link error statistical method, the device, the terminal and the storage medium, provided by the invention, non-fatal errors occurring on a link are effectively and reasonably counted, statistics is carried out in two dimensions of error quantity and error generation time, when the generated errors meet statistical conditions, an alarm is generated or the link is interrupted, serious system faults caused by excessive errors are avoided, and the stability and the reliability of system operation are greatly improved. The method fills the blank in the aspect of PCIE link error statistics, does not distinguish PCIE equipment, and has wider applicability and strong universality.

Drawings

FIG. 1 is a schematic flow chart of a method according to an embodiment of the present invention;

FIG. 2 is a block diagram illustrating an object pool architecture according to an embodiment of the present invention;

fig. 3 is a schematic block diagram of a second structure according to an embodiment of the present invention.

Detailed Description

The present invention will be described in detail below with reference to the accompanying drawings by way of specific examples, which are illustrative of the present invention and are not limited to the following embodiments.

As shown in fig. 1, the present embodiment provides a PCIE link error statistical method, including the following steps:

s1, monitoring the non-fatal error count of the PCIE link in real time;

s2, when it is monitored that the non-fatal error count continuously changes within a first preset time period and the change times exceed a first time threshold, an alarm is sent and/or the PCIE link is interrupted;

s3, when it is monitored that the non-fatal error count changes by N segments within a second preset duration, N exceeds a second segment number threshold, and the number of times of change of the non-fatal error count does not exceed a first time threshold in each continuous change, an alarm is sent and/or the PCIE link is interrupted; the change within the first preset time period is a segment change.

The method comprises the steps that a first preset time length, a first time threshold value, a second preset time length and a second time threshold value are set, when counting (referring to non-fatal error counting) is changed, if the counting is continuously changed within the first preset time length and the continuous change time exceeds a first time threshold value, the occurrence of abnormity is indicated, and abnormity processing is needed; in addition, if the threshold segment change exceeding the second segment number occurs within the second preset time length, the occurrence of the exception is also indicated, and the exception needs to be handled. It should be noted that each segment of change may be changed only once, or may include multiple continuous changes, and if the change is continuous, the number of times of the non-fatal error count change does not exceed the first number threshold, otherwise, an alarm may be issued and/or the PCIE link may be interrupted when the second preset time is not reached.

By the method, non-fatal errors (Uncoreactable errors) occurring on the link are reasonably and effectively counted, two dimensions of the Error quantity and the Error generation time are counted, when the generated errors meet the counting conditions, an alarm is generated or the link is interrupted, and serious system faults caused by excessive errors are avoided.

In this embodiment, the non-fatal-error counting monitoring information is stored and counted through the object pools, and a plurality of object pools are first applied, where the number of the object pools is the same as the threshold of the second segment number, so as to count the number of counting change segments within the second preset duration. The object pools are sorted according to the serial numbers and are cyclically covered for use, for example, monitoring information monitored at a first section is stored in a first object pool, monitoring information monitored at a second section is stored in a second object pool, if M object pools exist, monitoring information monitored at a M +1 section is stored in the first object pool and information before the first object pool is covered if the M section count is changed.

And when the non-fatal error count is monitored to be changed, recording monitoring information in the corresponding object pool. Wherein the monitoring information includes: the time when the non-fatal error count changes, the latest numerical value of the non-fatal error count and the number of times the non-fatal error count changes are monitored last time.

If the non-fatal error count is continuously changed within the first preset duration, continuously updating the monitoring information in the current object pool; and if the time interval between the next non-fatal error count change and the last non-fatal error count change is larger than a first preset time length, moving to the next object pool to record monitoring information, and circularly covering and using each object pool according to the sequence of the object pools.

Based on this, if all the object pools are used in a covering manner within the second preset time length, it indicates that when it is monitored that the non-fatal error count within the second preset time length changes in N segments, where N exceeds the second segment number threshold, and the number of times of change of the non-fatal error count does not exceed the first time threshold each time the non-fatal error count changes continuously, an alarm is sent and/or the PCIE link is interrupted.

A specific implementation is provided below to further understand the present solution.

As shown in fig. 2, in this specific implementation, 51 object pools are set, the first preset time is 20 seconds, the second preset time is 1 hour, the first time threshold is 4 times, and the second time threshold is 51 times.

In addition, the non-fatal error Count includes a Data Link Layer Packet error Count (Bad DLLP Count) and a transport Layer Packet error Count (Bad Transaction Layer Packet Count).

Applying for 51 statistical object pools in total from 0 to 50 by the system, wherein the 51 object pools are used in a circulating and covering manner and are used from zero; each object pool has a description of the error, the description contents of which are: current error Count time (i.e., the time when the non-fatal error Count change was last monitored), Bad TLP Count, Bad DLLP Count, and the number of statistical changes (i.e., the number of times the non-fatal error Count of the segment changed).

The statistical process is as follows:

(1) the system circularly reads the non-fatal error Count values (BadTLPCount and Bad DLLP Count) on the PCIE link, and when any one of the two changes (namely the Count changes), statistics is carried out;

(2) if the continuous counting changes within 20s, updating in the current object pool;

(3) and when the time between the current counting change and the last counting change is more than 20s, moving to the next object pool for counting.

When the count changes for 4 times continuously in 20s or 51 object pools are used in 1 hour, it indicates that an exception occurs, and an alarm may be sent and/or the PCIE link may be interrupted.

It should be noted that, the statistics is performed in the normal stage of the link, and the statistics is not performed in the process of plugging and unplugging the device and the process of changing the link. In addition, if 51 object pools are used up within 1 hour, for the convenience of statistics, the loop coverage can be stopped, that is, the monitoring information is stopped being stored, and an alarm is sent and/or the PCIE link is interrupted.

Example two

As shown in fig. 3, on the basis of the first embodiment, the present embodiment provides a PCIE link error statistics apparatus, which includes the following functional modules.

The count monitoring module 101: monitoring the non-fatal error count of the PCIE link in real time;

the first exception handling module 102: when the monitored non-fatal error count is continuously changed within a first preset time period, and the change times exceed a first time threshold value, an alarm is sent out and/or the PCIE link is interrupted;

the second exception handling module 103: when N sections of changes of the non-fatal error count within a second preset duration are monitored, N exceeds a second section number threshold, and the number of times of changes of the non-fatal error count does not exceed a first time threshold in each continuous change, an alarm is sent out and/or the PCIE link is interrupted; the change within the first preset time length is changed by one section;

object pool application module 104: applying for a plurality of object pools; the number of the object pools is the same as the threshold value of the second section number;

the monitoring information recording module 105: when monitoring that the non-fatal wrong counting is changed, recording monitoring information in a corresponding object pool; if the non-fatal error count is continuously changed within the first preset duration, continuously updating the monitoring information in the current object pool; and if the time interval between the next non-fatal error count change and the last non-fatal error count change is larger than a first preset time length, moving to the next object pool to record monitoring information, and circularly covering and using each object pool according to the sequence of the object pools.

Wherein the recorded monitoring information comprises: the time when the non-fatal error count changes, the latest numerical value of the non-fatal error count and the number of times the non-fatal error count changes are monitored last time. The non-fatal error counts include data link layer packet error counts and transport layer packet error counts.

EXAMPLE III

The present embodiments provide a terminal that includes a processor and a memory.

The memory is used for storing the execution instructions of the processor. The memory may be implemented by any type of volatile or non-volatile memory terminal or combination thereof, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk. The executable instructions in the memory, when executed by the processor, enable the terminal to perform some or all of the steps in the above-described method embodiments.

The processor is a control center of the storage terminal, connects various parts of the whole electronic terminal by using various interfaces and lines, and executes various functions of the electronic terminal and/or processes data by operating or executing software programs and/or modules stored in the memory and calling data stored in the memory. The processor may be composed of an Integrated Circuit (IC), for example, a single packaged IC, or a plurality of packaged ICs connected with the same or different functions.

Example four

The present embodiment provides a computer storage medium, wherein the computer storage medium may store a program, and the program may include some or all of the steps in the embodiments provided in the present invention when executed. The storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM) or a Random Access Memory (RAM).

The above disclosure is only for the preferred embodiments of the present invention, but the present invention is not limited thereto, and any non-inventive changes that can be made by those skilled in the art and several modifications and amendments made without departing from the principle of the present invention shall fall within the protection scope of the present invention.

Claims

1. A PCIE link error statistical method is characterized by comprising the following steps:

monitoring the non-fatal error count of the PCIE link in real time;

when N sections of changes of the non-fatal error count within a second preset duration are monitored, N exceeds a second section number threshold, and the number of times of changes of the non-fatal error count does not exceed a first time threshold in each continuous change, an alarm is sent out and/or the PCIE link is interrupted; the change within the first preset time period is a one-segment change.

2. The PCIE link error statistic method of claim 1, further comprising the steps of:

when monitoring that the non-fatal error count changes, recording monitoring information in a corresponding object pool;

if the time interval between the next non-fatal error counting change and the last non-fatal error counting change is larger than a first preset time length, moving to the next object pool to record monitoring information, and circularly covering and using each object pool according to the sequence of the object pools;

3. The PCIE link error statistics method of claim 2, wherein the recorded monitoring information comprises: the time when the non-fatal error count changes is monitored last time, the latest numerical value of the non-fatal error count and the number of times the non-fatal error count changes for the segment.

4. The PCIE link error statistic method of claim 1, 2 or 3 wherein the non-fatal error counts include data link layer packet error counts and transport layer packet error counts.

5. A PCIE link error statistic device is characterized in that it includes,

a second exception handling module: when N sections of changes of the non-fatal error count within a second preset duration are monitored, N exceeds a second section number threshold, and the number of times of changes of the non-fatal error count does not exceed a first time threshold in each continuous change, an alarm is sent out and/or the PCIE link is interrupted; the change within the first preset time period is a segment change.

6. The PCIE link error statistics apparatus of claim 5, further comprising,

monitoring information record module: when monitoring that the non-fatal error count changes, recording monitoring information in a corresponding object pool; if the non-fatal error count is continuously changed within the first preset duration, continuously updating the monitoring information in the current object pool; if the time interval between the next non-fatal error counting change and the last non-fatal error counting change is larger than a first preset time length, moving to the next object pool to record monitoring information, and using each object pool according to the sequencing cycle of the object pools in a covering mode;

7. The PCIE link error statistic device of claim 6, wherein the recorded monitoring information includes: the time when the non-fatal error count changes, the latest numerical value of the non-fatal error count and the number of times the non-fatal error count changes are monitored last time.

8. The PCIE link error statistics apparatus of claim 5, 6 or 7, wherein the non-fatal error counts comprise data link layer packet error counts and transport layer packet error counts.

9. A terminal, comprising:

a processor;

a memory for storing instructions for execution by the processor;

wherein the processor is configured to perform the method of any one of claims 1-4.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 4.