WO2016188175A1

WO2016188175A1 - Hardware fault analysis system and method

Info

Publication number: WO2016188175A1
Application number: PCT/CN2016/075547
Authority: WO
Inventors: 文洋; 谈虎; 王亮; 蔡衢; 蒋勇; 蒋彪
Original assignee: 中兴通讯股份有限公司
Priority date: 2015-10-14
Filing date: 2016-03-03
Publication date: 2016-12-01
Also published as: CN106598800A

Abstract

A hardware fault analysis system and method. The hardware fault analysis system comprises: a user configuration module (101), configured to configure addresses of all machines to be monitored, a storage path of a fault log file, a collection period of the fault log file and a fault judgement condition; an information collection module (102), configured to acquire the addresses of the machines to be monitored, the storage path and the collection period, periodically acquire the fault log file of the machines to be monitored corresponding to the addresses according to the collection period and store the fault log file in the storage path; and a current fault prediction module (103), configured to acquire the fault judgement condition and the fault log file in the storage path and perform fault prediction processing on the fault log file according to the fault judgement condition to obtain a prediction result. By means of the system and method, the effect of chronically predicting hardware faults of all machines in batches in a machine room can be achieved.

Description

Hardware failure analysis system and method

Technical field

The present invention relates to the field of computer applications, and in particular, to a hardware failure analysis system and method.

Background technique

At present, with the in-depth development and increasingly complex cloud computing, the data center computer room is the foundation of cloud computing, and the pressure is increasing. In order to ensure the normal operation of the machine in the equipment room, the user is provided with reliable and good service. In the prior art, the error generated by the hardware of the machine is reported to the BIOS (Basic Input Output System) through SMI (System Management Interrupt). Input and output system), the BIOS performs a series of processing, and then reports to the operating system kernel through NMI (Non Maskable Interrupt); the operating system performs in the MCE (machine check exception) interrupt processing function. In the embodiment of the present invention, the information of the exception information register of the CPU is read and stored in a ring buffer of the /dev/mcelog character device; the user mode program mcelog polls the /dev/mcelog character device to parse out the contents of the register. And recorded to the MCELOG log file, the user-mode program mcelog can realize the PFA (Predictive Failure Analysis) function by analyzing the mcelog exception information.

However, the above techniques have many drawbacks. The user mode program MCELOG in the above technology can only be run on each individual machine, and only the fault of this machine can be predicted, and the hardware failure of all the machines in the equipment room cannot be predicted in batches, so it is necessary to know The fault information of all the machines in the equipment room can only be predicted by the user state program MCELOG on each machine, and then the fault information is viewed on each machine, which undoubtedly increases the working time and workload; secondly, the above user mode The fault information obtained by the program MCELOG is only recorded in the background of the MCELOG log file, the user can not directly perceive, the user feels poor; and the MCELOG log file is full, the old fault information is discarded, not fully utilized, wasted The storage resources and MCELOG log file resources are also not provided for the normal operation of the machine according to the MCELOG log file.

Summary of the invention

The main technical problem to be solved by the embodiments of the present invention is to provide a hardware fault analysis system and method, which can solve the hardware failure in the process of hardware failure analysis in the prior art, and can not predict the hardware failure of all machines in the equipment room for a long time, and the working time is long. The problem of heavy workload.

To solve the above technical problem, an embodiment of the present invention provides a hardware fault analysis system, including:

The user configuration module is configured to configure the address of all the machines to be monitored, the storage path of the fault log file, the collection period of the fault log file, and the fault judgment condition;

The information collection module is configured to obtain an address, a storage path, and an collection period of the machine to be monitored, and periodically acquire a fault log file of the machine to be monitored corresponding to the address according to the collection period, and store the fault log file in the storage path. in;

The current fault prediction module is configured to obtain a fault log condition in the fault judgment condition and the storage path, and perform fault prediction processing on the fault log file according to the fault judgment condition to obtain a prediction result.

In the embodiment of the present invention, the fault determination condition configured by the user configuration module includes a fault time window of each fault and a fault threshold corresponding to each fault; the current fault prediction module is specifically configured to obtain a fault time window of each fault, and corresponding to each fault. The fault threshold value and the fault log file in the storage path; and the fault information in the fault log file in the fault time window of each fault is counted and counted, and when the count value is greater than the fault threshold corresponding to the fault, It is predicted that the hardware corresponding to the fault will soon be invalid.

In an embodiment of the present invention, a result presentation module is further included, configured to present at least a prediction result on the interface.

In the embodiment of the present invention, the cleaning module is further configured to: clear at least one prediction result presented by the result presentation module, and convert the fault information in the fault log file corresponding to the cleared prediction result into historical fault information.

In the embodiment of the present invention, the user configuration module is further configured to configure a historical fault information processing parameter; the hardware fault analysis system further includes a historical fault information processing module, configured to process the historical fault information according to the historical fault information processing parameter, and obtain each fault. Between the logical relationship.

In the embodiment of the present invention, the historical fault information processing parameter configured by the user configuration module includes frequent episode rule mining parameters; the historical fault information processing module is specifically configured to read the frequent episode rule mining parameters, and the historical fault information is mined according to the frequent episode rules. Process and mine frequent plot rules between faults.

In the embodiment of the present invention, the frequent scenario rule mining parameters configured by the user configuration module specifically include: a sliding time window, a sliding step size, a support degree threshold, and a confidence threshold; the historical fault information processing module is specifically configured to be based on the sliding The time window and the sliding step count and count the support and confidence between the faults in the historical fault information, and determine the frequent plot rules between the faults that are greater than the support threshold or the confidence threshold.

In the embodiment of the present invention, the historical fault information processing parameter configured by the user configuration module includes a statistical condition, and the statistical condition includes a statistical dimension and a statistical time period; the historical fault information processing module is specifically configured to perform historical fault information according to the statistical dimension and the statistical time period. Sort, statistic, and sort to get statistical results.

The embodiment of the invention further provides a hardware fault analysis method, including:

Configure the address of all machines to be monitored, the storage path of fault log files, the collection period of fault log files, and fault judgment conditions.

Obtaining the address, storage path, and collection period of the machine to be monitored, and periodically acquiring the fault log file of the machine to be monitored corresponding to the address according to the collection period, and storing the fault log file in the storage path;

Obtain the fault diagnosis file in the fault judgment condition and the storage path, perform fault prediction processing on the fault log file according to the fault judgment condition, and obtain the prediction result.

In the embodiment of the present invention, the configuration of the fault determination condition includes: configuring a fault time window of each fault and a fault threshold corresponding to each fault; acquiring a fault judgment condition and a fault log file in the storage path, and the fault log according to the fault judgment condition The file performs fault prediction processing, and the predicted result includes: obtaining a fault time window of each fault, a fault threshold corresponding to each fault, and a fault log file in the storage path; and a fault log file in the fault time window of each fault The fault information in the count is counted and counted. When the count value is greater than the fault threshold corresponding to the fault, the hardware corresponding to the fault is predicted to be invalid.

In the embodiment of the present invention, after obtaining the test result, the method further includes presenting at least a prediction result on the interface.

In the embodiment of the present invention, after presenting the prediction result, the method further includes: clearing the presented at least one prediction result, and converting the fault information in the fault log file corresponding to the cleared prediction result into historical fault information.

In the embodiment of the present invention, the historical fault information processing parameter is configured, and the historical fault information is processed according to the historical fault information processing parameter to obtain a logical relationship between the faults.

In the embodiment of the present invention, configuring historical fault information processing parameters includes configuring frequent episode rule mining parameters; processing historical fault information according to historical fault information processing parameters, and obtaining logical relationships between the faults: reading frequent episode rule mining parameters According to the frequent plot rules mining parameters, the historical fault information is processed, and the frequent plot rules between the faults are mined.

In the embodiment of the present invention, the configured frequent episode rule mining parameters include: sliding time window, sliding step size, support degree threshold, and confidence threshold; processing historical fault information according to frequent episode rules mining parameters, mining The frequent episode rules between faults include: counting and supporting the support degree and confidence between faults in the historical fault information according to the sliding time window and the sliding step size, and determining that the threshold is greater than the support threshold or the confidence threshold. Frequent plot rules between failures.

In the embodiment of the present invention, the configured historical fault information processing parameter includes a statistical condition, and the statistical condition includes a statistical dimension and a statistical time period; the historical fault information is processed according to the historical fault information processing parameter, and the logical relationship between the historical faults is obtained. : Sort, count, and sort historical fault information according to statistical dimensions and statistical time periods to obtain statistical results.

The embodiment of the invention further provides a hardware fault analysis system and method, which adopts the hardware fault analysis system of the invention, and the user configuration module configures the address of all the machines to be monitored, the storage path of the fault log file, and the collection period of the fault log file. And the fault judgment condition; the information collection module acquires the address, the storage path, and the collection period of the machine to be monitored, and periodically acquires the fault log file of the machine to be monitored corresponding to the address according to the collection period, and stores the fault log file to In the storage path, the current fault prediction module obtains the fault judgment condition and the fault log file in the storage path, performs fault prediction processing on the fault log file according to the fault judgment condition, and obtains a prediction result, and the above hardware fault analysis system acquires all the machines to be monitored. The address can find all the machines to be monitored, and the fault log files of all the machines to be monitored can be obtained in batches. The faults can be predicted simultaneously for all the machines to be monitored in the equipment room, and the fault log files are periodically collected according to the collection period. can The system can automatically treat long-term monitoring of the machine at the same time fault prediction, fault information obtained all the machines to achieve long-term, the bulk predict the effect of a hardware failure in the engine room of all machines, hard The replacement of the parts provides a reference, so that the user can replace the predicted fault hardware according to the fault information, which greatly saves time, reduces the workload, and ensures the long-term normal operation of the machine to be monitored.

In the embodiment of the present invention, the hardware fault analysis system of the present invention further includes a historical fault information processing module, which can process the historical fault information according to the historical fault information processing parameters configured by the user configuration module, and obtain the logical relationship between the historical faults. , to achieve the effect of fully using the fault log file.

In the embodiment of the present invention, the processing, by the historical fault information processing module, the historical fault information may include: classifying, counting, and sorting the historical fault information according to the statistical dimension and the statistical time period configured by the user configuration module, and obtaining the statistical result; The historical fault information is processed according to the frequent plot rule mining parameters configured by the user configuration module, and the frequent plot rules between the faults are mined; the specific processing method for the historical fault information fully utilizes the historical fault information, and obtains statistical results and hardware faults. Between the frequent plot rules, and the statistical results and frequent plot rules can reflect the relationship between the faults of the hardware that are prone to occur in the long-term operation of the machine, thereby determining the deficiencies and defects of the hardware, and providing assistance for improving the hardware. Enable hardware vendors to improve hardware based on this statistical result and the frequent episode rules.

DRAWINGS

1 is a schematic structural diagram of a hardware failure analysis system according to Embodiment 1 of the present invention;

2 is a schematic flowchart of a hardware fault analysis method according to Embodiment 2 of the present invention;

3 is a schematic flowchart of another hardware failure analysis method according to Embodiment 3 of the present invention;

4 is a schematic flowchart of a process of acquiring an MCELOG log file in FIG. 3;

FIG. 5 is a schematic flowchart of a process of performing fault prediction in FIG. 3; FIG.

6 is a schematic flowchart of a process of performing clearing alarm information in FIG. 3;

7 is a schematic flow chart of a process of performing a mining frequent plot rule in FIG. 3;

FIG. 8 is a schematic flow chart of the process of performing multi-dimensional statistics in FIG.

detailed description

The present invention will be further described in detail below with reference to the accompanying drawings.

Embodiment 1:

In order to solve the problem that the existing hardware failure analysis process cannot predict the hardware failure of all the machines in the equipment room, the user cannot directly perceive the machine fault, and the MCELOG log file is not fully utilized, the embodiment provides a hardware fault analysis system. The hardware failure analysis system can periodically predict the hardware failure of all machines in the equipment room in batches. Make the machine in the equipment room run normally for a long time, reduce the working time and workload of the fault analysis, make full use of the MCELOG log file, and enable the user to directly sense the machine fault. The hardware fault analysis system 10 is shown in Figure 1, including:

The user configuration module 101 is configured to configure an address of all the machines to be monitored, a storage path of the fault log file, an collection period of the fault log file, and a fault determination condition;

The information collection module 102 is configured to obtain an address, a storage path, and an collection period of the machine to be monitored, and periodically acquire a fault log file of the machine to be monitored corresponding to the address according to the collection period, and store the fault log file in the storage. In the path;

The current fault prediction module 103 is configured to acquire a fault log file in the fault judgment condition and the storage path, and perform fault prediction processing on the fault log file according to the fault judgment condition to obtain a prediction result.

The user configuration module 101 configures the collection period of the fault log file to enable the information collection module to periodically collect the fault log file, so that the entire hardware fault analysis system periodically and automatically runs for a long time, and the user configuration module 101 may not be set. In the collection cycle, only one fault prediction is performed on all the machines to be monitored at the current time, and the predicted result is obtained.

Preferably, in order to improve the accuracy of the determination, the fault determination condition configured by the user configuration module 101 may be set as the fault time window of each fault and the fault threshold corresponding to each fault; accordingly, the current fault prediction module 103 acquires each The fault time window of the fault, the fault threshold corresponding to each fault, and the fault log file in the storage path; then fault fault file is predicted according to the fault time window and the fault threshold of different faults, and the specific fault prediction process is The front fault prediction module 13 counts and counts the fault information in the fault log file corresponding to the specific fault in the fault time window of a specific fault, and when the count value is greater than the fault threshold corresponding to the fault, It is predicted that the hardware corresponding to the fault will soon be invalid.

Preferably, the fault time window of each fault for determining whether the hardware fails or the fault threshold corresponding to each fault is set to a different value according to the type of each hardware, for example, determining whether the memory stick is invalid. If the fault time window is set to 24 hours and the fault threshold is set to 3, if there are 3 correctable errors on one memory stick within 24 hours, it is predicted that the memory stick is about to expire.

Certainly, the foregoing fault determination condition may be set to other content according to the user's needs or actual application conditions, for example, the fault determination condition may be set to a specific time period, and the interval gate reflecting the minimum time interval of two identical faults. The limit value and the threshold value of the number of occurrences of the same fault, if the number of times a specific fault occurs within the time period exceeds the threshold value, and the time interval between adjacent faults is less than If the interval threshold is used, the hardware corresponding to the fault is predicted to be invalid.

Because the current fault prediction module 103 predicts that the hardware is about to fail the prediction process belongs to the real-time analysis process, in order to predict the result in time and quickly, and occupy as little memory, CPU and other resources as possible, the embodiment borrows from the MCELOG user state program. The leaky bucket algorithm counts the fault information. For the fault that occurs in the fault time window, the current fault prediction module 103 only needs to perform the above counting and counting process to accumulate the count value; for the fault that exceeds the fault time window, the discarding is required. portion. Here's how to calculate the count value for the drop. To improve efficiency, let's assume: over time In the meantime, the count value is linearly attenuated, and the current fault prediction module 103 uses the following aging algorithm to calculate the discarded count value:

Attenuation coefficient = time window for fault judgment threshold/fault judgment,

Discarded count value = attenuation factor * The size of the fault that exceeded the time window.

It should be understood that the above aging algorithm is based on the assumption that the count value is linearly attenuated over time, so other feasible algorithms can be used to calculate the discarded count according to the assumed situation. value.

Preferably, the hardware failure analysis system 10 further includes a presentation module 105 configured to present at least a prediction result on the interface, thereby indicating that a specific hardware is about to fail, and the prediction result may be displayed in a list or other form.

Preferably, the hardware fault analysis system 10 in this embodiment further includes a clearing module 106 configured to clear at least one test result presented by the rendering module 105. Preferably, the method may be manually cleared, and the clearing module 106 may further The fault information in the fault log file corresponding to the cleared test result is converted into historical fault information.

In order to make the above-mentioned historical fault information better utilized, the user is provided with too much information about the hardware fault, which is convenient for the user to maintain the machine. Preferably, the user configuration module 101 is further configured to configure the historical fault information processing parameter. The hardware fault analysis system 10 further includes a historical fault information processing module 104 configured to process historical fault information according to historical fault information processing parameters to obtain a logical relationship between the faults.

Preferably, the historical fault information processing parameter configured by the user configuration module 101 may include a frequent episode rule mining parameter; the historical fault information processing module 104 is specifically configured to read the frequent episode rule mining parameter, and the historical fault is mined according to the frequent episode rule. The information is processed to mine frequent plot rules between faults.

The above-mentioned frequent plot rules can be understood as follows: indicating the relationship of the fault in time, for example, if the fault set A occurs within a time period (sliding time window), another fault may occur in the sliding time window. Set B, denoted as A=>B.

The above-mentioned frequent episode rules can be measured by two indicators: support degree and confidence level. The support degree indicates the probability that both fault set A and fault set B appear simultaneously in all time windows; the confidence level is expressed in all time windows. In the case where fault set A occurs, the probability that fault set B also appears.

The above-mentioned frequent episode rules are similar to the relationship between the root cause alarm and the derived alarm in the communication system (alarm association rule). In the mining of alarm association rules, the WINEPI algorithm is widely used. The algorithm uses the preset sliding time window, sliding step size, minimum support degree, minimum confidence and other parameters to calculate the adjacent events on the time window. Degree and find the partial order relationship between events in time.

Therefore, this embodiment can apply the WINEPI algorithm to the frequent plot rule mining of the fault log file:

Each record of the MCELOG log is structured data, including the generation time, MCE GSTATUS, MCE BANK, BANK STATUS, etc., so an MCELOG record can be used as an event;

MCEBANK, BANK STATUS and other information of each MCELOG log as attributes of the event;

A collection of MCEBANK collections, BANK STATUS, and the like are defined as respective attribute domains;

Further, the set of MCELOG records on one machine can be regarded as a sequence of events in time as the sequence of events to be analyzed.

Preferably, the frequent event rule mining parameters configured by the user configuration module 101 may include: a sliding time window, a sliding step size, a support level threshold, and a confidence threshold. The historical fault information processing module 104 is specifically configured to be based on the sliding time. The window and the sliding step count and count the support and confidence of each fault in the historical fault information, and determine the frequent plot rules between the faults that are greater than the support threshold or the confidence threshold.

Of course, it should be understood that the mining of the above-mentioned frequent episode rules can refer to the use of other feasible algorithms to implement the above mining process.

Preferably, the historical fault information processing parameter configured by the user configuration module 101 may further include: a statistical condition, the statistical condition may include a statistical dimension and a statistical time period; and the historical fault information processing module 104 may be further configured according to the statistical dimension and the statistics. Time segments classify, count, and sort historical fault information to obtain statistical results.

The statistical dimension and the statistical time period may be set according to the actual needs of the user. Preferably, the statistical dimension may include a hardware fault and a fault type. The historical fault information processing module 104 may use the TopN algorithm to sort the historical fault information to obtain an arrangement. The preceding faulty hardware and fault types, namely TopN fault hardware and TopN fault type, facilitate hardware vendors to improve TopN faults, and the statistical time period can be set to a period of time when machine faults occur more frequently according to actual conditions.

Preferably, in order to enable the user to intuitively understand the fault type of the fault that is easy to generate, the hardware that is prone to failure, and the frequent plot rules, the above-described presentation module 105 may also be configured to present statistical results and frequent episode rules on the interface, preferably, The presentation module 105 presents statistical results in the interface according to multiple dimensions (fault hardware, fault type). More preferably, the presentation module 105 presents TopN fault hardware and fault type in the statistical result; preferably, the presentation module 105 is at the interface. Lists, pie charts, histograms, and many other forms present statistical results and frequent plot rules.

The beneficial effects of this embodiment are as follows: The present embodiment provides a hardware fault analysis system. The hardware fault analysis system of this embodiment can periodically obtain batches of fault log files of the machine to be monitored, and then according to the fault. The log file predicts the faults that may occur on the machine to be monitored, and knows what hardware may be invalid, so that the multiple machines can be uniformly monitored for a long time to help them maintain a good working state; on this basis, the implementation The system of the example is further provided with a presentation module, which presents the prediction result intuitively to the user, reminds the user of the possible failure, facilitates the user to quickly and accurately understand the state of the machine to be monitored and finds the hidden trouble of the machine; this embodiment can also The historical fault information is processed to obtain more meaningful information, such as multi-dimensional statistics of historical fault information to obtain fault types that are easily generated by different types of hardware on the machine, and frequent episode rule mining for historical fault information can be known. The relationship between two faulty pieces, this information can help Pieces of hardware manufacturers to defect analysis, to find the direction of improvement for the hardware improvements.

Embodiment 2:

Referring to FIG. 2, the embodiment provides a hardware failure analysis method, including:

S201: Configure an address of all the machines to be monitored, a storage path of the fault log file, a collection period of the fault log file, and a fault judgment condition.

S202: Obtain an address, a storage path, and an collection period of the machine to be monitored, and periodically obtain a fault log file of the machine to be monitored corresponding to the address according to the collection period, and store the fault log file in the storage path.

S203: Acquire a fault log file in the fault judgment condition and the storage path, perform fault prediction processing on the fault log file according to the fault judgment condition, and obtain a prediction result.

In the embodiment of the present invention, in order to obtain an accurate prediction result, the fault determination condition of the configuration in S201 includes: a fault time window of each fault and a fault threshold corresponding to each fault; and correspondingly, S203 includes acquiring each fault. The fault time window and the fault threshold corresponding to each fault and the fault log file in the storage path, and count the fault information in the fault log file in the fault time window of various faults, when the count value is greater than When the fault corresponds to the fault threshold, the hardware corresponding to the fault is predicted to be invalid.

Because the prediction process in the S203 is about to be invalid, the prediction process belongs to the real-time analysis process. In order to predict the result in time and quickly, and use as little memory as possible, such as memory and CPU, this embodiment draws lessons from the MCELOG user state program. The leaky bucket algorithm counts and collects the fault information. For the faults that occur in the fault time window, only the counting and counting are required, and the counting value is accumulated. For the fault that exceeds the fault time window, a part of the fault is discarded. This embodiment adopts the following The aging algorithm calculates the discarded count value:

It should be understood that, in this embodiment, other algorithms may be selected according to the actual situation of the fault and the needs of the user to calculate the count value that needs to be discarded.

The fault information in the fault log file used in the fault prediction process in the above S203 is generally the new fault information in the currently collected fault log file, that is, the fault information that has not been used before, and is not used to predict the faulty hardware. .

Preferably, the hardware fault analysis method may further include presenting at least a prediction result on the interface, thereby indicating that a specific hardware is about to be invalid, and the prediction result may be displayed in a list or other form.

Preferably, the hardware failure analysis method in this embodiment further includes: clearing at least one test result presented above, preferably, manually removing the test result, and when the user selects the test result to be cleared, The fault information in the fault log file corresponding to the test result to be cleared is converted into historical fault information.

In order to make the historical fault information better utilized, the user is provided with too much information about the hardware fault, which is convenient for the user to maintain the machine. Preferably, the hardware fault method further includes configuring the historical fault information processing parameter according to the historical fault. The information processing parameters process the historical fault information to obtain the logical relationship between the faults.

The above-mentioned frequent episode rules are similar to the relationship between the root cause alarm and the derived alarm in the communication system (alarm association rule). In the mining of alarm association rules, the WINEPI algorithm is widely used at present. The algorithm uses the preset sliding time window, sliding step size, minimum support degree, minimum confidence and other parameters to calculate the neighboring degree of the event in the time window and find the partial order relationship between the events in time.

Preferably, the frequent episode rule mining parameter of the foregoing configuration specifically includes: a sliding time window, a sliding step size, a supporting degree threshold, and a confidence threshold; the foregoing processing the historical fault information according to the historical fault information processing parameter, and obtaining each The logical relationship between the faults includes: counting and supporting the support and confidence of each historical fault in the historical fault information according to the sliding time window and the sliding step, and determining each of the greater than the support threshold or the confidence threshold Frequent plot rules between failures. Of course, it should be understood that the mining of the above-mentioned frequent episode rules can refer to the use of other feasible algorithms to implement the above mining process.

Preferably, the historical fault information processing parameter of the configuration further includes a statistical condition, where the statistical condition may include a statistical dimension and a statistical time period; the historical fault information processing parameter is used to process the historical fault information, and the logical relationship between the historical faults is obtained. Including: classifying, counting, and sorting historical fault information according to statistical dimensions and statistical time periods, and obtaining statistical results.

The statistical dimension and the statistical time period may be set according to the actual needs of the user. Preferably, the statistical dimension may include a hardware fault and a fault type, and the statistical time period may be set to a time period in which the machine fault occurs more frequently according to actual conditions. The above-mentioned ranking of historical fault information can be performed by using the TopN algorithm to obtain multiple fault hardware and fault types, that is, TopN fault hardware and TopN fault type, which are convenient for hardware manufacturers to improve TopN faults.

Preferably, in order to enable the user to intuitively understand the type of fault that is easy to generate, the hardware that fails, and the frequent plot rules, the embodiment may also present statistical results and frequent episode rules on the interface, preferably, multiple interfaces in the interface. The dimension (fault hardware, fault type) presents statistical results. Preferably, the statistical result presented may be the TopN fault hardware and the TopN fault type described above; there are also multiple ways of presenting, for example, in the interface, list, sector, column When graphs and other forms present statistical results and frequent plot rules.

The beneficial effects of this embodiment are as follows: This embodiment provides a hardware fault analysis method. Different from the prior art, only one device to be monitored can be predicted by using a fault of the device to be monitored. The machine performs long-term fault prediction and obtains the long-term running state of the machine to be monitored. In order to allow the user to quickly and intuitively understand the possible failure of the machine to be monitored; after obtaining the test result, the test result can be presented on the interface. The user can also be allowed to delete the test result of the presentation, and improve the user's sense of use. The fault information corresponding to the deleted test result will not be discarded but will be converted into historical fault information. This embodiment also provides history. The fault information is processed to obtain at least a multi-dimensional statistical result and a frequent plot rule of the faulty component. The hardware manufacturer can analyze the hardware from multiple angles according to the multi-dimensional statistical result and the frequent plot rule of the faulty component to find out its shortcomings. Improve and get better quality hardware.

Embodiment 3:

Referring to FIG. 3, the embodiment further provides a hardware failure analysis method, which is a hardware failure analysis method based on MCELOG. When the hardware failure analysis method according to the embodiment predicts a hardware failure of the machine, the first configuration is required. The configuration items that need to be used in the fault prediction process. The specific content and meaning of the configuration items are shown in Table 1. Table 1 is the configuration information table of the hardware fault prediction method. Table 2 below shows the list of displayed items, including the required interface. The content and meaning of the displayed item displayed on it.

Table 1

Table 2

After configuring the configuration items in Table 1, you can perform batch hardware failure analysis on all machines. Referring to Figure 3, the hardware failure analysis process is as follows:

S301: Read configuration items 11, 12, and 13 to obtain an MCELOG log file.

S302: Read configuration item 14, perform fault prediction processing on the MCELOG log file obtained in S301, and generate alarm information;

S303: The display item 21 in the table 2, that is, the alarm information in S302 is presented in the interface;

S304: Clear at least one alarm information presented on the interface.

S305: Read configuration item 15, select fault information to be mined, and perform frequent rule rule mining;

S306: Read the configuration item 16, select the fault information to be mined, and perform multi-dimensional statistics on the fault information.

S307: The display item 22 and the display item 23 in Table 2 are presented in the interface.

The above S301 is described in detail with reference to FIG. 4, and S301 includes the following steps:

S3011: Read and parse the address of all hosts to be monitored, the storage path of the MCELOG log file, and the collection period of the fault log file.

S3012: The host is found according to the address of the host to be monitored, and the MCELOG log file is obtained from the host to be monitored according to the collection period, and is stored in the configured storage path.

S3013: Determine whether the MCELOG log file of all hosts to be monitored in the period is obtained, and then enter S3014; otherwise, enter S3012, and continue to obtain the MCELOG log file of the host to be monitored;

S3014: End.

The above S302 is described in detail with reference to FIG. 5, and S302 includes the following steps:

S3021: Read and parse a fault time window and a fault threshold corresponding to a specific hardware;

S3022: Read the newly acquired MCELOG log file in S301.

S3023: determining whether the fault information in the MCELOG log file is in the fault time window, if yes, proceeding to S3024; otherwise, proceeding to S3025;

S3024: the count value is increased;

S3025: The count value is reduced according to the aging algorithm, and jumps to S3026;

S3026: determining whether the count value reaches the fault threshold, if yes, entering S3027, otherwise, jumping to S3022;

S3027: It is determined that the hardware is about to be invalid, and an alarm information is generated.

S3028: End.

The above S304 is described in detail with reference to FIG. 6, and S304 includes the following steps:

S3041: The user selects alarm information to be cleared.

S3042: Convert the fault information in the MCELOG log file corresponding to the alarm information to be cleared into historical fault information.

S3043: Re-collect the DMIDECODE information and the MCELOG log file to obtain the current fault information.

S3044: Clear the selected alarm information on the interface.

S3045: End.

The above S305 is described in detail with reference to FIG. 7, and S305 includes the following steps:

S3051: The user selects and selects historical fault information to be mined;

S3052: reading and parsing the sliding time window, the sliding step, the support threshold, and the confidence threshold;

S3053: Read historical fault information to be mined, run WINEPI algorithm, and mine frequent episode rules between faults;

S3054: End.

The above S306 is described in detail with reference to FIG. 8, and S306 includes the following steps:

S3061: Read and parse a statistical time period and a statistical dimension, where the statistical dimension includes fault hardware and a fault type;

S3062: Acquire historical fault information, classify, count, and sort historical fault information according to the statistical dimension and the statistical time period;

S3063: End.

The above is a further detailed description of the present invention in connection with the specific embodiments, and the specific embodiments of the present invention are not limited to the description. It will be apparent to those skilled in the art that the present invention may be made without departing from the spirit and scope of the invention.

Industrial applicability

According to the foregoing technical solution provided by the embodiment of the present invention, the historical fault information processing module processing the historical fault information may include: classifying, counting, and sorting the historical fault information according to the statistical dimension and the statistical time period configured by the user configuration module, and obtaining statistics. Result; and the historical fault information is processed according to the frequent plot rule mining parameters configured by the user configuration module, and the frequent plot rules between the faults are mined; the specific processing method for the historical fault information fully utilizes the historical fault information, and the statistical result is obtained. Frequent episode rules between hardware failures, and statistical results and frequent episode rules can reflect the relationship between faults and faults of hardware that occur during long-term operation on the machine, thereby determining the deficiencies and defects of the hardware, and providing hardware improvements. Help, so that hardware vendors can improve the hardware based on the statistics and the frequent plot rules.

Claims

A hardware failure analysis system, comprising:

The user configuration module is configured to configure the address of all the machines to be monitored, the storage path of the fault log file, the collection period of the fault log file, and the fault judgment condition;

The information collection module is configured to acquire the address of the machine to be monitored, the storage path, and the collection period, and periodically acquire a fault log file of the machine to be monitored corresponding to the address according to the collection period. And storing the fault log file in the storage path;

The current fault prediction module is configured to obtain the fault judgment condition and the fault log file in the storage path, and perform fault prediction processing on the fault log file according to the fault judgment condition to obtain a prediction result.
The hardware fault analysis system of claim 1 , wherein the fault determination condition of the user configuration module configuration comprises a fault time window of each fault and a fault threshold corresponding to each fault; the current fault prediction module is configured to acquire a fault time window of each fault, a fault threshold corresponding to each fault, and a fault log file in the storage path; and counting fault information in a fault log file within a fault time window of each fault Statistics: When the count value is greater than the fault threshold corresponding to the fault, the hardware corresponding to the fault is predicted to be invalid.
The hardware failure analysis system of claim 1 or 2, further comprising: a result presentation module configured to present at least the prediction result at the interface.
The hardware failure analysis system according to claim 3, further comprising: a clearing module configured to clear at least one of the prediction results presented by the result presentation module, and to be in a failure log file corresponding to the cleared prediction result The fault information is converted into historical fault information.
The hardware failure analysis system according to claim 4, wherein the user configuration module is further configured to configure historical failure information processing parameters; the hardware failure analysis system further includes a historical failure information processing module configured to process according to the historical failure information The parameter processes the historical fault information to obtain a logical relationship between the faults.
The hardware fault analysis system of claim 5, wherein the historical fault information processing parameter configured by the user configuration module comprises a frequent episode rule mining parameter; the historical fault information processing module is specifically configured to read the frequent episode rule Mining the parameters, processing the historical fault information according to the frequent episode rule mining parameters, and mining frequent episode rules between the faults.
The hardware failure analysis system according to claim 6, wherein the frequent scenario rule mining parameters configured by the user configuration module comprise: a sliding time window, a sliding step size, a support level threshold, and a confidence threshold; The historical fault information processing module is configured to count and support the support degree and the confidence between the faults in the historical fault information according to the sliding time window and the sliding step size, and determine that the threshold value or the threshold is greater than the support threshold Frequent plot rules between faults that describe the confidence threshold.
The hardware fault analysis system according to claim 5, wherein the historical fault information processing parameter configured by the user configuration module comprises a statistical condition, the statistical condition includes a statistical dimension and a statistical time period; and the historical fault information The processing module is configured to classify, count, and sort the historical fault information according to the statistical dimension and the statistical time period to obtain a statistical result.
A hardware failure analysis method, including:

Configure the address of all machines to be monitored, the storage path of fault log files, the collection period of fault log files, and fault judgment conditions.

Acquiring the address of the machine to be monitored, the storage path, and the collection period, and periodically acquiring a fault log file of the machine to be monitored corresponding to the address according to the collection period, and acquiring the fault The log file is stored in the storage path;

Obtaining the fault determination condition and the fault log file in the storage path, performing fault prediction processing on the fault log file according to the fault judgment condition, and obtaining a prediction result.
The hardware failure analysis method according to claim 9, wherein the configuration failure determination condition comprises: configuring a failure time window of each failure and a failure threshold corresponding to each failure; and acquiring the fault determination condition and the Storing a fault log file in the path, performing fault prediction processing on the fault log file according to the fault judgment condition, and obtaining a prediction result includes: acquiring a fault time window of each fault, and a fault threshold corresponding to each fault And the fault log file in the storage path; and the fault information in the fault log file in the fault time window of each fault is counted and counted, and when the count value is greater than the fault threshold corresponding to the fault, the fault is predicted The corresponding hardware is about to expire.
The hardware failure analysis method according to claim 9 or 10, further comprising, after obtaining the test result, at least presenting the prediction result on the interface.
The hardware failure analysis method according to claim 11, wherein after presenting the prediction result, further comprising: clearing at least one of the predicted results presented, and the failure information in the failure log file corresponding to the cleared prediction result Translate into historical failure information.
The hardware fault analysis method according to claim 12, further comprising configuring a historical fault information processing parameter, and processing the historical fault information according to the historical fault information processing parameter to obtain a logical relationship between the faults.
The hardware failure analysis method according to claim 13, wherein the configuration history failure information processing parameter comprises configuring a frequent episode rule mining parameter; and the processing the historical fault information according to the historical failure information processing parameter to obtain The logical relationship between the faults includes: reading the frequent episode rule mining parameters, processing the historical fault information according to the frequent episode rule mining parameters, and mining frequent episode rules between the faults.
The hardware failure analysis method according to claim 14, wherein the configured frequent episode rule mining parameters include: a sliding time window, a sliding step size, a support level threshold, and a confidence threshold; The frequent episode rule mining parameter processes the historical fault information, and mining the frequent episode rules between the faults includes: supporting and credibility between the faults in the historical fault information according to the sliding time window and the sliding step size Counting statistics are performed to determine frequent episode rules between faults greater than the support threshold or the confidence threshold.
The hardware failure analysis method according to claim 13, wherein the configured historical failure information processing parameter includes a statistical condition, the statistical condition includes a statistical dimension and a statistical time period; and the processing parameter pair is processed according to the historical failure information The historical fault information is processed, and the logical relationship between the historical faults is obtained. The historical fault information is classified, counted, and sorted according to the statistical dimension and the statistical time period, and the statistical result is obtained.