WO2016188175A1 - Hardware fault analysis system and method - Google Patents

Hardware fault analysis system and method Download PDF

Info

Publication number
WO2016188175A1
WO2016188175A1 PCT/CN2016/075547 CN2016075547W WO2016188175A1 WO 2016188175 A1 WO2016188175 A1 WO 2016188175A1 CN 2016075547 W CN2016075547 W CN 2016075547W WO 2016188175 A1 WO2016188175 A1 WO 2016188175A1
Authority
WO
WIPO (PCT)
Prior art keywords
fault
historical
log file
hardware
information
Prior art date
Application number
PCT/CN2016/075547
Other languages
French (fr)
Chinese (zh)
Inventor
文洋
谈虎
王亮
蔡衢
蒋勇
蒋彪
Original Assignee
中兴通讯股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中兴通讯股份有限公司 filed Critical 中兴通讯股份有限公司
Publication of WO2016188175A1 publication Critical patent/WO2016188175A1/en

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/14Network analysis or design
    • H04L41/147Network analysis or design for predicting network behaviour

Definitions

  • the present invention relates to the field of computer applications, and in particular, to a hardware failure analysis system and method.
  • the data center computer room is the foundation of cloud computing, and the pressure is increasing.
  • the user is provided with reliable and good service.
  • the error generated by the hardware of the machine is reported to the BIOS (Basic Input Output System) through SMI (System Management Interrupt). Input and output system), the BIOS performs a series of processing, and then reports to the operating system kernel through NMI (Non Maskable Interrupt); the operating system performs in the MCE (machine check exception) interrupt processing function.
  • BIOS Basic Input Output System
  • SMI System Management Interrupt
  • NMI Non Maskable Interrupt
  • MCE machine check exception
  • the information of the exception information register of the CPU is read and stored in a ring buffer of the /dev/mcelog character device; the user mode program mcelog polls the /dev/mcelog character device to parse out the contents of the register. And recorded to the MCELOG log file, the user-mode program mcelog can realize the PFA (Predictive Failure Analysis) function by analyzing the mcelog exception information.
  • PFA Predictive Failure Analysis
  • the user mode program MCELOG in the above technology can only be run on each individual machine, and only the fault of this machine can be predicted, and the hardware failure of all the machines in the equipment room cannot be predicted in batches, so it is necessary to know
  • the fault information of all the machines in the equipment room can only be predicted by the user state program MCELOG on each machine, and then the fault information is viewed on each machine, which undoubtedly increases the working time and workload; secondly, the above user mode
  • the fault information obtained by the program MCELOG is only recorded in the background of the MCELOG log file, the user can not directly perceive, the user feels poor; and the MCELOG log file is full, the old fault information is discarded, not fully utilized, wasted
  • the storage resources and MCELOG log file resources are also not provided for the normal operation of the machine according to the MCELOG log file.
  • the main technical problem to be solved by the embodiments of the present invention is to provide a hardware fault analysis system and method, which can solve the hardware failure in the process of hardware failure analysis in the prior art, and can not predict the hardware failure of all machines in the equipment room for a long time, and the working time is long.
  • the problem of heavy workload is to provide a hardware fault analysis system and method, which can solve the hardware failure in the process of hardware failure analysis in the prior art, and can not predict the hardware failure of all machines in the equipment room for a long time, and the working time is long.
  • the problem of heavy workload is to provide a hardware fault analysis system and method, which can solve the hardware failure in the process of hardware failure analysis in the prior art, and can not predict the hardware failure of all machines in the equipment room for a long time, and the working time is long.
  • an embodiment of the present invention provides a hardware fault analysis system, including:
  • the user configuration module is configured to configure the address of all the machines to be monitored, the storage path of the fault log file, the collection period of the fault log file, and the fault judgment condition;
  • the information collection module is configured to obtain an address, a storage path, and an collection period of the machine to be monitored, and periodically acquire a fault log file of the machine to be monitored corresponding to the address according to the collection period, and store the fault log file in the storage path. in;
  • the current fault prediction module is configured to obtain a fault log condition in the fault judgment condition and the storage path, and perform fault prediction processing on the fault log file according to the fault judgment condition to obtain a prediction result.
  • the fault determination condition configured by the user configuration module includes a fault time window of each fault and a fault threshold corresponding to each fault; the current fault prediction module is specifically configured to obtain a fault time window of each fault, and corresponding to each fault.
  • the fault threshold value and the fault log file in the storage path; and the fault information in the fault log file in the fault time window of each fault is counted and counted, and when the count value is greater than the fault threshold corresponding to the fault, It is predicted that the hardware corresponding to the fault will soon be invalid.
  • a result presentation module is further included, configured to present at least a prediction result on the interface.
  • the cleaning module is further configured to: clear at least one prediction result presented by the result presentation module, and convert the fault information in the fault log file corresponding to the cleared prediction result into historical fault information.
  • the user configuration module is further configured to configure a historical fault information processing parameter;
  • the hardware fault analysis system further includes a historical fault information processing module, configured to process the historical fault information according to the historical fault information processing parameter, and obtain each fault. Between the logical relationship.
  • the historical fault information processing parameter configured by the user configuration module includes frequent episode rule mining parameters; the historical fault information processing module is specifically configured to read the frequent episode rule mining parameters, and the historical fault information is mined according to the frequent episode rules. Process and mine frequent plot rules between faults.
  • the frequent scenario rule mining parameters configured by the user configuration module specifically include: a sliding time window, a sliding step size, a support degree threshold, and a confidence threshold; the historical fault information processing module is specifically configured to be based on the sliding The time window and the sliding step count and count the support and confidence between the faults in the historical fault information, and determine the frequent plot rules between the faults that are greater than the support threshold or the confidence threshold.
  • the historical fault information processing parameter configured by the user configuration module includes a statistical condition, and the statistical condition includes a statistical dimension and a statistical time period; the historical fault information processing module is specifically configured to perform historical fault information according to the statistical dimension and the statistical time period. Sort, statistic, and sort to get statistical results.
  • the embodiment of the invention further provides a hardware fault analysis method, including:
  • the configuration of the fault determination condition includes: configuring a fault time window of each fault and a fault threshold corresponding to each fault; acquiring a fault judgment condition and a fault log file in the storage path, and the fault log according to the fault judgment condition
  • the file performs fault prediction processing, and the predicted result includes: obtaining a fault time window of each fault, a fault threshold corresponding to each fault, and a fault log file in the storage path; and a fault log file in the fault time window of each fault
  • the fault information in the count is counted and counted. When the count value is greater than the fault threshold corresponding to the fault, the hardware corresponding to the fault is predicted to be invalid.
  • the method further includes presenting at least a prediction result on the interface.
  • the method further includes: clearing the presented at least one prediction result, and converting the fault information in the fault log file corresponding to the cleared prediction result into historical fault information.
  • the historical fault information processing parameter is configured, and the historical fault information is processed according to the historical fault information processing parameter to obtain a logical relationship between the faults.
  • configuring historical fault information processing parameters includes configuring frequent episode rule mining parameters; processing historical fault information according to historical fault information processing parameters, and obtaining logical relationships between the faults: reading frequent episode rule mining parameters According to the frequent plot rules mining parameters, the historical fault information is processed, and the frequent plot rules between the faults are mined.
  • the configured frequent episode rule mining parameters include: sliding time window, sliding step size, support degree threshold, and confidence threshold; processing historical fault information according to frequent episode rules mining parameters, mining
  • the frequent episode rules between faults include: counting and supporting the support degree and confidence between faults in the historical fault information according to the sliding time window and the sliding step size, and determining that the threshold is greater than the support threshold or the confidence threshold. Frequent plot rules between failures.
  • the configured historical fault information processing parameter includes a statistical condition, and the statistical condition includes a statistical dimension and a statistical time period; the historical fault information is processed according to the historical fault information processing parameter, and the logical relationship between the historical faults is obtained. : Sort, count, and sort historical fault information according to statistical dimensions and statistical time periods to obtain statistical results.
  • the embodiment of the invention further provides a hardware fault analysis system and method, which adopts the hardware fault analysis system of the invention, and the user configuration module configures the address of all the machines to be monitored, the storage path of the fault log file, and the collection period of the fault log file.
  • the information collection module acquires the address, the storage path, and the collection period of the machine to be monitored, and periodically acquires the fault log file of the machine to be monitored corresponding to the address according to the collection period, and stores the fault log file to
  • the current fault prediction module obtains the fault judgment condition and the fault log file in the storage path, performs fault prediction processing on the fault log file according to the fault judgment condition, and obtains a prediction result
  • the above hardware fault analysis system acquires all the machines to be monitored.
  • the address can find all the machines to be monitored, and the fault log files of all the machines to be monitored can be obtained in batches.
  • the faults can be predicted simultaneously for all the machines to be monitored in the equipment room, and the fault log files are periodically collected according to the collection period.
  • the system can automatically treat long-term monitoring of the machine at the same time fault prediction, fault information obtained all the machines to achieve long-term, the bulk predict the effect of a hardware failure in the engine room of all machines, hard
  • the replacement of the parts provides a reference, so that the user can replace the predicted fault hardware according to the fault information, which greatly saves time, reduces the workload, and ensures the long-term normal operation of the machine to be monitored.
  • the hardware fault analysis system of the present invention further includes a historical fault information processing module, which can process the historical fault information according to the historical fault information processing parameters configured by the user configuration module, and obtain the logical relationship between the historical faults. , to achieve the effect of fully using the fault log file.
  • the processing, by the historical fault information processing module, the historical fault information may include: classifying, counting, and sorting the historical fault information according to the statistical dimension and the statistical time period configured by the user configuration module, and obtaining the statistical result;
  • the historical fault information is processed according to the frequent plot rule mining parameters configured by the user configuration module, and the frequent plot rules between the faults are mined; the specific processing method for the historical fault information fully utilizes the historical fault information, and obtains statistical results and hardware faults.
  • the frequent plot rules and the statistical results and frequent plot rules can reflect the relationship between the faults of the hardware that are prone to occur in the long-term operation of the machine, thereby determining the deficiencies and defects of the hardware, and providing assistance for improving the hardware. Enable hardware vendors to improve hardware based on this statistical result and the frequent episode rules.
  • FIG. 1 is a schematic structural diagram of a hardware failure analysis system according to Embodiment 1 of the present invention.
  • FIG. 2 is a schematic flowchart of a hardware fault analysis method according to Embodiment 2 of the present invention.
  • FIG. 3 is a schematic flowchart of another hardware failure analysis method according to Embodiment 3 of the present invention.
  • FIG. 4 is a schematic flowchart of a process of acquiring an MCELOG log file in FIG. 3;
  • FIG. 5 is a schematic flowchart of a process of performing fault prediction in FIG. 3;
  • FIG. 5 is a schematic flowchart of a process of performing fault prediction in FIG. 3;
  • FIG. 6 is a schematic flowchart of a process of performing clearing alarm information in FIG. 3;
  • FIG. 7 is a schematic flow chart of a process of performing a mining frequent plot rule in FIG. 3;
  • FIG. 8 is a schematic flow chart of the process of performing multi-dimensional statistics in FIG.
  • Embodiment 1 is a diagrammatic representation of Embodiment 1:
  • the embodiment provides a hardware fault analysis system.
  • the hardware failure analysis system can periodically predict the hardware failure of all machines in the equipment room in batches. Make the machine in the equipment room run normally for a long time, reduce the working time and workload of the fault analysis, make full use of the MCELOG log file, and enable the user to directly sense the machine fault.
  • the hardware fault analysis system 10 is shown in Figure 1, including:
  • the user configuration module 101 is configured to configure an address of all the machines to be monitored, a storage path of the fault log file, an collection period of the fault log file, and a fault determination condition;
  • the information collection module 102 is configured to obtain an address, a storage path, and an collection period of the machine to be monitored, and periodically acquire a fault log file of the machine to be monitored corresponding to the address according to the collection period, and store the fault log file in the storage.
  • the path In the path;
  • the current fault prediction module 103 is configured to acquire a fault log file in the fault judgment condition and the storage path, and perform fault prediction processing on the fault log file according to the fault judgment condition to obtain a prediction result.
  • the user configuration module 101 configures the collection period of the fault log file to enable the information collection module to periodically collect the fault log file, so that the entire hardware fault analysis system periodically and automatically runs for a long time, and the user configuration module 101 may not be set. In the collection cycle, only one fault prediction is performed on all the machines to be monitored at the current time, and the predicted result is obtained.
  • the fault determination condition configured by the user configuration module 101 may be set as the fault time window of each fault and the fault threshold corresponding to each fault; accordingly, the current fault prediction module 103 acquires each The fault time window of the fault, the fault threshold corresponding to each fault, and the fault log file in the storage path; then fault fault file is predicted according to the fault time window and the fault threshold of different faults, and the specific fault prediction process is
  • the front fault prediction module 13 counts and counts the fault information in the fault log file corresponding to the specific fault in the fault time window of a specific fault, and when the count value is greater than the fault threshold corresponding to the fault, It is predicted that the hardware corresponding to the fault will soon be invalid.
  • the fault time window of each fault for determining whether the hardware fails or the fault threshold corresponding to each fault is set to a different value according to the type of each hardware, for example, determining whether the memory stick is invalid. If the fault time window is set to 24 hours and the fault threshold is set to 3, if there are 3 correctable errors on one memory stick within 24 hours, it is predicted that the memory stick is about to expire.
  • the foregoing fault determination condition may be set to other content according to the user's needs or actual application conditions, for example, the fault determination condition may be set to a specific time period, and the interval gate reflecting the minimum time interval of two identical faults.
  • the limit value and the threshold value of the number of occurrences of the same fault if the number of times a specific fault occurs within the time period exceeds the threshold value, and the time interval between adjacent faults is less than If the interval threshold is used, the hardware corresponding to the fault is predicted to be invalid.
  • the embodiment borrows from the MCELOG user state program.
  • the leaky bucket algorithm counts the fault information. For the fault that occurs in the fault time window, the current fault prediction module 103 only needs to perform the above counting and counting process to accumulate the count value; for the fault that exceeds the fault time window, the discarding is required. portion. Here's how to calculate the count value for the drop. To improve efficiency, let's assume: over time In the meantime, the count value is linearly attenuated, and the current fault prediction module 103 uses the following aging algorithm to calculate the discarded count value:
  • Attenuation coefficient time window for fault judgment threshold/fault judgment
  • Discarded count value attenuation factor * The size of the fault that exceeded the time window.
  • the hardware failure analysis system 10 further includes a presentation module 105 configured to present at least a prediction result on the interface, thereby indicating that a specific hardware is about to fail, and the prediction result may be displayed in a list or other form.
  • a presentation module 105 configured to present at least a prediction result on the interface, thereby indicating that a specific hardware is about to fail, and the prediction result may be displayed in a list or other form.
  • the hardware fault analysis system 10 in this embodiment further includes a clearing module 106 configured to clear at least one test result presented by the rendering module 105.
  • the method may be manually cleared, and the clearing module 106 may further The fault information in the fault log file corresponding to the cleared test result is converted into historical fault information.
  • the user is provided with too much information about the hardware fault, which is convenient for the user to maintain the machine.
  • the user configuration module 101 is further configured to configure the historical fault information processing parameter.
  • the hardware fault analysis system 10 further includes a historical fault information processing module 104 configured to process historical fault information according to historical fault information processing parameters to obtain a logical relationship between the faults.
  • the historical fault information processing parameter configured by the user configuration module 101 may include a frequent episode rule mining parameter; the historical fault information processing module 104 is specifically configured to read the frequent episode rule mining parameter, and the historical fault is mined according to the frequent episode rule. The information is processed to mine frequent plot rules between faults.
  • the above-mentioned frequent plot rules can be understood as follows: indicating the relationship of the fault in time, for example, if the fault set A occurs within a time period (sliding time window), another fault may occur in the sliding time window.
  • the above-mentioned frequent episode rules can be measured by two indicators: support degree and confidence level.
  • the support degree indicates the probability that both fault set A and fault set B appear simultaneously in all time windows; the confidence level is expressed in all time windows. In the case where fault set A occurs, the probability that fault set B also appears.
  • the above-mentioned frequent episode rules are similar to the relationship between the root cause alarm and the derived alarm in the communication system (alarm association rule).
  • the WINEPI algorithm is widely used. The algorithm uses the preset sliding time window, sliding step size, minimum support degree, minimum confidence and other parameters to calculate the adjacent events on the time window. Degree and find the partial order relationship between events in time.
  • this embodiment can apply the WINEPI algorithm to the frequent plot rule mining of the fault log file:
  • Each record of the MCELOG log is structured data, including the generation time, MCE GSTATUS, MCE BANK, BANK STATUS, etc., so an MCELOG record can be used as an event;
  • a collection of MCEBANK collections, BANK STATUS, and the like are defined as respective attribute domains;
  • the set of MCELOG records on one machine can be regarded as a sequence of events in time as the sequence of events to be analyzed.
  • the frequent event rule mining parameters configured by the user configuration module 101 may include: a sliding time window, a sliding step size, a support level threshold, and a confidence threshold.
  • the historical fault information processing module 104 is specifically configured to be based on the sliding time.
  • the window and the sliding step count and count the support and confidence of each fault in the historical fault information, and determine the frequent plot rules between the faults that are greater than the support threshold or the confidence threshold.
  • the historical fault information processing parameter configured by the user configuration module 101 may further include: a statistical condition, the statistical condition may include a statistical dimension and a statistical time period; and the historical fault information processing module 104 may be further configured according to the statistical dimension and the statistics. Time segments classify, count, and sort historical fault information to obtain statistical results.
  • the statistical dimension and the statistical time period may be set according to the actual needs of the user.
  • the statistical dimension may include a hardware fault and a fault type.
  • the historical fault information processing module 104 may use the TopN algorithm to sort the historical fault information to obtain an arrangement.
  • the preceding faulty hardware and fault types namely TopN fault hardware and TopN fault type, facilitate hardware vendors to improve TopN faults, and the statistical time period can be set to a period of time when machine faults occur more frequently according to actual conditions.
  • the above-described presentation module 105 may also be configured to present statistical results and frequent episode rules on the interface, preferably, The presentation module 105 presents statistical results in the interface according to multiple dimensions (fault hardware, fault type). More preferably, the presentation module 105 presents TopN fault hardware and fault type in the statistical result; preferably, the presentation module 105 is at the interface. Lists, pie charts, histograms, and many other forms present statistical results and frequent plot rules.
  • the beneficial effects of this embodiment are as follows:
  • the present embodiment provides a hardware fault analysis system.
  • the hardware fault analysis system of this embodiment can periodically obtain batches of fault log files of the machine to be monitored, and then according to the fault.
  • the log file predicts the faults that may occur on the machine to be monitored, and knows what hardware may be invalid, so that the multiple machines can be uniformly monitored for a long time to help them maintain a good working state; on this basis, the implementation
  • the system of the example is further provided with a presentation module, which presents the prediction result intuitively to the user, reminds the user of the possible failure, facilitates the user to quickly and accurately understand the state of the machine to be monitored and finds the hidden trouble of the machine;
  • this embodiment can also
  • the historical fault information is processed to obtain more meaningful information, such as multi-dimensional statistics of historical fault information to obtain fault types that are easily generated by different types of hardware on the machine, and frequent episode rule mining for historical fault information can be known.
  • the relationship between two faulty pieces, this information can help Pieces of hardware
  • Embodiment 2 is a diagrammatic representation of Embodiment 1:
  • the embodiment provides a hardware failure analysis method, including:
  • S201 Configure an address of all the machines to be monitored, a storage path of the fault log file, a collection period of the fault log file, and a fault judgment condition.
  • S202 Obtain an address, a storage path, and an collection period of the machine to be monitored, and periodically obtain a fault log file of the machine to be monitored corresponding to the address according to the collection period, and store the fault log file in the storage path.
  • S203 Acquire a fault log file in the fault judgment condition and the storage path, perform fault prediction processing on the fault log file according to the fault judgment condition, and obtain a prediction result.
  • the fault determination condition of the configuration in S201 includes: a fault time window of each fault and a fault threshold corresponding to each fault; and correspondingly, S203 includes acquiring each fault.
  • the hardware corresponding to the fault is predicted to be invalid.
  • the fault time window of each fault for determining whether the hardware fails or the fault threshold corresponding to each fault is set to a different value according to the type of each hardware, for example, determining whether the memory stick is invalid. If the fault time window is set to 24 hours and the fault threshold is set to 3, if there are 3 correctable errors on one memory stick within 24 hours, it is predicted that the memory stick is about to expire.
  • the prediction process in the S203 is about to be invalid, the prediction process belongs to the real-time analysis process.
  • this embodiment draws lessons from the MCELOG user state program.
  • the leaky bucket algorithm counts and collects the fault information. For the faults that occur in the fault time window, only the counting and counting are required, and the counting value is accumulated. For the fault that exceeds the fault time window, a part of the fault is discarded.
  • This embodiment adopts the following The aging algorithm calculates the discarded count value:
  • Attenuation coefficient time window for fault judgment threshold/fault judgment
  • Discarded count value attenuation factor * The size of the fault that exceeded the time window.
  • the fault information in the fault log file used in the fault prediction process in the above S203 is generally the new fault information in the currently collected fault log file, that is, the fault information that has not been used before, and is not used to predict the faulty hardware. .
  • the hardware fault analysis method may further include presenting at least a prediction result on the interface, thereby indicating that a specific hardware is about to be invalid, and the prediction result may be displayed in a list or other form.
  • the hardware failure analysis method in this embodiment further includes: clearing at least one test result presented above, preferably, manually removing the test result, and when the user selects the test result to be cleared, The fault information in the fault log file corresponding to the test result to be cleared is converted into historical fault information.
  • the hardware fault method further includes configuring the historical fault information processing parameter according to the historical fault.
  • the information processing parameters process the historical fault information to obtain the logical relationship between the faults.
  • the above-mentioned frequent episode rules are similar to the relationship between the root cause alarm and the derived alarm in the communication system (alarm association rule).
  • the WINEPI algorithm is widely used at present. The algorithm uses the preset sliding time window, sliding step size, minimum support degree, minimum confidence and other parameters to calculate the neighboring degree of the event in the time window and find the partial order relationship between the events in time.
  • this embodiment can apply the WINEPI algorithm to the frequent plot rule mining of the fault log file:
  • Each record of the MCELOG log is structured data, including the generation time, MCE GSTATUS, MCE BANK, BANK STATUS, etc., so an MCELOG record can be used as an event;
  • a collection of MCEBANK collections, BANK STATUS, and the like are defined as respective attribute domains;
  • the set of MCELOG records on one machine can be regarded as a sequence of events in time as the sequence of events to be analyzed.
  • the frequent episode rule mining parameter of the foregoing configuration specifically includes: a sliding time window, a sliding step size, a supporting degree threshold, and a confidence threshold; the foregoing processing the historical fault information according to the historical fault information processing parameter, and obtaining each
  • the logical relationship between the faults includes: counting and supporting the support and confidence of each historical fault in the historical fault information according to the sliding time window and the sliding step, and determining each of the greater than the support threshold or the confidence threshold Frequent plot rules between failures.
  • the mining of the above-mentioned frequent episode rules can refer to the use of other feasible algorithms to implement the above mining process.
  • the historical fault information processing parameter of the configuration further includes a statistical condition, where the statistical condition may include a statistical dimension and a statistical time period; the historical fault information processing parameter is used to process the historical fault information, and the logical relationship between the historical faults is obtained. Including: classifying, counting, and sorting historical fault information according to statistical dimensions and statistical time periods, and obtaining statistical results.
  • the statistical dimension and the statistical time period may be set according to the actual needs of the user.
  • the statistical dimension may include a hardware fault and a fault type, and the statistical time period may be set to a time period in which the machine fault occurs more frequently according to actual conditions.
  • the above-mentioned ranking of historical fault information can be performed by using the TopN algorithm to obtain multiple fault hardware and fault types, that is, TopN fault hardware and TopN fault type, which are convenient for hardware manufacturers to improve TopN faults.
  • the embodiment may also present statistical results and frequent episode rules on the interface, preferably, multiple interfaces in the interface.
  • the dimension presents statistical results.
  • the statistical result presented may be the TopN fault hardware and the TopN fault type described above; there are also multiple ways of presenting, for example, in the interface, list, sector, column When graphs and other forms present statistical results and frequent plot rules.
  • This embodiment provides a hardware fault analysis method. Different from the prior art, only one device to be monitored can be predicted by using a fault of the device to be monitored.
  • the machine performs long-term fault prediction and obtains the long-term running state of the machine to be monitored.
  • the test result can be presented on the interface.
  • the user can also be allowed to delete the test result of the presentation, and improve the user's sense of use.
  • the fault information corresponding to the deleted test result will not be discarded but will be converted into historical fault information.
  • This embodiment also provides history.
  • the fault information is processed to obtain at least a multi-dimensional statistical result and a frequent plot rule of the faulty component.
  • the hardware manufacturer can analyze the hardware from multiple angles according to the multi-dimensional statistical result and the frequent plot rule of the faulty component to find out its shortcomings. Improve and get better quality hardware.
  • Embodiment 3 is a diagrammatic representation of Embodiment 3
  • the embodiment further provides a hardware failure analysis method, which is a hardware failure analysis method based on MCELOG.
  • a hardware failure analysis method which is a hardware failure analysis method based on MCELOG.
  • the hardware failure analysis method according to the embodiment predicts a hardware failure of the machine, the first configuration is required.
  • the specific content and meaning of the configuration items are shown in Table 1.
  • Table 1 is the configuration information table of the hardware fault prediction method.
  • Table 2 below shows the list of displayed items, including the required interface. The content and meaning of the displayed item displayed on it.
  • S302 Read configuration item 14, perform fault prediction processing on the MCELOG log file obtained in S301, and generate alarm information;
  • S303 The display item 21 in the table 2, that is, the alarm information in S302 is presented in the interface;
  • S304 Clear at least one alarm information presented on the interface.
  • S305 Read configuration item 15, select fault information to be mined, and perform frequent rule rule mining;
  • S306 Read the configuration item 16, select the fault information to be mined, and perform multi-dimensional statistics on the fault information.
  • S301 includes the following steps:
  • S3011 Read and parse the address of all hosts to be monitored, the storage path of the MCELOG log file, and the collection period of the fault log file.
  • S3012 The host is found according to the address of the host to be monitored, and the MCELOG log file is obtained from the host to be monitored according to the collection period, and is stored in the configured storage path.
  • S3013 Determine whether the MCELOG log file of all hosts to be monitored in the period is obtained, and then enter S3014; otherwise, enter S3012, and continue to obtain the MCELOG log file of the host to be monitored;
  • S302 includes the following steps:
  • S3021 Read and parse a fault time window and a fault threshold corresponding to a specific hardware
  • S3023 determining whether the fault information in the MCELOG log file is in the fault time window, if yes, proceeding to S3024; otherwise, proceeding to S3025;
  • S3025 The count value is reduced according to the aging algorithm, and jumps to S3026;
  • S3026 determining whether the count value reaches the fault threshold, if yes, entering S3027, otherwise, jumping to S3022;
  • S304 includes the following steps:
  • S305 includes the following steps:
  • S3051 The user selects and selects historical fault information to be mined
  • S3053 Read historical fault information to be mined, run WINEPI algorithm, and mine frequent episode rules between faults;
  • S306 is described in detail with reference to FIG. 8, and S306 includes the following steps:
  • S3061 Read and parse a statistical time period and a statistical dimension, where the statistical dimension includes fault hardware and a fault type;
  • S3062 Acquire historical fault information, classify, count, and sort historical fault information according to the statistical dimension and the statistical time period;
  • the historical fault information processing module processing the historical fault information may include: classifying, counting, and sorting the historical fault information according to the statistical dimension and the statistical time period configured by the user configuration module, and obtaining statistics. Result; and the historical fault information is processed according to the frequent plot rule mining parameters configured by the user configuration module, and the frequent plot rules between the faults are mined; the specific processing method for the historical fault information fully utilizes the historical fault information, and the statistical result is obtained.
  • Frequent episode rules between hardware failures, and statistical results and frequent episode rules can reflect the relationship between faults and faults of hardware that occur during long-term operation on the machine, thereby determining the deficiencies and defects of the hardware, and providing hardware improvements. Help, so that hardware vendors can improve the hardware based on the statistics and the frequent plot rules.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Debugging And Monitoring (AREA)

Abstract

A hardware fault analysis system and method. The hardware fault analysis system comprises: a user configuration module (101), configured to configure addresses of all machines to be monitored, a storage path of a fault log file, a collection period of the fault log file and a fault judgement condition; an information collection module (102), configured to acquire the addresses of the machines to be monitored, the storage path and the collection period, periodically acquire the fault log file of the machines to be monitored corresponding to the addresses according to the collection period and store the fault log file in the storage path; and a current fault prediction module (103), configured to acquire the fault judgement condition and the fault log file in the storage path and perform fault prediction processing on the fault log file according to the fault judgement condition to obtain a prediction result. By means of the system and method, the effect of chronically predicting hardware faults of all machines in batches in a machine room can be achieved.

Description

一种硬件故障分析系统和方法Hardware failure analysis system and method 技术领域Technical field
本发明涉及计算机应用领域,尤其涉及一种硬件故障分析系统和方法。The present invention relates to the field of computer applications, and in particular, to a hardware failure analysis system and method.
背景技术Background technique
目前,随着云计算的深入发展和日渐复杂,数据中心机房作为云计算的基础,压力与日俱增。为了保证机房内的机器的正常运行,给用户提供可靠良好的服务,现有技术中将机器的硬件产生的错误通过SMI(System Management Interrupt,系统管理中断)上报给BIOS(Basic Input Output System,基本输入输出系统),BIOS进行一系列处理后,再通过NMI(Non Maskable Interrupt,不可屏蔽中断)上报到操作系统内核处理;操作系统在MCE(machine check exception,机器检查异常)中断处理函数中进行在本发明实施例中处理,并读取CPU的异常信息寄存器等信息,保存到/dev/mcelog字符设备的环形缓冲区;用户态程序mcelog轮询/dev/mcelog字符设备,解析出寄存器的内容,并记录到MCELOG日志文件,用户态程序mcelog通过分析mcelog异常信息,可以实现PFA(Predictive Failure Analysis预测故障分析)功能。At present, with the in-depth development and increasingly complex cloud computing, the data center computer room is the foundation of cloud computing, and the pressure is increasing. In order to ensure the normal operation of the machine in the equipment room, the user is provided with reliable and good service. In the prior art, the error generated by the hardware of the machine is reported to the BIOS (Basic Input Output System) through SMI (System Management Interrupt). Input and output system), the BIOS performs a series of processing, and then reports to the operating system kernel through NMI (Non Maskable Interrupt); the operating system performs in the MCE (machine check exception) interrupt processing function. In the embodiment of the present invention, the information of the exception information register of the CPU is read and stored in a ring buffer of the /dev/mcelog character device; the user mode program mcelog polls the /dev/mcelog character device to parse out the contents of the register. And recorded to the MCELOG log file, the user-mode program mcelog can realize the PFA (Predictive Failure Analysis) function by analyzing the mcelog exception information.
但是上述技术存在很多缺陷,上述技术中的用户态程序MCELOG只能在每一台单独的机器上运行,只能预测这台机器的故障,不能批量预测机房内所有机器的硬件故障,所以要知道机房内所有的机器的故障信息,只能在每台机器上都利用用户态程序MCELOG进行预测,然后在每台机器上查看故障信息,这无疑增加了工作时间和工作量;其次,上述用户态程序MCELOG解析得到的故障信息只是记录到后台的MCELOG日志文件,用户无法直接感知,用户使用感较差;而且MCELOG日志文件记满了就丢弃掉了老的故障信息,没有充分利用起来,白白浪费了存储资源和MCELOG日志文件资源,也没有根据MCELOG日志文件为机器的正常运行提供帮助。However, the above techniques have many drawbacks. The user mode program MCELOG in the above technology can only be run on each individual machine, and only the fault of this machine can be predicted, and the hardware failure of all the machines in the equipment room cannot be predicted in batches, so it is necessary to know The fault information of all the machines in the equipment room can only be predicted by the user state program MCELOG on each machine, and then the fault information is viewed on each machine, which undoubtedly increases the working time and workload; secondly, the above user mode The fault information obtained by the program MCELOG is only recorded in the background of the MCELOG log file, the user can not directly perceive, the user feels poor; and the MCELOG log file is full, the old fault information is discarded, not fully utilized, wasted The storage resources and MCELOG log file resources are also not provided for the normal operation of the machine according to the MCELOG log file.
发明内容Summary of the invention
本发明实施例要解决的主要技术问题是,提供一种硬件故障分析系统和方法,解决现有技术中硬件故障分析过程中存在的不能长期地批量预测机房内所有机器的硬件故障,工作时间长,工作量大的问题。The main technical problem to be solved by the embodiments of the present invention is to provide a hardware fault analysis system and method, which can solve the hardware failure in the process of hardware failure analysis in the prior art, and can not predict the hardware failure of all machines in the equipment room for a long time, and the working time is long. The problem of heavy workload.
为解决上述技术问题,本发明实施例提供一种硬件故障分析系统,包括:To solve the above technical problem, an embodiment of the present invention provides a hardware fault analysis system, including:
用户配置模块,设置为配置所有待监控的机器的地址、故障日志文件的存放路径、故障日志文件的采集周期和故障判断条件; The user configuration module is configured to configure the address of all the machines to be monitored, the storage path of the fault log file, the collection period of the fault log file, and the fault judgment condition;
信息采集模块,设置为获取待监控的机器的地址、存放路径和采集周期,根据采集周期,周期性地获取与地址对应的待监控的机器的故障日志文件,并将故障日志文件存放到存放路径中;The information collection module is configured to obtain an address, a storage path, and an collection period of the machine to be monitored, and periodically acquire a fault log file of the machine to be monitored corresponding to the address according to the collection period, and store the fault log file in the storage path. in;
当前故障预测模块,设置为获取故障判断条件和存放路径中的故障日志文件,根据故障判断条件对故障日志文件进行故障预测处理,得到预测结果。The current fault prediction module is configured to obtain a fault log condition in the fault judgment condition and the storage path, and perform fault prediction processing on the fault log file according to the fault judgment condition to obtain a prediction result.
在本发明实施例中,用户配置模块配置的故障判断条件包括各故障的故障时间窗和各故障对应的故障门限值;当前故障预测模块具体设置为获取各故障的故障时间窗、各故障对应的故障门限值和存放路径中的故障日志文件;并对在各故障的故障时间窗内的故障日志文件中的故障信息进行计数统计,当计数值大于该故障对应的故障门限值时,预测该故障对应的硬件即将失效。In the embodiment of the present invention, the fault determination condition configured by the user configuration module includes a fault time window of each fault and a fault threshold corresponding to each fault; the current fault prediction module is specifically configured to obtain a fault time window of each fault, and corresponding to each fault. The fault threshold value and the fault log file in the storage path; and the fault information in the fault log file in the fault time window of each fault is counted and counted, and when the count value is greater than the fault threshold corresponding to the fault, It is predicted that the hardware corresponding to the fault will soon be invalid.
在本发明实施例中,还包括结果呈现模块,设置为在界面至少呈现预测结果。In an embodiment of the present invention, a result presentation module is further included, configured to present at least a prediction result on the interface.
在本发明实施例中,还包括清除模块,设置为清除结果呈现模块呈现的至少一个预测结果,并将与清除的预测结果对应的故障日志文件中的故障信息转化为历史故障信息。In the embodiment of the present invention, the cleaning module is further configured to: clear at least one prediction result presented by the result presentation module, and convert the fault information in the fault log file corresponding to the cleared prediction result into historical fault information.
在本发明实施例中,用户配置模块还设置为配置历史故障信息处理参数;硬件故障分析系统还包括历史故障信息处理模块,设置为根据历史故障信息处理参数对历史故障信息进行处理,得到各故障间的逻辑关系。In the embodiment of the present invention, the user configuration module is further configured to configure a historical fault information processing parameter; the hardware fault analysis system further includes a historical fault information processing module, configured to process the historical fault information according to the historical fault information processing parameter, and obtain each fault. Between the logical relationship.
在本发明实施例中,用户配置模块配置的历史故障信息处理参数包括频繁情节规则挖掘参数;历史故障信息处理模块具体设置为读取频繁情节规则挖掘参数,根据频繁情节规则挖掘参数对历史故障信息进行处理,挖掘各故障间的频繁情节规则。In the embodiment of the present invention, the historical fault information processing parameter configured by the user configuration module includes frequent episode rule mining parameters; the historical fault information processing module is specifically configured to read the frequent episode rule mining parameters, and the historical fault information is mined according to the frequent episode rules. Process and mine frequent plot rules between faults.
在本发明实施例中,用户配置模块配置的频繁情节规则挖掘参数具体包括:滑动时间窗、滑动步长、支持度门限值和置信度门限值;历史故障信息处理模块具体设置为根据滑动时间窗和滑动步长对历史故障信息中的各故障间的支持度和置信度进行计数统计,确定出大于支持度门限值或者置信度门限值的各故障间的频繁情节规则。In the embodiment of the present invention, the frequent scenario rule mining parameters configured by the user configuration module specifically include: a sliding time window, a sliding step size, a support degree threshold, and a confidence threshold; the historical fault information processing module is specifically configured to be based on the sliding The time window and the sliding step count and count the support and confidence between the faults in the historical fault information, and determine the frequent plot rules between the faults that are greater than the support threshold or the confidence threshold.
在本发明实施例中,用户配置模块配置的历史故障信息处理参数包括统计条件,统计条件包括统计维度和统计时间段;历史故障信息处理模块具体设置为根据统计维度和统计时间段对历史故障信息进行分类、统计和排序,得到统计结果。In the embodiment of the present invention, the historical fault information processing parameter configured by the user configuration module includes a statistical condition, and the statistical condition includes a statistical dimension and a statistical time period; the historical fault information processing module is specifically configured to perform historical fault information according to the statistical dimension and the statistical time period. Sort, statistic, and sort to get statistical results.
本发明实施例还提供一种硬件故障分析方法,包括:The embodiment of the invention further provides a hardware fault analysis method, including:
配置所有待监控的机器的地址、故障日志文件的存放路径、故障日志文件的采集周期和故障判断条件;Configure the address of all machines to be monitored, the storage path of fault log files, the collection period of fault log files, and fault judgment conditions.
获取待监控的机器的地址、存放路径和采集周期,根据采集周期,周期性地获取与地址对应的待监控的机器的故障日志文件,并将故障日志文件存放到存放路径中;Obtaining the address, storage path, and collection period of the machine to be monitored, and periodically acquiring the fault log file of the machine to be monitored corresponding to the address according to the collection period, and storing the fault log file in the storage path;
获取故障判断条件和存放路径中的故障日志文件,根据故障判断条件对故障日志文件进行故障预测处理,得到预测结果。 Obtain the fault diagnosis file in the fault judgment condition and the storage path, perform fault prediction processing on the fault log file according to the fault judgment condition, and obtain the prediction result.
在本发明实施例中,配置故障判断条件包括:配置各故障的故障时间窗和各故障对应的故障门限值;获取故障判断条件和存放路径中的故障日志文件,根据故障判断条件对故障日志文件进行故障预测处理,得到预测结果包括:获取各故障的故障时间窗、各故障对应的故障门限值和存放路径中的故障日志文件;并对在各故障的故障时间窗内的故障日志文件中的故障信息进行计数统计,当计数值大于该故障对应的故障门限值时,预测该故障对应的硬件即将失效。In the embodiment of the present invention, the configuration of the fault determination condition includes: configuring a fault time window of each fault and a fault threshold corresponding to each fault; acquiring a fault judgment condition and a fault log file in the storage path, and the fault log according to the fault judgment condition The file performs fault prediction processing, and the predicted result includes: obtaining a fault time window of each fault, a fault threshold corresponding to each fault, and a fault log file in the storage path; and a fault log file in the fault time window of each fault The fault information in the count is counted and counted. When the count value is greater than the fault threshold corresponding to the fault, the hardware corresponding to the fault is predicted to be invalid.
在本发明实施例中,在得到测试结果之后,还包括在界面至少呈现预测结果。In the embodiment of the present invention, after obtaining the test result, the method further includes presenting at least a prediction result on the interface.
在本发明实施例中,在呈现预测结果之后,还包括清除呈现的至少一个预测结果,并将与清除的预测结果对应的故障日志文件中的故障信息转化为历史故障信息。In the embodiment of the present invention, after presenting the prediction result, the method further includes: clearing the presented at least one prediction result, and converting the fault information in the fault log file corresponding to the cleared prediction result into historical fault information.
在本发明实施例中,还包括配置历史故障信息处理参数,根据历史故障信息处理参数对历史故障信息进行处理,得到各故障间的逻辑关系。In the embodiment of the present invention, the historical fault information processing parameter is configured, and the historical fault information is processed according to the historical fault information processing parameter to obtain a logical relationship between the faults.
在本发明实施例中,配置历史故障信息处理参数包括配置频繁情节规则挖掘参数;根据历史故障信息处理参数对历史故障信息进行处理,得到各故障间的逻辑关系包括:读取频繁情节规则挖掘参数,根据频繁情节规则挖掘参数对历史故障信息进行处理,挖掘各故障间的频繁情节规则。In the embodiment of the present invention, configuring historical fault information processing parameters includes configuring frequent episode rule mining parameters; processing historical fault information according to historical fault information processing parameters, and obtaining logical relationships between the faults: reading frequent episode rule mining parameters According to the frequent plot rules mining parameters, the historical fault information is processed, and the frequent plot rules between the faults are mined.
在本发明实施例中,配置的频繁情节规则挖掘参数包括:滑动时间窗、滑动步长、支持度门限值和置信度门限值;根据频繁情节规则挖掘参数对历史故障信息进行处理,挖掘各故障间的频繁情节规则包括:根据滑动时间窗和滑动步长对历史故障信息中的各故障间的支持度和置信度进行计数统计,确定出大于支持度门限值或者置信度门限值的各故障间的频繁情节规则。In the embodiment of the present invention, the configured frequent episode rule mining parameters include: sliding time window, sliding step size, support degree threshold, and confidence threshold; processing historical fault information according to frequent episode rules mining parameters, mining The frequent episode rules between faults include: counting and supporting the support degree and confidence between faults in the historical fault information according to the sliding time window and the sliding step size, and determining that the threshold is greater than the support threshold or the confidence threshold. Frequent plot rules between failures.
在本发明实施例中,配置的历史故障信息处理参数包括统计条件,统计条件包括统计维度和统计时间段;根据历史故障信息处理参数对历史故障信息进行处理,得到各历史故障间的逻辑关系包括:根据统计维度和统计时间段对历史故障信息进行分类、统计和排序,得到统计结果。In the embodiment of the present invention, the configured historical fault information processing parameter includes a statistical condition, and the statistical condition includes a statistical dimension and a statistical time period; the historical fault information is processed according to the historical fault information processing parameter, and the logical relationship between the historical faults is obtained. : Sort, count, and sort historical fault information according to statistical dimensions and statistical time periods to obtain statistical results.
本发明实施例还提供了一种硬件故障分析系统和方法,采用本发明的硬件故障分析系统,用户配置模块配置所有待监控的机器的地址、故障日志文件的存放路径、故障日志文件的采集周期和故障判断条件;信息采集模块获取待监控的机器的地址、存放路径和采集周期,根据采集周期,周期性地获取与地址对应的待监控的机器的故障日志文件,并将故障日志文件存放到存放路径中;当前故障预测模块获取故障判断条件和存放路径中的故障日志文件,根据故障判断条件对故障日志文件进行故障预测处理,得到预测结果,上述硬件故障分析系统通过获取所有待监控的机器的地址找到所有待监控的机器,就能批量获得所有待监控的机器的故障日志文件,能一次性对机房内所有待监控的机器同时进行故障预测,而根据采集周期周期性采集故障日志文件使得能该系统能自动长期地对待监控的机器同时进行故障预测,得出所有机器的故障信息,达到长期地、批量地预测机房内所有机器的硬件故障的效果,为硬 件更换提供参考意见,使用户可以根据故障信息集中替换预测出的故障硬件,大大地节省了时间、降低了工作量,保证了待监控的机器长期正常的运行。The embodiment of the invention further provides a hardware fault analysis system and method, which adopts the hardware fault analysis system of the invention, and the user configuration module configures the address of all the machines to be monitored, the storage path of the fault log file, and the collection period of the fault log file. And the fault judgment condition; the information collection module acquires the address, the storage path, and the collection period of the machine to be monitored, and periodically acquires the fault log file of the machine to be monitored corresponding to the address according to the collection period, and stores the fault log file to In the storage path, the current fault prediction module obtains the fault judgment condition and the fault log file in the storage path, performs fault prediction processing on the fault log file according to the fault judgment condition, and obtains a prediction result, and the above hardware fault analysis system acquires all the machines to be monitored. The address can find all the machines to be monitored, and the fault log files of all the machines to be monitored can be obtained in batches. The faults can be predicted simultaneously for all the machines to be monitored in the equipment room, and the fault log files are periodically collected according to the collection period. can The system can automatically treat long-term monitoring of the machine at the same time fault prediction, fault information obtained all the machines to achieve long-term, the bulk predict the effect of a hardware failure in the engine room of all machines, hard The replacement of the parts provides a reference, so that the user can replace the predicted fault hardware according to the fault information, which greatly saves time, reduces the workload, and ensures the long-term normal operation of the machine to be monitored.
在本发明实施例中,本发明的硬件故障分析系统还包括历史故障信息处理模块,可以根据用户配置模块配置的历史故障信息处理参数,对历史故障信息进行处理,得到各历史故障间的逻辑关系,达到充分使用故障日志文件的效果。In the embodiment of the present invention, the hardware fault analysis system of the present invention further includes a historical fault information processing module, which can process the historical fault information according to the historical fault information processing parameters configured by the user configuration module, and obtain the logical relationship between the historical faults. , to achieve the effect of fully using the fault log file.
在本发明实施例中,历史故障信息处理模块对历史故障信息进行处理可以包括:根据用户配置模块配置的统计维度和统计时间段,对历史故障信息进行分类、统计和排序,得到统计结果;和根据用户配置模块配置的频繁情节规则挖掘参数对历史故障信息进行处理,挖掘各故障间的频繁情节规则;上述对历史故障信息的具体处理方法充分利用了历史故障信息,得到了统计结果和硬件故障间的频繁情节规则,而统计结果和频繁情节规则可以反映硬件在机器上长期运行时的易发生的故障各故障间的关系,从而确定该硬件存在的不足和缺陷,为改进硬件提供了帮助,使硬件厂商可以根据该统计结果和该频繁情节规则对硬件进行改进。In the embodiment of the present invention, the processing, by the historical fault information processing module, the historical fault information may include: classifying, counting, and sorting the historical fault information according to the statistical dimension and the statistical time period configured by the user configuration module, and obtaining the statistical result; The historical fault information is processed according to the frequent plot rule mining parameters configured by the user configuration module, and the frequent plot rules between the faults are mined; the specific processing method for the historical fault information fully utilizes the historical fault information, and obtains statistical results and hardware faults. Between the frequent plot rules, and the statistical results and frequent plot rules can reflect the relationship between the faults of the hardware that are prone to occur in the long-term operation of the machine, thereby determining the deficiencies and defects of the hardware, and providing assistance for improving the hardware. Enable hardware vendors to improve hardware based on this statistical result and the frequent episode rules.
附图说明DRAWINGS
图1为本发明实施例一提供的一种硬件故障分析系统的结构示意图;1 is a schematic structural diagram of a hardware failure analysis system according to Embodiment 1 of the present invention;
图2为本发明实施例二提供的一种硬件故障分析方法的流程示意图;2 is a schematic flowchart of a hardware fault analysis method according to Embodiment 2 of the present invention;
图3为本发明实施例三提供的另一种硬件故障分析方法的流程示意图;3 is a schematic flowchart of another hardware failure analysis method according to Embodiment 3 of the present invention;
图4为图3中执行获取MCELOG日志文件的过程的流程示意图;4 is a schematic flowchart of a process of acquiring an MCELOG log file in FIG. 3;
图5为图3中执行故障预测的过程的流程示意图;FIG. 5 is a schematic flowchart of a process of performing fault prediction in FIG. 3; FIG.
图6为图3中执行清除告警信息的过程的流程示意图;6 is a schematic flowchart of a process of performing clearing alarm information in FIG. 3;
图7为图3中执行挖掘频繁情节规则的过程的流程示意图;7 is a schematic flow chart of a process of performing a mining frequent plot rule in FIG. 3;
图8为图3中执行多维度统计的过程的流程示意图。FIG. 8 is a schematic flow chart of the process of performing multi-dimensional statistics in FIG.
具体实施方式detailed description
下面通过具体实施方式结合附图对本发明作进一步详细说明。The present invention will be further described in detail below with reference to the accompanying drawings.
实施例一:Embodiment 1:
为了解决现有的硬件故障分析过程中存在的不能批量预测机房内所有机器的硬件故障、用户无法直接感知机器故障和没有充分利用MCELOG日志文件的问题,本实施例提供一种硬件故障分析系统,采用该硬件故障分析系统能周期性地批量预测机房内所有机器的硬件故障, 使机房内的机器长期正常运行,同时降低故障分析的工作时间和工作量,充分利用MCELOG日志文件,和使用户能直接感知机器故障,该硬件故障分析系统10请参见图1所示,包括:In order to solve the problem that the existing hardware failure analysis process cannot predict the hardware failure of all the machines in the equipment room, the user cannot directly perceive the machine fault, and the MCELOG log file is not fully utilized, the embodiment provides a hardware fault analysis system. The hardware failure analysis system can periodically predict the hardware failure of all machines in the equipment room in batches. Make the machine in the equipment room run normally for a long time, reduce the working time and workload of the fault analysis, make full use of the MCELOG log file, and enable the user to directly sense the machine fault. The hardware fault analysis system 10 is shown in Figure 1, including:
用户配置模块101,设置为配置所有待监控的机器的地址、故障日志文件的存放路径、故障日志文件的采集周期和故障判断条件;The user configuration module 101 is configured to configure an address of all the machines to be monitored, a storage path of the fault log file, an collection period of the fault log file, and a fault determination condition;
信息采集模块102,设置为获取待监控的机器的地址、存放路径和采集周期,根据采集周期,周期性地获取与地址对应的待监控的机器的故障日志文件,并将故障日志文件存放到存放路径中;The information collection module 102 is configured to obtain an address, a storage path, and an collection period of the machine to be monitored, and periodically acquire a fault log file of the machine to be monitored corresponding to the address according to the collection period, and store the fault log file in the storage. In the path;
当前故障预测模块103,设置为获取故障判断条件和存放路径中的故障日志文件,根据故障判断条件对故障日志文件进行故障预测处理,得到预测结果。The current fault prediction module 103 is configured to acquire a fault log file in the fault judgment condition and the storage path, and perform fault prediction processing on the fault log file according to the fault judgment condition to obtain a prediction result.
上述用户配置模块101配置故障日志文件的采集周期是为了使信息采集模块能周期性的采集故障日志文件,以便整个硬件故障分析系统周期性地长期的自动运行,上述用户配置模块101也可以不设置采集周期,只在当前时刻对所有的待监控的机器进行一次故障预测,得到预测结果。The user configuration module 101 configures the collection period of the fault log file to enable the information collection module to periodically collect the fault log file, so that the entire hardware fault analysis system periodically and automatically runs for a long time, and the user configuration module 101 may not be set. In the collection cycle, only one fault prediction is performed on all the machines to be monitored at the current time, and the predicted result is obtained.
优选地,为了提高判断的准确性,用户配置模块101配置的故障判断条件可以设置为各故障的故障时间窗和各故障对应的故障门限值;相应地,当前故障预测模块103就要获取各故障的故障时间窗、各故障对应的故障门限值和存放路径中的故障日志文件;然后根据不同故障的故障时间窗和故障门限值对故障日志文件进行故障预测,具体的故障预测过程是前故障预测模块13对在某个具体故障的故障时间窗内,存在的与该具体故障对应的故障日志文件中的故障信息进行计数统计,当计数值大于该故障对应的故障门限值时,预测该故障对应的硬件即将失效。Preferably, in order to improve the accuracy of the determination, the fault determination condition configured by the user configuration module 101 may be set as the fault time window of each fault and the fault threshold corresponding to each fault; accordingly, the current fault prediction module 103 acquires each The fault time window of the fault, the fault threshold corresponding to each fault, and the fault log file in the storage path; then fault fault file is predicted according to the fault time window and the fault threshold of different faults, and the specific fault prediction process is The front fault prediction module 13 counts and counts the fault information in the fault log file corresponding to the specific fault in the fault time window of a specific fault, and when the count value is greater than the fault threshold corresponding to the fault, It is predicted that the hardware corresponding to the fault will soon be invalid.
优选地,上述的配置的用于判断硬件是否失效的各故障的故障时间窗和各故障对应的故障门限值可以根据各硬件的种类设置成不同的值,例如,将判断内存条是否失效的故障时间窗设置为24小时、故障门限值设置为3,则如果在24小时内,一根内存条上产生3次可纠正的错误,就预测该内存条即将失效。Preferably, the fault time window of each fault for determining whether the hardware fails or the fault threshold corresponding to each fault is set to a different value according to the type of each hardware, for example, determining whether the memory stick is invalid. If the fault time window is set to 24 hours and the fault threshold is set to 3, if there are 3 correctable errors on one memory stick within 24 hours, it is predicted that the memory stick is about to expire.
当然,上述的故障判断条件可以根据用户的需求或者实际的应用情况来设置为其他内容,比如可以将故障判断条件设置为一个具体的时间段、反映两个相同的故障的最小时间间隔的间隔门限值和反映相同的故障的发生次数的次数门限值,如果某个具体的故障在该时间段内发生的次数超过了次数门限值,且在时间上相邻的故障间的时间间隔小于间隔门限值,则预测该故障对应的硬件即将失效。Certainly, the foregoing fault determination condition may be set to other content according to the user's needs or actual application conditions, for example, the fault determination condition may be set to a specific time period, and the interval gate reflecting the minimum time interval of two identical faults. The limit value and the threshold value of the number of occurrences of the same fault, if the number of times a specific fault occurs within the time period exceeds the threshold value, and the time interval between adjacent faults is less than If the interval threshold is used, the hardware corresponding to the fault is predicted to be invalid.
由于当前故障预测模块103预测硬件即将失效的预测过程属于实时分析过程,为了及时、快速地预测出结果,并尽可能少的占用内存、CPU等资源,本实施例借鉴了MCELOG用户态程序中的漏桶算法对故障信息进行计数统计,对于在故障时间窗内发生的故障,当前故障预测模块103只需要进行上述的计数统计的过程,累加计数值;对于超出故障时间窗的故障,则需要丢弃一部分。下面是计算丢弃的计数值的方法,为了提高效率,我们假设:随着时间 的推移,计数值线性衰减,上述当前故障预测模块103采用如下的老化算法对丢弃的计数值进行计算:Because the current fault prediction module 103 predicts that the hardware is about to fail the prediction process belongs to the real-time analysis process, in order to predict the result in time and quickly, and occupy as little memory, CPU and other resources as possible, the embodiment borrows from the MCELOG user state program. The leaky bucket algorithm counts the fault information. For the fault that occurs in the fault time window, the current fault prediction module 103 only needs to perform the above counting and counting process to accumulate the count value; for the fault that exceeds the fault time window, the discarding is required. portion. Here's how to calculate the count value for the drop. To improve efficiency, let's assume: over time In the meantime, the count value is linearly attenuated, and the current fault prediction module 103 uses the following aging algorithm to calculate the discarded count value:
衰减系数=故障判断门限值/故障判断的时间窗,Attenuation coefficient = time window for fault judgment threshold/fault judgment,
丢弃的计数值=衰减系数*超出时间窗的故障的大小。Discarded count value = attenuation factor * The size of the fault that exceeded the time window.
需要理解的是,上述的老化算法是建立在我们假设的随着时间的推移,计数值线性衰减的情况下的,所以可以根据假设的情况的不同采用其他的可行的算法来计算得到丢弃的计数值。It should be understood that the above aging algorithm is based on the assumption that the count value is linearly attenuated over time, so other feasible algorithms can be used to calculate the discarded count according to the assumed situation. value.
优选地,上述硬件故障分析系统10还可以包括呈现模块105,设置为在界面至少呈现预测结果,以此表示某个具体的硬件即将失效,预测结果可以以列表或者其他形式来显示出来。Preferably, the hardware failure analysis system 10 further includes a presentation module 105 configured to present at least a prediction result on the interface, thereby indicating that a specific hardware is about to fail, and the prediction result may be displayed in a list or other form.
优选地,本实施例中的硬件故障分析系统10还包括清除模块106,设置为清除上述呈现模块105呈现出的至少一个测试结果,优选地,可以采用手工清除测试结果的方式,清除模块106还会将与清除的测试结果对应的故障日志文件中的故障信息转化为历史故障信息。Preferably, the hardware fault analysis system 10 in this embodiment further includes a clearing module 106 configured to clear at least one test result presented by the rendering module 105. Preferably, the method may be manually cleared, and the clearing module 106 may further The fault information in the fault log file corresponding to the cleared test result is converted into historical fault information.
为了使上述历史故障信息能更好的被利用,为用户提供过多的关于硬件故障的信息,方便用户对机器的维护,优选地,上述用户配置模块101还设置为配置历史故障信息处理参数;上述硬件故障分析系统10还包括历史故障信息处理模块104,设置为根据历史故障信息处理参数对历史故障信息进行处理,得到各故障间的逻辑关系。In order to make the above-mentioned historical fault information better utilized, the user is provided with too much information about the hardware fault, which is convenient for the user to maintain the machine. Preferably, the user configuration module 101 is further configured to configure the historical fault information processing parameter. The hardware fault analysis system 10 further includes a historical fault information processing module 104 configured to process historical fault information according to historical fault information processing parameters to obtain a logical relationship between the faults.
优选地,上述用户配置模块101配置的历史故障信息处理参数可以包括频繁情节规则挖掘参数;历史故障信息处理模块104,具体设置为读取频繁情节规则挖掘参数,根据频繁情节规则挖掘参数对历史故障信息进行处理,挖掘各故障间的频繁情节规则。Preferably, the historical fault information processing parameter configured by the user configuration module 101 may include a frequent episode rule mining parameter; the historical fault information processing module 104 is specifically configured to read the frequent episode rule mining parameter, and the historical fault is mined according to the frequent episode rule. The information is processed to mine frequent plot rules between faults.
对上述的频繁情节规则可以理解如下:表示故障在时间上的关系,比如:如果在一个时间段(滑动时间窗)内发生故障集A,那么在该滑动时间窗内也可能会发生另外的故障集B,记作A=>B。The above-mentioned frequent plot rules can be understood as follows: indicating the relationship of the fault in time, for example, if the fault set A occurs within a time period (sliding time window), another fault may occur in the sliding time window. Set B, denoted as A=>B.
上述的频繁情节规则可以用支持度和置信度这两个指标来衡量,支持度表示在所有的时间窗内,故障集A和故障集B同时出现的概率;置信度表示在所有的时间窗内,在故障集A出现的情况下,故障集B也出现的概率。The above-mentioned frequent episode rules can be measured by two indicators: support degree and confidence level. The support degree indicates the probability that both fault set A and fault set B appear simultaneously in all time windows; the confidence level is expressed in all time windows. In the case where fault set A occurs, the probability that fault set B also appears.
上述的频繁情节规则类似于通信系统中的根源告警和衍生告警的关系(告警关联规则)。在告警关联规则的挖掘中,目前广泛使用的是WINEPI算法,该算法利用预先设置的滑动时间窗、滑动步长、最小支持度、最小置信度等参数,计算出事件在时间窗口上的相邻程度并发现事件间在时间上的偏序关系。The above-mentioned frequent episode rules are similar to the relationship between the root cause alarm and the derived alarm in the communication system (alarm association rule). In the mining of alarm association rules, the WINEPI algorithm is widely used. The algorithm uses the preset sliding time window, sliding step size, minimum support degree, minimum confidence and other parameters to calculate the adjacent events on the time window. Degree and find the partial order relationship between events in time.
所以,本实施例可以将WINEPI算法运用在故障日志文件的频繁情节规则挖掘上:Therefore, this embodiment can apply the WINEPI algorithm to the frequent plot rule mining of the fault log file:
MCELOG日志的每一条记录都是结构化的数据,包含产生时间、MCE GSTATUS、MCE BANK、BANK STATUS等信息,所以可以把一条MCELOG记录作为一个事件; Each record of the MCELOG log is structured data, including the generation time, MCE GSTATUS, MCE BANK, BANK STATUS, etc., so an MCELOG record can be used as an event;
每条MCELOG日志的MCEBANK、BANK STATUS等信息作为事件的属性;MCEBANK, BANK STATUS and other information of each MCELOG log as attributes of the event;
MCEBANK的集合、BANK STATUS等的集合分别定义为各个属性域;A collection of MCEBANK collections, BANK STATUS, and the like are defined as respective attribute domains;
进一步,可以把一台机器上的MCELOG记录的集合,看作是在时间上有序的事件序列,作为待分析的事件序列。Further, the set of MCELOG records on one machine can be regarded as a sequence of events in time as the sequence of events to be analyzed.
优选地,上述用户配置模块101配置的频繁情节规则挖掘参数可以包括:滑动时间窗、滑动步长、支持度门限值和置信度门限值,历史故障信息处理模块104具体设置为根据滑动时间窗和滑动步长对历史故障信息中的各故障的支持度和置信度进行计数统计,确定出大于支持度门限值或者置信度门限值的各故障间的频繁情节规则。Preferably, the frequent event rule mining parameters configured by the user configuration module 101 may include: a sliding time window, a sliding step size, a support level threshold, and a confidence threshold. The historical fault information processing module 104 is specifically configured to be based on the sliding time. The window and the sliding step count and count the support and confidence of each fault in the historical fault information, and determine the frequent plot rules between the faults that are greater than the support threshold or the confidence threshold.
当然,需要理解的是,对于上述频繁情节规则的挖掘可以参考利用其他的可行的算法来实现上述的挖掘过程。Of course, it should be understood that the mining of the above-mentioned frequent episode rules can refer to the use of other feasible algorithms to implement the above mining process.
优选地,上述用户配置模块101配置的历史故障信息处理参数具体还可以包括:统计条件,该统计条件可以包括统计维度和统计时间段;历史故障信息处理模块104还可以设置为根据统计维度和统计时间段对历史故障信息进行分类、统计和排序,得到统计结果。Preferably, the historical fault information processing parameter configured by the user configuration module 101 may further include: a statistical condition, the statistical condition may include a statistical dimension and a statistical time period; and the historical fault information processing module 104 may be further configured according to the statistical dimension and the statistics. Time segments classify, count, and sort historical fault information to obtain statistical results.
上述的统计维度和统计时间段可以根据用户的实际需求来设置,优选地,统计维度可以包括硬件故障和故障类型,上述历史故障信息处理模块104可以采用TopN算法对历史故障信息进行排序,得到排列在前的多个故障硬件和故障类型即TopN故障硬件和TopN故障类型,方便硬件厂商改进TopN故障,统计时间段可以根据实际情况设置成机器故障发生较频繁的时间段。The statistical dimension and the statistical time period may be set according to the actual needs of the user. Preferably, the statistical dimension may include a hardware fault and a fault type. The historical fault information processing module 104 may use the TopN algorithm to sort the historical fault information to obtain an arrangement. The preceding faulty hardware and fault types, namely TopN fault hardware and TopN fault type, facilitate hardware vendors to improve TopN faults, and the statistical time period can be set to a period of time when machine faults occur more frequently according to actual conditions.
优选地,为了让用户能直观的了解到容易产生的故障的故障类型、容易发生故障的硬件和频繁情节规则,上述呈现模块105还可以设置为在界面呈现统计结果和频繁情节规则,优选地,呈现模块105在界面分别按照多个维度(故障硬件、故障类型)呈现出统计结果,更优地,呈现模块105呈现统计结果中的TopN故障硬件和故障类型;优选地,呈现模块105在界面以列表、扇形图、柱状图等多种形式呈现统计结果和频繁情节规则。Preferably, in order to enable the user to intuitively understand the fault type of the fault that is easy to generate, the hardware that is prone to failure, and the frequent plot rules, the above-described presentation module 105 may also be configured to present statistical results and frequent episode rules on the interface, preferably, The presentation module 105 presents statistical results in the interface according to multiple dimensions (fault hardware, fault type). More preferably, the presentation module 105 presents TopN fault hardware and fault type in the statistical result; preferably, the presentation module 105 is at the interface. Lists, pie charts, histograms, and many other forms present statistical results and frequent plot rules.
本实施例的有益效果是:本实施例提供了一种硬件故障分析系统,采用本实施例的硬件故障分析系统可以周期性地批量地获得待监控的机器的故障日志文件,然后就可以根据故障日志文件对待监控的机器可能会发生的故障做出预测,得知可能失效的硬件是什么,以便长期的对多台机器进行统一监控,帮助其保持良好的工作状态;在此基础上,本实施例的系统还设有呈现模块,将预测结果直观的呈献给用户,提醒用户可能发生的故障,方便用户快速准确地了解到待监控机器的状态和找到该机器的隐患;本实施例还可以对历史故障信息进行处理,得到更多有实际意义的信息,例如对历史故障信息进行多维度统计得到不同种类的硬件在机器上容易发生的故障类型,对历史故障信息进行频繁情节规则挖掘可以得知两个故障件的关联关系,这些信息可以帮助硬件厂商分析硬件的缺陷,找到改进的方向以便对硬件进行改进。The beneficial effects of this embodiment are as follows: The present embodiment provides a hardware fault analysis system. The hardware fault analysis system of this embodiment can periodically obtain batches of fault log files of the machine to be monitored, and then according to the fault. The log file predicts the faults that may occur on the machine to be monitored, and knows what hardware may be invalid, so that the multiple machines can be uniformly monitored for a long time to help them maintain a good working state; on this basis, the implementation The system of the example is further provided with a presentation module, which presents the prediction result intuitively to the user, reminds the user of the possible failure, facilitates the user to quickly and accurately understand the state of the machine to be monitored and finds the hidden trouble of the machine; this embodiment can also The historical fault information is processed to obtain more meaningful information, such as multi-dimensional statistics of historical fault information to obtain fault types that are easily generated by different types of hardware on the machine, and frequent episode rule mining for historical fault information can be known. The relationship between two faulty pieces, this information can help Pieces of hardware manufacturers to defect analysis, to find the direction of improvement for the hardware improvements.
实施例二: Embodiment 2:
参见附图2,本实施例提供一种硬件故障分析方法,包括:Referring to FIG. 2, the embodiment provides a hardware failure analysis method, including:
S201:配置所有待监控的机器的地址、故障日志文件的存放路径、故障日志文件的采集周期和故障判断条件;S201: Configure an address of all the machines to be monitored, a storage path of the fault log file, a collection period of the fault log file, and a fault judgment condition.
S202:获取待监控的机器的地址、存放路径和采集周期,根据采集周期,周期性地获取与地址对应的待监控的机器的故障日志文件,并将故障日志文件存放到存放路径中;S202: Obtain an address, a storage path, and an collection period of the machine to be monitored, and periodically obtain a fault log file of the machine to be monitored corresponding to the address according to the collection period, and store the fault log file in the storage path.
S203:用于获取故障判断条件和存放路径中的故障日志文件,根据故障判断条件对故障日志文件进行故障预测处理,得到预测结果。S203: Acquire a fault log file in the fault judgment condition and the storage path, perform fault prediction processing on the fault log file according to the fault judgment condition, and obtain a prediction result.
在本发明实施例中,为了得到准确的预测结果,S201中的配置的故障判断条件包括:各故障的故障时间窗和各故障对应的故障门限值;相应的,S203就包括获取各故障的故障时间窗和各故障对应的故障门限值和存放路径中的故障日志文件,并对在各种不同的故障的故障时间窗内的故障日志文件中的故障信息进行计数统计,当计数值大于该故障对应的故障门限值时,预测该故障对应的硬件即将失效。In the embodiment of the present invention, in order to obtain an accurate prediction result, the fault determination condition of the configuration in S201 includes: a fault time window of each fault and a fault threshold corresponding to each fault; and correspondingly, S203 includes acquiring each fault. The fault time window and the fault threshold corresponding to each fault and the fault log file in the storage path, and count the fault information in the fault log file in the fault time window of various faults, when the count value is greater than When the fault corresponds to the fault threshold, the hardware corresponding to the fault is predicted to be invalid.
优选地,上述的配置的用于判断硬件是否失效的各故障的故障时间窗和各故障对应的故障门限值可以根据各硬件的种类设置成不同的值,例如,将判断内存条是否失效的故障时间窗设置为24小时、故障门限值设置为3,则如果在24小时内,一根内存条上产生3次可纠正的错误,就预测该内存条即将失效。Preferably, the fault time window of each fault for determining whether the hardware fails or the fault threshold corresponding to each fault is set to a different value according to the type of each hardware, for example, determining whether the memory stick is invalid. If the fault time window is set to 24 hours and the fault threshold is set to 3, if there are 3 correctable errors on one memory stick within 24 hours, it is predicted that the memory stick is about to expire.
由于上述S203中的预测硬件即将失效的预测过程属于实时分析过程,为了及时、快速地预测出结果,并尽可能少的占用内存、CPU等资源用,本实施例借鉴了借鉴MCELOG用户态程序中的漏桶算法对故障信息进行计数统计,对在故障时间窗内发生的故障,只需要进行计数统计,累加计数值;对超出故障时间窗的故障,则需要丢弃一部分,本实施例采用如下的老化算法对丢弃的计数值进行计算:Because the prediction process in the S203 is about to be invalid, the prediction process belongs to the real-time analysis process. In order to predict the result in time and quickly, and use as little memory as possible, such as memory and CPU, this embodiment draws lessons from the MCELOG user state program. The leaky bucket algorithm counts and collects the fault information. For the faults that occur in the fault time window, only the counting and counting are required, and the counting value is accumulated. For the fault that exceeds the fault time window, a part of the fault is discarded. This embodiment adopts the following The aging algorithm calculates the discarded count value:
衰减系数=故障判断门限值/故障判断的时间窗,Attenuation coefficient = time window for fault judgment threshold/fault judgment,
丢弃的计数值=衰减系数*超出时间窗的故障的大小。Discarded count value = attenuation factor * The size of the fault that exceeded the time window.
需要理解的是,本实施例可以根据故障的实际情况和用户的需求选择其他的算法对上述需要丢弃的计数值进行计算。It should be understood that, in this embodiment, other algorithms may be selected according to the actual situation of the fault and the needs of the user to calculate the count value that needs to be discarded.
上述S203中故障预测过程中使用的故障日志文件中的故障信息一般是当前采集到的故障日志文件里的新的故障信息,也即之前没有使用过的,没有用来预测过故障硬件的故障信息。The fault information in the fault log file used in the fault prediction process in the above S203 is generally the new fault information in the currently collected fault log file, that is, the fault information that has not been used before, and is not used to predict the faulty hardware. .
优选地,上述硬件故障分析方法还可以包括在界面至少呈现预测结果,以此表示某个具体的硬件即将失效,预测结果可以以列表或者其他形式来显示出来。Preferably, the hardware fault analysis method may further include presenting at least a prediction result on the interface, thereby indicating that a specific hardware is about to be invalid, and the prediction result may be displayed in a list or other form.
优选地,本实施例中的硬件故障分析方法还包括清除上述呈现出的至少一个测试结果,优选地,可以采用手工清除测试结果的方式,当用户选定要清除的测试结果后,就将与该要清除的测试结果对应的故障日志文件中的故障信息转化为历史故障信息。 Preferably, the hardware failure analysis method in this embodiment further includes: clearing at least one test result presented above, preferably, manually removing the test result, and when the user selects the test result to be cleared, The fault information in the fault log file corresponding to the test result to be cleared is converted into historical fault information.
为了使历史故障信息能更好的被利用,为用户提供过多的关于硬件故障的信息,方便用户对机器的维护,优选地,上述硬件故障方法还包括配置历史故障信息处理参数,根据历史故障信息处理参数对历史故障信息进行处理,得到各故障间的逻辑关系。In order to make the historical fault information better utilized, the user is provided with too much information about the hardware fault, which is convenient for the user to maintain the machine. Preferably, the hardware fault method further includes configuring the historical fault information processing parameter according to the historical fault. The information processing parameters process the historical fault information to obtain the logical relationship between the faults.
上述的频繁情节规则类似于通信系统中的根源告警和衍生告警的关系(告警关联规则)。在告警关联规则的挖掘中,目前广泛使用的是WINEPI算法。该算法利用预先设置的滑动时间窗、滑动步长、最小支持度、最小置信度等参数,计算出事件在时间窗口上的相邻程度并发现事件间在时间上的偏序关系。The above-mentioned frequent episode rules are similar to the relationship between the root cause alarm and the derived alarm in the communication system (alarm association rule). In the mining of alarm association rules, the WINEPI algorithm is widely used at present. The algorithm uses the preset sliding time window, sliding step size, minimum support degree, minimum confidence and other parameters to calculate the neighboring degree of the event in the time window and find the partial order relationship between the events in time.
所以,本实施例可以将WINEPI算法运用在故障日志文件的频繁情节规则挖掘上:Therefore, this embodiment can apply the WINEPI algorithm to the frequent plot rule mining of the fault log file:
MCELOG日志的每一条记录都是结构化的数据,包含产生时间、MCE GSTATUS、MCE BANK、BANK STATUS等信息,所以可以把一条MCELOG记录作为一个事件;Each record of the MCELOG log is structured data, including the generation time, MCE GSTATUS, MCE BANK, BANK STATUS, etc., so an MCELOG record can be used as an event;
每条MCELOG日志的MCEBANK、BANK STATUS等信息作为事件的属性;MCEBANK, BANK STATUS and other information of each MCELOG log as attributes of the event;
MCEBANK的集合、BANK STATUS等的集合分别定义为各个属性域;A collection of MCEBANK collections, BANK STATUS, and the like are defined as respective attribute domains;
进一步,可以把一台机器上的MCELOG记录的集合,看作是在时间上有序的事件序列,作为待分析的事件序列。Further, the set of MCELOG records on one machine can be regarded as a sequence of events in time as the sequence of events to be analyzed.
优选地,上述配置的频繁情节规则挖掘参数具体包括:滑动时间窗、滑动步长、支持度门限值和置信度门限值;上述根据历史故障信息处理参数对历史故障信息进行处理,得到各故障间的逻辑关系包括:根据滑动时间窗和滑动步长对历史故障信息中的各历史故障的支持度和置信度进行计数统计,确定出大于支持度门限值或者置信度门限值的各故障间的频繁情节规则。当然,需要理解的是,对于上述频繁情节规则的挖掘可以参考利用其他的可行的算法来实现上述的挖掘过程。Preferably, the frequent episode rule mining parameter of the foregoing configuration specifically includes: a sliding time window, a sliding step size, a supporting degree threshold, and a confidence threshold; the foregoing processing the historical fault information according to the historical fault information processing parameter, and obtaining each The logical relationship between the faults includes: counting and supporting the support and confidence of each historical fault in the historical fault information according to the sliding time window and the sliding step, and determining each of the greater than the support threshold or the confidence threshold Frequent plot rules between failures. Of course, it should be understood that the mining of the above-mentioned frequent episode rules can refer to the use of other feasible algorithms to implement the above mining process.
优选地,上述配置的历史故障信息处理参数还包括统计条件,该统计条件可以包括统计维度和统计时间段;上述根据历史故障信息处理参数对历史故障信息进行处理,得到各历史故障间的逻辑关系包括:根据统计维度和统计时间段对历史故障信息进行分类、统计和排序,得到统计结果。Preferably, the historical fault information processing parameter of the configuration further includes a statistical condition, where the statistical condition may include a statistical dimension and a statistical time period; the historical fault information processing parameter is used to process the historical fault information, and the logical relationship between the historical faults is obtained. Including: classifying, counting, and sorting historical fault information according to statistical dimensions and statistical time periods, and obtaining statistical results.
上述的统计维度和统计时间段可以根据用户的实际需求来设置,优选地,统计维度可以包括硬件故障和故障类型,统计时间段可以根据实际情况设置成机器故障发生较频繁的时间段。上述对历史故障信息进行排序可以采用TopN算法,得到排列在前的多个故障硬件和故障类型即TopN故障硬件和TopN故障类型,方便硬件厂商改进TopN故障。The statistical dimension and the statistical time period may be set according to the actual needs of the user. Preferably, the statistical dimension may include a hardware fault and a fault type, and the statistical time period may be set to a time period in which the machine fault occurs more frequently according to actual conditions. The above-mentioned ranking of historical fault information can be performed by using the TopN algorithm to obtain multiple fault hardware and fault types, that is, TopN fault hardware and TopN fault type, which are convenient for hardware manufacturers to improve TopN faults.
优选地,为了让用户能直观的了解到容易产生的故障类型、发生故障的硬件和频繁情节规则,本实施例还可以在界面呈现统计结果和频繁情节规则,优选地,在界面分别按照多个维度(故障硬件、故障类型)呈现出统计结果,优选地,呈现出的统计结果可以是上述的TopN故障硬件和TopN故障类型;呈现的方式也有多种,例如在界面以列表、扇形图、柱状图等多种形式呈现统计结果和频繁情节规则时。 Preferably, in order to enable the user to intuitively understand the type of fault that is easy to generate, the hardware that fails, and the frequent plot rules, the embodiment may also present statistical results and frequent episode rules on the interface, preferably, multiple interfaces in the interface. The dimension (fault hardware, fault type) presents statistical results. Preferably, the statistical result presented may be the TopN fault hardware and the TopN fault type described above; there are also multiple ways of presenting, for example, in the interface, list, sector, column When graphs and other forms present statistical results and frequent plot rules.
本实施例的有益效果是:本实施例提供了一种硬件故障分析方法,不同于现有技术只能对一台待监控机器进行故障预测,采用本实施例的方法,可以对多台待监控的机器进行长期进行故障预测,得到待监控机器的长时间的运行状态,为了让用户快速、更直观地了解待监控机器可能发生的故障;在得到测试结果之后,可以在界面上将测试结果呈现给用户,同时还可以允许用户对呈现的测试结果进行删除,提高用户使用感,删除的测试结果对应的故障信息不会被丢弃而是会转为历史故障信息;本实施例还提供了对历史故障信息进行处理,至少得到多维度统计结果和故障件的频繁情节规则的方法,硬件厂商可以根据多维度统计结果和故障件的频繁情节规则,从多个角度对硬件进行分析找出其缺点加以改善,得到更优质的硬件。The beneficial effects of this embodiment are as follows: This embodiment provides a hardware fault analysis method. Different from the prior art, only one device to be monitored can be predicted by using a fault of the device to be monitored. The machine performs long-term fault prediction and obtains the long-term running state of the machine to be monitored. In order to allow the user to quickly and intuitively understand the possible failure of the machine to be monitored; after obtaining the test result, the test result can be presented on the interface. The user can also be allowed to delete the test result of the presentation, and improve the user's sense of use. The fault information corresponding to the deleted test result will not be discarded but will be converted into historical fault information. This embodiment also provides history. The fault information is processed to obtain at least a multi-dimensional statistical result and a frequent plot rule of the faulty component. The hardware manufacturer can analyze the hardware from multiple angles according to the multi-dimensional statistical result and the frequent plot rule of the faulty component to find out its shortcomings. Improve and get better quality hardware.
实施例三:Embodiment 3:
参见附图3,本实施例还提供一种硬件故障分析方法,该方法是基于MCELOG的硬件故障分析方法,在根据本实施例的硬件故障分析方法对机器的硬件故障进行预测时,首先需要配置故障预测过程中需要用到的配置项,该配置项的具体内容和含义如下表1所示,表1是硬件故障预测方法的配置信息表,下表2:显示项列表,包含了需要在界面上显示的显示项的内容和含义。Referring to FIG. 3, the embodiment further provides a hardware failure analysis method, which is a hardware failure analysis method based on MCELOG. When the hardware failure analysis method according to the embodiment predicts a hardware failure of the machine, the first configuration is required. The configuration items that need to be used in the fault prediction process. The specific content and meaning of the configuration items are shown in Table 1. Table 1 is the configuration information table of the hardware fault prediction method. Table 2 below shows the list of displayed items, including the required interface. The content and meaning of the displayed item displayed on it.
Figure PCTCN2016075547-appb-000001
Figure PCTCN2016075547-appb-000001
表1 Table 1
Figure PCTCN2016075547-appb-000002
Figure PCTCN2016075547-appb-000002
表2Table 2
在配置好表1中的配置项后,就可以对所有的机器进行批量硬件故障分析了,参见附图3,该硬件故障分析的流程如下:After configuring the configuration items in Table 1, you can perform batch hardware failure analysis on all machines. Referring to Figure 3, the hardware failure analysis process is as follows:
S301:读取配置项11、12和13,获取MCELOG日志文件;S301: Read configuration items 11, 12, and 13 to obtain an MCELOG log file.
S302:读取配置项14,对S301中获取的MCELOG日志文件进行故障预测处理,产生告警信息;S302: Read configuration item 14, perform fault prediction processing on the MCELOG log file obtained in S301, and generate alarm information;
S303:在界面呈现表2中的显示项21,即S302中的告警信息;S303: The display item 21 in the table 2, that is, the alarm information in S302 is presented in the interface;
S304:清除呈现在界面的至少一个告警信息;S304: Clear at least one alarm information presented on the interface.
S305:读取配置项15,选择待挖掘的故障信息进行频繁情节规则挖掘;S305: Read configuration item 15, select fault information to be mined, and perform frequent rule rule mining;
S306:读取配置项16,选择待挖掘的故障信息,对该故障信息进行多维度统计;S306: Read the configuration item 16, select the fault information to be mined, and perform multi-dimensional statistics on the fault information.
S307:在界面呈现表2中的显示项22和显示项23。S307: The display item 22 and the display item 23 in Table 2 are presented in the interface.
参见附图4对上述S301进行详细的说明,S301包括以下步骤:The above S301 is described in detail with reference to FIG. 4, and S301 includes the following steps:
S3011:读取并解析所有待监控主机的地址、MCELOG日志文件的存储路径和故障日志文件的采集周期;S3011: Read and parse the address of all hosts to be monitored, the storage path of the MCELOG log file, and the collection period of the fault log file.
S3012:根据待监控主机的地址找到对应的主机,按照采集周期定时从待监控的主机上获取MCELOG日志文件,集中存放到配置的存储路径下;S3012: The host is found according to the address of the host to be monitored, and the MCELOG log file is obtained from the host to be monitored according to the collection period, and is stored in the configured storage path.
S3013:判断是否获取了该周期内所有待监控的主机的MCELOG日志文件,是则进入S3014,否则进入S3012,继续获取待监控主机的MCELOG日志文件;S3013: Determine whether the MCELOG log file of all hosts to be monitored in the period is obtained, and then enter S3014; otherwise, enter S3012, and continue to obtain the MCELOG log file of the host to be monitored;
S3014:结束。S3014: End.
参见附图5对上述S302进行详细的说明,S302包括以下步骤:The above S302 is described in detail with reference to FIG. 5, and S302 includes the following steps:
S3021:读取并解析与某具体硬件对应的故障时间窗和故障门限值; S3021: Read and parse a fault time window and a fault threshold corresponding to a specific hardware;
S3022:读取S301中新获取的MCELOG日志文件;S3022: Read the newly acquired MCELOG log file in S301.
S3023:判断该MCELOG日志文件中的故障信息是否在故障时间窗内,如果是,则进入S3024;否则,就进入S3025;S3023: determining whether the fault information in the MCELOG log file is in the fault time window, if yes, proceeding to S3024; otherwise, proceeding to S3025;
S3024:计数值增加;S3024: the count value is increased;
S3025:按照老化算法对计数值进行减少,并跳转到S3026;S3025: The count value is reduced according to the aging algorithm, and jumps to S3026;
S3026:判断计数值是否达到故障门限值,如果是,进入S3027,否则,就跳转到S3022;S3026: determining whether the count value reaches the fault threshold, if yes, entering S3027, otherwise, jumping to S3022;
S3027:判断硬件即将失效,产生告警信息;S3027: It is determined that the hardware is about to be invalid, and an alarm information is generated.
S3028:结束。S3028: End.
参见附图6对上述S304进行详细的说明,S304包括以下步骤:The above S304 is described in detail with reference to FIG. 6, and S304 includes the following steps:
S3041:用户选择待清除的告警信息;S3041: The user selects alarm information to be cleared.
S3042:把待清除的告警信息对应的MCELOG日志文件中的故障信息转化为历史故障信息;S3042: Convert the fault information in the MCELOG log file corresponding to the alarm information to be cleared into historical fault information.
S3043:重新采集DMIDECODE信息、MCELOG日志文件,得到当前故障信息;S3043: Re-collect the DMIDECODE information and the MCELOG log file to obtain the current fault information.
S3044:在界面清除选中的告警信息;S3044: Clear the selected alarm information on the interface.
S3045:结束。S3045: End.
参见附图7对上述S305进行详细的说明,S305包括以下步骤:The above S305 is described in detail with reference to FIG. 7, and S305 includes the following steps:
S3051:用户选取选择待挖掘的历史故障信息;S3051: The user selects and selects historical fault information to be mined;
S3052:读取并解析滑动时间窗、滑动步长、支持度门限值、置信度门限值;S3052: reading and parsing the sliding time window, the sliding step, the support threshold, and the confidence threshold;
S3053:读取待挖掘的历史故障信息,运行WINEPI算法,挖掘出故障间的频繁情节规则;S3053: Read historical fault information to be mined, run WINEPI algorithm, and mine frequent episode rules between faults;
S3054:结束。S3054: End.
参见附图8对上述S306进行详细的说明,S306包括以下步骤:The above S306 is described in detail with reference to FIG. 8, and S306 includes the following steps:
S3061:读取并解析统计时间段和统计维度,统计维度包括故障硬件和故障类型;S3061: Read and parse a statistical time period and a statistical dimension, where the statistical dimension includes fault hardware and a fault type;
S3062:获取历史故障信息,按照统计维度和统计时间段对历史故障信息进行分类、统计和排序;S3062: Acquire historical fault information, classify, count, and sort historical fault information according to the statistical dimension and the statistical time period;
S3063:结束。S3063: End.
以上内容是结合具体的实施方式对本发明所作的进一步详细说明,不能认定本发明的具体实施只局限于这些说明。对于本发明所属技术领域的普通技术人员来说,在不脱离本发明构思的前提下,还可以做出若干简单推演或替换,都应当视为属于本发明的保护范围。 The above is a further detailed description of the present invention in connection with the specific embodiments, and the specific embodiments of the present invention are not limited to the description. It will be apparent to those skilled in the art that the present invention may be made without departing from the spirit and scope of the invention.
工业实用性Industrial applicability
本发明实施例提供的上述技术方案,历史故障信息处理模块对历史故障信息进行处理可以包括:根据用户配置模块配置的统计维度和统计时间段,对历史故障信息进行分类、统计和排序,得到统计结果;和根据用户配置模块配置的频繁情节规则挖掘参数对历史故障信息进行处理,挖掘各故障间的频繁情节规则;上述对历史故障信息的具体处理方法充分利用了历史故障信息,得到了统计结果和硬件故障间的频繁情节规则,而统计结果和频繁情节规则可以反映硬件在机器上长期运行时的易发生的故障各故障间的关系,从而确定该硬件存在的不足和缺陷,为改进硬件提供了帮助,使硬件厂商可以根据该统计结果和该频繁情节规则对硬件进行改进。 According to the foregoing technical solution provided by the embodiment of the present invention, the historical fault information processing module processing the historical fault information may include: classifying, counting, and sorting the historical fault information according to the statistical dimension and the statistical time period configured by the user configuration module, and obtaining statistics. Result; and the historical fault information is processed according to the frequent plot rule mining parameters configured by the user configuration module, and the frequent plot rules between the faults are mined; the specific processing method for the historical fault information fully utilizes the historical fault information, and the statistical result is obtained. Frequent episode rules between hardware failures, and statistical results and frequent episode rules can reflect the relationship between faults and faults of hardware that occur during long-term operation on the machine, thereby determining the deficiencies and defects of the hardware, and providing hardware improvements. Help, so that hardware vendors can improve the hardware based on the statistics and the frequent plot rules.

Claims (16)

  1. 一种硬件故障分析系统,包括:A hardware failure analysis system, comprising:
    用户配置模块,设置为配置所有待监控的机器的地址、故障日志文件的存放路径、故障日志文件的采集周期和故障判断条件;The user configuration module is configured to configure the address of all the machines to be monitored, the storage path of the fault log file, the collection period of the fault log file, and the fault judgment condition;
    信息采集模块,设置为获取所述待监控的机器的地址、所述存放路径和所述采集周期,根据所述采集周期,周期性地获取与所述地址对应的待监控的机器的故障日志文件,并将所述故障日志文件存放到所述存放路径中;The information collection module is configured to acquire the address of the machine to be monitored, the storage path, and the collection period, and periodically acquire a fault log file of the machine to be monitored corresponding to the address according to the collection period. And storing the fault log file in the storage path;
    当前故障预测模块,设置为获取所述故障判断条件和所述存放路径中的故障日志文件,根据所述故障判断条件对所述故障日志文件进行故障预测处理,得到预测结果。The current fault prediction module is configured to obtain the fault judgment condition and the fault log file in the storage path, and perform fault prediction processing on the fault log file according to the fault judgment condition to obtain a prediction result.
  2. 如权利要求1所述的硬件故障分析系统,其中,所述用户配置模块配置的故障判断条件包括各故障的故障时间窗和各故障对应的故障门限值;所述当前故障预测模块设置为获取所述各故障的故障时间窗、所述各故障对应的故障门限值和所述存放路径中的故障日志文件;并对在各故障的故障时间窗内的故障日志文件中的故障信息进行计数统计,当计数值大于该故障对应的故障门限值时,预测该故障对应的硬件即将失效。The hardware fault analysis system of claim 1 , wherein the fault determination condition of the user configuration module configuration comprises a fault time window of each fault and a fault threshold corresponding to each fault; the current fault prediction module is configured to acquire a fault time window of each fault, a fault threshold corresponding to each fault, and a fault log file in the storage path; and counting fault information in a fault log file within a fault time window of each fault Statistics: When the count value is greater than the fault threshold corresponding to the fault, the hardware corresponding to the fault is predicted to be invalid.
  3. 如权利要求1或2所述的硬件故障分析系统,其中,还包括:结果呈现模块,设置为在界面至少呈现所述预测结果。The hardware failure analysis system of claim 1 or 2, further comprising: a result presentation module configured to present at least the prediction result at the interface.
  4. 如权利要求3所述的硬件故障分析系统,其中,还包括:清除模块,设置为清除所述结果呈现模块呈现的至少一个所述预测结果,并将与清除的预测结果对应的故障日志文件中的故障信息转化为历史故障信息。The hardware failure analysis system according to claim 3, further comprising: a clearing module configured to clear at least one of the prediction results presented by the result presentation module, and to be in a failure log file corresponding to the cleared prediction result The fault information is converted into historical fault information.
  5. 如权利要求4所述的硬件故障分析系统,其中,所述用户配置模块还设置为配置历史故障信息处理参数;硬件故障分析系统还包括历史故障信息处理模块,设置为根据所述历史故障信息处理参数对所述历史故障信息进行处理,得到各故障间的逻辑关系。The hardware failure analysis system according to claim 4, wherein the user configuration module is further configured to configure historical failure information processing parameters; the hardware failure analysis system further includes a historical failure information processing module configured to process according to the historical failure information The parameter processes the historical fault information to obtain a logical relationship between the faults.
  6. 如权利要求5所述的硬件故障分析系统,其中,所述用户配置模块配置的历史故障信息处理参数包括频繁情节规则挖掘参数;所述历史故障信息处理模块具体设置为读取所述频繁情节规则挖掘参数,根据所述频繁情节规则挖掘参数对所述历史故障信息进行处理,挖掘各故障间的频繁情节规则。The hardware fault analysis system of claim 5, wherein the historical fault information processing parameter configured by the user configuration module comprises a frequent episode rule mining parameter; the historical fault information processing module is specifically configured to read the frequent episode rule Mining the parameters, processing the historical fault information according to the frequent episode rule mining parameters, and mining frequent episode rules between the faults.
  7. 如权利要求6所述的硬件故障分析系统,其中,所述用户配置模块配置的频繁情节规则挖掘参数包括:滑动时间窗、滑动步长、支持度门限值和置信度门限值;所述历史故障信息处理模块设置为根据所述滑动时间窗和滑动步长对所述历史故障信息中的各故障间的支持度和置信度进行计数统计,确定出大于所述支持度门限值或者所述置信度门限值的各故障间的频繁情节规则。The hardware failure analysis system according to claim 6, wherein the frequent scenario rule mining parameters configured by the user configuration module comprise: a sliding time window, a sliding step size, a support level threshold, and a confidence threshold; The historical fault information processing module is configured to count and support the support degree and the confidence between the faults in the historical fault information according to the sliding time window and the sliding step size, and determine that the threshold value or the threshold is greater than the support threshold Frequent plot rules between faults that describe the confidence threshold.
  8. 如权利要求5所述的硬件故障分析系统,其中,所述用户配置模块配置的历史故障信息处理参数包括统计条件,所述统计条件包括统计维度和统计时间段;所述历史故障信息 处理模块设置为根据所述统计维度和统计时间段对所述历史故障信息进行分类、统计和排序,得到统计结果。The hardware fault analysis system according to claim 5, wherein the historical fault information processing parameter configured by the user configuration module comprises a statistical condition, the statistical condition includes a statistical dimension and a statistical time period; and the historical fault information The processing module is configured to classify, count, and sort the historical fault information according to the statistical dimension and the statistical time period to obtain a statistical result.
  9. 一种硬件故障分析方法,包括:A hardware failure analysis method, including:
    配置所有待监控的机器的地址、故障日志文件的存放路径、故障日志文件的采集周期和故障判断条件;Configure the address of all machines to be monitored, the storage path of fault log files, the collection period of fault log files, and fault judgment conditions.
    获取所述待监控的机器的地址、所述存放路径和所述采集周期,根据所述采集周期,周期性地获取与所述地址对应的待监控的机器的故障日志文件,并将所述故障日志文件存放到所述存放路径中;Acquiring the address of the machine to be monitored, the storage path, and the collection period, and periodically acquiring a fault log file of the machine to be monitored corresponding to the address according to the collection period, and acquiring the fault The log file is stored in the storage path;
    获取所述故障判断条件和所述存放路径中的故障日志文件,根据所述故障判断条件对所述故障日志文件进行故障预测处理,得到预测结果。Obtaining the fault determination condition and the fault log file in the storage path, performing fault prediction processing on the fault log file according to the fault judgment condition, and obtaining a prediction result.
  10. 如权利要求9所述的硬件故障分析方法,其中,所述配置故障判断条件包括:配置各故障的故障时间窗和各故障对应的故障门限值;所述获取所述故障判断条件和所述存放路径中的故障日志文件,根据所述故障判断条件对所述故障日志文件进行故障预测处理,得到预测结果包括:获取所述各故障的故障时间窗、所述各故障对应的故障门限值和所述存放路径中的故障日志文件;并对在各故障的故障时间窗内的故障日志文件中的故障信息进行计数统计,当计数值大于该故障对应的故障门限值时,预测该故障对应的硬件即将失效。The hardware failure analysis method according to claim 9, wherein the configuration failure determination condition comprises: configuring a failure time window of each failure and a failure threshold corresponding to each failure; and acquiring the fault determination condition and the Storing a fault log file in the path, performing fault prediction processing on the fault log file according to the fault judgment condition, and obtaining a prediction result includes: acquiring a fault time window of each fault, and a fault threshold corresponding to each fault And the fault log file in the storage path; and the fault information in the fault log file in the fault time window of each fault is counted and counted, and when the count value is greater than the fault threshold corresponding to the fault, the fault is predicted The corresponding hardware is about to expire.
  11. 如权利要求9或10所述的硬件故障分析方法,其中,在得到测试结果之后,还包括在界面至少呈现所述预测结果。The hardware failure analysis method according to claim 9 or 10, further comprising, after obtaining the test result, at least presenting the prediction result on the interface.
  12. 如权利要求11所述的硬件故障分析方法,其中,在呈现所述预测结果之后,还包括清除呈现的至少一个所述预测结果,并将与清除的预测结果对应的故障日志文件中的故障信息转化为历史故障信息。The hardware failure analysis method according to claim 11, wherein after presenting the prediction result, further comprising: clearing at least one of the predicted results presented, and the failure information in the failure log file corresponding to the cleared prediction result Translate into historical failure information.
  13. 如权利要求12所述的硬件故障分析方法,其中,还包括配置历史故障信息处理参数,根据所述历史故障信息处理参数对所述历史故障信息进行处理,得到各故障间的逻辑关系。The hardware fault analysis method according to claim 12, further comprising configuring a historical fault information processing parameter, and processing the historical fault information according to the historical fault information processing parameter to obtain a logical relationship between the faults.
  14. 如权利要求13所述的硬件故障分析方法,其中,所述配置历史故障信息处理参数包括配置频繁情节规则挖掘参数;所述根据所述历史故障信息处理参数对所述历史故障信息进行处理,得到各故障间的逻辑关系包括:读取所述频繁情节规则挖掘参数,根据所述频繁情节规则挖掘参数对所述历史故障信息进行处理,挖掘各故障间的频繁情节规则。The hardware failure analysis method according to claim 13, wherein the configuration history failure information processing parameter comprises configuring a frequent episode rule mining parameter; and the processing the historical fault information according to the historical failure information processing parameter to obtain The logical relationship between the faults includes: reading the frequent episode rule mining parameters, processing the historical fault information according to the frequent episode rule mining parameters, and mining frequent episode rules between the faults.
  15. 如权利要求14所述的硬件故障分析方法,其中,所述配置的频繁情节规则挖掘参数包括:滑动时间窗、滑动步长、支持度门限值和置信度门限值;所述根据所述频繁情节规则挖掘参数对所述历史故障信息进行处理,挖掘各故障间的频繁情节规则包括:根据所述滑动时间窗和滑动步长对所述历史故障信息中的各故障间的支持度和置信度进行计数统计,确定出大于所述支持度门限值或者所述置信度门限值的各故障间的频繁情节规则。 The hardware failure analysis method according to claim 14, wherein the configured frequent episode rule mining parameters include: a sliding time window, a sliding step size, a support level threshold, and a confidence threshold; The frequent episode rule mining parameter processes the historical fault information, and mining the frequent episode rules between the faults includes: supporting and credibility between the faults in the historical fault information according to the sliding time window and the sliding step size Counting statistics are performed to determine frequent episode rules between faults greater than the support threshold or the confidence threshold.
  16. 如权利要求13所述的硬件故障分析方法,其中,所述配置的历史故障信息处理参数包括统计条件,所述统计条件包括统计维度和统计时间段;所述根据所述历史故障信息处理参数对所述历史故障信息进行处理,得到各历史故障间的逻辑关系包括:根据所述统计维度和统计时间段对所述历史故障信息进行分类、统计和排序,得到统计结果。 The hardware failure analysis method according to claim 13, wherein the configured historical failure information processing parameter includes a statistical condition, the statistical condition includes a statistical dimension and a statistical time period; and the processing parameter pair is processed according to the historical failure information The historical fault information is processed, and the logical relationship between the historical faults is obtained. The historical fault information is classified, counted, and sorted according to the statistical dimension and the statistical time period, and the statistical result is obtained.
PCT/CN2016/075547 2015-10-14 2016-03-03 Hardware fault analysis system and method WO2016188175A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201510662922.3A CN106598800A (en) 2015-10-14 2015-10-14 Hardware fault analysis system and method
CN201510662922.3 2015-10-14

Publications (1)

Publication Number Publication Date
WO2016188175A1 true WO2016188175A1 (en) 2016-12-01

Family

ID=57392413

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/075547 WO2016188175A1 (en) 2015-10-14 2016-03-03 Hardware fault analysis system and method

Country Status (2)

Country Link
CN (1) CN106598800A (en)
WO (1) WO2016188175A1 (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107545129A (en) * 2016-06-27 2018-01-05 西门子(深圳)磁共振有限公司 A kind of trouble shooting method and apparatus of Medical Devices
CN110647446A (en) * 2018-06-26 2020-01-03 中兴通讯股份有限公司 Log fault association and prediction method, device, equipment and storage medium
CN111258624A (en) * 2020-01-13 2020-06-09 上海交通大学 Method and system for predicting Issue solution time in open source software development
CN111352792A (en) * 2018-12-21 2020-06-30 海能达通信股份有限公司 Log management method, server and device with storage function
CN111949488A (en) * 2020-08-14 2020-11-17 山东英信计算机技术有限公司 Hard disk fault prediction method and system, electronic equipment and storage medium
CN115396287A (en) * 2022-08-29 2022-11-25 武汉烽火技术服务有限公司 Fault analysis method and device
WO2023061209A1 (en) * 2021-10-12 2023-04-20 中兴通讯股份有限公司 Method for predicting memory fault, and electronic device and computer-readable storage medium

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107766208B (en) * 2017-10-27 2021-01-05 深圳市中润四方信息技术有限公司 Method, system and device for monitoring business system
CN110008056A (en) * 2019-03-28 2019-07-12 联想(北京)有限公司 EMS memory management process, device, electronic equipment and computer readable storage medium
CN111913054B (en) * 2019-05-10 2021-09-21 株洲中车时代电气股份有限公司 Method and system for diagnosing over-temperature fault of chopping wave and transmission control device
CN110968447A (en) * 2019-12-02 2020-04-07 安徽三实信息技术服务有限公司 Server host inspection system
CN111858116B (en) 2020-06-19 2024-02-13 浪潮电子信息产业股份有限公司 Information recording method, device, equipment and readable storage medium
CN113900902A (en) * 2021-10-21 2022-01-07 挂号网(杭州)科技有限公司 Log processing method and device, electronic equipment and storage medium
CN116431454B (en) * 2023-04-17 2023-11-14 云上遵义大数据有限公司 Big data computer performance control system and method

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102546227A (en) * 2011-05-19 2012-07-04 广东迅通科技股份有限公司 Method and device for monitoring working status of different system devices
CN103117879A (en) * 2013-01-30 2013-05-22 昆明理工大学 Network monitoring system for computer hardware processing parameters
CN104270268A (en) * 2014-09-28 2015-01-07 曙光信息产业股份有限公司 Network performance analysis and fault diagnosis method of distributed system
JP2015146152A (en) * 2014-02-04 2015-08-13 三菱電機株式会社 Monitoring control apparatus
CN104866411A (en) * 2015-06-08 2015-08-26 北京奇虎科技有限公司 Monitoring and analyzing method and device for solid state disks

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2008257411A (en) * 2007-04-04 2008-10-23 Hitachi Ltd Disk control system
CN101667197A (en) * 2009-09-18 2010-03-10 浙江大学 Mining method of data stream association rules based on sliding window
CN101888309B (en) * 2010-06-30 2012-07-04 中国科学院计算技术研究所 Online log analysis method
WO2014043623A1 (en) * 2012-09-17 2014-03-20 Siemens Corporation Log-based predictive maintenance
CN103812719B (en) * 2012-11-12 2018-05-18 华为技术有限公司 The failure prediction method and device of group system
CN103198000A (en) * 2013-04-02 2013-07-10 浪潮电子信息产业股份有限公司 Method for positioning faulted memory in linux system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102546227A (en) * 2011-05-19 2012-07-04 广东迅通科技股份有限公司 Method and device for monitoring working status of different system devices
CN103117879A (en) * 2013-01-30 2013-05-22 昆明理工大学 Network monitoring system for computer hardware processing parameters
JP2015146152A (en) * 2014-02-04 2015-08-13 三菱電機株式会社 Monitoring control apparatus
CN104270268A (en) * 2014-09-28 2015-01-07 曙光信息产业股份有限公司 Network performance analysis and fault diagnosis method of distributed system
CN104866411A (en) * 2015-06-08 2015-08-26 北京奇虎科技有限公司 Monitoring and analyzing method and device for solid state disks

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107545129A (en) * 2016-06-27 2018-01-05 西门子(深圳)磁共振有限公司 A kind of trouble shooting method and apparatus of Medical Devices
CN107545129B (en) * 2016-06-27 2021-06-22 西门子(深圳)磁共振有限公司 Fault checking method and device for medical equipment
CN110647446A (en) * 2018-06-26 2020-01-03 中兴通讯股份有限公司 Log fault association and prediction method, device, equipment and storage medium
CN110647446B (en) * 2018-06-26 2023-02-21 中兴通讯股份有限公司 Log fault association and prediction method, device, equipment and storage medium
CN111352792A (en) * 2018-12-21 2020-06-30 海能达通信股份有限公司 Log management method, server and device with storage function
CN111352792B (en) * 2018-12-21 2023-10-24 海能达通信股份有限公司 Log management method, server and device with storage function
CN111258624A (en) * 2020-01-13 2020-06-09 上海交通大学 Method and system for predicting Issue solution time in open source software development
CN111258624B (en) * 2020-01-13 2023-04-28 上海交通大学 Issue solving time prediction method and system in open source software development
CN111949488A (en) * 2020-08-14 2020-11-17 山东英信计算机技术有限公司 Hard disk fault prediction method and system, electronic equipment and storage medium
WO2023061209A1 (en) * 2021-10-12 2023-04-20 中兴通讯股份有限公司 Method for predicting memory fault, and electronic device and computer-readable storage medium
CN115396287A (en) * 2022-08-29 2022-11-25 武汉烽火技术服务有限公司 Fault analysis method and device
CN115396287B (en) * 2022-08-29 2023-05-12 武汉烽火技术服务有限公司 Fault analysis method and device

Also Published As

Publication number Publication date
CN106598800A (en) 2017-04-26

Similar Documents

Publication Publication Date Title
WO2016188175A1 (en) Hardware fault analysis system and method
US8533536B2 (en) Monitoring data categorization and module-based health correlations
US8862728B2 (en) Problem determination and diagnosis in shared dynamic clouds
US8156377B2 (en) Method and apparatus for determining ranked causal paths for faults in a complex multi-host system with probabilistic inference in a time series
US8230262B2 (en) Method and apparatus for dealing with accumulative behavior of some system observations in a time series for Bayesian inference with a static Bayesian network model
US10664837B2 (en) Method and system for real-time, load-driven multidimensional and hierarchical classification of monitored transaction executions for visualization and analysis tasks like statistical anomaly detection
US8655623B2 (en) Diagnostic system and method
US8069370B1 (en) Fault identification of multi-host complex systems with timesliding window analysis in a time series
US8291263B2 (en) Methods and apparatus for cross-host diagnosis of complex multi-host systems in a time series with probabilistic inference
US10733009B2 (en) Information processing apparatus and information processing method
WO2018228049A1 (en) Database performance index monitoring method, apparatus and device, and storage medium
CN104991853A (en) Method and apparatus for outputting early warning information
CN110647447B (en) Abnormal instance detection method, device, equipment and medium for distributed system
US9600523B2 (en) Efficient data collection mechanism in middleware runtime environment
US11438239B2 (en) Tail-based span data sampling
US9116804B2 (en) Transient detection for predictive health management of data processing systems
JP2014120001A (en) Monitoring device, monitoring method of monitoring object host, monitoring program, and recording medium
US7962692B2 (en) Method and system for managing performance data
US8214693B2 (en) Damaged software system detection
CN112256548B (en) Abnormal data monitoring method and device, server and storage medium
US20140280860A1 (en) Method and system for signal categorization for monitoring and detecting health changes in a database system
US8762783B2 (en) Error identification
CN113760879B (en) Database anomaly monitoring method, system, electronic equipment and medium
CN115543671A (en) Data analysis method, device, equipment, storage medium and program product
US20150261640A1 (en) Analyzing data with computer vision

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16799064

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16799064

Country of ref document: EP

Kind code of ref document: A1