WO2023071039A1 - 一种故障诊断方法、装置、设备及可读存储介质 - Google Patents

一种故障诊断方法、装置、设备及可读存储介质 Download PDF

Info

Publication number
WO2023071039A1
WO2023071039A1 PCT/CN2022/083577 CN2022083577W WO2023071039A1 WO 2023071039 A1 WO2023071039 A1 WO 2023071039A1 CN 2022083577 W CN2022083577 W CN 2022083577W WO 2023071039 A1 WO2023071039 A1 WO 2023071039A1
Authority
WO
WIPO (PCT)
Prior art keywords
fault diagnosis
log file
raid card
fault
rule base
Prior art date
Application number
PCT/CN2022/083577
Other languages
English (en)
French (fr)
Inventor
孔涛
Original Assignee
苏州浪潮智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 苏州浪潮智能科技有限公司 filed Critical 苏州浪潮智能科技有限公司
Publication of WO2023071039A1 publication Critical patent/WO2023071039A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11CSTATIC STORES
    • G11C29/00Checking stores for correct operation ; Subsequent repair; Testing stores during standby or offline operation
    • G11C29/04Detection or location of defective memory elements, e.g. cell constructio details, timing of test signals
    • G11C29/08Functional testing, e.g. testing during refresh, power-on self testing [POST] or distributed testing
    • G11C29/10Test algorithms, e.g. memory scan [MScan] algorithms; Test patterns, e.g. checkerboard patterns 
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11CSTATIC STORES
    • G11C29/00Checking stores for correct operation ; Subsequent repair; Testing stores during standby or offline operation
    • G11C29/04Detection or location of defective memory elements, e.g. cell constructio details, timing of test signals

Definitions

  • the present application relates to the technical field of servers, and more specifically, to a fault diagnosis method, device, equipment and readable storage medium.
  • RAID Redundant Arrays of Independent Disks, disk array
  • hard disk types there are more management modes for hard disks in RAID. Faults or hard drive failures can lead to data loss, downtime, and other issues with the server.
  • the purpose of this application is to provide a fault diagnosis method, device, equipment and readable storage medium for fault diagnosis of RAID and hard disks, so as to process faults in time, thereby reducing server data loss and probability of downtime.
  • a fault diagnosis method comprising:
  • the library provides a fault handling scheme; the fault diagnosis rule library is created by analyzing the historical faults of the RAID card and the hard disk it manages in advance;
  • the fault diagnosis rule base provides a fault handling plan, including:
  • Each log file of the RAID card is compared with the fault diagnosis rule base, and judges whether there is a target file matching the current log file of the RAID card in the fault diagnosis rule base;
  • using the fault diagnosis method to diagnose the current log file to provide the fault handling solution including:
  • the latest information does not include the abnormal information and the normal information, then provide a troubleshooting plan for the hard disk failure on the first slot number, and suggest submitting a work order and elevating the corresponding failure problem to the second line.
  • using the fault diagnosis method to diagnose the current log file to provide the fault handling solution including:
  • using the fault diagnosis method to diagnose the current log file to provide the fault handling solution including:
  • diagnosis keyword in the fault diagnosis rule base does not contain any one of hwErrors, mediumErrors, and smartWarning, then obtain the number of the value greater than 0 corresponding to the last keyword;
  • Construct a log training set according to the state information use the Relief filter selection algorithm to select samples from the log training set, find guessing neighbor samples from samples of the same type as the sample, and select samples from samples that are different from the sample. Randomly select a guessed wrong neighbor sample, if the distance between the sample and the guessed neighbor sample on the feature is smaller than the distance between the sample and the guessed wrong neighbor sample on the same feature, then increase the weight of the feature, If the distance between the sample and the guessed neighbor sample on the feature is not less than the distance between the sample and the guessed neighbor sample on the same feature, then reduce the weight of the feature, and the feature After a preset number of trainings, and obtaining the average weight of the feature after the preset number of trainings;
  • the analysis result of the feature set is received, and the analysis result of the feature set is added to the fault diagnosis rule base.
  • outputting a log file for determining that the RAID card and/or managed hard disk is faulty and the fault handling plan including:
  • a fault diagnosis device comprising:
  • Obtaining module is used for obtaining each log file of the RAID card in the server to be monitored;
  • the determining module is used to determine whether the RAID card and the hard disk it manages are faulty according to the fault diagnosis rule base and the log files of the RAID card, and use
  • the fault diagnosis rule base provides a fault handling scheme; the fault diagnosis rule base is created by analyzing the historical faults of the RAID card and the hard disk it manages in advance;
  • An output module configured to output a log file for determining that the RAID card and/or the managed hard disk is faulty, and the fault handling plan.
  • a fault diagnosis device comprising:
  • a processor configured to implement the steps of the fault diagnosis method described in any one of the above when executing the computer program.
  • a readable storage medium wherein a computer program is stored in the readable storage medium, and when the computer program is executed by a processor, the steps of the fault diagnosis method described in any one of the above are realized.
  • the application provides a fault diagnosis method, device, equipment and readable storage medium, wherein the method includes: obtaining each log file of the RAID card in the server to be monitored; Determine whether there is a fault in the RAID card and the hard disk it manages, and use the fault diagnosis rule base to give a fault handling plan when it is determined that the RAID card and/or the managed hard disk is faulty; Created by analyzing the historical faults of hard disks; outputting log files and troubleshooting solutions for determining faults in the RAID card and/or managed hard disks.
  • the above-mentioned technical scheme disclosed in the present application determines the fault diagnosis rule base created by analyzing the historical faults of the RAID card and the hard disk it manages in advance and the acquired log files of the RAID card in the server to be monitored.
  • Monitor the RAID card in the server and/or whether the hard disk managed by the RAID card is faulty and use the fault diagnosis rule base to provide a fault handling plan when it is determined that there is a fault, so as to realize the fault diagnosis of the RAID card and hard disk, and pass the output
  • Fig. 1 is a flow chart of a fault diagnosis method provided by the embodiment of the present application.
  • FIG. 2 is a schematic structural diagram of a fault diagnosis device provided in an embodiment of the present application.
  • FIG. 3 is a schematic structural diagram of a fault diagnosis device provided in an embodiment of the present application.
  • FIG. 4 is a schematic structural diagram of a readable storage medium provided by an embodiment of the present application.
  • the core of this application is to provide a fault diagnosis method, device, equipment and readable storage medium, which are used for fault diagnosis of RAID and hard disks, so as to deal with faults in time, thereby reducing the probability of data loss and downtime of servers .
  • FIG. 1 it shows a flow chart of a fault diagnosis method provided in an embodiment of the present application.
  • a fault diagnosis method provided in an embodiment of the present application may include:
  • the monitoring platform can remotely monitor the server to be monitored through the SSH (Secure Shell, Safe Shell Protocol) protocol, and obtain each log file of the RAID card in the server to be monitored, wherein the log file mentioned here includes but Not limited to storcliAdpalilog.txt (hard disk status information file in txt format), storcliPDList.txt (logical disk list information file in txt format), Controller_1_Config.txt (controller configuration information file in txt format), Controller_1_Device_log.txt ( controller device log information).
  • SSH Secure Shell, Safe Shell Protocol
  • the information of the server to be monitored (that is, the server to be monitored) can be registered to the monitoring platform in advance, wherein the server information mentioned here can specifically include the IP of the server, the user name, password.
  • the monitoring platform can log in to the server to be monitored according to the registration information of the server to be monitored, and remotely copy the tool used to capture the log files of the RAID card (such as the storcli64 tool) to the specified directory of the server to be monitored , and give the tool executable authority to use the tool to collect the log files of the RAID card in the server to be monitored, and return the collected log files of the RAID card to the monitoring platform, so that the monitoring platform can obtain the server to be monitored Each log file of the RAID card in the
  • S12 Determine whether the RAID card and the hard disk it manages are faulty according to the fault diagnosis rule base and each log file of the RAID card, and use the fault diagnosis rule base to provide a fault handling plan when it is determined that the RAID card and/or the managed hard disk is faulty ;
  • the fault diagnosis rule base is created by analyzing the historical faults of the RAID card and the hard disk it manages in advance.
  • the fault diagnosis rule base After obtaining the log files of the RAID card in the server to be monitored, you can determine whether the RAID card and the hard disk managed by the RAID card are faulty according to the pre-created fault diagnosis rule base and the log files of the RAID card.
  • the fault diagnosis rule base determines that there is a fault in the RAID card and/or the hard disk managed by the RAID card, the fault diagnosis rule base can be used to give a fault handling solution.
  • the fault diagnosis rule base is created by collecting in advance the historical faults of the RAID cards and the hard disks they manage in the servers monitored by the monitoring platform, and analyzing and extracting the historical faults of the RAID cards and the hard disks they manage.
  • the application can remotely monitor the server to be monitored, obtain the log files of the RAID card of the server to be monitored, and treat the server to be monitored by means of the fault diagnosis rule library created in advance and the log files of the RAID card obtained. Diagnose the faults of the RAID card in the system and the hard disks managed by the RAID card, so as to find faults in time and provide a troubleshooting plan.
  • step S12 it is possible to output a log file that determines that the RAID card and/or the managed hard disk has a fault, that is, output a log file that contains fault information, and simultaneously output a fault handling plan, so that relevant personnel can pass through the output
  • the server to be monitored runs normally, thereby reducing the probability of data loss and downtime of the server, and improving the stability and reliability of the server operation.
  • this application has good versatility, and can perform fault diagnosis on RAID cards of different manufacturers and hard disks under different RAID management modes, so as to effectively reduce the risk of data loss and the probability of server downtime caused by hard disk failures.
  • the above-mentioned technical scheme disclosed in the present application determines the fault diagnosis rule base created by analyzing the historical faults of the RAID card and the hard disk it manages in advance and the acquired log files of the RAID card in the server to be monitored.
  • Monitor the RAID card in the server and/or whether the hard disk managed by the RAID card is faulty and use the fault diagnosis rule base to provide a fault handling plan when it is determined that there is a fault, so as to realize the fault diagnosis of the RAID card and hard disk, and pass the output
  • the fault diagnosis rule library and the log files of the RAID card, it is determined whether the RAID card and the hard disk it manages are faulty, and when it is determined that the RAID card and/or the hard disk managed are faulty
  • Use the fault diagnosis rule base to give a fault handling plan which may include:
  • Each log file of the RAID card is compared with the fault diagnosis rule base to determine whether there is a target file matching the current log file of the RAID card in the fault diagnosis rule base;
  • the pre-created fault diagnosis rule base contains rule information one by one, and the format of each rule information is as follows:
  • the monitoring platform will match the log file that needs to be diagnosed according to the file name of the file;
  • Fault diagnosis method It is used to give a fault treatment plan; among them, some rule information may have the item of fault diagnosis method, and some rule information may not have the item of fault diagnosis method.
  • the monitoring platform can compare and match each log file of the RAID card with each rule information in the fault diagnosis rule base.
  • the next log file of RAID can be used as the current log file to return to the step of judging whether there is a target file matching the current log file of the RAID card in the fault diagnosis rule base, until the acquisition is completed to the processing of all log files.
  • the monitoring platform When it is determined that there is a target file matching the current log file of the RAID card in the fault diagnosis library, the monitoring platform will match each diagnostic keyword corresponding to the target file with the contents of each row in the current log file.
  • the record can be faultLine (fault line) for each line content that can match each diagnosis keyword corresponding to the target file, and determine the RAID card And/or there is a fault in the managed hard disk, that is, if there is at least one line of content in the current log file that can match each diagnostic keyword corresponding to the target file, it is considered that there is fault information in the current log file. If the content of each line in the current log file cannot match the diagnostic keywords corresponding to the target file, it is determined that there is no fault information in the current log file.
  • faultLine fault line
  • the next log file of the RAID card can be used as the current log file to return to the step of judging whether there is a target file matching the current log file of the RAID card in the fault diagnosis rule base, until the processing of all the log files obtained is completed.
  • a fault diagnosis method provided in the embodiment of the present application uses the fault diagnosis method to diagnose the current log file to provide a fault handling plan, which may include:
  • the fault diagnosis method when using the fault diagnosis method to diagnose the current log file to provide a troubleshooting solution, it can be checked whether the current log file satisfies the first preset regular expression, and if so, then the first preset regular expression can be followed.
  • the regular expression extracts the first slot number from the current log file, for example, according to ⁇ .*PD.* ⁇ (.* ⁇ /s[0-9] ⁇ 1, ⁇ ).*$
  • the regular expression (that is, can be used as the first preset regular expression) is checked and the first slot number is extracted, and recorded as: slot1.
  • the above steps can be regarded as hard disk slot analysis, and the following steps can be regarded as fault information filtering:
  • the latest information found contains neither abnormal information nor normal information, it will give the hard disk failure on the first slot number.
  • the fault problem is submitted to the second-line operation and maintenance personnel through the work order, so that the second-line operation and maintenance personnel can handle the hard disk on the first slot number according to the problem.
  • the above-mentioned fault diagnosis method of the present application can be regarded as hard disk slot analysis and fault information filtering.
  • the above method can not only diagnose where a fault occurs, but also provide an accurate fault handling plan, so that relevant personnel can perform fault maintenance according to the output fault handling plan.
  • a fault diagnosis method provided in the embodiment of the present application uses the fault diagnosis method to diagnose the current log file to provide a fault handling plan, which may include:
  • the number of errors corresponding to the second slot number represents the number of failures of the hard disk corresponding to the second slot number
  • the second preset regular expression can specifically be Slot Number:[0-9] ⁇ 1 , ⁇
  • the third regular expression can specifically be Other Error Count:[0-9] ⁇ 1, ⁇
  • the extracted second slot number can be expressed as slot2
  • the corresponding error number can be expressed as M.
  • the above fault diagnosis method can be regarded as link problem detection.
  • a fault diagnosis method provided in the embodiment of the present application uses the fault diagnosis method to diagnose the current log file to provide a fault handling plan, which may include:
  • diagnosis keyword in the fault diagnosis rule base does not contain any one of hwErrors, mediumErrors, and smartWarning, then obtain the number corresponding to the last keyword whose value is greater than 0;
  • the "serialNumber" serial number can be extracted from the current log file, specifically, the serial number is extracted from the content of the line where the value is located, and recorded as SN. Then, loop through the current log file according to ⁇ s*Serial number ⁇ s*: ⁇ s*"+SN+”.*$ (the fourth preset regular expression), if there is a match in the current log file For the content of a line of the fourth preset regular expression, look forward from the content of the line to match ⁇ s*Reported Location.*: ⁇ s*Enclosure.*Slot.*$ (that is, the fifth preset regular expression formula), and extract the third slot number from the content matching the fifth preset regular expression, and record it as: slot3.
  • the third slot number is given .
  • the troubleshooting solution for the hard disk corresponding to the serial number that is, the troubleshooting solution for replacing the hard disk in the third slot number with the serial number being SN.
  • N If N is equal to 1, provide a troubleshooting plan for replacing the hard disk corresponding to the third slot number and serial number;
  • N is greater than 1, it is unlikely that multiple hard disks are faulty at the same time. Considering the link problem, it is recommended to replace the RAID card, backplane, and SAS cables one by one in order.
  • the above fault diagnosis method can be regarded as PMC card fault diagnosis.
  • Construct a log training set according to the state information use the Relief filter selection algorithm to select samples from the log training set, find the guessing neighbor sample from the samples of the same type as the sample, and randomly select a guessing wrong neighbor sample from the samples that are different from the sample, If the distance between the sample and the guessed neighbor sample on the feature is smaller than the distance between the sample and the guessed wrong neighbor sample on the same feature, then increase the weight of the feature. If the distance between the sample and the guessed neighbor sample is smaller than the sample and the guessed neighbor sample The distance of the sample on the same feature reduces the weight of the feature, trains the feature for a preset number of times, and obtains the average weight of the feature after the preset number of training;
  • the analysis result of the feature set is received, and the analysis result of the feature set is added to the fault diagnosis rule base.
  • the fault diagnosis rule base can also be continuously updated by using the log files that do not contain fault information, so as to improve the fault diagnosis rule base. Accuracy of fault diagnosis.
  • these log files are log files without fault information (abbreviated as fault-free log files). At this time, the status information can be extracted from these fault-free log files .
  • the fault-free log file is composed of two parts: fixed part and variable part, for example: 169:21-01-11,21:23:59Info:VD 02/2 is now OPTIMAL, among them, "Info:VD” It is a fixed part, and the others are variable parts.
  • the information in front of the fixed part "Info: VD” is the log ID, time and other information, which are of no value for log fault diagnosis.
  • the information after "Info: VD” It represents the status information, where the status information indicates the health status of the RAID card or hard disk. Therefore, Event1:[0-9] ⁇ 1, ⁇ :[0-9] ⁇ 2 ⁇ -[0 can be defined in the process of log fault diagnosis.
  • the method of *Info:VD.* extracts the information that matches Event1, and the extracted information is divided according to "Info:VD" to obtain status information, for example: " is now OPTIMAL".
  • the basic model event in this step can extract all the model features of the RAID card.
  • the model features mentioned here include health features and fault features.
  • these model features need to be screened, and the fault feature extraction is used to improve the fault. Diagnostic rule base.
  • the screening uses the Relief filter selection algorithm for feature selection.
  • the state information extracted from the non-fault log files is taken as the log training set and recorded as RD, and a sample R is randomly selected from the log training set RD by using the Relief filter selection algorithm, and the sample R is selected from the same class as R
  • Finding the nearest neighbor sample NH in the sample is called: guessing the nearest neighbor (near-hit), randomly selecting a sample from a sample different from R and recording it as NM, called: guessing the wrong neighbor (near-miss), and then, Perform feature extraction training according to the following training rules: If the distance between R and NH on a certain feature is smaller than the distance between R and NM on the same feature, it means that this feature is beneficial to distinguish the nearest neighbors of the same class from different classes.
  • the weight of the above training process goes through the preset number of times (record the preset number of times as m), and finally obtain the average weight of the feature after the preset number of training. The greater the average weight, the stronger the classification ability, and vice versa, the weaker the classification ability.
  • Add features with strong classification ability to the feature set that is, add features with an average weight greater than the preset value to the feature set, where the size of the preset value can be set according to actual needs, and the average weight greater than the preset value indicates the classification of the feature strong ability.
  • the specific algorithm is as follows:
  • R n represents a sample randomly selected from the log training set RD for the nth time
  • NH n is the guessed neighbor sample selected for the nth time
  • NM n is the wrong neighbor sample selected for the nth time
  • is the final
  • the value of n here is a discrete value
  • the calculation method of the formula diff(X, Y) is as follows:
  • the feature set can be output, so that relevant personnel can manually analyze the feature set to obtain the analysis results, wherein the analysis results can be included in the fault diagnosis rule base.
  • the monitoring platform can receive the analysis results of the feature set, and add the analysis results of the feature set into the fault diagnosis rule base, so as to realize the update and improvement of the fault diagnosis rule base, so as to facilitate the fault diagnosis rule base based on the updated fault diagnosis rules.
  • the library performs fault diagnosis on the log files of the RAID card, thereby improving the accuracy of fault diagnosis.
  • a kind of fault diagnosis method that the embodiment of the present application provides, outputs the log file and the fault handling plan that confirms RAID card and/or managed hard disk is faulty can comprise:
  • the log file and the fault handling plan for determining the failure of the RAID card and/or the managed hard disk will be sent by mail and /or SMS output to the mobile terminal, so that relevant personnel can obtain relevant information in time, and deal with the fault in time.
  • FIG. 2 shows a schematic structural diagram of a fault diagnosis device provided in the embodiment of the present application, which may include:
  • Determining module 22 is used to determine whether the hard disk of the RAID card and its management has a fault according to each log file of the fault diagnosis rule base and the RAID card, and utilizes the fault diagnosis rule base when determining that the hard disk of the RAID card and/or management has a fault. Troubleshooting solution; the fault diagnosis rule base is created by analyzing the historical faults of the RAID card and the hard disk it manages in advance;
  • the output module 23 is configured to output a log file and a fault handling plan for determining that the RAID card and/or the managed hard disk is faulty.
  • the determining module 22 may include:
  • a comparison unit is used to compare each log file of the RAID card with the fault diagnosis rule base, and judge whether there is a target file matching the current log file of the RAID card in the fault diagnosis rule base;
  • a matching unit configured to match each diagnostic keyword corresponding to the target file with each row content in the current log file if the target file exists;
  • a determining unit configured to determine that the RAID card and/or the managed hard disk is faulty if there is at least one line of content that can match each diagnostic keyword in the current log file;
  • the diagnosis unit is used to judge whether there is a fault diagnosis method corresponding to the target file in the fault diagnosis rule base, and if there is a fault diagnosis method corresponding to the target file, then use the fault diagnosis method to diagnose the current log file to provide fault handling plan.
  • the diagnosis unit may include:
  • the first extraction subunit is used to extract the first slot number from the current log file according to the first preset regular expression
  • the search subunit is used to find out the latest information corresponding to the first slot number from the current log file
  • the first subunit is provided, which is used to provide a troubleshooting plan for replacing the hard disk on the first slot number if the latest information contains abnormal information;
  • the first filtering subunit is used to filter the latest information from the current log file if the latest information contains normal information
  • the second provides a sub-unit, which is used to provide the hard disk failure on the first slot number if the latest information does not contain abnormal information and normal information, and recommends submitting a work order and escalating the corresponding failure problem to the second-line troubleshooting plan .
  • the diagnosis unit may include:
  • the second extraction subunit is used to extract the second slot number from the current log file according to the second preset regular expression, and extract the error corresponding to the second slot number from the current log file according to the third preset regular expression number;
  • the third sub-unit is provided, which is used to provide a troubleshooting plan for replacing the RAID card, backplane, and SAS cables one by one in the order of the error number if it is greater than 1;
  • the fourth provides a subunit, which is used to provide a troubleshooting solution for replacing the hard disk on the second slot number if the number of errors is equal to 1.
  • the diagnosis unit may include:
  • the third extraction subunit is used to extract the last keyword corresponding to the target file from the fault diagnosis rule base, and extract the value corresponding to the last keyword from the current log file;
  • the second filtering subunit is used to filter the content of the line where the value is located from the current log file if the value is equal to 0;
  • the fourth extraction subunit is used to extract the sequence number from the current log file if the value is greater than 0, and loop through the current log file according to the fourth preset regular expression, if there is a fourth preset in the current log file. Assuming a line of regular expression, then look forward from the line of content that can match the fourth preset regular expression to the content that can match the fifth preset regular expression, and from the match to the fifth preset regular expression Extract the third slot number from the content of
  • the fifth is a subunit, which is used to provide a troubleshooting plan for replacing the hard disk corresponding to the third slot number and serial number if the diagnostic keywords in the fault diagnosis rule base contain any one of hwErrors, mediumErrors, and smartWarning ;
  • the sixth provides a subunit, which is used to provide a troubleshooting plan for replacing the hard disk corresponding to the third slot number and serial number if the number is equal to 1;
  • the seventh provides subunits, which are used to provide a troubleshooting plan for considering link problems and suggesting that the RAID card, backplane, and SAS cable be replaced one by one in order if the number is greater than 1.
  • the extraction module is used to if there is no target file matching the current log file of the RAID card in the fault diagnosis rule base, or there is a target file matching the current log file of the RAID card in the fault diagnosis rule base and if the current log file If there is no at least one line of content that can match each diagnostic keyword, the status information is extracted from the current log file;
  • the training module is used to construct a log training set according to the state information, select samples from the log training set using the Relief filter selection algorithm, find guessing neighbor samples from samples of the same type as the sample, and randomly select a sample from samples of a different type from the sample Guess the wrong neighbor sample, if the distance between the sample and the guessed neighbor sample on the feature is smaller than the distance between the sample and the guessed wrong neighbor sample on the same feature, then increase the weight of the feature, if the sample and the guessed neighbor sample The distance on the feature is different If it is less than the distance between the sample and the guessed neighbor sample on the same feature, reduce the weight of the feature, train the feature for a preset number of times, and obtain the average weight of the feature after the preset number of training;
  • the first adding module is used to add features whose average weight is greater than a preset value to the feature set, and output the feature set;
  • the second adding module is used to receive the analysis result of the feature set, and add the analysis result of the feature set into the fault diagnosis rule base.
  • the output module 23 may include:
  • the output unit is used to output the log file and the troubleshooting plan for determining the failure of the RAID card and/or the managed hard disk to the mobile terminal by email and/or short message.
  • FIG. 3 shows a schematic structural diagram of a fault diagnosis device provided in the embodiment of the present application, which may include:
  • memory 31 for storing computer programs
  • the processor 32 for executing the computer program stored in the memory 31, can realize the following steps:
  • each log file of the RAID card in the server to be monitored determine whether the RAID card and its managed hard disk are faulty according to the fault diagnosis rule base and each log file of the RAID card, and determine whether the RAID card and/or the managed hard disk are faulty
  • Use the fault diagnosis rule base to give a fault handling plan the fault diagnosis rule base is created by analyzing the historical faults of the RAID card and the hard disk it manages in advance; output the log file that determines the failure of the RAID card and/or the hard disk managed and troubleshooting solutions.
  • the embodiment of the present application also provides a readable storage medium.
  • FIG. 4 it shows a schematic structural diagram of a readable storage medium provided in the embodiment of the present application.
  • a computer program 602 is stored in the readable storage medium 601. When the computer program 602 is executed by the processor, the following steps can be realized:
  • each log file of the RAID card in the server to be monitored determine whether the RAID card and its managed hard disk are faulty according to the fault diagnosis rule base and each log file of the RAID card, and determine whether the RAID card and/or the managed hard disk are faulty
  • Use the fault diagnosis rule base to give a fault handling plan the fault diagnosis rule base is created by analyzing the historical faults of the RAID card and the hard disk it manages in advance; output the log file that determines the failure of the RAID card and/or the hard disk managed and troubleshooting solutions.
  • the readable storage medium 601 can include: U disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disk or optical disk, etc. can store program codes medium.

Landscapes

  • Debugging And Monitoring (AREA)

Abstract

本申请公开了一种故障诊断方法、装置、设备及可读存储介质,方法包括:获取待监控服务器中的RAID卡的各日志文件;根据故障诊断规则库及RAID卡的各日志文件确定RAID卡及其管理的硬盘是否存在故障,并在确定存在故障时利用故障诊断规则库给出故障处理方案;故障诊断规则库为通过预先对RAID卡及其管理的硬盘的历史故障进行分析创建的;输出确定RAID卡和/或管理的硬盘存在故障的日志文件及故障处理方案。本申请公开的上述技术方案,借助预先创建的故障诊断规则库对待监控服务器中的RAID卡及其管理的硬盘进行故障诊断并给出故障处理方案,以便于及时对故障进行处理,从而降低服务器发生数据丢失和宕机的概率。

Description

一种故障诊断方法、装置、设备及可读存储介质
本申请要求在2021年10月26日提交中国专利局、申请号为202111244269.0、发明名称为“一种故障诊断方法、装置、设备及可读存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及服务器技术领域,更具体地说,涉及一种故障诊断方法、装置、设备及可读存储介质。
背景技术
目前,随着RAID(Redundant Arrays of Independent Disks,磁盘阵列)和硬盘种类的增多,RAID对硬盘的管理模式也比较多,在RAID管理模式下,服务器是不能直接感知硬盘的状态是否正常,而RAID故障或硬盘故障会导致服务器发生数据丢失、宕机等问题。
综上所述,如何对RAID和硬盘进行故障诊断,以便于及时对故障进行处理,从而降低服务器发生数据丢失和宕机的概率,是目前本领域技术人员亟待解决的技术问题。
发明内容
有鉴于此,本申请的目的是提供一种故障诊断方法、装置、设备及可读存储介质,用于对RAID和硬盘进行故障诊断,以便于及时对故障进行处理,从而降低服务器发生数据丢失和宕机的概率。
为了实现上述目的,本申请提供如下技术方案:
一种故障诊断方法,包括:
获取待监控服务器中的RAID卡的各日志文件;
根据故障诊断规则库及所述RAID卡的各日志文件确定所述RAID卡及其管理的硬盘是否存在故障,并在确定所述RAID卡和/或管理的硬盘存在故障时利用所述故障诊断规则库给出故障处理方案;所述故障诊断规则库为通过预先对RAID卡及其管理的硬盘的历史故障进行分析创建的;
输出确定所述RAID卡和/或管理的硬盘存在故障的日志文件及所述故障处理方案。
可选的,根据故障诊断规则库及所述RAID卡的各日志文件确定所述RAID卡及其管理的硬盘是否存在故障,并在确定所述RAID卡和/或管理的硬盘存在故障时利用所述故障诊断规则库给出故障处理方案,包括:
将所述RAID卡的各日志文件与所述故障诊断规则库对比,判断所述故障诊断规则库中是否存在与所述RAID卡的当前日志文件相匹配的目标文件;
若所述故障诊断规则库中存在所述目标文件,则将所述目标文件对应的各诊断关键字与所述当前日志文件中的每一行内容进行匹配;
若所述当前日志文件中存在能够与各所述诊断关键字相匹配的至少一行内容,则确定所述RAID卡和/或管理的硬盘存在故障;
判断所述故障诊断规则库中是否存在与所述目标文件对应的故障诊断方法,若所述故障诊断规则库中存在与所述目标文件对应的故障诊断方法,则利用所述故障诊断方法对所述当前日志文件进行诊断,以给出所述故障处理方案。
可选的,利用所述故障诊断方法对所述当前日志文件进行诊断,以给出所述故障处理方案,包括:
按照第一预设正则表达式从所述当前日志文件中提取第一槽位号;
从所述当前日志文件中查找出所述第一槽位号对应的最新信息;
若所述最新信息包含异常信息,则给出更换所述第一槽位号上的硬盘的故障处理方案;
若所述最新信息包含正常信息,则从所述当前日志文件中过滤所述最新信息;
若所述最新信息不包含所述异常信息和所述正常信息,则给出所述第一槽位号上的硬盘故障、建议提交工单并将对应的故障问题提升至二线的故障处理方案。
可选的,利用所述故障诊断方法对所述当前日志文件进行诊断,以给出所述故障处理方案,包括:
按照第二预设正则表达式从所述当前日志文件中提取第二槽位号,按照第三预设正则表达式从所述当前日志文件中提取所述第二槽位号对应的错误数;
若所述错误数大于1,则给出建议按照RAID卡、背板、SAS线缆顺序逐个更换的故障处理方案;
若所述错误数等于1,则给出更换所述第二槽位号上的硬盘的故障处理方案。
可选的,利用所述故障诊断方法对所述当前日志文件进行诊断,以给出所述故障处理方案,包括:
从所述故障诊断规则库中提取与所述目标文件对应的最后一个关键字,并从所述当前日志文件中提取与所述最后一个关键字对应的数值;
若所述数值等于0,则从所述当前日志文件中过滤所述数值所在行的内容;
若所述数值大于0,则从所述当前日志文件中提取序列号,并按照第四预设正则表达式循环遍历所述当前日志文件,若所述当前日志文件中存在能匹配到所述第四预设正则表达式的一行内容,则从能匹配到所述第四预设正则表达式的一行内容向前查找能匹配到第五预设正则表达式的内容,并从匹配到所述第五预设正则表达式的内容中提取第三槽位号;
若所述故障诊断规则库中的所述诊断关键字中包含hwErrors、mediumErrors、smartWarning中的任意一个,则给出更换所述第三槽位号、所述序列号对应的硬盘的故障处理方案;
若所述故障诊断规则库中的所述诊断关键字中不包含hwErrors、mediumErrors、smartWarning中的任意一个,则获取所述最后一个关键字对应的所述数值大于0的个数;
若所述个数等于1,则给出更换所述第三槽位号、所述序列号对应的硬盘的故障处理方案;
若所述个数大于1,则给出考虑链路问题,建议按照RAID卡、背板、SAS线缆顺序逐个更换的故障处理方案。
可选的,还包括:
若所述故障诊断规则库中不存在与所述RAID卡的当前日志文件相匹配的目标文件,或所述故障诊断规则库中存在与所述RAID卡的当前日志文件相匹配的目标文件且若所述当前日志文件中不存在能够与各所述诊断关键字相匹配的至少一行内容,则从所述当前日志文件中提取状态信息;
根据所述状态信息构建日志训练集,利用Relief过滤式选择算法从所述日志训练集中选择样本,从和所述样本同类的样本中寻找猜中近邻样本,从和所述样本不同类的样本中随机选择一个猜错近邻样本,若所述样本和所述猜中近邻样本在特征上的距离小于所述样本和所述猜错近邻样本在同样特征上的距离,则增加所述特征的权重,若所述样本和所述猜中近邻样本在所述特征上的距离不小于所述样本和所述猜错近邻样本在同样特征上的距离,则减小所述特征的权重,对所述特征经过预设次数训练,并获取所述特征经过所述预设次数训练后的平均权重;
将所述平均权重大于预设值的特征加入特征集,并输出所述特征集;
接收对所述特征集的分析结果,并将所述特征集的分析结果加入所述故障诊断规则库中。
可选的,输出确定所述RAID卡和/或管理的硬盘存在故障的日志文件及所述故障处理方案,包括:
将确定所述RAID卡和/或管理的硬盘存在故障的日志文件及所述故障处理方案通过邮件和/或短信输出至移动终端。
一种故障诊断装置,包括:
获取模块,用于获取待监控服务器中的RAID卡的各日志文件;
确定模块,用于根据故障诊断规则库及所述RAID卡的各日志文件确定所述RAID卡及其管理的硬盘是否存在故障,并在确定所述RAID卡和/或管理的硬盘存在故障时利用所述故障诊断规则库给出故障处理方案;所述故障诊断规则库为通过预先对RAID卡及其管理的硬盘的历史故障进行分析创建的;
输出模块,用于输出确定所述RAID卡和/或管理的硬盘存在故障的日志文件及所述故障处理方案。
一种故障诊断设备,包括:
存储器,用于存储计算机程序;
处理器,用于执行所述计算机程序时实现如上述任一项所述的故障诊断方法的步骤。
一种可读存储介质,所述可读存储介质中存储有计算机程序,所述计算机程序被处理器执行时实现如上述任一项所述的故障诊断方法的步骤。
本申请提供了一种故障诊断方法、装置、设备及可读存储介质,其中,该方法包括:获取待监控服务器中的RAID卡的各日志文件;根据故障诊断规则库及RAID卡的各日志文件确定RAID卡及其管理的硬盘是否存在故障,并在确定RAID卡和/或管理的硬盘存在故障时利用故障诊断规则库给出故障处理方案;故障诊断规则库为通过预先对RAID卡及其管理的硬盘的历史故障进行分析创建的;输出确定RAID卡和/或管理的硬盘存在故障的日志文件及故障处理方案。
本申请公开的上述技术方案,根据通过预先对RAID卡及其管理的硬盘的历史故障进行分析所创建的故障诊断规则库以及所获取到的待监控服务器中的RAID卡的各日志文件来确定待监控服务器中的RAID卡和/或RAID卡所管理的硬盘是否存在故障,并在确定存在故障时利用故障诊断规则库给出故障处理方案,以实现对RAID卡和硬盘的故障诊断,且通过输出确定RAID卡和/或管理的硬盘存在故障的日志文件及故障处理方案而便于相关人员及时获知故障并结合故障处理方案来对故障进行处理,以使得出现故障的RAID卡和/或其管理的硬盘能够及时恢复正常,从而降低服务器发生数据丢失和宕机的概率,提高服务器运行的稳定性和可靠性。
附图说明
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据提供的附图获得其他的附图。
图1为本申请实施例提供的一种故障诊断方法的流程图;
图2为本申请实施例提供的一种故障诊断装置的结构示意图;
图3为本申请实施例提供的一种故障诊断设备的结构示意图;
图4为本申请实施例提供的一种可读存储介质的结构示意图。
具体实施方式
本申请的核心是提供一种故障诊断方法、装置、设备及可读存储介质,用于对RAID和硬盘进行故障诊断,以便于及时对故障进行处理,从而降低服务器发生数据丢失和宕机的概率。
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
参见图1,其示出了本申请实施例提供的一种故障诊断方法的流程图,本申请实施例提供的一种故障诊断方法,可以包括:
S11:获取待监控服务器中的RAID卡的各日志文件。
在本申请中,监控平台可以通过SSH(Secure Shell,安全外壳协议)协议对待监控服务器进行远程监控,并获取待监控服务器中的RAID卡的各日志文件,其中,这里提及的日志文件包含但不限于storcliAdpalilog.txt(txt格式的硬盘状态信息文件)、storcliPDList.txt(txt格式的逻辑盘列表信息文件)、Controller_1_Config.txt(txt格式的控制器配置信息文件)、Controller_1_Device_log.txt(txt格式的控制器设备日志信息)。
其中,为了提高日志文件获取的合法性和效率,则可以预先将需要监控的服务器(即待监控服务器)的信息注册至监控平台,其中,这里提及的服务器信息具体可以包括服务器的IP、用户名、密码。待监控服务器注册成功之后,监控平台可以根据待监控服务器的注册信息登录待监控服务器,并将用于抓取RAID卡的日志文件的工具(例如storcli64工具)远程拷贝至待监控服务器的指定目录下,且赋予该工具可执行权限,以利用工具采集待监控服务器中的RAID卡的各日志文件,并将采集到的RAID卡的各日志文件回传至监控平台,以使得监控平台获取待监控服务器中的RAID卡的各日志文件。
S12:根据故障诊断规则库及RAID卡的各日志文件确定RAID卡及其管理的硬盘是否存在故障,并在确定RAID卡和/或管理的硬盘存在故障时利用故障诊断规则库给出故障处理方案;故障诊断规则库为通过预先对RAID卡及其管理的硬盘的历史故障进行分析创建的。
在获取待监控服务器中的RAID卡的各日志文件之后,可以根据预先创建的故障诊断规则库及RAID卡的各日志文件确定RAID卡以及RAID卡所管理的硬盘是否存在故障,在根据预先创建的故障诊断规则库确定RAID卡和/或RAID卡管理的硬盘存在故障时则可以利用故障诊断规则库给出故障处理方案。
其中,故障诊断规则库为预先收集监控平台所监控的各服务器中的RAID卡及其管理的硬盘的历史故障,并对RAID卡及其管理的硬盘的历史故障进行分析和提取所创建的。
通过上述过程可知,本申请可以远程对待监控服务器进行监控,获取待监控服务器的RAID卡的各日志文件,并借助预先创建的故障诊断规则库及获取到的RAID卡的各日志文件来对待监控服务器中的RAID卡及RAID卡所管理的硬盘进行故障诊断,以便于及时发现故障,并给出故障处理方案。
S13:输出确定RAID卡和/或管理的硬盘存在故障的日志文件及故障处理方案。
在步骤S12的基础上,可以输出确定RAID卡和/或管理的硬盘存在故障的日志文件,也即输出包含有故障信息的日志文件,同时可以输出故障处理方案,以便于相关人员可以通过所输出的日志文件了解RAID卡和/或管理的硬盘的故障信息,并便于根据故障处理方案对故障进行处理,以使得出现故障的部件能够及时恢复正常,也即使得RAID卡及其管理的硬盘能够在待监控服务器中正常运行,从而降低服务器发生数据丢失和宕机的概率,以提高服务器运行的稳定性和可靠性。另外,本申请的通用性比较好,可以对不同厂商的RAID卡及不同RAID管理模式下的硬盘进行故障诊断,以有效地降低数据丢失的风险以及服务器因硬盘故障导致的宕机概率等。
本申请公开的上述技术方案,根据通过预先对RAID卡及其管理的硬盘的历史故障进行分析所创建的故障诊断规则库以及所获取到的待监控服务器中的RAID卡的各日志文件来确定待监控服务器中的RAID卡和/或RAID卡所管理的硬盘是否存在故障,并在确定存 在故障时利用故障诊断规则库给出故障处理方案,以实现对RAID卡和硬盘的故障诊断,且通过输出确定RAID卡和/或管理的硬盘存在故障的日志文件及故障处理方案而便于相关人员及时获知故障并结合故障处理方案来对故障进行处理,以使得出现故障的RAID卡和/或其管理的硬盘能够及时恢复正常,从而降低服务器发生数据丢失和宕机的概率,提高服务器运行的稳定性和可靠性。
本申请实施例提供的一种故障诊断方法,根据故障诊断规则库及RAID卡的各日志文件确定RAID卡及其管理的硬盘是否存在故障,并在确定RAID卡和/或管理的硬盘存在故障时利用故障诊断规则库给出故障处理方案,可以包括:
将RAID卡的各日志文件与故障诊断规则库对比,判断故障诊断规则库中是否存在与RAID卡的当前日志文件相匹配的目标文件;
若存在目标文件,则将目标文件对应的各诊断关键字与当前日志文件中的每一行内容进行匹配;
若当前日志文件中存在能够与各诊断关键字相匹配的至少一行内容,则确定RAID卡和/或管理的硬盘存在故障;
判断故障诊断规则库中是否存在与目标文件对应的故障诊断方法,若存在与目标文件对应的故障诊断方法,则利用故障诊断方法对当前日志文件进行诊断,以给出故障处理方案。
在本申请中,预先创建的故障诊断规则库中包含有一条条的规则信息,其中,每条规则信息的格式如下:
文件(具体包含文件名):监控平台会根据文件的文件名匹配需要诊断的日志文件;
诊断关键字:RAID卡或硬盘产生故障后,日志文件中报出的错误内容;
故障诊断方法:用于给出故障处理方案;其中,有的规则信息中可能有故障诊断方法这一项,有的规则信息中可能无故障诊断方法这一项。
在根据故障诊断规则库及RAID卡的各日志文件确定RAID卡及其管理的硬盘是否存在故障,并在确定RAID卡和/或管理的硬盘存在故障时利用故障诊断规则库给出故障处理方案时,监控平台可以将RAID卡的各日志文件与故障诊断规则库中的各规则信息进行对比和匹配,具体地,可以依次将RAID卡的每个日志文件作为当前日志文件来与故障诊断规则库进行对比和匹配,以判断故障诊断规则库中是否存在与RAID卡的当前日志文件相匹配的目标文件,具体地,根据故障诊断规则库中所包含的规则信息,判断故障诊断规则库中是否存在文件名与RAID卡的当前日志文件相匹配的文件,若故障诊断规则库中存在文件名与RAID卡的当前日志文件相匹配的文件,则认为故障诊断规则库中存在与RAID卡的当前日志文件相匹配的目标文件。例如:storcliAdpalilog.txt文件适用于规则库中文件名为“storcliAdpalilog”的文件,则可以将文件名为“storcliAdpalilog”的文件确定为与RAID卡的当前日志文件相匹配的目标文件。若故障诊断规则库中不存在文件名与RAID卡的当前日志文件相匹配的文件,则确定故障诊断规则库中不存在与RAID卡的当前日志文件相匹配的目标文件,则认为当前日志文件中不包含故障信息,此时,则可以将RAID的下一 个日志文件作为当前日志文件来返回判断故障诊断规则库中是否存在与RAID卡的当前日志文件相匹配的目标文件的步骤,直至完成对获取到的所有日志文件的处理为止。
在确定故障诊断库中存在与RAID卡的当前日志文件相匹配的目标文件,则监控平台将目标文件对应的每个诊断关键字均与当前日志文件中的各行内容进行匹配。其中,在进行匹配之前,可以先按照“,”将目标文件对应的各诊断关键字分割开来,也即以逗号区分相邻两个诊断关键字,然后,将分割开来的诊断关键字对当前日志文件逐行循环遍历匹配,以确定当前日志文件中的各行内容是否能够与目标文件对应的各诊断关键字进行匹配。
若当前日志文件中存在能够与目标文件对应的各诊断关键字相匹配的至少一行内容,则记录能够与目标文件对应的各诊断关键字相匹配各行内容为faultLine(故障线),并确定RAID卡和/或管理的硬盘存在故障,也即若当前日志文件中存在能够与目标文件对应的各诊断关键字相匹配的至少一行内容,则认为当前日志文件中存在故障信息。若当前日志文件中的各行内容均不能够与目标文件对应的各诊断关键字相匹配,则确定当前日志文件中不存在故障信息,此时,则可以将RAID卡的下一个日志文件作为当前日志文件来返回判断故障诊断规则库中是否存在与RAID卡的当前日志文件相匹配的目标文件的步骤,直至完成对获取到的所有日志文件的处理为止。
在确定RAID卡和/或管理的硬盘存在故障之后,则判断故障诊断规则库中是否存在与目标文件对应的故障诊断方法,若存在与目标文件对应的故障诊断方法,则利用该诊断方法对当前日志文件进行诊断,以给出故障处理方案,若不存在与目标文件对应的故障诊断方法,则可以给出暂无故障处理方案的提示等。
通过上述过程可以实现对RAID卡及其管理的硬盘是否存在故障的准确判断,并能够给出相对应的故障处理方案,以便于相关人员能够根据故障处理方案对故障进行准确地处理。
本申请实施例提供的一种故障诊断方法,利用故障诊断方法对当前日志文件进行诊断,以给出故障处理方案,可以包括:
按照第一预设正则表达式从当前日志文件中提取第一槽位号;
从当前日志文件中查找出第一槽位号对应的最新信息;
若最新信息包含异常信息,则给出更换第一槽位号上的硬盘的故障处理方案;
若最新信息包含正常信息,则从当前日志文件中过滤最新信息;
若最新信息不包含异常信息和正常信息,则给出第一槽位号上的硬盘故障、建议提交工单并将对应的故障问题提升至二线的故障处理方案。
在本申请中,在利用故障诊断方法对当前日志文件进行诊断,以给出故障处理方案时,可以检查当前日志文件是否满足第一预设正则表达式,若满足,则可以按照第一预设正则表达式从当前日志文件中提取第一槽位号,例如可以根据^.*PD.*\\(.*\\/s[0-9]{1,}\\).*$这一正则表达式(即可以作为第一预设正则表达式)进行检查和提取第一槽位号,并记为:slot1。其中,前述步骤可以视作硬盘槽位解析,下述步骤则可以视作故障信息过滤:
在提取出第一槽位号slot1之后,可以按照第一槽位号slot1在当前日志文件中查找最新一条满足第一预设正则表达式的信息,也即查找出第一槽位号对应的最新信息;
若查找出的最新信息中包含“to FAILED”或“to UNCONFIGURED_BAD”等异常信息,则给出更换第一槽位号上的硬盘的故障处理方案;
若查找出的最新信息中包含“to ONLINE”等正常信息,则从当前日志文件中过滤掉该条最新信息,也即过滤掉该条故障信息;
若查找到的最新信息中既不包含异常信息也不包含正常信息,则给出第一槽位号上的硬盘故障,建议提工单并将该故障问题提升至二线的故障处理方案,以把故障问题通过工单提交给二线运维人员,从而使得二线运维人员能够根据对第一槽位号上的硬盘进行处理。
其中,本申请上述故障诊断方法可以视作硬盘槽位解析及故障信息过滤。通过上述方法不仅可以诊断出哪里出现故障,而且可以给出准确的故障处理方案,以便于相关人员可以根据所输出的故障处理方案进行故障维护。
本申请实施例提供的一种故障诊断方法,利用故障诊断方法对当前日志文件进行诊断,以给出故障处理方案,可以包括:
按照第二预设正则表达式从当前日志文件中提取第二槽位号,按照第三预设正则表达式从当前日志文件中提取第二槽位号对应的错误数;
若错误数大于1,则给出建议按照RAID卡、背板、SAS线缆顺序逐个更换的故障处理方案;
若错误数等于1,则给出更换第二槽位号上的硬盘的故障处理方案。
在本申请中,在利用故障诊断方法对当前日志文件进行诊断,以给出故障处理方案时,除了采用上述方式进行实现外,还可以采用下述方式进行实现,具体地:
对当前日志文件进行逐行遍历,分别按照第二预设正则表达式从当前日志文件中提取第二槽位号,按照第三预设正则表达式从当前日志文件中提取第二槽位号对应的错误数,第二槽位号对应的错误数即代表该第二槽位号对应的硬盘的故障数,其中,第二预设正则表达式具体可以为Slot Number:[0-9]{1,},第三正则表达式具体可以为Other Error Count:[0-9]{1,},且提取到的第二槽位号可以表示为slot2,其对应的错误数可以表示为M。
若存在第二槽位号slot2对应的错误数M大于1,则此时多个硬盘同时出现故障的可能性不大,最有可能出现故障情况的是管理硬盘的RAID卡出现了故障或者链路或者硬盘背板出现了故障,此时,则可以给出建议按照RAID卡、背板、SAS线缆顺序逐个更换的故障处理方案;
若存在第二槽位号slot2对应的错误数M等于1,则此时硬盘出现故障的可能性比较大,此时,则可以给出更换第二槽位号上的硬盘的故障处理方案。
其中,上述故障诊断方法可以视作链路问题检测。
本申请实施例提供的一种故障诊断方法,利用故障诊断方法对当前日志文件进行诊断,以给出故障处理方案,可以包括:
从故障诊断规则库中提取与目标文件对应的最后一个关键字,并从当前日志文件中提取与最后一个关键字对应的数值;
若数值等于0,则从当前日志文件中过滤数值所在行的内容;
若数值大于0,则从当前日志文件中提取序列号,并按照第四预设正则表达式循环遍 历当前日志文件,若当前日志文件中存在能匹配到第四预设正则表达式的一行内容,则从能匹配到第四预设正则表达式的一行内容向前查找能匹配到第五预设正则表达式的内容,并从匹配到第五预设正则表达式的内容中提取第三槽位号;
若故障诊断规则库中的诊断关键字中包含hwErrors、mediumErrors、smartWarning中的任意一个,则给出更换第三槽位号、序列号对应的硬盘的故障处理方案;
若故障诊断规则库中的诊断关键字中不包含hwErrors、mediumErrors、smartWarning中的任意一个,则获取最后一个关键字对应的数值大于0的个数;
若个数等于1,则给出更换第三槽位号、序列号对应的硬盘的故障处理方案;
若个数大于1,则给出考虑链路问题,建议按照RAID卡、背板、SAS线缆顺序逐个更换的故障处理方案。
在本申请中,在利用故障诊断方法对当前日志文件进行诊断,以给出故障处理方案时,除了采用上述两种方式进行实现外,还可以采用如下方式进行实现,具体地:
将故障诊断规则库中与目标文件对应的诊断关键字按照“,”分割开来,并提取分割后的最后一个关键字,其中,可以将最后一个关键字记为:key_last,然后,从当前日志文件中提取key_last对应的数值,并将该数值记为:val。
若val等于0,则从当前日志文件中过滤掉该数值所在行的内容,也即过滤掉val等于0对应的故障信息;
若val大于0,则可以从当前日志文件中提取“serialNumber=”序列号,具体地,从该数值所在行的内容中提取序列号,并记为SN。然后,按照^\\s*Serial number\\s*:\\s*"+SN+".*$(第四预设正则表达式)循环遍历当前日志文件,若当前日志文件中存在能匹配到第四预设正则表达式的一行内容,则从该行内容向前查找能够匹配^\\s*Reported Location.*:\\s*Enclosure.*Slot.*$(即第五预设正则表达式)的内容,并从匹配到第五预设正则表达式的内容中提取第三槽位号,并将其记为:slot3。
若故障诊断规则库中与目标文件对应的诊断关键字中包含hwErrors(硬件错误数)、mediumErrors(介质错误数)、smartWarning(smart告警数)中的任意一个,则给出更换第三槽位号、序列号对应的硬盘的故障处理方案,也即给出更换第三槽位号上的硬盘,序列号为SN的故障处理方案。
若故障诊断规则库中与目标文件对应的诊断关键字中不包含hwErrors、mediumErrors、smartWarning中的任意一个,则获取最后一个关键字key_last对应的val大于0的个数,并将该个数记为N,且该个数即代表硬盘故障的个数。
若N等于1,则给出更换第三槽位号、序列号对应的硬盘的故障处理方案;
若N大于1,则此时多个硬盘同时出现故障的可能性不大,给出考虑链路问题,建议按照RAID卡、背板、SAS线缆顺序逐个更换的故障处理方案。
其中,上述故障诊断方法可以视作PMC卡故障诊断。
本申请实施例提供的一种故障诊断方法,还可以包括:
若故障诊断规则库中不存在与RAID卡的当前日志文件相匹配的目标文件,或故障诊断规则库中存在与RAID卡的当前日志文件相匹配的目标文件且若当前日志文件中不存在能够与各诊断关键字相匹配的至少一行内容,则从当前日志文件中提取状态信息;
根据状态信息构建日志训练集,利用Relief过滤式选择算法从日志训练集中选择样本,从和样本同类的样本中寻找猜中近邻样本,从和样本不同类的样本中随机选择一个猜错近邻样本,若样本和猜中近邻样本在特征上的距离小于样本和猜错近邻样本在同样特征上的距离,则增加特征的权重,若样本和猜中近邻样本在特征上的距离小于样本和猜错近邻样本在同样特征上的距离,则减小特征的权重,对特征经过预设次数训练,并获取特征经过预设次数训练后的平均权重;
将平均权重大于预设值的特征加入特征集,并输出特征集;
接收对特征集的分析结果,并将特征集的分析结果加入故障诊断规则库中。
在本申请中,考虑到未包含故障信息的日志文件中可能包含有故障信息,因此,还可以利用未包含故障信息的日志文件对故障诊断规则库进行不断更新,以便于提高故障诊断规则库进行故障诊断的准确性。
具体地,若故障诊断规则库中不存在与RAID卡的当前日志文件相匹配的目标文件,或故障诊断规则库中存在与RAID卡的当前日志文件相匹配的目标文件且若当前日志文件中不存在能够与各诊断关键字相匹配的至少一行内容,则这些日志文件即为无故障信息的日志文件(简称为无故障日志文件),此时,可以从这些无故障日志文件中提取出状态信息。其中,无故障日志文件是由两部分组成:固定部分和可变部分,例如:169:21-01-11,21:23:59Info:VD 02/2 is now OPTIMAL,其中,“Info:VD”是固定部分,其他则是可变部分,其中,固定部分“Info:VD”前面的信息分别是日志ID、时间等信息,这些信息对于日志故障诊断没有价值,在“Info:VD”之后的信息则代表状态信息,这里状态信息表示RAID卡或者硬盘的健康状态,因此,在日志故障诊断过程中可以定义Event1:[0-9]{1,}:[0-9]{2}-[0-9]{2}-[0-9]{2},*Info:VD.*的方式将符合Event1的信息进行提取,提取后的信息按照“Info:VD”分割得到状态信息,例如:“is now OPTIMAL”。其中,这一步中的基础模型事件能够提取出RAID卡的所有模型特征,这里提及的模型特征包括健康特征和故障特征,接下来需要对这些模型特征进行筛选,将故障特征提取用于完善故障诊断规则库。其中,此处筛选采用Relief过滤式选择算法进行特征选择。
在上述基础上,将从无故障日志文件中提取到状态信息作为日志训练集,并记为RD,利用Relief过滤式选择算法从日志训练集RD中随机选择一个样本R,并从和R同类的样本中寻找最近邻样本NH,称为:猜中近邻(near-hit),从和R不同类的样本中随机选择一个样本记为NM,称为:猜错近邻(near-miss),然后,根据以下训练规则进行特征提取训练:如果R和NH在某个特征上距离小于R和NM在同样特征上的距离,则说明该特征对区分同类和不同类的最近邻是有益的,因此,则增加该特征的权重,如果R和NH在上述特征上距离不小于R和NM在同样特征上的距离,则说明该特征对区分同类和不同类的最近邻起负面作用,因此,则降低该特征的权重,以上训练过程经过预设次数(将预设次数记为m),最后得到该特征在这预设次数的训练后的平均权重。平均权重越大的分类能力越强,反之,分类能力越弱。
将分类能力强的特征加入特征集,也即将平均权重大于预设值的特征加入特征集,其中,预设值的大小可以根据实际需要进行设置,平均权重大于预设值则表明该特征的分类能力强。具体算法如下:
Figure PCTCN2022083577-appb-000001
其中,R n代表从日志训练集RD中第n次随机选择的一个样本,NH n是第n次选择的猜中近邻样本,NM n是第n次选择的猜错近邻样本,δ为最后得到的该特征的平均权重,因此,此处的n的取值为离散取值,所以,公式diff(X,Y)的计算方式如下:
Figure PCTCN2022083577-appb-000002
在将平均权重大于预设值的特征加入特征集之后,可以输出特征集,以便于相关人员对特征集进行人为分析,以得到分析结果,其中,分析结果中可以包括与故障诊断规则库中包含的规则信息相同的格式。相应地,监控平台则可以接收对特征集的分析结果,并将特征集的分析结果加入故障诊断规则库中,以实现对故障诊断规则库的更新和完善,以便于根据更新后的故障诊断规则库对RAID卡的日志文件进行故障诊断,从而提高故障诊断的准确性。
本申请实施例提供的一种故障诊断方法,输出确定RAID卡和/或管理的硬盘存在故障的日志文件及故障处理方案,可以包括:
将确定RAID卡和/或管理的硬盘存在故障的日志文件及故障处理方案通过邮件和/或短信输出至移动终端。
在本申请中,在输出确定RAID卡和/或管理的硬盘存在故障的日志文件及故障处理方案时,具体将确定RAID卡和/或管理的硬盘存在故障的日志文件及故障处理方案通过邮件和/或短信输出至移动终端,以便于相关人员可以及时获取相关信息,并及时对故障进行处理。
本申请实施例还提供了一种故障诊断装置,参见图2,其示出了本申请实施例提供的一种故障诊断装置的结构示意图,可以包括:
获取模块21,用于获取待监控服务器中的RAID卡的各日志文件;
确定模块22,用于根据故障诊断规则库及RAID卡的各日志文件确定RAID卡及其管理的硬盘是否存在故障,并在确定RAID卡和/或管理的硬盘存在故障时利用故障诊断规则库给出故障处理方案;故障诊断规则库为通过预先对RAID卡及其管理的硬盘的历史故障进行分析创建的;
输出模块23,用于输出确定RAID卡和/或管理的硬盘存在故障的日志文件及故障处理方案。
本申请实施例提供的一种故障诊断装置,确定模块22可以包括:
对比单元,用于将RAID卡的各日志文件与故障诊断规则库对比,判断故障诊断规则库中是否存在与RAID卡的当前日志文件相匹配的目标文件;
匹配单元,用于若存在目标文件,则将目标文件对应的各诊断关键字与当前日志文件中的每一行内容进行匹配;
确定单元,用于若当前日志文件中存在能够与各诊断关键字相匹配的至少一行内容,则确定RAID卡和/或管理的硬盘存在故障;
诊断单元,用于判断故障诊断规则库中是否存在与目标文件对应的故障诊断方法,若存在与目标文件对应的故障诊断方法,则利用故障诊断方法对当前日志文件进行诊断,以给出故障处理方案。
本申请实施例提供的一种故障诊断装置,诊断单元可以包括:
第一提取子单元,用于按照第一预设正则表达式从当前日志文件中提取第一槽位号;
查找子单元,用于从当前日志文件中查找出第一槽位号对应的最新信息;
第一给出子单元,用于若最新信息包含异常信息,则给出更换第一槽位号上的硬盘的故障处理方案;
第一过滤子单元,用于若最新信息包含正常信息,则从当前日志文件中过滤最新信息;
第二给出子单元,用于若最新信息不包含异常信息和正常信息,则给出第一槽位号上的硬盘故障、建议提交工单并将对应的故障问题提升至二线的故障处理方案。
本申请实施例提供的一种故障诊断装置,诊断单元可以包括:
第二提取子单元,用于按照第二预设正则表达式从当前日志文件中提取第二槽位号,按照第三预设正则表达式从当前日志文件中提取第二槽位号对应的错误数;
第三给出子单元,用于若错误数大于1,则给出建议按照RAID卡、背板、SAS线缆顺序逐个更换的故障处理方案;
第四给出子单元,用于若错误数等于1,则给出更换第二槽位号上的硬盘的故障处理方案。
本申请实施例提供的一种故障诊断装置,诊断单元可以包括:
第三提取子单元,用于从故障诊断规则库中提取与目标文件对应的最后一个关键字,并从当前日志文件中提取与最后一个关键字对应的数值;
第二过滤子单元,用于若数值等于0,则从当前日志文件中过滤数值所在行的内容;
第四提取子单元,用于若数值大于0,则从当前日志文件中提取序列号,并按照第四预设正则表达式循环遍历当前日志文件,若当前日志文件中存在能匹配到第四预设正则表达式的一行内容,则从能匹配到第四预设正则表达式的一行内容向前查找能匹配到第五预设正则表达式的内容,并从匹配到第五预设正则表达式的内容中提取第三槽位号;
第五给出子单元,用于若故障诊断规则库中的诊断关键字中包含hwErrors、mediumErrors、smartWarning中的任意一个,则给出更换第三槽位号、序列号对应的硬盘的故障处理方案;
获取子单元,用于若故障诊断规则库中的诊断关键字中不包含hwErrors、mediumErrors、smartWarning中的任意一个,则获取最后一个关键字对应的数值大于0的个数;
第六给出子单元,用于若个数等于1,则给出更换第三槽位号、序列号对应的硬盘的故障处理方案;
第七给出子单元,用于若个数大于1,则给出考虑链路问题,建议按照RAID卡、背板、SAS线缆顺序逐个更换的故障处理方案。
本申请实施例提供的一种故障诊断装置,还可以包括:
提取模块,用于若故障诊断规则库中不存在与RAID卡的当前日志文件相匹配的目标文件,或故障诊断规则库中存在与RAID卡的当前日志文件相匹配的目标文件且若当前日志文件中不存在能够与各诊断关键字相匹配的至少一行内容,则从当前日志文件中提取状态信息;
训练模块,用于根据状态信息构建日志训练集,利用Relief过滤式选择算法从日志训练集中选择样本,从和样本同类的样本中寻找猜中近邻样本,从和样本不同类的样本中随机选择一个猜错近邻样本,若样本和猜中近邻样本在特征上的距离小于样本和猜错近邻样本在同样特征上的距离,则增加特征的权重,若样本和猜中近邻样本在特征上的距离不小于样本和猜错近邻样本在同样特征上的距离,则减小特征的权重,对特征经过预设次数训练,并获取特征经过预设次数训练后的平均权重;
第一加入模块,用于将平均权重大于预设值的特征加入特征集,并输出特征集;
第二加入模块,用于接收对特征集的分析结果,并将特征集的分析结果加入故障诊断规则库中。
本申请实施例提供的一种故障诊断装置,输出模块23可以包括:
输出单元,用于将确定RAID卡和/或管理的硬盘存在故障的日志文件及故障处理方案通过邮件和/或短信输出至移动终端。
本申请实施例还提供了一种故障诊断设备,参见图3,其示出了本申请实施例提供的一种故障诊断设备的结构示意图,可以包括:
存储器31,用于存储计算机程序;
处理器32,用于执行存储器31存储的计算机程序时可实现如下步骤:
获取待监控服务器中的RAID卡的各日志文件;根据故障诊断规则库及RAID卡的各日志文件确定RAID卡及其管理的硬盘是否存在故障,并在确定RAID卡和/或管理的硬盘存在故障时利用故障诊断规则库给出故障处理方案;故障诊断规则库为通过预先对RAID卡及其管理的硬盘的历史故障进行分析创建的;输出确定RAID卡和/或管理的硬盘存在故障的日志文件及故障处理方案。
本申请实施例还提供了一种可读存储介质,参见图4,其示出了本申请实施例提供的一种可读存储介质的结构示意图,可读存储介质601中存储有计算机程序602,计算机程序602被处理器执行时可实现如下步骤:
获取待监控服务器中的RAID卡的各日志文件;根据故障诊断规则库及RAID卡的各日志文件确定RAID卡及其管理的硬盘是否存在故障,并在确定RAID卡和/或管理的硬盘存在故障时利用故障诊断规则库给出故障处理方案;故障诊断规则库为通过预先对RAID卡及其管理的硬盘的历史故障进行分析创建的;输出确定RAID卡和/或管理的硬盘存在故障的日志文件及故障处理方案。
该可读存储介质601可以包括:U盘、移动硬盘、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、磁碟或者光盘等各种可以 存储程序代码的介质。
本申请提供的一种故障诊断装置、设备及计算机可读存储介质中相关部分的说明可以参见本申请实施例提供的一种故障诊断方法中对应部分的详细说明,在此不再赘述。
需要说明的是,在本文中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。另外,本申请实施例提供的上述技术方案中与现有技术中对应技术方案实现原理一致的部分并未详细说明,以免过多赘述。
对所公开的实施例的上述说明,使本领域技术人员能够实现或使用本申请。对这些实施例的多种修改对本领域技术人员来说将是显而易见的,本文中所定义的一般原理可以在不脱离本申请的精神或范围的情况下,在其它实施例中实现。因此,本申请将不会被限制于本文所示的这些实施例,而是要符合与本文所公开的原理和新颖特点相一致的最宽的范围。

Claims (12)

  1. 一种故障诊断方法,其特征在于,包括:
    获取待监控服务器中的RAID卡的各日志文件;
    根据故障诊断规则库及所述RAID卡的各日志文件确定所述RAID卡及其管理的硬盘是否存在故障,并在确定所述RAID卡和/或管理的硬盘存在故障时利用所述故障诊断规则库给出故障处理方案;所述故障诊断规则库为通过预先对RAID卡及其管理的硬盘的历史故障进行分析创建的;
    输出确定所述RAID卡和/或管理的硬盘存在故障的日志文件及所述故障处理方案。
  2. 根据权利要求1所述的故障诊断方法,其特征在于,根据故障诊断规则库及所述RAID卡的各日志文件确定所述RAID卡及其管理的硬盘是否存在故障,并在确定所述RAID卡和/或管理的硬盘存在故障时利用所述故障诊断规则库给出故障处理方案,包括:
    将所述RAID卡的各日志文件与所述故障诊断规则库对比,判断所述故障诊断规则库中是否存在与所述RAID卡的当前日志文件相匹配的目标文件;
    若所述故障诊断规则库中存在所述目标文件,则将所述目标文件对应的各诊断关键字与所述当前日志文件中的每一行内容进行匹配;
    若所述当前日志文件中存在能够与各所述诊断关键字相匹配的至少一行内容,则确定所述RAID卡和/或管理的硬盘存在故障;
    判断所述故障诊断规则库中是否存在与所述目标文件对应的故障诊断方法,若所述故障诊断规则库中存在与所述目标文件对应的故障诊断方法,则利用所述故障诊断方法对所述当前日志文件进行诊断,以给出所述故障处理方案。
  3. 根据权利要求2所述的故障诊断方法,其特征在于,利用所述故障诊断方法对所述当前日志文件进行诊断,以给出所述故障处理方案,包括:
    按照第一预设正则表达式从所述当前日志文件中提取第一槽位号;
    从所述当前日志文件中查找出所述第一槽位号对应的最新信息;
    若所述最新信息包含异常信息,则给出更换所述第一槽位号上的硬盘的故障处理方案;
    若所述最新信息包含正常信息,则从所述当前日志文件中过滤所述最新信息;
    若所述最新信息不包含所述异常信息和所述正常信息,则给出所述第一槽位号上的硬盘故障、建议提交工单并将对应的故障问题提升至二线的故障处理方案。
  4. 根据权利要求2所述的故障诊断方法,其特征在于,利用所述故障诊断方法对所述当前日志文件进行诊断,以给出所述故障处理方案,包括:
    按照第二预设正则表达式从所述当前日志文件中提取第二槽位号,按照第三预设正则表达式从所述当前日志文件中提取所述第二槽位号对应的错误数;
    若所述错误数大于1,则给出建议按照RAID卡、背板、SAS线缆顺序逐个更换的故障处理方案;
    若所述错误数等于1,则给出更换所述第二槽位号上的硬盘的故障处理方案。
  5. 根据权利要求2所述的故障诊断方法,其特征在于,利用所述故障诊断方法对所述当前日志文件进行诊断,以给出所述故障处理方案,包括:
    从所述故障诊断规则库中提取与所述目标文件对应的最后一个关键字,并从所述当前日志文件中提取与所述最后一个关键字对应的数值;
    若所述数值等于0,则从所述当前日志文件中过滤所述数值所在行的内容;
    若所述数值大于0,则从所述当前日志文件中提取序列号,并按照第四预设正则表达式循环遍历所述当前日志文件,若所述当前日志文件中存在能匹配到所述第四预设正则表达式的一行内容,则从能匹配到所述第四预设正则表达式的一行内容向前查找能匹配到第五预设正则表达式的内容,并从匹配到所述第五预设正则表达式的内容中提取第三槽位号;
    若所述故障诊断规则库中的所述诊断关键字中包含hwErrors、mediumErrors、smartWarning中的任意一个,则给出更换所述第三槽位号、所述序列号对应的硬盘的故障处理方案;
    若所述故障诊断规则库中的所述诊断关键字中不包含hwErrors、mediumErrors、smartWarning中的任意一个,则获取所述最后一个关键字对应的所述数值大于0的个数;
    若所述个数等于1,则给出更换所述第三槽位号、所述序列号对应的硬盘的故障处理方案;
    若所述个数大于1,则给出考虑链路问题,建议按照RAID卡、背板、SAS线缆顺序逐个更换的故障处理方案。
  6. 根据权利要求2所述的故障诊断方法,其特征在于,还包括:
    若所述故障诊断规则库中不存在与所述RAID卡的当前日志文件相匹配的目标文件,或所述故障诊断规则库中存在与所述RAID卡的当前日志文件相匹配的目标文件且若所述当前日志文件中不存在能够与各所述诊断关键字相匹配的至少一行内容,则从所述当前日志文件中提取状态信息;
    根据所述状态信息构建日志训练集,利用Relief过滤式选择算法从所述日志训练集中选择样本,从和所述样本同类的样本中寻找猜中近邻样本,从和所述样本不同类的样本中随机选择一个猜错近邻样本,若所述样本和所述猜中近邻样本在特征上的距离小于所述样本和所述猜错近邻样本在同样特征上的距离,则增加所述特征的权重,若所述样本和所述猜中近邻样本在所述特征上的距离不小于所述样本和所述猜错近邻样本在同样特征上的距离,则减小所述特征的权重,对所述特征经过预设次数训练,并获取所述特征经过所述预设次数训练后的平均权重;
    将所述平均权重大于预设值的特征加入特征集,并输出所述特征集;
    接收对所述特征集的分析结果,并将所述特征集的分析结果加入所述故障诊断规则库中。
  7. 根据权利要求1至6任一项所述的故障诊断方法,其特征在于,输出确定所述RAID卡和/或管理的硬盘存在故障的日志文件及所述故障处理方案,包括:
    将确定所述RAID卡和/或管理的硬盘存在故障的日志文件及所述故障处理方案通过邮件和/或短信输出至移动终端。
  8. 一种故障诊断装置,其特征在于,包括:
    获取模块,用于获取待监控服务器中的RAID卡的各日志文件;
    确定模块,用于根据故障诊断规则库及所述RAID卡的各日志文件确定所述RAID卡及其管理的硬盘是否存在故障,并在确定所述RAID卡和/或管理的硬盘存在故障时利用所述故障诊断规则库给出故障处理方案;所述故障诊断规则库为通过预先对RAID卡及其管理的硬盘的历史故障进行分析创建的;
    输出模块,用于输出确定所述RAID卡和/或管理的硬盘存在故障的日志文件及所述故障处理方案。
  9. 根据权利要求8所述的故障诊断装置,其特征在于,所述确定模块,包括:
    对比单元,用于将所述RAID卡的各日志文件与所述故障诊断规则库对比,判断所述故障诊断规则库中是否存在与所述RAID卡的当前日志文件相匹配的目标文件;
    匹配单元,用于若所述故障诊断规则库中存在所述目标文件,则将所述目标文件对应的各诊断关键字与所述当前日志文件中的每一行内容进行匹配;
    确定单元,用于若所述当前日志文件中存在能够与各所述诊断关键字相匹配的至少一行内容,则确定所述RAID卡和/或管理的硬盘存在故障;
    诊断单元,用于判断所述故障诊断规则库中是否存在与所述目标文件对应的故障诊断方法,若所述故障诊断规则库中存在与所述目标文件对应的故障诊断方法,则利用所述故障诊断方法对所述当前日志文件进行诊断,以给出所述故障处理方案。
  10. 根据权利要求9所述的故障诊断装置,其特征在于,所述诊断单元包括:
    第三提取子单元,用于从所述故障诊断规则库中提取与所述目标文件对应的最后一个关键字,并从所述当前日志文件中提取与所述最后一个关键字对应的数值;
    第二过滤子单元,用于若所述数值等于0,则从所述当前日志文件中过滤所述数值所在行的内容;
    第四提取子单元,用于若所述数值大于0,则从所述当前日志文件中提取序列号,并按照第四预设正则表达式循环遍历所述当前日志文件,若所述当前日志文件中存在能匹配到所述第四预设正则表达式的一行内容,则从能匹配到所述第四预设正则表达式的一行内容向前查找能匹配到第五预设正则表达式的内容,并从匹配到所述第五预设正则表达式的内容中提取第三槽位号;
    第五给出子单元,用于若所述故障诊断规则库中的所述诊断关键字中包含hwErrors、mediumErrors、smartWarning中的任意一个,则给出更换所述第三槽位号、所述序列号对应的硬盘的故障处理方案;
    获取子单元,用于若所述故障诊断规则库中的所述诊断关键字中不包含hwErrors、mediumErrors、smartWarning中的任意一个,则获取所述最后一个关键字对应的所述数值大于0的个数;
    第六给出子单元,用于若所述个数等于1,则给出更换所述第三槽位号、所述序列号对应的硬盘的故障处理方案;
    第七给出子单元,用于若所述个数大于1,则给出考虑链路问题,建议按照RAID卡、背板、SAS线缆顺序逐个更换的故障处理方案。
  11. 一种故障诊断设备,其特征在于,包括:
    存储器,用于存储计算机程序;
    处理器,用于执行所述计算机程序时实现如权利要求1至7任一项所述的故障诊断方法的步骤。
  12. 一种可读存储介质,其特征在于,所述可读存储介质中存储有计算机程序,所述计算机程序被处理器执行时实现如权利要求1至7任一项所述的故障诊断方法的步骤。
PCT/CN2022/083577 2021-10-26 2022-03-29 一种故障诊断方法、装置、设备及可读存储介质 WO2023071039A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111244269.0 2021-10-26
CN202111244269.0A CN113689911B (zh) 2021-10-26 2021-10-26 一种故障诊断方法、装置、设备及可读存储介质

Publications (1)

Publication Number Publication Date
WO2023071039A1 true WO2023071039A1 (zh) 2023-05-04

Family

ID=78587904

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/083577 WO2023071039A1 (zh) 2021-10-26 2022-03-29 一种故障诊断方法、装置、设备及可读存储介质

Country Status (2)

Country Link
CN (1) CN113689911B (zh)
WO (1) WO2023071039A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117170984A (zh) * 2023-11-02 2023-12-05 麒麟软件有限公司 一种linux系统待机状态的异常检测方法及系统

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113689911B (zh) * 2021-10-26 2022-02-18 苏州浪潮智能科技有限公司 一种故障诊断方法、装置、设备及可读存储介质
CN114546692A (zh) * 2022-01-19 2022-05-27 北京得瑞领新科技有限公司 固态硬盘的故障分类方法、装置、存储介质及计算机设备
CN115729761B (zh) * 2022-11-23 2023-10-20 中国人民解放军陆军装甲兵学院 一种硬盘故障预测方法、系统、设备及介质

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6553509B1 (en) * 1999-07-28 2003-04-22 Hewlett Packard Development Company, L.P. Log record parsing for a distributed log on a disk array data storage system
CN109614469A (zh) * 2018-12-03 2019-04-12 郑州云海信息技术有限公司 一种日志分析方法和装置
CN111061584A (zh) * 2019-11-21 2020-04-24 浪潮电子信息产业股份有限公司 一种故障诊断方法、装置、设备及可读存储介质
CN111242225A (zh) * 2020-01-16 2020-06-05 南京邮电大学 一种基于卷积神经网络的故障检测与诊断方法
CN111949488A (zh) * 2020-08-14 2020-11-17 山东英信计算机技术有限公司 一种硬盘故障预测方法、系统及电子设备和存储介质
CN113409876A (zh) * 2021-07-15 2021-09-17 中国建设银行股份有限公司 一种故障硬盘的定位方法及系统
CN113689911A (zh) * 2021-10-26 2021-11-23 苏州浪潮智能科技有限公司 一种故障诊断方法、装置、设备及可读存储介质

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6553509B1 (en) * 1999-07-28 2003-04-22 Hewlett Packard Development Company, L.P. Log record parsing for a distributed log on a disk array data storage system
CN109614469A (zh) * 2018-12-03 2019-04-12 郑州云海信息技术有限公司 一种日志分析方法和装置
CN111061584A (zh) * 2019-11-21 2020-04-24 浪潮电子信息产业股份有限公司 一种故障诊断方法、装置、设备及可读存储介质
CN111242225A (zh) * 2020-01-16 2020-06-05 南京邮电大学 一种基于卷积神经网络的故障检测与诊断方法
CN111949488A (zh) * 2020-08-14 2020-11-17 山东英信计算机技术有限公司 一种硬盘故障预测方法、系统及电子设备和存储介质
CN113409876A (zh) * 2021-07-15 2021-09-17 中国建设银行股份有限公司 一种故障硬盘的定位方法及系统
CN113689911A (zh) * 2021-10-26 2021-11-23 苏州浪潮智能科技有限公司 一种故障诊断方法、装置、设备及可读存储介质

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117170984A (zh) * 2023-11-02 2023-12-05 麒麟软件有限公司 一种linux系统待机状态的异常检测方法及系统
CN117170984B (zh) * 2023-11-02 2024-01-30 麒麟软件有限公司 一种linux系统待机状态的异常检测方法及系统

Also Published As

Publication number Publication date
CN113689911A (zh) 2021-11-23
CN113689911B (zh) 2022-02-18

Similar Documents

Publication Publication Date Title
WO2023071039A1 (zh) 一种故障诊断方法、装置、设备及可读存储介质
CN107291911B (zh) 一种异常检测方法和装置
US6598179B1 (en) Table-based error log analysis
Oliner et al. Alert detection in system logs
JP4318643B2 (ja) 運用管理方法、運用管理装置および運用管理プログラム
CN111209131A (zh) 一种基于机器学习确定异构系统的故障的方法和系统
US10698605B2 (en) Multipath storage device based on multi-dimensional health diagnosis
CN110955550A (zh) 一种云平台故障定位方法、装置、设备及存储介质
GB2456914A (en) Network management involving cross-checking identified possible root causes of events in different data subsets of events
CN105302697B (zh) 一种密集数据模型数据库的运行状态监控方法及系统
CN106407083A (zh) 故障检测方法及装置
CN113836044A (zh) 一种软件故障采集和分析的方法及系统
CN109918313B (zh) 一种基于GBDT决策树的SaaS软件性能故障诊断方法
EP2415209A1 (en) Network analysis system
WO2022001125A1 (zh) 一种存储系统的存储故障预测方法、系统及装置
JP4383484B2 (ja) メッセージ解析装置、制御方法および制御プログラム
US8949669B1 (en) Error detection, correction and triage of a storage array errors
CN110489260A (zh) 故障识别方法、装置及bmc
CN113626236B (zh) 一种分布式文件系统的故障诊断方法、装置、设备及介质
JP6666489B1 (ja) 障害予兆検知システム
JP4850733B2 (ja) ヘルスチェック装置及びヘルスチェック方法及びプログラム
CN111835566A (zh) 一种系统故障管理方法、装置及系统
CN111831511A (zh) 一种云服务的业务主机的检测处理方法、装置及介质
CN110519102A (zh) 一种服务器故障识别方法、装置及存储介质
CN113645070B (zh) 网络设备操作执行方法、装置、计算机设备及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22884956

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 18564696

Country of ref document: US