WO2021169270A1 - 服务器故障预警方法、装置、计算机设备及存储介质 - Google Patents

服务器故障预警方法、装置、计算机设备及存储介质 Download PDF

Info

Publication number
WO2021169270A1
WO2021169270A1 PCT/CN2020/117575 CN2020117575W WO2021169270A1 WO 2021169270 A1 WO2021169270 A1 WO 2021169270A1 CN 2020117575 W CN2020117575 W CN 2020117575W WO 2021169270 A1 WO2021169270 A1 WO 2021169270A1
Authority
WO
WIPO (PCT)
Prior art keywords
model
level
early warning
component
failure rate
Prior art date
Application number
PCT/CN2020/117575
Other languages
English (en)
French (fr)
Inventor
张建浓
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021169270A1 publication Critical patent/WO2021169270A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0766Error or fault reporting or storing
    • G06F11/0787Storage of error reports, e.g. persistent data storage, storage using memory protection

Definitions

  • This application relates to the technical field of server operation and maintenance, and in particular to a server failure warning method, device, computer equipment and storage medium.
  • Server hardware failure monitoring is an important part of server operation and maintenance. Server hardware failures can affect machine performance and send out error messages, and at worst, cause machine downtime, seriously affecting business operation and availability. All server vendors provide hardware monitoring and hardware log services.
  • the BMC Baseboard Management Controller
  • the existing server hardware failure monitoring can detect the failure problem.
  • the hardware problem can be repaired by replacing the hardware or upgrading the firmware, so as to avoid the further deterioration of the hardware failure and the more serious situation (such as downtime, data Lost etc.).
  • the log can also be used to determine the problem and solve the problem.
  • the current BMC monitors a single server.
  • IDC Internet Data Center, Internet Data Center
  • How to find common problems in time to improve server availability and reduce the occurrence of major problems has become a difficult problem in server operation and maintenance.
  • the traditional server management platform cannot provide early warning and cannot find common problems, resulting in frequent failures that affect availability.
  • the model of one failure and one maintenance is not only inefficient, but also keeps operation and maintenance costs high. The inventor realizes that how to quickly and accurately obtain the common problems of batch machines and improve machine usability has become an urgent problem to be solved.
  • the embodiments of the present application provide a server failure warning method, device, computer equipment, and storage medium to solve the problem of quickly and accurately obtaining common problems of batch machines and improving machine availability.
  • a server failure warning method including:
  • the server failure warning request includes a regular task and a time period.
  • the regular task includes reading the log information of the server system event log library;
  • the periodical task is activated and the log information corresponding to the timing period is obtained;
  • model early warning level or component early warning level reaches the preset report level, based on the model maintenance record table, extract the periodic failure causes of each online model in the timing period;
  • a server failure warning device including:
  • the server failure early warning request includes regular tasks and time periods, among which, the regular tasks include reading log information of the server system event log library;
  • Obtain monitoring data module used to monitor the server hardware status through IPMI commands, obtain hardware monitoring data, and add hardware monitoring data to log information;
  • Activate the periodic task module which is used to activate the periodic task if the current time of the system meets the timing period and obtain the log information corresponding to the timing period;
  • the early warning level module which is used to obtain the model early warning level or component early warning level based on the log information corresponding to the timing period;
  • the fault reason extraction module is used to extract the periodic fault reason of each online model in the timing period based on the model maintenance record table if the model warning level or the component warning level reaches the preset report level;
  • the formation reason sorting table module is used to count the number of failures corresponding to the reason of each timing period, and arrange all the times of failures in descending order to form a sorting table of failure reasons;
  • An analysis report module is formed, which is used to add the sorting list of fault causes to the preset periodic fault analysis template to form a periodic fault analysis report.
  • a computer device includes a memory, a processor, and computer-readable instructions that are stored in the memory and can run on the processor, wherein the processor implements the following steps when the processor executes the computer-readable instructions:
  • the server failure warning request includes a regular task and a timing period, where the regular task includes reading log information of the server system event log library;
  • model early warning level or the component early warning level reaches the preset report level, based on the model maintenance record table, extract the periodic failure causes of each of the online models in the timing period;
  • the breakdown reason sorting table is added to the preset periodic fault analysis template to form a periodic fault analysis report.
  • One or more readable storage media storing computer readable instructions.
  • the computer readable storage medium stores computer readable instructions.
  • the computer readable instructions execute the following steps:
  • the server failure warning request includes a regular task and a timing period, where the regular task includes reading log information of the server system event log library;
  • model early warning level or the component early warning level reaches the preset report level, based on the model maintenance record table, extract the periodic failure causes of each of the online models in the timing period;
  • the breakdown reason sorting table is added to the preset periodic fault analysis template to form a periodic fault analysis report.
  • the above-mentioned server failure warning method, device, computer equipment and storage medium can obtain the current model failure rate or current component failure rate in a regular period to correspond to different model warning levels, and flexibly deal with different component problems that affect safety applications. Responses can ensure the normal and stable operation of the machine; at the same time, the server can obtain periodic fault analysis reports based on the preset report level, which is helpful for maintenance personnel to obtain common problems of models or component types from the periodic fault analysis reports, and take maintenance or repairs in a timely manner. Upgrade measures to reduce the failure rate of current models or the failure rate of current components caused by common problems of batch machines, and improve the applicability of the machine.
  • FIG. 1 is a schematic diagram of an application environment of a server failure warning method in an embodiment of the present application
  • Figure 2 is a flowchart of a server failure warning method in an embodiment of the present application
  • FIG. 3 is another flowchart of a server failure warning method in an embodiment of the present application.
  • FIG. 4 is another flowchart of a server failure warning method in an embodiment of the present application.
  • FIG. 5 is another flowchart of a server failure warning method in an embodiment of the present application.
  • FIG. 6 is another flowchart of a server failure warning method in an embodiment of the present application.
  • FIG. 7 is another flowchart of a server failure warning method in an embodiment of the present application.
  • FIG. 8 is another flowchart of a server failure warning method in an embodiment of the present application.
  • FIG. 9 is a schematic diagram of a server failure warning device in an embodiment of the present application.
  • Fig. 10 is a schematic diagram of a computer device in an embodiment of the present application.
  • the server fault early warning method provided by the embodiment of the application can be applied in the application environment as shown in FIG. Communicate with the server through the network.
  • the client also known as the client, refers to the program that corresponds to the server and provides local services for the client.
  • the client can be installed on, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices and other computer devices.
  • the server can be implemented as an independent server or a server cluster composed of multiple servers.
  • a server failure warning method is provided.
  • the method is applied to the server in FIG. 1 as an example for description, which specifically includes the following steps:
  • the server failure warning request includes a regular task and a time period, where the regular task includes reading log information of the server system event log library.
  • the server failure warning request is a request sent by the client for warning of the hardware failure of the server.
  • Periodic tasks are tasks that the server executes when the current time of the system meets the preset timing period. The period in which the periodic task is activated each time in the timing period.
  • Log information is a variety of information that records server operation information, such as software and hardware operation information.
  • S20 Monitor the hardware status of the server through IPMI commands, obtain hardware monitoring data, and add the hardware monitoring data to the log information.
  • IPMI Intelligent Platform Management Interface
  • BMC Baseboard Management Controller as shown in Figure 1
  • ipmi an embedded management microcontroller
  • Hardware monitoring data is the data that records the operating status of each component in the server, including machine ID, component ID, and faults generated during operation.
  • the server can activate the timing task by itself and execute the task corresponding to the timing task without manual activation.
  • the model early warning level is a different security level preset by the server based on the failure rate of the current model. It is used to initiate different early warning response behaviors based on different security levels.
  • Each early warning response behavior is a level early warning response. , For example as follows:
  • Aircraft early warning level first-level early warning level, second-level early warning level, and third-level early warning level.
  • Level early warning response For the first level early warning level, maintenance will be carried out immediately.
  • maintenance can be carried out at 8 o'clock every day at the preset second-level periodic response time.
  • maintenance can be carried out at 8 pm every Friday at the preset three-level periodic response time.
  • the component early warning level is a different security level preset by the server based on the current component failure rate. It is used to initiate different early warning response behaviors based on different security levels. Each early warning response behavior is a graded early warning response, for example, as follows :
  • Aircraft early warning level first-level early warning level, second-level early warning level, and third-level early warning level.
  • Level early warning response For the first level early warning level, maintenance will be carried out immediately.
  • maintenance can be carried out at 8 o'clock every day at the preset second-level periodic response time.
  • maintenance can be carried out at 8 pm every Friday at the preset three-level periodic response time.
  • model early-warning level or the component early-warning level reaches the preset report level, based on the model maintenance record table, extract the periodic failure causes of each online model in the timing period.
  • the preset report level is the level set by the server to meet the need to form a periodic fault analysis report.
  • the first-level early warning level and the second-level early warning level can be set as the preset report levels.
  • Periodic failure causes are based on the failure causes of the same model or the same component type in the current cycle formed by the timing cycle. Furthermore, periodic failure causes can also be sorted by reason similarity, so that maintenance personnel can obtain common problems of the same model or the same component type in the timing period from the periodic failure causes in time. Understandably, analyzing the causes of periodic failures facilitates maintenance personnel to perform system upgrades and other maintenance and upgrade measures based on common problems.
  • step S50 when the server detects that the model early warning level or the component early warning level meets the preset report level, the periodic failure cause should be formed in time, so that the maintenance personnel can obtain the common problem of the model or component type from the periodic failure cause.
  • the server can mark various types of errors with error types, and count the number of occurrences of each error type flag, thereby forming a sorting table of failure causes (for example, it can be sorted in descending order of the number of errors).
  • the preset periodic failure analysis template is set according to actual application scenarios, and is suitable for maintenance personnel to view the template for adding error information.
  • the server adds various information obtained in the foregoing to the preset periodic fault analysis template to form a periodic fault analysis report.
  • the server obtains the current model failure rate or the current component failure rate in a timing period, respectively corresponding to different model early warning levels, and flexibly responds to different component issues that affect safety applications. It can guarantee the normal and stable operation of the machine.
  • the server can obtain periodic fault analysis reports based on the preset report level, which is helpful for maintenance personnel to obtain the common problems of models or component types from the periodic fault analysis reports, and take maintenance or upgrade measures in time to reduce the common problems of batch machines.
  • the current model failure rate or the current component failure rate improves the applicability of the machine.
  • step S40 that is, obtaining the model early warning level or the component early warning level, specifically includes the following steps:
  • the online model data table is a record table that records the status of machines that have been online in a timing period. For example, if machine A has been online in the timing period, the online status of machine A in the online model data table can be updated to logged in. state. Understandably, at the end of each timing period, the server automatically updates the online status of each machine in the online model data table to the unlogged state, so that the server can count the number of machines that have been online in the new timing period. It is determined as the number of online models, that is, the definition of the number of online models is the total number of machines whose online status is logged in in the online model data table in the current timing period.
  • the model maintenance data sheet is a record sheet that records the problem of the parts in the machine and the specific cause of the problem.
  • the specific registration content also includes the machine ID, model, part ID, part type, and failure reason, so that the subsequent server can continue to maintain the data sheet based on the data sheet.
  • the server can filter the model maintenance data table based on the same model, and count the total number of faulty models of the model in the current timing period as the number of problem models.
  • the server can filter the model maintenance data table based on the same component type, and count the total number of faulty components of this component type in the current timing period as the number of problematic components.
  • step S41 the server can obtain the number of online models, the number of problematic models corresponding to the same model, and the number of problem parts corresponding to the same component type in a timely manner according to the online model data sheet and the model maintenance record table, avoiding manual screening calculations. Convenient and accurate.
  • the current model failure rate is the percentage of the number of problematic models to the number of online models
  • the current component failure rate is the percentage of the number of problem components to the number of online models.
  • the number of online models in the timing period is N
  • the number of problematic models is n
  • the number of problematic parts is m:
  • step S42 the server can quickly obtain the current model failure rate and the current component failure rate according to the preset formula, and prepare a data basis for the subsequent determination of the model early warning level based on the current model failure rate and the current component failure rate.
  • step S43 the server setting is based on different current model failure rates corresponding to different model early warning levels and different levels of early warning responses. Flexible level early warning responses can be adopted to maintain the machine, which is conducive to timely response to critical machine problems. Maintenance: For minor machine problems, centralized maintenance is carried out according to the preset time, which can effectively guarantee the normal operation of the machine and reduce the delay in running time due to maintenance.
  • the component early warning level is a different security level preset by the server based on the current component failure rate, which is used to initiate different early warning response behaviors based on different security levels.
  • Each early warning response behavior is a graded early warning response. Examples are as follows:
  • Aircraft early warning level first-level early warning level, second-level early warning level, and third-level early warning level.
  • Level early warning response For the first level early warning level, maintenance will be carried out immediately.
  • maintenance can be carried out at 8 o'clock every day at the preset second-level periodic response time.
  • maintenance can be carried out at 8 pm every Friday at the preset three-level periodic response time.
  • step S44 the server is set based on different current component failure rates corresponding to different early warning levels of different models, and different levels of early warning responses.
  • Flexible level early warning responses can be adopted to maintain machine parts, which is conducive to critical machine component problems.
  • Timely maintenance, centralized maintenance for minor machine component problems according to the preset time can effectively guarantee the normal operation of the machine and reduce the delay in running time due to maintenance.
  • the server can obtain the number of online models, the number of problem models corresponding to the same model, and the number of problem parts corresponding to the same component type in a timely manner according to the online model data sheet and the model maintenance record table, avoiding manual screening
  • the calculation is convenient and accurate.
  • the server can quickly obtain the current model failure rate and the current component failure rate according to the preset formula, and prepare a data basis for the subsequent determination of the model early warning level based on the current model failure rate and the current component failure rate.
  • the server settings are based on different current model failure rates corresponding to different model early warning levels and different levels of early warning response. Flexible level early warning responses can be adopted to maintain the machine, which is conducive to timely maintenance of critical machine problems.
  • Important machine problems are centrally maintained according to the preset time, which can effectively guarantee the normal operation of the machine and reduce the delay in running time due to maintenance.
  • the server settings are based on different current component failure rates corresponding to different early warning levels of different models, and different levels of early warning responses. Flexible levels of early warning responses can be adopted to maintain machine components, which is conducive to timely maintenance of critical machine component problems.
  • centralized maintenance is carried out at a preset time, which can effectively guarantee the normal operation of the machine and reduce the delay in running time due to maintenance.
  • step S43 based on the number of online models corresponding to the timing period and the current model failure rate, the model warning level is obtained, and the level warning response is performed based on the model warning level. Specifically include the following steps:
  • the preset comparison quantity is a preset comparison quantity according to actual application scenarios, which is not specifically limited here, and the preset first failure rate can be obtained in the same way.
  • the first-level model early warning is a reminder of the urgency of the early warning set according to the time application scenario.
  • the first-level early warning response is a response method corresponding to the first-level model early warning, which can specifically include various corresponding response measures.
  • the failure rate of the previous model refers to the failure rate of the current online model in the previous cycle.
  • the failure rate of the previous model is greater than the preset first failure rate, the first-level model early warning is obtained, and the first-level early warning response is performed based on the first-level model early warning.
  • failure rate of the previous model is not greater than the preset first failure rate, obtain a second-level model early warning, and perform a second-level early warning response based on the second-level model early warning.
  • the second-level aircraft early warning is more urgent or less urgent than the first-level aircraft early warning.
  • it is specifically an early warning that is more advanced than the first-level aircraft early warning.
  • the second-level early warning response corresponding to the second-level aircraft early warning can be obtained, that is, the urgency of the second-level early warning response should be greater than the first-level early warning response.
  • the server settings are based on different current model failure rates corresponding to different model early warning levels and different levels of early warning responses.
  • Flexible level early warning responses can be adopted to maintain the machine, which is conducive to critical machine problems. Carry out timely maintenance, and carry out centralized maintenance according to the preset time for minor machine problems, which can effectively guarantee the normal operation of the machine and reduce the delay in running time due to maintenance.
  • step S44 based on the number of online models corresponding to the timing period and the current component failure rate, the component warning level is obtained, and the level warning response is performed based on the component warning level, which specifically includes the following step:
  • the first-level component early warning is a reminder of the urgency of the early warning set according to the time application scenario.
  • the first-level early warning response is a response method corresponding to the first-level component early warning, which can specifically include various corresponding response measures.
  • the early component failure rate refers to the component failure rate of the current component in the previous cycle.
  • failure rate of the previous component is not greater than the preset second failure rate, obtain a second-level component early warning, and perform a second-level early warning response based on the second-level component early warning.
  • the server settings are based on different current component failure rates corresponding to different model early warning levels and different levels of early warning responses.
  • Flexible level early warning responses can be adopted to maintain machine parts, which is beneficial to critical machine parts.
  • Timely maintenance for problems, and centralized maintenance for minor machine component problems according to the preset time which can effectively guarantee the normal operation of the machine and reduce the delay in running time due to maintenance.
  • the server failure warning method before step S10, that is, before obtaining the server failure warning request, the server failure warning method further specifically includes the following steps:
  • the fault report request includes the date of the fault report and the fault report information
  • the fault report information includes the machine ID, the component ID and the cause of the fault.
  • the failure reporting date is the date when the machine or component sends the failure and reports it to the server.
  • the failure report information is information such as the specific failure cause.
  • the machine ID and component ID are the unique identifiers used by the server to distinguish each machine or component.
  • the cause of the failure is the specific cause of the failure, etc.
  • each machine ID corresponds to a model
  • each part ID also corresponds to a part type.
  • Obtaining the model corresponding to the machine ID and the component type corresponding to the component ID facilitates subsequent classification of faults in the model or component based on each model or component type.
  • the current model maintenance information includes all information related to the occurrence of the fault, such as various information such as machine ID, model, component ID, or component type.
  • the model maintenance record sheet is a record sheet used to record and maintain each model or component, which is helpful for maintenance personnel to find and locate various problems based on the sheet.
  • the server can record the problem model of the online model, the problem component corresponding to the problem model, and the failure reason corresponding to the problem component based on the failure report request, so that the subsequent server can count the current cycle in the regular period. Obtain periodic failure analysis reports for the problem models and problem components, and find common problems.
  • the server failure warning method before step S10, before step S10, that is, before obtaining the server failure warning request, the server failure warning method further specifically includes the following steps:
  • the login status in the online model data table is the number corresponding to each online model in the logged-in status, which is determined as the number of online models.
  • each model may not be online during the timing period. Only the models that are online during the positioning period can be recorded in the online model data table in the current timing period, and the login status corresponding to the online model will be updated. It is logged in state.
  • the sum of the number of machines corresponding to the problem model corresponding to each online model in the model maintenance record table is determined as the number of problem models and the problem corresponding to each online model
  • the sum of the number of parts corresponding to the part type is determined as the number of problem parts.
  • the server can obtain the number of online models in a timely manner based on the online model data sheet, and obtain the number of problematic models and the number of problem parts based on the model maintenance record table, avoiding manual statistics, increasing the degree of calculation automation, and being accurate and efficient. .
  • step S50 that is, if the model warning level or the component warning level reaches the preset report level, periodic failure analysis is performed on the online model in the model maintenance record table, Obtain the periodic failure analysis report, including the following steps:
  • model early warning level is the preset report level
  • model maintenance record table obtain the model failure reason corresponding to the problem model corresponding to the online model.
  • the preset report level is the level at which a report needs to be generated for the aircraft type early warning level. Because each model or component has a different urgency of failure, there is no need to generate a report level for each failure. Only the models that belong to the preset report level should be added to the model maintenance record table in order to bring the attention of the maintenance personnel to the table.
  • the component early warning level is the preset report level
  • the component failure reason corresponding to the problem component type corresponding to the online model is obtained based on the model maintenance record table.
  • the server can sort the failure causes in the periodic failure causes according to the similarity of the causes, so that the maintenance personnel can obtain the common problems of the same model or the same component type in the timing period from the periodic failure analysis report in time, which is beneficial to Maintenance personnel carry out maintenance and upgrade measures such as system upgrades based on common problems.
  • the server obtains the current model failure rate or the current component failure rate in a timing period, respectively corresponding to different model early warning levels, and flexibly responds to different component issues that affect safety applications. It can guarantee the normal and stable operation of the machine.
  • the server can obtain periodic fault analysis reports based on the preset report level, which is helpful for maintenance personnel to obtain the common problems of models or component types from the periodic fault analysis reports, and take maintenance or upgrade measures in time to reduce the common problems of batch machines.
  • the current model failure rate or the current component failure rate improves the applicability of the machine.
  • server settings are based on different current model failure rates corresponding to different model early warning levels and different levels of early warning responses.
  • Flexible level early warning responses can be adopted to maintain the machine, which is conducive to timely maintenance of critical machine problems.
  • centralized maintenance is carried out according to the preset time, which can effectively guarantee the normal operation of the machine and reduce the delay in running time due to maintenance.
  • server settings are based on different current component failure rates corresponding to different early warning levels of different models, and different levels of early warning responses. Flexible level early warning responses can be adopted to maintain machine parts, which is conducive to timely handling of critical machine component problems. Maintenance: For minor machine component problems, centralized maintenance is performed according to the preset time, which can effectively guarantee the normal operation of the machine and reduce the delay in running time due to maintenance.
  • the server can record the problem model of the online model, the problem component corresponding to the problem model, and the failure reason corresponding to the problem component based on the failure report request, so that the subsequent server can count the problem machines in the current cycle during the timing period.
  • Type and problem parts obtain periodic failure analysis reports, and find common problems.
  • the server can obtain the number of online models in a timely manner based on the online model data table, and obtain the number of problematic models and the number of problem parts based on the model maintenance record table, avoiding manual statistics, increasing the degree of calculation automation, and being accurate and efficient.
  • the server can sort the failure causes in the periodic failure causes according to the similarity of the reasons, so that the maintenance personnel can obtain the common problems of the same model or the same component type in the timing period from the periodic failure analysis report in time, which is beneficial to the maintenance personnel based on For common problems, carry out maintenance and upgrade measures such as system upgrades.
  • a server failure early warning device corresponds to the server failure early warning method in the above-mentioned embodiment in a one-to-one correspondence.
  • the server failure early warning device includes an early warning request module 10, a monitoring data acquisition module 20, a regular task activation module 30, an early warning level acquisition module 40, a failure cause extraction module 50, a formation cause ranking module 60, and a formation Analysis report module 70.
  • the detailed description of each functional module is as follows:
  • the obtaining early warning request module 10 is configured to obtain a server failure early warning request.
  • the server failure early warning request includes a regular task and a time period.
  • the regular task includes reading log information of the server system event log library.
  • the acquiring monitoring data module 20 is used for monitoring the hardware status of the server through IPMI commands, acquiring hardware monitoring data, and adding the hardware monitoring data to log information.
  • the activating periodical task module 30 is configured to activate the periodical task if the current time of the system meets the periodical period, and obtain log information corresponding to the periodical period.
  • the early warning level obtaining module 40 is used to obtain the early warning level of the model or the early warning level of the component based on the log information corresponding to the timing period.
  • the fault cause extraction module 50 is used to extract the periodic fault cause of each online model in a timing period based on the model maintenance record table if the model warning level or the component warning level reaches the preset report level.
  • the formation cause sorting table module 60 is used to count the number of occurrences of failures corresponding to the failure causes of each timing period, and sort all the occurrences of the failures in descending order to form a sorting list of failure causes.
  • An analysis report module 70 is formed, which is used to add the fault cause sorting table to the preset periodic fault analysis template to form a periodic fault analysis report.
  • the module 40 for obtaining an early warning level includes:
  • the statistical record table unit is used to regularly perform statistics on the online model data sheet and the model maintenance record sheet, and obtain the number of online models, the number of problem models, and the number of problem parts corresponding to each online model in a timing period.
  • the Get Component Failure Rate Unit is used to obtain the current model failure rate and the current component failure rate in a timing period based on the number of online models, the number of problem models, and the number of problem components.
  • model level unit which is used to obtain the model early warning level based on the number of online models corresponding to the timing period and the current model failure rate, and perform a level early warning response based on the model early warning level.
  • the component level acquisition unit is used to obtain the component early warning level based on the number of online models corresponding to the timing period and the current component failure rate, and perform a level warning response based on the component early warning level.
  • the module for acquiring model level includes:
  • the first-level early warning unit which is used to obtain the first-level model early warning if the number of online models in the timing period is greater than the preset comparison number, and the current model failure rate is greater than the preset first failure rate, based on the first-level model Early warning is a first-level early warning response.
  • the model failure rate unit is used to obtain the failure rate of the previous model if the number of online models in the timing period is not greater than the preset comparison number, and the current model failure rate is greater than the preset first failure rate.
  • a first-level response unit is used to obtain a first-level model early warning if the failure rate of the previous model is greater than the preset first failure rate, and perform a first-level early warning response based on the first-level model early warning.
  • obtaining the component level module includes:
  • Obtaining a component early warning unit used to obtain a first-level component early warning if the number of online models in the timing period is greater than the preset comparison number, and the current component failure rate is greater than the preset second failure rate, based on the first-level component early warning Early warning response.
  • the component failure rate obtaining unit is configured to obtain the previous component failure rate if the number of models in the timing period is not greater than the preset comparison number, and the current component failure rate is greater than the preset second failure rate.
  • An early warning response unit is used to obtain a first-level component early warning if the early component failure rate is greater than the preset second failure rate, and perform a first-level early warning response based on the first-level component early warning
  • the server failure early warning device further includes:
  • the obtain report request module is used to obtain a fault report request.
  • the fault report request includes the date of the fault report and the information of the fault report.
  • the information of the fault report includes the machine ID, the component ID and the cause of the fault.
  • the get component type module is used to get the model corresponding to the machine ID and the component type corresponding to the component ID.
  • a maintenance information module is formed, which is used to associate and save the failure report date, machine ID, model, component ID, component type, and failure cause to form the current model maintenance information, and add the current model maintenance information to the model maintenance record table.
  • the statistical record table module includes:
  • the unit for determining the number of models is used to count the number of online models corresponding to each online model whose login status in the online model data table corresponds to the number of online models within the timing period corresponding to the current time of the system.
  • the fault reason extraction module includes:
  • model reason unit used to obtain the model failure reason corresponding to the problem model corresponding to the online model based on the model maintenance record table if the model warning level is the preset report level.
  • the component reason obtaining unit is used to obtain the component failure reason corresponding to the problem component type corresponding to the online model based on the model maintenance record table if the component warning level is the preset report level.
  • each module in the foregoing server fault early warning device can be implemented in whole or in part by software, hardware, and a combination thereof.
  • the above-mentioned modules may be embedded in the form of hardware or independent of the processor in the computer equipment, or may be stored in the memory of the computer equipment in the form of software, so that the processor can call and execute the operations corresponding to the above-mentioned modules.
  • a computer device is provided.
  • the computer device may be a server, and its internal structure diagram may be as shown in FIG. 10.
  • the computer equipment includes a processor, a memory, a network interface, and a database connected through a system bus. Among them, the processor of the computer device is used to provide calculation and control capabilities.
  • the memory of the computer device includes a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium stores an operating system, computer readable instructions, and a database.
  • the internal memory provides an environment for the operation of the operating system and computer-readable instructions in the non-volatile storage medium.
  • the database of the computer equipment is used for data related to the server failure warning method.
  • the network interface of the computer device is used to communicate with an external terminal through a network connection.
  • the computer-readable instruction is executed by the processor to realize a server failure warning method.
  • a computer device including a memory, a processor, and computer-readable instructions stored in the memory and capable of running on the processor.
  • the processor executes the computer-readable instructions to realize the server failure warning of the above-mentioned embodiment.
  • the method for example, S10 to step S70 shown in FIG. 2.
  • the processor implements the functions of the modules/units of the server fault early warning device in the foregoing embodiment when executing the computer-readable instructions, for example, the functions of the modules 10 to 70 shown in FIG. 9. To avoid repetition, I won’t repeat them here.
  • the readable storage medium in this embodiment includes a non-volatile readable storage medium and a volatile readable storage medium.
  • a computer-readable storage medium is provided, and computer-readable instructions are stored thereon.
  • the server failure warning method of the foregoing embodiment is implemented, for example, S10 to step shown in FIG. 2 S70.
  • the computer-readable instruction is executed by the processor, the function of each module/unit in the server fault early warning device in the foregoing device embodiment is realized, for example, the function of the module 10 to the module 70 shown in FIG. 9. To avoid repetition, I won’t repeat them here.
  • Non-volatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • Volatile memory may include random access memory (RAM) or external cache memory.
  • RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Channel (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Debugging And Monitoring (AREA)

Abstract

一种服务器故障预警方法、装置、计算机设备及存储介质,该服务器故障预警方法包括:获取服务器故障预警请求,服务器故障预警请求包括定期任务和定时周期,其中,定期任务包括读取服务器系统事件日志库的日志信息(S10);通过IPMI命令对服务器硬件状态进行监测,获取硬件监测数据,将硬件监测数据添加到日志信息中(S20);若系统当前时间满足定时周期,则激活定期任务,获取定时周期对应的日志信息(S30);基于定时周期对应的日志信息,获取机型预警等级或部件预警等级(S40);若机型预警等级或部件预警等级达到预设报告等级,则基于机型维护记录表,提取每一在线机型在定时周期内的周期故障原因(S50);统计每一周期故障原因对应的故障发生次数,按降序排列所有故障发生次数,形成故障原因排序表(S60);将故障原因排序表添加到预设周期故障分析模板中,形成周期故障分析报告(S70)。该方法及时采取维护或升级措施,降低批量机器共性问题产生的当前机型故障率或当前部件故障率。

Description

服务器故障预警方法、装置、计算机设备及存储介质
本申请以2020年2月27日提交的申请号为202010122319.7,名称为“服务器故障预警方法、装置、计算机设备及存储介质”的中国发明申请为基础,并要求其优先权。
技术领域
本申请涉及服务器运维技术领域,尤其涉及一种服务器故障预警方法、装置、计算机设备及存储介质。
背景技术
服务器硬件故障监控是服务器运维重要的一部分。服务器硬件故障轻则影响机器性能,发出错误信息,重则导致机器宕机,严重影响业务运转和可用性。各家服务器厂商都提供了硬件监控和硬件日志服务。BMC(Baseboard Management Controller,基板管理控制器)实时对服务器的各个部件进行监控,当服务器出现出现硬件故障时,BMC将会监测到硬件故障部件,记录日志,并进行报警,通知用户。现有的服务器硬件故障监控可以监测到故障问题,在轻微故障的情况下,可以通过更换硬件或升级固件的方式修复硬件问题,避免因为硬件故障进一步恶化导致更严重的情况(例如宕机、数据丢失等)。在严重故障的情况下,也可以通过日志判断问题所在,从而解决问题。
目前的BMC均为对单一服务器的监控。在一个IDC(Internet Data Center,互联网数据中心)中心,可能存在着成千上万台服务器,同样的潜在故障缺陷可能出现在成千上万台服务器中。如何及时发现共性问题从而提高服务器的可用性,减少重大问题的发生成为了服务器运维中的难题。传统的服务器管理平台无法预警,不能发现共性问题,导致故障频繁影响可用性。故障一台维护一台的模式不仅效率底下,也让运维成本居高不下。发明人意识到如何快速准确地获取批量机器的共性问题提高机器可用性成为亟待解决的问题。
发明内容
本申请实施例提供一种服务器故障预警方法、装置、计算机设备及存储介质,以解决快速准确地获取批量机器的共性问题提高机器可用性的问题。
一种服务器故障预警方法,包括:
获取服务器故障预警请求,服务器故障预警请求包括定期任务和定时周期,其中,定期任务包括读取服务器系统事件日志库的日志信息;
通过IPMI命令对服务器硬件状态进行监测,获取硬件监测数据,将硬件监测数据添加到日志信息中;
若系统当前时间满足定时周期,则激活定期任务,获取定时周期对应的日志信息;
基于定时周期对应的日志信息,获取机型预警等级或部件预警等级;
若机型预警等级或部件预警等级达到预设报告等级,则基于机型维护记录表,提取每一在线机型在定时周期内的周期故障原因;
统计每一定时周期故障原因对应的故障发生次数,按降序排列所有故障发生次数,形成故障原因排序表;
将故障原因排序表添加到预设周期故障分析模板中,形成周期故障分析报告。
一种服务器故障预警装置,包括:
获取预警请求模块,用于获取服务器故障预警请求,服务器故障预警请求包括定期任务和定时周期,其中,定期任务包括读取服务器系统事件日志库的日志信息;
获取监测数据模块,用于通过IPMI命令对服务器硬件状态进行监测,获取硬件监测数据, 将硬件监测数据添加到日志信息中;
激活定期任务模块,用于若系统当前时间满足定时周期,则激活定期任务,获取定时周期对应的日志信息;
获取预警等级模块,用于基于定时周期对应的日志信息,获取机型预警等级或部件预警等级;
提取故障原因模块,用于若机型预警等级或部件预警等级达到预设报告等级,则基于机型维护记录表,提取每一在线机型在定时周期内的周期故障原因;
形成原因排序表模块,用于统计每一定时周期故障原因对应的故障发生次数,按降序排列所有故障发生次数,形成故障原因排序表;
形成分析报告模块,用于将故障原因排序表添加到预设周期故障分析模板中,形成周期故障分析报告。
一种计算机设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,其中,所述处理器执行所述计算机可读指令时实现如下步骤:
获取服务器故障预警请求,所述服务器故障预警请求包括定期任务和定时周期,其中,所述定期任务包括读取服务器系统事件日志库的日志信息;
通过IPMI命令对服务器硬件状态进行监测,获取硬件监测数据,将所述硬件监测数据添加到所述日志信息中;
若系统当前时间满足所述定时周期,则激活所述定期任务,获取所述定时周期对应的所述日志信息;
基于所述定时周期对应的所述日志信息,获取机型预警等级或部件预警等级;
若所述机型预警等级或所述部件预警等级达到预设报告等级,则基于机型维护记录表,提取每一所述在线机型在所述定时周期内的周期故障原因;
统计每一所述定时周期故障原因对应的故障发生次数,按降序排列所有所述故障发生次数,形成故障原因排序表;
将所述故障原因排序表添加到预设周期故障分析模板中,形成周期故障分析报告。
一个或多个存储有计算机可读指令的可读存储介质,所述计算机可读存储介质存储有计算机可读指令,其中,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行如下步骤:
获取服务器故障预警请求,所述服务器故障预警请求包括定期任务和定时周期,其中,所述定期任务包括读取服务器系统事件日志库的日志信息;
通过IPMI命令对服务器硬件状态进行监测,获取硬件监测数据,将所述硬件监测数据添加到所述日志信息中;
若系统当前时间满足所述定时周期,则激活所述定期任务,获取所述定时周期对应的所述日志信息;
基于所述定时周期对应的所述日志信息,获取机型预警等级或部件预警等级;
若所述机型预警等级或所述部件预警等级达到预设报告等级,则基于机型维护记录表,提取每一所述在线机型在所述定时周期内的周期故障原因;
统计每一所述定时周期故障原因对应的故障发生次数,按降序排列所有所述故障发生次数,形成故障原因排序表;
将所述故障原因排序表添加到预设周期故障分析模板中,形成周期故障分析报告。
上述服务器故障预警方法、装置、计算机设备及存储介质,通过定时周期内获取当前机型故障率或当前部件故障率分别对应不同的机型预警等级,灵活地对影响安全应用的不同的部件问题进行响应,可保障机器的正常稳健运行;同时,服务器可基于预设报告等级获取周期故障分析报告,利于维护人员从该周期故障分析报告中获取机型或部件类型的共性问题,并及时采取维护或升级措施,降低批量机器共性问题产生的当前机型故障率或当前部件故障率,提高机器的适用性。
附图说明
为了更清楚地说明本申请实施例的技术方案,下面将对本申请实施例的描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。
图1是本申请一实施例中服务器故障预警方法的应用环境示意图;
图2是本申请一实施例中服务器故障预警方法的流程图;
图3是本申请一实施例中服务器故障预警方法的另一流程图;
图4是本申请一实施例中服务器故障预警方法的另一流程图;
图5是本申请一实施例中服务器故障预警方法的另一流程图;
图6是本申请一实施例中服务器故障预警方法的另一流程图;
图7是本申请一实施例中服务器故障预警方法的另一流程图;
图8是本申请一实施例中服务器故障预警方法的另一流程图;
图9是本申请一实施例中服务器故障预警装置的示意图;
图10是本申请一实施例中计算机设备的示意图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
本申请实施例提供的服务器故障预警方法,可应用在如图1的应用环境中,该服务器故障预警方法应用在服务器故障预警系统中,该服务器故障预警系统包括客户端和服务器,其中,客户端通过网络与服务器进行通信。客户端又称为用户端,是指与服务器相对应,为客户端提供本地服务的程序。该客户端可安装在但不限于各种个人计算机、笔记本电脑、智能手机、平板电脑和便携式可穿戴设备等计算机设备上。服务器可以用独立的服务器或者是多个服务器组成的服务器集群来实现。
在一实施例中,如图2所示,提供一种服务器故障预警方法,以该方法应用在图1中的服务器为例进行说明,具体包括如下步骤:
S10.获取服务器故障预警请求,服务器故障预警请求包括定期任务和定时周期,其中,定期任务包括读取服务器系统事件日志库的日志信息。
其中,服务器故障预警请求是客户端发送的请求对服务器的硬件故障进行预警的请求。定期任务是当系统当前时间满足预设的定时周期时,服务器自行执行的任务。定时周期时每次激活定期任务的周期。
日志信息是记录服务器运行信息的各种信息,比如软硬件运行信息等。
S20.通过IPMI命令对服务器硬件状态进行监测,获取硬件监测数据,将硬件监测数据添加到日志信息中。
其中,IPMI(Intelligent Platform Management Interface)即智能平台管理接口是使硬件管理具备“智能化”的新一代通用接口标准。用户可以利用IPMI监视服务器的物理特征,如温度、电压、电扇工作状态、电源供应以及机箱入侵等。Ipmi最大的优势在于它是独立于CPU BIOS和OS的,所以用户无论在开机还是关机的状态下,只要接通电源就可以实现对服务器的监控。Ipmi是一种规范的标准,其中最重要的物理部件就是BMC(Baseboard Management Controller如图1),一种嵌入式管理微控制器,它相当于整个平台管理的“大脑”,通过它ipmi可以监控各个传感器的数据并记录各种事件的日志。
硬件监测数据是记录服务器中各个部件运行状态的数据,包括机器ID、部件ID以及运 行时产生的故障等信息。
S30.若系统当前时间满足定时周期,则激活定期任务,获取定时周期对应的日志信息。
具体地,当系统当前时间满足定时周期时,服务器可自行激活定时任务,执行定时任务对应的任务,无需人工启动。
S40.基于定时周期对应的日志信息,获取机型预警等级或部件预警等级。
其中,机型预警等级是服务器预设的基于当前机型故障率划分的不同的安全性等级,用于基于不同的安全性等级启动不同的预警响应行为,每一预警响应行为即为等级预警响应,举例如下:
机型预警等级:一级预警等级、二级预警等级、三级预警等级。
等级预警响应:对于一级预警等级,即刻进行维护。
对于二级预警等级,可在预设二级周期响应时间每日晚8点进行维护。
对于三级预警等级,可在预设三级周期响应时间每周五晚8点进行维护。
部件预警等级是服务器预设的基于当前部件故障率分别划分的不同的安全性等级,用于基于不同的安全性等级启动不同的预警响应行为,每一预警响应行为即为等级预警响应,举例如下:
机型预警等级:一级预警等级、二级预警等级、三级预警等级。
等级预警响应:对于一级预警等级,即刻进行维护。
对于二级预警等级,可在预设二级周期响应时间每日晚8点进行维护。
对于三级预警等级,可在预设三级周期响应时间每周五晚8点进行维护。
S50.若机型预警等级或部件预警等级达到预设报告等级,则基于机型维护记录表,提取每一在线机型在定时周期内的周期故障原因。
其中,预设报告等级是服务器设定的满足需要形成周期故障分析报告的等级,比如,于本实施例,可将一级预警等级和二级预警等级设定为预设报告等级。
周期故障原因是基于定时周期形成的当前周期内同一机型或者同一部件类型出现的故障原因。进一步地,周期故障原因还可对故障原因按原因相似度进行排序,以便维护人员及时从周期故障原因中获取定时周期内相同机型或相同部件类型存在的共性问题。可以理解地,周期故障分析原因利于维护人员基于共性问题进行系统升级等维护和升级措施。
步骤S50中,当服务器检测到机型预警等级或部件预警等级满足预设报告等级时,应及时形成周期故障原因,以便维护人员从该周期故障原因中获取机型或者部件类型的共性问题。
S60.统计每一定时周期故障原因对应的故障发生次数,按降序排列所有故障发生次数,形成故障原因排序表。
具体地,服务器可将各类错误进行错误类型标记,并统计每一错误类型标记的出现次数,从而形成故障原因排序表(比如可按错误次数的降序排列)。
S70.将故障原因排序表添加到预设周期故障分析模板中,形成周期故障分析报告。
其中,预设周期故障分析模板即为根据实际应用场景设定的,适用于维护人员查看的添加错误信息的模板。服务器将前述获得的各种信息添加到预设周期故障分析模板即可形成周期故障分析报告。
本实施例提供的服务器故障预警方法中,服务器通过定时周期内获取当前机型故障率或当前部件故障率分别对应不同的机型预警等级,灵活地对影响安全应用的不同的部件问题进行响应,可保障机器的正常稳健运行。同时,服务器可基于预设报告等级获取周期故障分析报告,利于维护人员从该周期故障分析报告中获取机型或部件类型的共性问题,并及时采取维护或升级措施,降低批量机器共性问题产生的当前机型故障率或当前部件故障率,提高机器的适用性。
在一实施例中,如图3所示,在步骤S40中,即获取机型预警等级或部件预警等级中,具体包括如下步骤:
S41.获取每一在线机型在定时周期内对应的在线机型数量、问题机型数量和问题部件数 量。
其中,在线机型数据表是记录定时周期内上线过的机器的状态记录表,比如,定时周期内机器A上线过,则可将在线机型数据表中机器A对应的上线状态更新为已登录状态。可以理解地,在每一定时周期结束时,服务器自动将在线机型数据表中每台机器对应的上线状态更新为未登录状态,以便服务器在新的定时周期内统计上线过的机器的数量,确定为在线机型数量,也即在线机型数量的定义就是在当前的定时周期内在线机型数据表中上线状态为已登录状态的机器的总数。
机型维护数据表是记录机器中的部件出现问题以及具体问题原因的记录表,具体登记内容还包括机器ID、机型、部件ID、部件类型和故障原因等,以便后续服务器基于继续维护数据表获取基于同一机型的问题机型数量,或基于同一部件类型的问题部件数量。比如,服务器可基于同一机型在机型维护数据表中进行筛选,统计当前的定时周期内该机型发生过故障的总数量作为问题机型数量。服务器可基于同一部件类型在机型维护数据表中进行筛选,统计当前的定时周期内该部件类型发生过故障的总数量作为问题部件数量。
步骤S41中,服务器可根据在线机型数据表和机型维护记录表及时获取在线机型数量、同一机型对应的问题机型数量和同一部件类型对应的问题部件数量,避免人工进行筛选计算,便捷准确。
S42.基于在线机型数量、问题机型数量和问题部件数量,获取定时周期内的当前机型故障率和当前部件故障率。
其中,当前机型故障率是问题机型数量占在线机型数量的百分比,当前部件故障率是问题部件数量占在线机型数量的百分比。
具体地,定时周期内的在线机型数量N,问题机型数量n,问题部件数量m:
每月该款机器的总体故障率和部件故障率为:
当前机型故障率=n/N*100%
当前部件故障率=m/N*100%
步骤S42中,服务器根据预设公式可快速获取当前机型故障率和当前部件故障率,为后续基于当前机型故障率和当前部件故障率判定机型预警等级准备数据基础。
S43.基于定时周期对应的在线机型数量和当前机型故障率,获取机型预警等级,基于机型预警等级进行等级预警响应。
步骤S43中,服务器设定基于不同的当前机型故障率对应不同的机型预警等级,以及不同的等级预警响应,可采取灵活的等级预警响应对机器进行维护,利于对于紧要的机器问题进行及时维护,对于次要的机器问题按预设时间进行集中维护,可有效保障机器的正常运行,减少因维护耽误运行时间。
S44.基于定时周期对应的在线机型数量和当前部件故障率,获取部件预警等级,基于部件预警等级进行等级预警响应。
其中,部件预警等级是服务器预设的基于当前部件故障率分别划分的不同的安全性等级,用于基于不同的安全性等级启动不同的预警响应行为,每一预警响应行为即为等级预警响应,举例如下:
机型预警等级:一级预警等级、二级预警等级、三级预警等级。
等级预警响应:对于一级预警等级,即刻进行维护。
对于二级预警等级,可在预设二级周期响应时间每日晚8点进行维护。
对于三级预警等级,可在预设三级周期响应时间每周五晚8点进行维护。
步骤S44中,服务器设定基于不同的当前部件故障率对应不同的机型预警等级,以及不同的等级预警响应,可采取灵活的等级预警响应对机器部件进行维护,利于对于紧要的机器部件问题进行及时维护,对于次要的机器部件问题按预设时间进行集中维护,可有效保障机器的正常运行,减少因维护耽误运行时间。
步骤S41至S44中,服务器可根据在线机型数据表和机型维护记录表及时获取在线机型 数量、同一机型对应的问题机型数量和同一部件类型对应的问题部件数量,避免人工进行筛选计算,便捷准确。服务器根据预设公式可快速获取当前机型故障率和当前部件故障率,为后续基于当前机型故障率和当前部件故障率判定机型预警等级准备数据基础。服务器设定基于不同的当前机型故障率对应不同的机型预警等级,以及不同的等级预警响应,可采取灵活的等级预警响应对机器进行维护,利于对于紧要的机器问题进行及时维护,对于次要的机器问题按预设时间进行集中维护,可有效保障机器的正常运行,减少因维护耽误运行时间。服务器设定基于不同的当前部件故障率对应不同的机型预警等级,以及不同的等级预警响应,可采取灵活的等级预警响应对机器部件进行维护,利于对于紧要的机器部件问题进行及时维护,对于次要的机器部件问题按预设时间进行集中维护,可有效保障机器的正常运行,减少因维护耽误运行时间。
在一实施例中,如图4所示,在步骤S43中,即基于定时周期对应的在线机型数量和当前机型故障率,获取机型预警等级,基于机型预警等级进行等级预警响应,具体包括如下步骤:
S431.若定时周期内的在线机型数量大于预设对比数量,且当前机型故障率大于预设第一故障率,则获取一级机型预警,基于一级机型预警进行一级预警响应。
其中,预设对比数量是根据实际应用场景而预设的对比数量,此处不作具体限定,同理可得预设第一故障率。
一级机型预警是根据时间应用场景设定的预警紧急程度的提示,于本实施例,可将级数越大的预警设定为越紧急的事件。由此可得,一级预警响应是与一级机型预警对应的响应方式,具体可包括对应的各种响应措施等。
S432.若定时周期内的在线机型数量不大于预设对比数量,且当前机型故障率大于预设第一故障率,则获取前期机型故障率。
其中,前期机型故障率是指当前在线机型在前一个周期的机型故障率。
S433.若前期机型故障率大于预设第一故障率,则获取一级机型预警,基于一级机型预警进行一级预警响应。
S434.若前期机型故障率不大于预设第一故障率,则获取二级机型预警,基于二级机型预警进行二级预警响应。
其中,二级机型预警是对比一级机型预警更为紧急或不太紧急的预警。于本实施例,具体为比一级机型预警更为进行的预警,同理可得与二级机型预警对应的二级预警响应,也即二级预警响应的紧急程度应大于一级预警响应。
步骤S431至S434中,服务器设定基于不同的当前机型故障率对应不同的机型预警等级,以及不同的等级预警响应,可采取灵活的等级预警响应对机器进行维护,利于对于紧要的机器问题进行及时维护,对于次要的机器问题按预设时间进行集中维护,可有效保障机器的正常运行,减少因维护耽误运行时间。
在一实施例中,如图5所示,在步骤S44中,即基于定时周期对应的在线机型数量和当前部件故障率,获取部件预警等级,基于部件预警等级进行等级预警响应,具体包括如下步骤:
S441.若定时周期内的在线机型数量大于预设对比数量,且当前部件故障率大于预设第二故障率,则获取一级部件预警,基于一级部件预警进行一级预警响应。
其中,一级部件预警是根据时间应用场景设定的预警紧急程度的提示,于本实施例,可将级数越大的预警设定为越紧急的事件。由此可得,一级预警响应是与一级部件预警对应的响应方式,具体可包括对应的各种响应措施等。
S442.若定时周期内的机型数量不大于预设对比数量,且当前部件故障率大于预设第二故障率,则获取前期部件故障率。
其中,前期部件故障率是指当前当前部件在前一个周期的部件故障率。
S443.若前期部件故障率大于预设第二故障率,则获取一级部件预警,基于一级部件预警 进行一级预警响应
S444.若前期部件故障率未大于预设第二故障率,则获取二级部件预警,基于二级部件预警进行二级预警响应。
步骤S441至S444中,服务器设定基于不同的当前部件故障率对应不同的机型预警等级,以及不同的等级预警响应,可采取灵活的等级预警响应对机器部件进行维护,利于对于紧要的机器部件问题进行及时维护,对于次要的机器部件问题按预设时间进行集中维护,可有效保障机器的正常运行,减少因维护耽误运行时间。
在一实施例中,如图6所示,在步骤S10之前,即在获取服务器故障预警请求之前,服务器故障预警方法还具体包括如下步骤:
S111.获取故障上报请求,故障上报请求包括故障上报日期和故障上报信息,故障上报信息包括机器ID、部件ID和故障原因。
其中,故障上报日期即为机器或部件发送故障并上报到服务器的日期。故障上报信息是具体故障原因等信息。机器ID和部件ID是服务器用于区分每一机器或部件的唯一标识。故障原因即为发生故障的具体原因等。
S112.获取机器ID对应的机型和部件ID对应的部件类型。
其中,每一机器ID都对应一个机型,同时每一部件ID也都对应一部件类型。获取机器ID对应的机型和部件ID对应的部件类型利于后续基于每一机型或部件类型对机型或部件出现的故障进行类别统计。
S113.关联保存故障上报日期、机器ID、机型、部件ID、部件类型和故障原因形成当前机型维护信息,将当前机型维护信息添加到机型维护记录表中。
其中,当前机型维护信息是包括与故障产生有关的所有信息,比如机器ID、机型、部件ID或部件类型等的各种信息。
机型维护记录表是用以记录并维护各个机型或部件的记录表,利于维护人员基于该表进行各种问题的查找和定位。
步骤S111至S113中,服务器可基于故障上报请求记录在线机型出现的问题机型、该问题机型对应的问题部件以及该问题部件对应的故障原因,便于后续服务器在定时周期内统计并当前周期的问题机型以及问题部件,获取周期故障分析报告,查找到共性问题。
在一实施例中,如图7所示,在步骤S10之前,在步骤S10之前,即在获取服务器故障预警请求之前,服务器故障预警方法还具体包括如下步骤:
S121.统计系统当前时间对应的定时周期内,在线机型数据表中的登录状态为已登录状态的每一在线机型对应的数量,确定为在线机型数量。
具体地,每一机型在定时周期内未必都在线,只有在定位周期内在线的机型才可记录到当前定时周期内的在线机型数据表中,并更新该在线机型对应的登录状态为已登录状态。
S122.统计系统当前时间对应的定时周期内,机型维护记录表中每一在线机型对应的问题机型对应的机器数量和,确定为问题机型数量,以及每一在线机型对应的问题部件类型对应的部件数量和,确定为问题部件数量。
步骤S121至S122中,服务器可基于在线机型数据表及时获取在线机型数量,基于机型维护记录表及时获取问题机型数量和问题部件数量,避免人工进行统计,提高计算自动化程度,准确高效。
在一实施例中,如图8所示,在步骤S50中,即若机型预警等级或部件预警等级达到预设报告等级,则对机型维护记录表中的在线机型进行周期故障分析,获取周期故障分析报告,具体包括如下步骤:
S51.若机型预警等级为预设报告等级,则基于机型维护记录表,获取在线机型对应的问题机型对应的机型故障原因。
其中,预设报告等级为需要对机型预警等级产生报告的等级。因每个机型或部件产生故障的紧急程度不同,无需对每一故障都产生报告等级。仅仅将属于预设报告等级内的机型应 等级添加到机型维护记录表中,以便提起查表维护人员的注意。
或者,
S52.若部件预警等级为预设报告等级,则基于机型维护记录表,获取在线机型对应的问题部件类型对应的部件故障原因。
步骤S51至S52中,服务器可在周期故障原因中对故障原因按原因相似度进行排序,以便维护人员及时从周期故障分析报告中获取定时周期内相同机型或相同部件类型存在的共性问题,利于维护人员基于共性问题进行系统升级等维护和升级措施。
应理解,上述实施例中各步骤的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。
本实施例提供的服务器故障预警方法中,服务器通过定时周期内获取当前机型故障率或当前部件故障率分别对应不同的机型预警等级,灵活地对影响安全应用的不同的部件问题进行响应,可保障机器的正常稳健运行。同时,服务器可基于预设报告等级获取周期故障分析报告,利于维护人员从该周期故障分析报告中获取机型或部件类型的共性问题,并及时采取维护或升级措施,降低批量机器共性问题产生的当前机型故障率或当前部件故障率,提高机器的适用性。
进一步地,服务器设定基于不同的当前机型故障率对应不同的机型预警等级,以及不同的等级预警响应,可采取灵活的等级预警响应对机器进行维护,利于对于紧要的机器问题进行及时维护,对于次要的机器问题按预设时间进行集中维护,可有效保障机器的正常运行,减少因维护耽误运行时间。
进一步地,服务器设定基于不同的当前部件故障率对应不同的机型预警等级,以及不同的等级预警响应,可采取灵活的等级预警响应对机器部件进行维护,利于对于紧要的机器部件问题进行及时维护,对于次要的机器部件问题按预设时间进行集中维护,可有效保障机器的正常运行,减少因维护耽误运行时间。
进一步地,服务器可基于故障上报请求记录在线机型出现的问题机型、该问题机型对应的问题部件以及该问题部件对应的故障原因,便于后续服务器在定时周期内统计并当前周期的问题机型以及问题部件,获取周期故障分析报告,查找到共性问题。
进一步地,服务器可基于在线机型数据表及时获取在线机型数量,基于机型维护记录表及时获取问题机型数量和问题部件数量,避免人工进行统计,提高计算自动化程度,准确高效。
进一步地,服务器可在周期故障原因中对故障原因按原因相似度进行排序,以便维护人员及时从周期故障分析报告中获取定时周期内相同机型或相同部件类型存在的共性问题,利于维护人员基于共性问题进行系统升级等维护和升级措施。
在一实施例中,提供一种服务器故障预警装置,该服务器故障预警装置与上述实施例中服务器故障预警方法一一对应。如图9所示,该服务器故障预警装置包括获取预警请求模块10、获取监测数据模块20、激活定期任务模块30、获取预警等级模块40、提取故障原因模块50、形成原因排序表模块60和形成分析报告模块70。各功能模块详细说明如下:
获取预警请求模块10,用于获取服务器故障预警请求,服务器故障预警请求包括定期任务和定时周期,其中,定期任务包括读取服务器系统事件日志库的日志信息。
获取监测数据模块20,用于通过IPMI命令对服务器硬件状态进行监测,获取硬件监测数据,将硬件监测数据添加到日志信息中。
激活定期任务模块30,用于若系统当前时间满足定时周期,则激活定期任务,获取定时周期对应的日志信息。
获取预警等级模块40,用于基于定时周期对应的日志信息,获取机型预警等级或部件预警等级。
提取故障原因模块50,用于若机型预警等级或部件预警等级达到预设报告等级,则基于机型维护记录表,提取每一在线机型在定时周期内的周期故障原因。
形成原因排序表模块60,用于统计每一定时周期故障原因对应的故障发生次数,按降序排列所有故障发生次数,形成故障原因排序表。
形成分析报告模块70,用于将故障原因排序表添加到预设周期故障分析模板中,形成周期故障分析报告。
优选地,该获取预警等级模块40,包括:
统计记录表单元,用于定期对在线机型数据表和机型维护记录表进行统计,获取每一在线机型在定时周期内对应的在线机型数量、问题机型数量和问题部件数量。
获取部件故障率单元,用于基于在线机型数量、问题机型数量和问题部件数量,获取定时周期内的当前机型故障率和当前部件故障率。
获取机型等级单元,用于基于定时周期对应的在线机型数量和当前机型故障率,获取机型预警等级,基于机型预警等级进行等级预警响应。
获取部件等级单元,用于基于定时周期对应的在线机型数量和当前部件故障率,获取部件预警等级,基于部件预警等级进行等级预警响应。
优选地,该获取机型等级模块,包括:
获取一级预警单元,用于若定时周期内的在线机型数量大于预设对比数量,且当前机型故障率大于预设第一故障率,则获取一级机型预警,基于一级机型预警进行一级预警响应。
获取机型故障率单元,用于若定时周期内的在线机型数量不大于预设对比数量,且当前机型故障率大于预设第一故障率,则获取前期机型故障率。
进行一级响应单元,用于若前期机型故障率大于预设第一故障率,则获取一级机型预警,基于一级机型预警进行一级预警响应。
进行二级响应单元,用于若前期机型故障率不大于预设第一故障率,则获取二级机型预警,基于二级机型预警进行二级预警响应。
优选地,获取部件等级模块,包括:
获取部件预警单元,用于若定时周期内的在线机型数量大于预设对比数量,且当前部件故障率大于预设第二故障率,则获取一级部件预警,基于一级部件预警进行一级预警响应。
获取部件故障率单元,用于若定时周期内的机型数量不大于预设对比数量,且当前部件故障率大于预设第二故障率,则获取前期部件故障率。
进行预警响应单元,用于若前期部件故障率大于预设第二故障率,则获取一级部件预警,基于一级部件预警进行一级预警响应
进行二级响应单元,用于若前期部件故障率未大于预设第二故障率,则获取二级部件预警,基于二级部件预警进行二级预警响应。
优选地,该服务器故障预警装置还包括:
获取上报请求模块,用于获取故障上报请求,故障上报请求包括故障上报日期和故障上报信息,故障上报信息包括机器ID、部件ID和故障原因。
获取部件类型模块,用于获取机器ID对应的机型和部件ID对应的部件类型。
形成维护信息模块,用于关联保存故障上报日期、机器ID、机型、部件ID、部件类型和故障原因形成当前机型维护信息,将当前机型维护信息添加到机型维护记录表中。
优选地,该统计记录表模块包括:
确定机型数量单元,用于统计系统当前时间对应的定时周期内,在线机型数据表中的登录状态为已登录状态的每一在线机型对应的数量,确定为在线机型数量。
确定部件数量单元,用于统计系统当前时间对应的定时周期内,机型维护记录表中每一在线机型对应的问题机型对应的机器数量和,确定为问题机型数量,以及每一在线机型对应的问题部件类型对应的部件数量和,确定为问题部件数量。
优选地,该提取故障原因模块包括:
获取机型原因单元,用于若机型预警等级为预设报告等级,则基于机型维护记录表,获取在线机型对应的问题机型对应的机型故障原因。
或者,
获取部件原因单元,用于若部件预警等级为预设报告等级,则基于机型维护记录表,获取在线机型对应的问题部件类型对应的部件故障原因。
关于服务器故障预警装置的具体限定可以参见上文中对于服务器故障预警方法的限定,在此不再赘述。上述服务器故障预警装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中,也可以以软件形式存储于计算机设备中的存储器中,以便于处理器调用执行以上各个模块对应的操作。
在一实施例中,提供了一种计算机设备,该计算机设备可以是服务器,其内部结构图可以如图10所示。该计算机设备包括通过系统总线连接的处理器、存储器、网络接口和数据库。其中,该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统、计算机可读指令和数据库。该内存储器为非易失性存储介质中的操作系统和计算机可读指令的运行提供环境。该计算机设备的数据库用于服务器故障预警方法相关的数据。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机可读指令被处理器执行时以实现一种服务器故障预警方法。
在一实施例中,提供一种计算机设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机可读指令,处理器执行计算机可读指令时实现上述实施例服务器故障预警方法,例如图2所示S10至步骤S70。或者,处理器执行计算机可读指令时实现上述实施例中服务器故障预警装置的各模块/单元的功能,例如图9所示模块10至模块70的功能。为避免重复,此处不再赘述。本实施例中的可读存储介质包括非易失性可读存储介质和易失性可读存储介质。
在一实施例中,提供一种计算机可读存储介质,其上存储有计算机可读指令,计算机可读指令被处理器执行时实现上述实施例服务器故障预警方法,例如图2所示S10至步骤S70。或者,该计算机可读指令被处理器执行时实现上述装置实施例中服务器故障预警装置中各模块/单元的功能,例如图9所示模块10至模块70的功能。为避免重复,此处不再赘述。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机可读指令来指令相关的硬件来完成,该计算机可读指令可存储于一非易失性可读存储介质也可以存储在易失性可读存储介质中,该计算机可读指令在执行时,该计算机可读指令在执行时,可包括如上述各方法的实施例的流程。其中,本申请各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。
所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,仅以上述各功能单元、模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能单元、模块完成,即将所述装置的内部结构划分成不同的功能单元或模块,以完成以上描述的全部或者部分功能。
以上实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围,均应包含在本申请的保护范围之内。

Claims (20)

  1. 一种服务器故障预警方法,其中,包括:
    获取服务器故障预警请求,所述服务器故障预警请求包括定期任务和定时周期,其中,所述定期任务包括读取服务器系统事件日志库的日志信息;
    通过IPMI命令对服务器硬件状态进行监测,获取硬件监测数据,将所述硬件监测数据添加到所述日志信息中;
    若系统当前时间满足所述定时周期,则激活所述定期任务,获取所述定时周期对应的所述日志信息;
    基于所述定时周期对应的所述日志信息,获取机型预警等级或部件预警等级;
    若所述机型预警等级或所述部件预警等级达到预设报告等级,则基于机型维护记录表,提取每一所述在线机型在所述定时周期内的周期故障原因;
    统计每一所述定时周期故障原因对应的故障发生次数,按降序排列所有所述故障发生次数,形成故障原因排序表;
    将所述故障原因排序表添加到预设周期故障分析模板中,形成周期故障分析报告。
  2. 如权利要求1所述的服务器故障预警方法,其中,所述获取机型预警等级或部件预警等级,包括:
    获取每一在线机型在所述定时周期内对应的在线机型数量、问题机型数量和问题部件数量;
    基于所述在线机型数量、所述问题机型数量和所述问题部件数量,获取所述定时周期内的当前机型故障率和当前部件故障率;
    基于所述定时周期对应的所述在线机型数量和所述当前机型故障率,获取机型预警等级,基于所述机型预警等级进行等级预警响应;
    基于所述定时周期对应的所述在线机型数量和所述当前部件故障率,获取部件预警等级,基于所述部件预警等级进行等级预警响应。
  3. 如权利要求2所述的服务器故障预警方法,其中,所述基于所述定时周期对应的所述在线机型数量和所述当前机型故障率,获取机型预警等级,基于所述机型预警等级进行等级预警响应,包括:
    若所述定时周期内的所述在线机型数量大于预设对比数量,且所述当前机型故障率大于预设第一故障率,则获取一级机型预警,基于所述一级机型预警进行一级预警响应;
    若所述定时周期内的所述在线机型数量不大于所述预设对比数量,且所述当前机型故障率大于所述预设第一故障率,则获取前期机型故障率;
    若所述前期机型故障率大于所述预设第一故障率,则获取一级机型预警,基于所述一级机型预警进行一级预警响应;
    若所述前期机型故障率不大于所述预设第一故障率,则获取二级机型预警,基于所述二级机型预警进行二级预警响应。
  4. 如权利要求2所述的服务器故障预警方法,其中,所述基于所述定时周期对应的所述在线机型数量和所述当前部件故障率,获取部件预警等级,基于所述部件预警等级进行等级预警响应,包括:
    若所述定时周期内的所述在线机型数量大于所述预设对比数量,且所述当前部件故障率大于预设第二故障率,则获取一级部件预警,基于所述一级部件预警进行一级预警响应;
    若所述定时周期内的所述机型数量不大于所述预设对比数量,且所述当前部件故障率大于所述预设第二故障率,则获取前期部件故障率;
    若所述前期部件故障率大于所述预设第二故障率,则获取一级部件预警,基于所述一级部件预警进行一级预警响应
    若所述前期部件故障率未大于所述预设第二故障率,则获取二级部件预警,基于所述二级部件预警进行二级预警响应。
  5. 如权利要求2所述的服务器故障预警方法,其中,在所述获取服务器故障预警请求之前,所述服务器故障预警方法还包括:
    获取故障上报请求,所述故障上报请求包括故障上报日期和故障上报信息,所述故障上报信息包括机器ID、部件ID和故障原因;
    获取所述机器ID对应的机型和所述部件ID对应的部件类型;
    关联保存所述故障上报日期、所述机器ID、所述机型、所述部件ID、所述部件类型和所述故障原因形成当前机型维护信息,将所述当前机型维护信息添加到中所述机型维护记录表中。
  6. 如权利要求1所述的服务器故障预警方法,其中,在所述获取服务器故障预警请求之前,所述服务器故障预警方法还包括:
    统计系统当前时间对应的所述定时周期内,所述在线机型数据表中的登录状态为已登录状态的每一所述在线机型对应的数量,确定为所述在线机型数量;
    统计系统当前时间对应的定时周期内,所述机型维护记录表中每一所述在线机型对应的问题机型对应的机器数量和,确定为所述问题机型数量,以及每一所述在线机型对应的问题部件类型对应的部件数量和,确定为问题部件数量。
  7. 如权利要求1所述的服务器故障预警方法,其中,所述基于所述机型维护记录表,提取每一所述在线机型在所述定时周期内的周期故障原因,包括:
    若所述机型预警等级为预设报告等级,则基于所述机型维护记录表,获取所述在线机型对应的问题机型对应的机型故障原因;
    或者,
    若所述部件预警等级为预设报告等级,则基于所述机型维护记录表,获取所述在线机型对应的问题部件类型对应的部件故障原因。
  8. 一种服务器故障预警装置,其中,包括:
    获取预警请求模块,用于获取服务器故障预警请求,服务器故障预警请求包括定期任务和定时周期,其中,定期任务包括读取服务器系统事件日志库的日志信息;
    获取监测数据模块,用于通过IPMI命令对服务器硬件状态进行监测,获取硬件监测数据,将硬件监测数据添加到日志信息中;
    激活定期任务模块,用于若系统当前时间满足定时周期,则激活定期任务,获取定时周期对应的日志信息;
    获取预警等级模块,用于基于定时周期对应的日志信息,获取机型预警等级或部件预警等级;
    提取故障原因模块,用于若所述机型预警等级或所述部件预警等级达到预设报告等级,则基于所述机型维护记录表,提取每一所述在线机型在所述定时周期内的周期故障原因;
    形成原因排序表模块,用于统计每一所述定时周期故障原因对应的故障发生次数,按降序排列所有所述故障发生次数,形成故障原因排序表;
    形成分析报告模块,用于将所述故障原因排序表添加到预设周期故障分析模板中,形成周期故障分析报告。
  9. 一种计算机设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,其中,所述处理器执行所述计算机可读指令时实现如下步骤:
    获取服务器故障预警请求,所述服务器故障预警请求包括定期任务和定时周期,其中,所述定期任务包括读取服务器系统事件日志库的日志信息;
    通过IPMI命令对服务器硬件状态进行监测,获取硬件监测数据,将所述硬件监测数 据添加到所述日志信息中;
    若系统当前时间满足所述定时周期,则激活所述定期任务,获取所述定时周期对应的所述日志信息;
    基于所述定时周期对应的所述日志信息,获取机型预警等级或部件预警等级;
    若所述机型预警等级或所述部件预警等级达到预设报告等级,则基于机型维护记录表,提取每一所述在线机型在所述定时周期内的周期故障原因;
    统计每一所述定时周期故障原因对应的故障发生次数,按降序排列所有所述故障发生次数,形成故障原因排序表;
    将所述故障原因排序表添加到预设周期故障分析模板中,形成周期故障分析报告。
  10. 如权利要求9所述的计算机设备,其中,所述获取机型预警等级或部件预警等级,包括:
    获取每一在线机型在所述定时周期内对应的在线机型数量、问题机型数量和问题部件数量;
    基于所述在线机型数量、所述问题机型数量和所述问题部件数量,获取所述定时周期内的当前机型故障率和当前部件故障率;
    基于所述定时周期对应的所述在线机型数量和所述当前机型故障率,获取机型预警等级,基于所述机型预警等级进行等级预警响应;
    基于所述定时周期对应的所述在线机型数量和所述当前部件故障率,获取部件预警等级,基于所述部件预警等级进行等级预警响应。
  11. 如权利要求10所述的计算机设备,其中,所述基于所述定时周期对应的所述在线机型数量和所述当前机型故障率,获取机型预警等级,基于所述机型预警等级进行等级预警响应,包括:
    若所述定时周期内的所述在线机型数量大于预设对比数量,且所述当前机型故障率大于预设第一故障率,则获取一级机型预警,基于所述一级机型预警进行一级预警响应;
    若所述定时周期内的所述在线机型数量不大于所述预设对比数量,且所述当前机型故障率大于所述预设第一故障率,则获取前期机型故障率;
    若所述前期机型故障率大于所述预设第一故障率,则获取一级机型预警,基于所述一级机型预警进行一级预警响应;
    若所述前期机型故障率不大于所述预设第一故障率,则获取二级机型预警,基于所述二级机型预警进行二级预警响应。
  12. 如权利要求10所述的计算机设备,其中,所述基于所述定时周期对应的所述在线机型数量和所述当前部件故障率,获取部件预警等级,基于所述部件预警等级进行等级预警响应,包括:
    若所述定时周期内的所述在线机型数量大于所述预设对比数量,且所述当前部件故障率大于预设第二故障率,则获取一级部件预警,基于所述一级部件预警进行一级预警响应;
    若所述定时周期内的所述机型数量不大于所述预设对比数量,且所述当前部件故障率大于所述预设第二故障率,则获取前期部件故障率;
    若所述前期部件故障率大于所述预设第二故障率,则获取一级部件预警,基于所述一级部件预警进行一级预警响应
    若所述前期部件故障率未大于所述预设第二故障率,则获取二级部件预警,基于所述二级部件预警进行二级预警响应。
  13. 如权利要求10所述的计算机设备,其中,在所述获取服务器故障预警请求之前,所述处理器执行所述计算机可读指令时还实现如下步骤:
    获取故障上报请求,所述故障上报请求包括故障上报日期和故障上报信息,所述故障上报信息包括机器ID、部件ID和故障原因;
    获取所述机器ID对应的机型和所述部件ID对应的部件类型;
    关联保存所述故障上报日期、所述机器ID、所述机型、所述部件ID、所述部件类型和所述故障原因形成当前机型维护信息,将所述当前机型维护信息添加到中所述机型维护记录表中。
  14. 如权利要求9所述的计算机设备,其中,在所述获取服务器故障预警请求之前,所述处理器执行所述计算机可读指令时还实现如下步骤:
    统计系统当前时间对应的所述定时周期内,所述在线机型数据表中的登录状态为已登录状态的每一所述在线机型对应的数量,确定为所述在线机型数量;
    统计系统当前时间对应的定时周期内,所述机型维护记录表中每一所述在线机型对应的问题机型对应的机器数量和,确定为所述问题机型数量,以及每一所述在线机型对应的问题部件类型对应的部件数量和,确定为问题部件数量。
  15. 一个或多个存储有计算机可读指令的可读存储介质,所述计算机可读存储介质存储有计算机可读指令,其中,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行如下步骤:
    获取服务器故障预警请求,所述服务器故障预警请求包括定期任务和定时周期,其中,所述定期任务包括读取服务器系统事件日志库的日志信息;
    通过IPMI命令对服务器硬件状态进行监测,获取硬件监测数据,将所述硬件监测数据添加到所述日志信息中;
    若系统当前时间满足所述定时周期,则激活所述定期任务,获取所述定时周期对应的所述日志信息;
    基于所述定时周期对应的所述日志信息,获取机型预警等级或部件预警等级;
    若所述机型预警等级或所述部件预警等级达到预设报告等级,则基于机型维护记录表,提取每一所述在线机型在所述定时周期内的周期故障原因;
    统计每一所述定时周期故障原因对应的故障发生次数,按降序排列所有所述故障发生次数,形成故障原因排序表;
    将所述故障原因排序表添加到预设周期故障分析模板中,形成周期故障分析报告。
  16. 如权利要求15所述的可读存储介质,其中,所述获取机型预警等级或部件预警等级,包括:
    获取每一在线机型在所述定时周期内对应的在线机型数量、问题机型数量和问题部件数量;
    基于所述在线机型数量、所述问题机型数量和所述问题部件数量,获取所述定时周期内的当前机型故障率和当前部件故障率;
    基于所述定时周期对应的所述在线机型数量和所述当前机型故障率,获取机型预警等级,基于所述机型预警等级进行等级预警响应;
    基于所述定时周期对应的所述在线机型数量和所述当前部件故障率,获取部件预警等级,基于所述部件预警等级进行等级预警响应。
  17. 如权利要求16所述的可读存储介质,其中,所述基于所述定时周期对应的所述在线机型数量和所述当前机型故障率,获取机型预警等级,基于所述机型预警等级进行等级预警响应,包括:
    若所述定时周期内的所述在线机型数量大于预设对比数量,且所述当前机型故障率大于预设第一故障率,则获取一级机型预警,基于所述一级机型预警进行一级预警响应;
    若所述定时周期内的所述在线机型数量不大于所述预设对比数量,且所述当前机型故障率大于所述预设第一故障率,则获取前期机型故障率;
    若所述前期机型故障率大于所述预设第一故障率,则获取一级机型预警,基于所述一级机型预警进行一级预警响应;
    若所述前期机型故障率不大于所述预设第一故障率,则获取二级机型预警,基于所述二级机型预警进行二级预警响应。
  18. 如权利要求16所述的可读存储介质,其中,所述基于所述定时周期对应的所述在线机型数量和所述当前部件故障率,获取部件预警等级,基于所述部件预警等级进行等级预警响应,包括:
    若所述定时周期内的所述在线机型数量大于所述预设对比数量,且所述当前部件故障率大于预设第二故障率,则获取一级部件预警,基于所述一级部件预警进行一级预警响应;
    若所述定时周期内的所述机型数量不大于所述预设对比数量,且所述当前部件故障率大于所述预设第二故障率,则获取前期部件故障率;
    若所述前期部件故障率大于所述预设第二故障率,则获取一级部件预警,基于所述一级部件预警进行一级预警响应
    若所述前期部件故障率未大于所述预设第二故障率,则获取二级部件预警,基于所述二级部件预警进行二级预警响应。
  19. 如权利要求16所述的可读存储介质,其中,在所述获取服务器故障预警请求之前,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器还执行如下步骤:
    获取故障上报请求,所述故障上报请求包括故障上报日期和故障上报信息,所述故障上报信息包括机器ID、部件ID和故障原因;
    获取所述机器ID对应的机型和所述部件ID对应的部件类型;
    关联保存所述故障上报日期、所述机器ID、所述机型、所述部件ID、所述部件类型和所述故障原因形成当前机型维护信息,将所述当前机型维护信息添加到中所述机型维护记录表中。
  20. 如权利要求15所述的可读存储介质,其中,在所述获取服务器故障预警请求之前,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器还执行如下步骤:
    统计系统当前时间对应的所述定时周期内,所述在线机型数据表中的登录状态为已登录状态的每一所述在线机型对应的数量,确定为所述在线机型数量;
    统计系统当前时间对应的定时周期内,所述机型维护记录表中每一所述在线机型对应的问题机型对应的机器数量和,确定为所述问题机型数量,以及每一所述在线机型对应的问题部件类型对应的部件数量和,确定为问题部件数量。
PCT/CN2020/117575 2020-02-27 2020-09-25 服务器故障预警方法、装置、计算机设备及存储介质 WO2021169270A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010122319.7A CN111444031A (zh) 2020-02-27 2020-02-27 服务器故障预警方法、装置、计算机设备及存储介质
CN202010122319.7 2020-02-27

Publications (1)

Publication Number Publication Date
WO2021169270A1 true WO2021169270A1 (zh) 2021-09-02

Family

ID=71627068

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/117575 WO2021169270A1 (zh) 2020-02-27 2020-09-25 服务器故障预警方法、装置、计算机设备及存储介质

Country Status (2)

Country Link
CN (1) CN111444031A (zh)
WO (1) WO2021169270A1 (zh)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114475731A (zh) * 2021-12-29 2022-05-13 卡斯柯信号有限公司 一种信号设备故障知识库系统及其实现方法
CN115130702A (zh) * 2022-09-02 2022-09-30 山东汇泓纺织科技有限公司 一种基于大数据分析的纺织机故障预测系统
CN115242611A (zh) * 2022-07-21 2022-10-25 北京天一恩华科技股份有限公司 一种网络故障报警级别管理方法、装置、设备和存储介质
CN115271669A (zh) * 2022-08-01 2022-11-01 成都龙祥思远科技有限公司 一种用于erp服务器的维护方法及系统
CN115277353A (zh) * 2022-07-21 2022-11-01 西安航天发动机有限公司 一种智能柜机远程故障主被动预警方法
CN115860586A (zh) * 2023-03-01 2023-03-28 英迪格(天津)电气有限公司 一种铁路变配电故障的分析系统
CN116090702A (zh) * 2023-01-18 2023-05-09 盐城市久泰商品混凝土有限公司 一种基于物联网的erp数据智能监管系统及方法
CN117076253A (zh) * 2023-08-30 2023-11-17 广州逸芸信息科技有限公司 一种数据中心业务及设施多维度智能运维系统

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111444031A (zh) * 2020-02-27 2020-07-24 平安科技(深圳)有限公司 服务器故障预警方法、装置、计算机设备及存储介质
CN112100456A (zh) * 2020-09-16 2020-12-18 广东电网有限责任公司电力科学研究院 一次设备共性缺陷或故障的判断方法、装置及终端设备
CN112504332B (zh) * 2020-10-16 2022-04-01 安徽中科中涣防务装备技术有限公司 一种复合型传感检测及智能控制方法、系统和装置
CN113127299A (zh) * 2021-03-30 2021-07-16 山东英信计算机技术有限公司 服务器运维方法、装置、系统及计算机可读存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130013967A1 (en) * 2006-12-22 2013-01-10 Commvault Systems, Inc. Systems and methods for remote monitoring in a computer network
CN108023782A (zh) * 2017-12-29 2018-05-11 华东师范大学 一种基于维修记录的设备故障预警方法
CN108376107A (zh) * 2018-03-01 2018-08-07 郑州云海信息技术有限公司 一种服务器故障检测的方法、装置、设备及存储介质
CN108415789A (zh) * 2018-01-24 2018-08-17 西安交通大学 面向大规模混合异构存储系统的节点故障预测系统及方法
CN109189640A (zh) * 2018-08-24 2019-01-11 平安科技(深圳)有限公司 服务器的监控方法、装置、计算机设备及存储介质
CN111444031A (zh) * 2020-02-27 2020-07-24 平安科技(深圳)有限公司 服务器故障预警方法、装置、计算机设备及存储介质

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105721180B (zh) * 2014-12-02 2019-06-07 中兴通讯股份有限公司 一种实现故障定位的方法和服务器
CN109376882A (zh) * 2018-12-29 2019-02-22 华润电力技术研究院有限公司 维修策略制定方法、终端和计算机存储介质

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130013967A1 (en) * 2006-12-22 2013-01-10 Commvault Systems, Inc. Systems and methods for remote monitoring in a computer network
CN108023782A (zh) * 2017-12-29 2018-05-11 华东师范大学 一种基于维修记录的设备故障预警方法
CN108415789A (zh) * 2018-01-24 2018-08-17 西安交通大学 面向大规模混合异构存储系统的节点故障预测系统及方法
CN108376107A (zh) * 2018-03-01 2018-08-07 郑州云海信息技术有限公司 一种服务器故障检测的方法、装置、设备及存储介质
CN109189640A (zh) * 2018-08-24 2019-01-11 平安科技(深圳)有限公司 服务器的监控方法、装置、计算机设备及存储介质
CN111444031A (zh) * 2020-02-27 2020-07-24 平安科技(深圳)有限公司 服务器故障预警方法、装置、计算机设备及存储介质

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114475731A (zh) * 2021-12-29 2022-05-13 卡斯柯信号有限公司 一种信号设备故障知识库系统及其实现方法
CN115277353B (zh) * 2022-07-21 2023-07-28 西安航天发动机有限公司 一种智能柜机远程故障主被动预警方法
CN115242611B (zh) * 2022-07-21 2023-10-03 北京天一恩华科技股份有限公司 一种网络故障报警级别管理方法、装置、设备和存储介质
CN115242611A (zh) * 2022-07-21 2022-10-25 北京天一恩华科技股份有限公司 一种网络故障报警级别管理方法、装置、设备和存储介质
CN115277353A (zh) * 2022-07-21 2022-11-01 西安航天发动机有限公司 一种智能柜机远程故障主被动预警方法
CN115271669A (zh) * 2022-08-01 2022-11-01 成都龙祥思远科技有限公司 一种用于erp服务器的维护方法及系统
CN115130702B (zh) * 2022-09-02 2022-12-02 山东汇泓纺织科技有限公司 一种基于大数据分析的纺织机故障预测系统
CN115130702A (zh) * 2022-09-02 2022-09-30 山东汇泓纺织科技有限公司 一种基于大数据分析的纺织机故障预测系统
CN116090702A (zh) * 2023-01-18 2023-05-09 盐城市久泰商品混凝土有限公司 一种基于物联网的erp数据智能监管系统及方法
CN116090702B (zh) * 2023-01-18 2024-05-14 江苏盛泉环保科技发展有限公司 一种基于物联网的erp数据智能监管系统及方法
CN115860586A (zh) * 2023-03-01 2023-03-28 英迪格(天津)电气有限公司 一种铁路变配电故障的分析系统
CN117076253A (zh) * 2023-08-30 2023-11-17 广州逸芸信息科技有限公司 一种数据中心业务及设施多维度智能运维系统
CN117076253B (zh) * 2023-08-30 2024-05-28 广州逸芸信息科技有限公司 一种数据中心业务及设施多维度智能运维系统

Also Published As

Publication number Publication date
CN111444031A (zh) 2020-07-24

Similar Documents

Publication Publication Date Title
WO2021169270A1 (zh) 服务器故障预警方法、装置、计算机设备及存储介质
Wang et al. What can we learn from four years of data center hardware failures?
US11681595B2 (en) Techniques and system for optimization driven by dynamic resilience
US10761926B2 (en) Server hardware fault analysis and recovery
US10282248B1 (en) Technology system auto-recovery and optimality engine and techniques
US10635557B2 (en) System and method for automated detection of anomalies in the values of configuration item parameters
US20160042285A1 (en) System and method for analyzing and prioritizing changes and differences to configuration parameters in information technology systems
WO2022089202A1 (zh) 故障识别模型训练方法、故障识别方法、装置及电子设备
US11329869B2 (en) Self-monitoring
WO2021248754A1 (zh) 一种系统测试方法、装置、存储介质及电子设备
WO2018233170A1 (zh) 日志记录方法、装置、计算机设备及存储介质
US20210064458A1 (en) Automated detection and classification of dynamic service outages
CN109901969B (zh) 一种集中监控管理平台的设计方法及装置
CN110063042A (zh) 一种数据库故障的响应方法及其终端
CN112988439A (zh) 服务器故障发现方法、装置、电子设备及存储介质
Amvrosiadis et al. Getting back up: Understanding how enterprise data backups fail
Li et al. Going through the life cycle of faults in clouds: Guidelines on fault handling
JP7436737B1 (ja) マルチベンダーを支援するサーバ管理システム
Sun et al. R 2 C: Robust rolling-upgrade in clouds
CN115952227A (zh) 数据采集系统及方法、电子设备和存储介质
Yim Evaluation metrics of service-level reliability monitoring rules of a big data service
Lal et al. Error and failure analysis of a unix server
US9953266B2 (en) Management of building energy systems through quantification of reliability
CN114625607A (zh) 一种软件的监测方法、装置以及电子设备
CA3060095A1 (en) Techniques and system for optimization driven by dynamic resilience

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20920967

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20920967

Country of ref document: EP

Kind code of ref document: A1