CN102662788A

CN102662788A - Computer system fault diagnosis decision and processing method

Info

Publication number: CN102662788A
Application number: CN201210129006XA
Authority: CN
Inventors: 乔英良
Original assignee: Inspur Electronic Information Industry Co Ltd
Current assignee: Inspur Electronic Information Industry Co Ltd
Priority date: 2012-04-28
Filing date: 2012-04-28
Publication date: 2012-09-12

Abstract

The invention provides a computer system fault diagnosis decision and processing method, which is characterized in that a fault management system is adopted, and the system can intelligently utilize knowledge in a fault management knowledge base for carrying out self configuration and optimization according to the configuration, the operation state and the fault diagnosis of a computer to be managed, and on the basis, faults are diagnosed and are processed by proper strategies. The fault management system comprises the fault management knowledge base, a state monitoring module, a fault knowledge study analysis module, a decision planning module, a fault processing module and a man-machine interface, by aiming at the condition that the computer to be managed has faults or under the condition that system resource dynamic change occurrence are caused, the fault management system can intelligently carry out configuration and regulation on fault diagnosis knowledge, fault processing strategy knowledge and fault predication knowledge according to the monitored system state/fault information, and the requirements of self management and self regulation are reached.

Description

Decision-making of a kind of computer system fault diagnosis and disposal route

Technical field

The present invention relates to the computer failure management domain, be specifically related to the method for a kind of computer system fault diagnosis, decision plan and processing.

Technical background

In fields such as science calculating, commercial service, government functions; Computer systems such as various servers, storage all play a part neural hinge; In case break down, gently then cause service disruption, equipment failure, heavy then injure the safety of the country and people's life and property.The user is in indexs such as the system of pursuit high-performance, high power capacity, high density, and what more value is the reliability and stability of system, and under the promotion of this demand, various fault managements, fault-tolerant management technology are arisen at the historic moment.The present application software of high-end computer system from the hardware of bottom to top layer in the world; Mostly adopted polytype fault management technology; For example the Z server catalyst Catalyst of the Superdome server of Hewlett-Packard Corporation and IBM has adopted fault management abilities such as fault detect and correction mechanism, mistake restore funcitons, hardware fault isolation all sidedly.

Yet; The existing fault management method mostly is static the deployment with strategy; In case certain type fault has promptly taken place in the middle of the system; System can carry out fault detect according to a kind of fixing pattern, carry out work such as localization of fault, fault isolation, system reconfiguration, and these strategies are all just decided when system deployment, and system is difficult to change in operational process.

In fact; Because the complicacy of high-end computer system self framework; The complicacy that network environment of in addition being disposed and last operation thereof are used, along with the operation of system, the available resources in the system can change a lot; Complicated variation also can take place in external environment condition, and fixed mode fault handling strategy is difficult to satisfy the long-play of fault-tolerant computer under complex environment.Therefore, the fault handling strategy of system needs can dynamic, adaptive variation, with the variation of adaptive system state and the variation of external environment condition as far as possible.Therefore, in active computer fault management field, on the basis of existing fault handling theory, the method that proposes a kind of computer system fault diagnosis, decision plan and processing is necessary.

Summary of the invention

The present invention proposes the method for a kind of computer system fault diagnosis, decision plan and processing; Utilize this method; Fault Management System can be according to by the configuration of supervisory computer, running status and failure symptom; Intelligently the knowledge in the fault management knowledge base is independently disposed and optimize, based on this to diagnosing malfunction and adopt appropriate policies to handle.

The objective of the invention is to realize by following mode, comprise Fault Management System, this system can be according to by the configuration of supervisory computer, running status and failure symptom; Utilize knowledge in the fault management knowledge base independently to dispose intelligently and optimize, based on this to diagnosing malfunction and adopt appropriate policies to handle, Fault Management System comprises: fault management knowledge base (1); State monitoring module (2); Fault knowledge study analysis module (3), decision plan module (4), fault processing module (5); Man-machine interface (6), wherein:

Fault management knowledge base (1) comprises fault diagnosis knowledge, fault handling plan knowledge and failure prediction knowledge; The fault management knowledge base is the basis of realizing this method;

State monitoring module (2) is responsible for system state is detected;

Fault knowledge study analysis module (3); The status information of utilizing the existing knowledge in the fault management knowledge base and from state monitoring module, collecting is analysis-by-synthesis in addition, and the knowledge in the fault management knowledge base is reconfigured and upgrades to analysis result; Fault knowledge study analysis module is to realize the core of this method;

Decision plan module (4); According to the status information of from state monitoring module, collecting; Inquiry fault management knowledge base, whether decision-making exists fault, this kind fault should carry out which kind of processing policy to current system is handled, whether need carry out early warning;

Fault processing module (5) is responsible for carrying out actual fault handling action according to the result of decision of decision plan module, comprises fan speed-regulating, parts isolation;

Man-machine interface (6); Adopt artificial mode that the content in the fault management knowledge base is upgraded through this interface by the keeper; Or carry out specific fault and handle action, man-machine interface provides keeper and Fault Management System to carry out mutual interface, replenishes as the useful of autonomous computing mechanism.

Described state monitoring module adopts is with comprehensively monitoring mode in outer/band, obtains computer system SMIS chip level, integrated circuit board level, system-level state/failure message.

Described fault knowledge study analysis module; Based on a large amount of historic state/fault knowledge in the fault management knowledge base; Utilize cluster algorithm that the trend of following fault generating, the processing policy that will take to specific fault are carried out intellectual analysis, and the new knowledge that will draw after will analyzing is updated in the fault management knowledge base.

Described fault processing module, the fault tolerant mechanism of combined with hardware/operating system grade is handled that taken place or potential fault.

The invention has the beneficial effects as follows: Fault Management System is directed against by supervisory computer owing to break down or other cause under the condition of system resource generation dynamic change; Can be according to the system state/failure message that monitors; Intelligently fault diagnosis knowledge, fault handling plan knowledge and failure prediction knowledge are carried out dynamic-configuration and adjustment, reach the requirement of autonomous management, autonomous adjustment.The above-mentioned advantage that is had based on computer system fault diagnosis, response and method for early warning from host computer; Make it remedy in traditional Fault Management System owing to can only adopt the predefine strategy to carry out fault diagnosis and processing, fault omission, the fault handling strategy that possibly exist be wrong, system configuration/external environment condition is changed problem such as bad adaptability.

Description of drawings

Fig. 1 is traditional computer fault management system architectural schematic based on the static failure processing policy;

Fig. 2 is based on the computer fault management system architectural schematic of fault diagnosis, response and method for early warning from host computer.

Embodiment

With reference to the accompanying drawings, content of the present invention realization is described based on the process from the computer fault management system of fault diagnosis, response and the method for early warning of host computer with an instantiation.

As described in the summary of the invention; Architecture of the present invention (referring to accompanying drawing 2) mainly comprises: fault management knowledge base (1), state monitoring module (2), fault knowledge study analysis module (3); Decision plan module (4); Fault processing module (5), man-machine interface (6), wherein:

Content in the fault management knowledge base mainly comprises fault diagnosis knowledge, fault handling plan knowledge and failure prediction knowledge, can adopt the mode of data warehouse to realize; The fault management knowledge base is the basis of realizing this method.Fault management knowledge base saved system historical failure diagnostic knowledge, fault handling plan knowledge and failure prediction knowledge are analyzed use for fault knowledge study analysis module.Wherein fault diagnosis knowledge comprises computer mode Monitoring Data and symptom of acquisition etc., is used to diagnose by the state of management resource and external environment condition/fault foundation; The mapping of fault handling plan knowledge definition from state to action or target comprises the fault handling strategy and the predefine strategy that obtain through from host computer; Failure prediction knowledge comprises the problem solving of inferring incipient fault to known fault.The keeper can carry out artificial regeneration to the knowledge in the fault management knowledge base through man-machine interface, with replenishing as autonomous account form; The decision plan module is carried out the foundation of fault handling strategy also from the fault management knowledge base.

Fault knowledge study analysis module is utilized cluster algorithm; Fault diagnosis knowledge in the fault management knowledge base, fault handling plan knowledge and failure prediction knowledge are carried out data pick-up, cleaning, conversion and transplanting; The trend of following fault generating, the processing policy that will take to specific fault are carried out intellectual analysis, and the new knowledge that will draw after will analyzing is updated in the fault management knowledge base.Fault knowledge study analysis module is to realize the core of this method.This module can adopt like hierarchical clustering method, decomposition method, addition method, dynamic clustering method, clustering ordered samples, have classical clustering algorithms such as overlapping cluster and fuzzy clustering to realize.

State monitoring module is responsible for system state is detected.Can obtain system status information through following dual mode: obtain computer system SMIS chip level, integrated circuit board level state/failure message through being with outer hardware fault management interface (like I2C/SMBUS, JTAG, GPIO etc.); Obtain operating system grade state/failure message through band internal operating system fault management interface (realizing operation system state/fault monitoring agency) like the API of call operation system.

The decision plan module; According to the status information of from state monitoring module, collecting; Inquiry fault management knowledge base; Whether decision-making exists fault, this kind fault should carry out which kind of processing policy to current system is handled, whether need carry out early warning, and calls fault processing module and carry out corresponding fault handling action.

Fault processing module; Be responsible for the result of decision according to the decision plan module; Carry out corresponding fault handling action; Can realize through following dual mode: through realize the failure handling mechanisms of hardware-level with outer hardware fault management interface (like I2C/SMBUS, JTAG, GPIO etc.), like fan speed-regulating, parts isolation etc.; Through being with internal operating system fault management interface (realizing operation system state/fault handling agency) to realize the failure handling mechanisms of operating system grade, remap like process migration, page table etc. like the API of call operation system.

Man-machine interface can be adopted artificial mode that the content in the fault management knowledge base is upgraded by the keeper through this interface, or carry out specific fault and handle action.Can adopt multiple mode (like Web UI, GUI, CLI etc.) to realize.

Claims

1. computer system fault diagnosis decision-making and disposal route is characterized in that: comprise Fault Management System, this system can be according to by the configuration of supervisory computer, running status and failure symptom; Utilize knowledge in the fault management knowledge base independently to dispose intelligently and optimize, based on this to diagnosing malfunction and adopt appropriate policies to handle, Fault Management System comprises: fault management knowledge base (1); State monitoring module (2); Fault knowledge study analysis module (3), decision plan module (4), fault processing module (5); Man-machine interface (6), wherein:

State monitoring module (2) is responsible for system state is detected;

2. method according to claim 1 is characterized in that: state monitoring module adopts is with comprehensively monitoring mode in outer/band, obtains computer system SMIS chip level, integrated circuit board level, system-level state/failure message.

3. method according to claim 1; It is characterized in that; Fault knowledge study analysis module; Based on a large amount of historic state/fault knowledge in the fault management knowledge base, utilize cluster algorithm that the trend of following fault generating, the processing policy that will take to specific fault are carried out intellectual analysis, and the new knowledge that will draw after will analyzing is updated in the fault management knowledge base.

4. method according to claim 1 is characterized in that, fault processing module, and the fault tolerant mechanism of combined with hardware/operating system grade is handled that taken place or potential fault.