CN102662788A - Computer system fault diagnosis decision and processing method - Google Patents

Computer system fault diagnosis decision and processing method Download PDF

Info

Publication number
CN102662788A
CN102662788A CN201210129006XA CN201210129006A CN102662788A CN 102662788 A CN102662788 A CN 102662788A CN 201210129006X A CN201210129006X A CN 201210129006XA CN 201210129006 A CN201210129006 A CN 201210129006A CN 102662788 A CN102662788 A CN 102662788A
Authority
CN
China
Prior art keywords
fault
knowledge
module
fault management
knowledge base
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201210129006XA
Other languages
Chinese (zh)
Inventor
乔英良
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inspur Electronic Information Industry Co Ltd
Original Assignee
Inspur Electronic Information Industry Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inspur Electronic Information Industry Co Ltd filed Critical Inspur Electronic Information Industry Co Ltd
Priority to CN201210129006XA priority Critical patent/CN102662788A/en
Publication of CN102662788A publication Critical patent/CN102662788A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Test And Diagnosis Of Digital Computers (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The invention provides a computer system fault diagnosis decision and processing method, which is characterized in that a fault management system is adopted, and the system can intelligently utilize knowledge in a fault management knowledge base for carrying out self configuration and optimization according to the configuration, the operation state and the fault diagnosis of a computer to be managed, and on the basis, faults are diagnosed and are processed by proper strategies. The fault management system comprises the fault management knowledge base, a state monitoring module, a fault knowledge study analysis module, a decision planning module, a fault processing module and a man-machine interface, by aiming at the condition that the computer to be managed has faults or under the condition that system resource dynamic change occurrence are caused, the fault management system can intelligently carry out configuration and regulation on fault diagnosis knowledge, fault processing strategy knowledge and fault predication knowledge according to the monitored system state/fault information, and the requirements of self management and self regulation are reached.

Description

Decision-making of a kind of computer system fault diagnosis and disposal route
Technical field
The present invention relates to the computer failure management domain, be specifically related to the method for a kind of computer system fault diagnosis, decision plan and processing.
Technical background
In fields such as science calculating, commercial service, government functions; Computer systems such as various servers, storage all play a part neural hinge; In case break down, gently then cause service disruption, equipment failure, heavy then injure the safety of the country and people's life and property.The user is in indexs such as the system of pursuit high-performance, high power capacity, high density, and what more value is the reliability and stability of system, and under the promotion of this demand, various fault managements, fault-tolerant management technology are arisen at the historic moment.The present application software of high-end computer system from the hardware of bottom to top layer in the world; Mostly adopted polytype fault management technology; For example the Z server catalyst Catalyst of the Superdome server of Hewlett-Packard Corporation and IBM has adopted fault management abilities such as fault detect and correction mechanism, mistake restore funcitons, hardware fault isolation all sidedly.
Yet; The existing fault management method mostly is static the deployment with strategy; In case certain type fault has promptly taken place in the middle of the system; System can carry out fault detect according to a kind of fixing pattern, carry out work such as localization of fault, fault isolation, system reconfiguration, and these strategies are all just decided when system deployment, and system is difficult to change in operational process.
In fact; Because the complicacy of high-end computer system self framework; The complicacy that network environment of in addition being disposed and last operation thereof are used, along with the operation of system, the available resources in the system can change a lot; Complicated variation also can take place in external environment condition, and fixed mode fault handling strategy is difficult to satisfy the long-play of fault-tolerant computer under complex environment.Therefore, the fault handling strategy of system needs can dynamic, adaptive variation, with the variation of adaptive system state and the variation of external environment condition as far as possible.Therefore, in active computer fault management field, on the basis of existing fault handling theory, the method that proposes a kind of computer system fault diagnosis, decision plan and processing is necessary.
Summary of the invention
The present invention proposes the method for a kind of computer system fault diagnosis, decision plan and processing; Utilize this method; Fault Management System can be according to by the configuration of supervisory computer, running status and failure symptom; Intelligently the knowledge in the fault management knowledge base is independently disposed and optimize, based on this to diagnosing malfunction and adopt appropriate policies to handle.
The objective of the invention is to realize by following mode, comprise Fault Management System, this system can be according to by the configuration of supervisory computer, running status and failure symptom; Utilize knowledge in the fault management knowledge base independently to dispose intelligently and optimize, based on this to diagnosing malfunction and adopt appropriate policies to handle, Fault Management System comprises: fault management knowledge base (1); State monitoring module (2); Fault knowledge study analysis module (3), decision plan module (4), fault processing module (5); Man-machine interface (6), wherein:
Fault management knowledge base (1) comprises fault diagnosis knowledge, fault handling plan knowledge and failure prediction knowledge; The fault management knowledge base is the basis of realizing this method;
State monitoring module (2) is responsible for system state is detected;
Fault knowledge study analysis module (3); The status information of utilizing the existing knowledge in the fault management knowledge base and from state monitoring module, collecting is analysis-by-synthesis in addition, and the knowledge in the fault management knowledge base is reconfigured and upgrades to analysis result; Fault knowledge study analysis module is to realize the core of this method;
Decision plan module (4); According to the status information of from state monitoring module, collecting; Inquiry fault management knowledge base, whether decision-making exists fault, this kind fault should carry out which kind of processing policy to current system is handled, whether need carry out early warning;
Fault processing module (5) is responsible for carrying out actual fault handling action according to the result of decision of decision plan module, comprises fan speed-regulating, parts isolation;
Man-machine interface (6); Adopt artificial mode that the content in the fault management knowledge base is upgraded through this interface by the keeper; Or carry out specific fault and handle action, man-machine interface provides keeper and Fault Management System to carry out mutual interface, replenishes as the useful of autonomous computing mechanism.
Described state monitoring module adopts is with comprehensively monitoring mode in outer/band, obtains computer system SMIS chip level, integrated circuit board level, system-level state/failure message.
Described fault knowledge study analysis module; Based on a large amount of historic state/fault knowledge in the fault management knowledge base; Utilize cluster algorithm that the trend of following fault generating, the processing policy that will take to specific fault are carried out intellectual analysis, and the new knowledge that will draw after will analyzing is updated in the fault management knowledge base.
Described fault processing module, the fault tolerant mechanism of combined with hardware/operating system grade is handled that taken place or potential fault.
The invention has the beneficial effects as follows: Fault Management System is directed against by supervisory computer owing to break down or other cause under the condition of system resource generation dynamic change; Can be according to the system state/failure message that monitors; Intelligently fault diagnosis knowledge, fault handling plan knowledge and failure prediction knowledge are carried out dynamic-configuration and adjustment, reach the requirement of autonomous management, autonomous adjustment.The above-mentioned advantage that is had based on computer system fault diagnosis, response and method for early warning from host computer; Make it remedy in traditional Fault Management System owing to can only adopt the predefine strategy to carry out fault diagnosis and processing, fault omission, the fault handling strategy that possibly exist be wrong, system configuration/external environment condition is changed problem such as bad adaptability.
Description of drawings
Fig. 1 is traditional computer fault management system architectural schematic based on the static failure processing policy;
Fig. 2 is based on the computer fault management system architectural schematic of fault diagnosis, response and method for early warning from host computer.
Embodiment
With reference to the accompanying drawings, content of the present invention realization is described based on the process from the computer fault management system of fault diagnosis, response and the method for early warning of host computer with an instantiation.
As described in the summary of the invention; Architecture of the present invention (referring to accompanying drawing 2) mainly comprises: fault management knowledge base (1), state monitoring module (2), fault knowledge study analysis module (3); Decision plan module (4); Fault processing module (5), man-machine interface (6), wherein:
Content in the fault management knowledge base mainly comprises fault diagnosis knowledge, fault handling plan knowledge and failure prediction knowledge, can adopt the mode of data warehouse to realize; The fault management knowledge base is the basis of realizing this method.Fault management knowledge base saved system historical failure diagnostic knowledge, fault handling plan knowledge and failure prediction knowledge are analyzed use for fault knowledge study analysis module.Wherein fault diagnosis knowledge comprises computer mode Monitoring Data and symptom of acquisition etc., is used to diagnose by the state of management resource and external environment condition/fault foundation; The mapping of fault handling plan knowledge definition from state to action or target comprises the fault handling strategy and the predefine strategy that obtain through from host computer; Failure prediction knowledge comprises the problem solving of inferring incipient fault to known fault.The keeper can carry out artificial regeneration to the knowledge in the fault management knowledge base through man-machine interface, with replenishing as autonomous account form; The decision plan module is carried out the foundation of fault handling strategy also from the fault management knowledge base.
Fault knowledge study analysis module is utilized cluster algorithm; Fault diagnosis knowledge in the fault management knowledge base, fault handling plan knowledge and failure prediction knowledge are carried out data pick-up, cleaning, conversion and transplanting; The trend of following fault generating, the processing policy that will take to specific fault are carried out intellectual analysis, and the new knowledge that will draw after will analyzing is updated in the fault management knowledge base.Fault knowledge study analysis module is to realize the core of this method.This module can adopt like hierarchical clustering method, decomposition method, addition method, dynamic clustering method, clustering ordered samples, have classical clustering algorithms such as overlapping cluster and fuzzy clustering to realize.
State monitoring module is responsible for system state is detected.Can obtain system status information through following dual mode: obtain computer system SMIS chip level, integrated circuit board level state/failure message through being with outer hardware fault management interface (like I2C/SMBUS, JTAG, GPIO etc.); Obtain operating system grade state/failure message through band internal operating system fault management interface (realizing operation system state/fault monitoring agency) like the API of call operation system.
The decision plan module; According to the status information of from state monitoring module, collecting; Inquiry fault management knowledge base; Whether decision-making exists fault, this kind fault should carry out which kind of processing policy to current system is handled, whether need carry out early warning, and calls fault processing module and carry out corresponding fault handling action.
Fault processing module; Be responsible for the result of decision according to the decision plan module; Carry out corresponding fault handling action; Can realize through following dual mode: through realize the failure handling mechanisms of hardware-level with outer hardware fault management interface (like I2C/SMBUS, JTAG, GPIO etc.), like fan speed-regulating, parts isolation etc.; Through being with internal operating system fault management interface (realizing operation system state/fault handling agency) to realize the failure handling mechanisms of operating system grade, remap like process migration, page table etc. like the API of call operation system.
Man-machine interface can be adopted artificial mode that the content in the fault management knowledge base is upgraded by the keeper through this interface, or carry out specific fault and handle action.Can adopt multiple mode (like Web UI, GUI, CLI etc.) to realize.

Claims (4)

1. computer system fault diagnosis decision-making and disposal route is characterized in that: comprise Fault Management System, this system can be according to by the configuration of supervisory computer, running status and failure symptom; Utilize knowledge in the fault management knowledge base independently to dispose intelligently and optimize, based on this to diagnosing malfunction and adopt appropriate policies to handle, Fault Management System comprises: fault management knowledge base (1); State monitoring module (2); Fault knowledge study analysis module (3), decision plan module (4), fault processing module (5); Man-machine interface (6), wherein:
Fault management knowledge base (1) comprises fault diagnosis knowledge, fault handling plan knowledge and failure prediction knowledge; The fault management knowledge base is the basis of realizing this method;
State monitoring module (2) is responsible for system state is detected;
Fault knowledge study analysis module (3); The status information of utilizing the existing knowledge in the fault management knowledge base and from state monitoring module, collecting is analysis-by-synthesis in addition, and the knowledge in the fault management knowledge base is reconfigured and upgrades to analysis result; Fault knowledge study analysis module is to realize the core of this method;
Decision plan module (4); According to the status information of from state monitoring module, collecting; Inquiry fault management knowledge base, whether decision-making exists fault, this kind fault should carry out which kind of processing policy to current system is handled, whether need carry out early warning;
Fault processing module (5) is responsible for carrying out actual fault handling action according to the result of decision of decision plan module, comprises fan speed-regulating, parts isolation;
Man-machine interface (6); Adopt artificial mode that the content in the fault management knowledge base is upgraded through this interface by the keeper; Or carry out specific fault and handle action, man-machine interface provides keeper and Fault Management System to carry out mutual interface, replenishes as the useful of autonomous computing mechanism.
2. method according to claim 1 is characterized in that: state monitoring module adopts is with comprehensively monitoring mode in outer/band, obtains computer system SMIS chip level, integrated circuit board level, system-level state/failure message.
3. method according to claim 1; It is characterized in that; Fault knowledge study analysis module; Based on a large amount of historic state/fault knowledge in the fault management knowledge base, utilize cluster algorithm that the trend of following fault generating, the processing policy that will take to specific fault are carried out intellectual analysis, and the new knowledge that will draw after will analyzing is updated in the fault management knowledge base.
4. method according to claim 1 is characterized in that, fault processing module, and the fault tolerant mechanism of combined with hardware/operating system grade is handled that taken place or potential fault.
CN201210129006XA 2012-04-28 2012-04-28 Computer system fault diagnosis decision and processing method Pending CN102662788A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210129006XA CN102662788A (en) 2012-04-28 2012-04-28 Computer system fault diagnosis decision and processing method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210129006XA CN102662788A (en) 2012-04-28 2012-04-28 Computer system fault diagnosis decision and processing method

Publications (1)

Publication Number Publication Date
CN102662788A true CN102662788A (en) 2012-09-12

Family

ID=46772287

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210129006XA Pending CN102662788A (en) 2012-04-28 2012-04-28 Computer system fault diagnosis decision and processing method

Country Status (1)

Country Link
CN (1) CN102662788A (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103106126A (en) * 2013-01-16 2013-05-15 浪潮电子信息产业股份有限公司 High-availability computer system based on virtualization
WO2014040418A1 (en) * 2012-09-17 2014-03-20 华为技术有限公司 Internal fault handling method, device and system for virtual machine
CN104007994A (en) * 2014-06-11 2014-08-27 焦点科技股份有限公司 Updating method, upgrading method and upgrading system based on strategy storeroom interaction
WO2015035574A1 (en) * 2013-09-11 2015-03-19 华为技术有限公司 Failure processing method, computer system, and apparatus
CN105550056A (en) * 2015-12-11 2016-05-04 中国航空工业集团公司西安航空计算技术研究所 System reconfiguration based fault self-recovery system and realization method therefor
CN106921547A (en) * 2017-01-25 2017-07-04 华为技术有限公司 The apparatus and method of management equipment
CN107562603A (en) * 2017-09-25 2018-01-09 郑州云海信息技术有限公司 A kind of intelligent fault alignment system and method based on linux
CN108304212A (en) * 2018-01-05 2018-07-20 北京康拓科技有限公司 A method of platform is monitored and is safeguarded
CN109005072A (en) * 2018-09-06 2018-12-14 郑州信大壹密科技有限公司 The multistage monitoring and managing method of centralization based on strategy
CN109254895A (en) * 2018-08-21 2019-01-22 山东超越数控电子股份有限公司 A kind of high-performance server accident analysis prediction technique based on BMC
CN109581995A (en) * 2017-09-28 2019-04-05 上海微电子装备(集团)股份有限公司 A kind of intelligent diagnosis system and method
CN114185275A (en) * 2021-12-06 2022-03-15 红云红河烟草(集团)有限责任公司 Fault diagnosis method and device of equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN2524291Y (en) * 2002-01-11 2002-12-04 中国人民解放军第二炮兵工程学院技术开发中心 Rapid breakdown test devices for automatic computer measure and control system
CN101833497A (en) * 2010-03-30 2010-09-15 山东高效能服务器和存储研究院 Computer fault management system based on expert system method
CN101866271A (en) * 2010-06-08 2010-10-20 华中科技大学 Security early warning system and method based on RAID
CN102364444A (en) * 2011-09-19 2012-02-29 浪潮电子信息产业股份有限公司 System guiding method based on interaction of in-band system and out-of-band system
US20120066376A1 (en) * 2010-09-09 2012-03-15 Hitachi, Ltd. Management method of computer system and management system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN2524291Y (en) * 2002-01-11 2002-12-04 中国人民解放军第二炮兵工程学院技术开发中心 Rapid breakdown test devices for automatic computer measure and control system
CN101833497A (en) * 2010-03-30 2010-09-15 山东高效能服务器和存储研究院 Computer fault management system based on expert system method
CN101866271A (en) * 2010-06-08 2010-10-20 华中科技大学 Security early warning system and method based on RAID
US20120066376A1 (en) * 2010-09-09 2012-03-15 Hitachi, Ltd. Management method of computer system and management system
CN102364444A (en) * 2011-09-19 2012-02-29 浪潮电子信息产业股份有限公司 System guiding method based on interaction of in-band system and out-of-band system

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2014040418A1 (en) * 2012-09-17 2014-03-20 华为技术有限公司 Internal fault handling method, device and system for virtual machine
US9483368B2 (en) 2012-09-17 2016-11-01 Huawei Technologies Co., Ltd. Method, apparatus, and system for handling virtual machine internal fault
CN103106126A (en) * 2013-01-16 2013-05-15 浪潮电子信息产业股份有限公司 High-availability computer system based on virtualization
WO2015035574A1 (en) * 2013-09-11 2015-03-19 华为技术有限公司 Failure processing method, computer system, and apparatus
US9678826B2 (en) 2013-09-11 2017-06-13 Huawei Technologies Co., Ltd. Fault isolation method, computer system, and apparatus
CN104007994A (en) * 2014-06-11 2014-08-27 焦点科技股份有限公司 Updating method, upgrading method and upgrading system based on strategy storeroom interaction
CN104007994B (en) * 2014-06-11 2015-04-01 焦点科技股份有限公司 Updating method, upgrading method and upgrading system based on strategy storeroom interaction
CN105550056B (en) * 2015-12-11 2019-08-06 中国航空工业集团公司西安航空计算技术研究所 A kind of fault self-recovery system and its implementation based on system reconfiguration
CN105550056A (en) * 2015-12-11 2016-05-04 中国航空工业集团公司西安航空计算技术研究所 System reconfiguration based fault self-recovery system and realization method therefor
CN106921547A (en) * 2017-01-25 2017-07-04 华为技术有限公司 The apparatus and method of management equipment
CN107562603A (en) * 2017-09-25 2018-01-09 郑州云海信息技术有限公司 A kind of intelligent fault alignment system and method based on linux
CN109581995A (en) * 2017-09-28 2019-04-05 上海微电子装备(集团)股份有限公司 A kind of intelligent diagnosis system and method
CN109581995B (en) * 2017-09-28 2021-09-17 上海微电子装备(集团)股份有限公司 Intelligent diagnosis system and method
CN108304212A (en) * 2018-01-05 2018-07-20 北京康拓科技有限公司 A method of platform is monitored and is safeguarded
CN109254895A (en) * 2018-08-21 2019-01-22 山东超越数控电子股份有限公司 A kind of high-performance server accident analysis prediction technique based on BMC
CN109005072A (en) * 2018-09-06 2018-12-14 郑州信大壹密科技有限公司 The multistage monitoring and managing method of centralization based on strategy
CN109005072B (en) * 2018-09-06 2021-12-17 郑州信大壹密科技有限公司 Centralized multi-level supervision system based on strategy
CN114185275A (en) * 2021-12-06 2022-03-15 红云红河烟草(集团)有限责任公司 Fault diagnosis method and device of equipment

Similar Documents

Publication Publication Date Title
CN102662788A (en) Computer system fault diagnosis decision and processing method
Chen et al. Outage prediction and diagnosis for cloud service systems
Lin et al. Predicting node failure in cloud service systems
US7730364B2 (en) Systems and methods for predictive failure management
CN102231681B (en) High availability cluster computer system and fault treatment method thereof
Sharma et al. CloudPD: Problem determination and diagnosis in shared dynamic clouds
US20080281959A1 (en) Managing addition and removal of nodes in a network
US20150169721A1 (en) Discovering relationships between data processing environment components
CN109144813B (en) System and method for monitoring server node fault of cloud computing system
CN102402395A (en) Quorum disk-based non-interrupted operation method for high availability system
JP2021530067A (en) Data Center Hardware Instance Network Training
CN103439629A (en) Power distribution network fault diagnosis system based on data grid
US10936375B2 (en) Hyper-converged infrastructure (HCI) distributed monitoring system
CN103605581B (en) A kind of Distributed Computer System troubleshooting process
Fekade et al. Clustering hypervisors to minimize failures in mobile cloud computing
Becker et al. Towards aiops in edge computing environments
KR102013141B1 (en) Apparatus and method for detecting abnormality in network equipment
CN103106126A (en) High-availability computer system based on virtualization
Yin et al. A dependable ESB framework for service integration
Di Sanzo et al. Machine learning for achieving self-* properties and seamless execution of applications in the cloud
Khalid et al. Survey of frameworks, architectures and techniques in autonomic computing
Di Martino et al. Measuring the resiliency of extreme-scale computing environments
Lu et al. Iaso: an autonomous fault-tolerant management system for supercomputers
CN103326880B (en) Genesys calling system high availability cloud computing monitoring system and method
CN108021463B (en) GPU fault management method based on finite-state machine

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20120912