CN102662788A - Computer system fault diagnosis decision and processing method - Google Patents
Computer system fault diagnosis decision and processing method Download PDFInfo
- Publication number
- CN102662788A CN102662788A CN201210129006XA CN201210129006A CN102662788A CN 102662788 A CN102662788 A CN 102662788A CN 201210129006X A CN201210129006X A CN 201210129006XA CN 201210129006 A CN201210129006 A CN 201210129006A CN 102662788 A CN102662788 A CN 102662788A
- Authority
- CN
- China
- Prior art keywords
- fault
- knowledge
- module
- fault management
- knowledge base
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Landscapes
- Test And Diagnosis Of Digital Computers (AREA)
- Debugging And Monitoring (AREA)
Abstract
The invention provides a computer system fault diagnosis decision and processing method, which is characterized in that a fault management system is adopted, and the system can intelligently utilize knowledge in a fault management knowledge base for carrying out self configuration and optimization according to the configuration, the operation state and the fault diagnosis of a computer to be managed, and on the basis, faults are diagnosed and are processed by proper strategies. The fault management system comprises the fault management knowledge base, a state monitoring module, a fault knowledge study analysis module, a decision planning module, a fault processing module and a man-machine interface, by aiming at the condition that the computer to be managed has faults or under the condition that system resource dynamic change occurrence are caused, the fault management system can intelligently carry out configuration and regulation on fault diagnosis knowledge, fault processing strategy knowledge and fault predication knowledge according to the monitored system state/fault information, and the requirements of self management and self regulation are reached.
Description
Technical field
The present invention relates to the computer failure management domain, be specifically related to the method for a kind of computer system fault diagnosis, decision plan and processing.
Technical background
In fields such as science calculating, commercial service, government functions; Computer systems such as various servers, storage all play a part neural hinge; In case break down, gently then cause service disruption, equipment failure, heavy then injure the safety of the country and people's life and property.The user is in indexs such as the system of pursuit high-performance, high power capacity, high density, and what more value is the reliability and stability of system, and under the promotion of this demand, various fault managements, fault-tolerant management technology are arisen at the historic moment.The present application software of high-end computer system from the hardware of bottom to top layer in the world; Mostly adopted polytype fault management technology; For example the Z server catalyst Catalyst of the Superdome server of Hewlett-Packard Corporation and IBM has adopted fault management abilities such as fault detect and correction mechanism, mistake restore funcitons, hardware fault isolation all sidedly.
Yet; The existing fault management method mostly is static the deployment with strategy; In case certain type fault has promptly taken place in the middle of the system; System can carry out fault detect according to a kind of fixing pattern, carry out work such as localization of fault, fault isolation, system reconfiguration, and these strategies are all just decided when system deployment, and system is difficult to change in operational process.
In fact; Because the complicacy of high-end computer system self framework; The complicacy that network environment of in addition being disposed and last operation thereof are used, along with the operation of system, the available resources in the system can change a lot; Complicated variation also can take place in external environment condition, and fixed mode fault handling strategy is difficult to satisfy the long-play of fault-tolerant computer under complex environment.Therefore, the fault handling strategy of system needs can dynamic, adaptive variation, with the variation of adaptive system state and the variation of external environment condition as far as possible.Therefore, in active computer fault management field, on the basis of existing fault handling theory, the method that proposes a kind of computer system fault diagnosis, decision plan and processing is necessary.
Summary of the invention
The present invention proposes the method for a kind of computer system fault diagnosis, decision plan and processing; Utilize this method; Fault Management System can be according to by the configuration of supervisory computer, running status and failure symptom; Intelligently the knowledge in the fault management knowledge base is independently disposed and optimize, based on this to diagnosing malfunction and adopt appropriate policies to handle.
The objective of the invention is to realize by following mode, comprise Fault Management System, this system can be according to by the configuration of supervisory computer, running status and failure symptom; Utilize knowledge in the fault management knowledge base independently to dispose intelligently and optimize, based on this to diagnosing malfunction and adopt appropriate policies to handle, Fault Management System comprises: fault management knowledge base (1); State monitoring module (2); Fault knowledge study analysis module (3), decision plan module (4), fault processing module (5); Man-machine interface (6), wherein:
Fault management knowledge base (1) comprises fault diagnosis knowledge, fault handling plan knowledge and failure prediction knowledge; The fault management knowledge base is the basis of realizing this method;
State monitoring module (2) is responsible for system state is detected;
Fault knowledge study analysis module (3); The status information of utilizing the existing knowledge in the fault management knowledge base and from state monitoring module, collecting is analysis-by-synthesis in addition, and the knowledge in the fault management knowledge base is reconfigured and upgrades to analysis result; Fault knowledge study analysis module is to realize the core of this method;
Decision plan module (4); According to the status information of from state monitoring module, collecting; Inquiry fault management knowledge base, whether decision-making exists fault, this kind fault should carry out which kind of processing policy to current system is handled, whether need carry out early warning;
Fault processing module (5) is responsible for carrying out actual fault handling action according to the result of decision of decision plan module, comprises fan speed-regulating, parts isolation;
Man-machine interface (6); Adopt artificial mode that the content in the fault management knowledge base is upgraded through this interface by the keeper; Or carry out specific fault and handle action, man-machine interface provides keeper and Fault Management System to carry out mutual interface, replenishes as the useful of autonomous computing mechanism.
Described state monitoring module adopts is with comprehensively monitoring mode in outer/band, obtains computer system SMIS chip level, integrated circuit board level, system-level state/failure message.
Described fault knowledge study analysis module; Based on a large amount of historic state/fault knowledge in the fault management knowledge base; Utilize cluster algorithm that the trend of following fault generating, the processing policy that will take to specific fault are carried out intellectual analysis, and the new knowledge that will draw after will analyzing is updated in the fault management knowledge base.
Described fault processing module, the fault tolerant mechanism of combined with hardware/operating system grade is handled that taken place or potential fault.
The invention has the beneficial effects as follows: Fault Management System is directed against by supervisory computer owing to break down or other cause under the condition of system resource generation dynamic change; Can be according to the system state/failure message that monitors; Intelligently fault diagnosis knowledge, fault handling plan knowledge and failure prediction knowledge are carried out dynamic-configuration and adjustment, reach the requirement of autonomous management, autonomous adjustment.The above-mentioned advantage that is had based on computer system fault diagnosis, response and method for early warning from host computer; Make it remedy in traditional Fault Management System owing to can only adopt the predefine strategy to carry out fault diagnosis and processing, fault omission, the fault handling strategy that possibly exist be wrong, system configuration/external environment condition is changed problem such as bad adaptability.
Description of drawings
Fig. 1 is traditional computer fault management system architectural schematic based on the static failure processing policy;
Fig. 2 is based on the computer fault management system architectural schematic of fault diagnosis, response and method for early warning from host computer.
Embodiment
With reference to the accompanying drawings, content of the present invention realization is described based on the process from the computer fault management system of fault diagnosis, response and the method for early warning of host computer with an instantiation.
As described in the summary of the invention; Architecture of the present invention (referring to accompanying drawing 2) mainly comprises: fault management knowledge base (1), state monitoring module (2), fault knowledge study analysis module (3); Decision plan module (4); Fault processing module (5), man-machine interface (6), wherein:
Content in the fault management knowledge base mainly comprises fault diagnosis knowledge, fault handling plan knowledge and failure prediction knowledge, can adopt the mode of data warehouse to realize; The fault management knowledge base is the basis of realizing this method.Fault management knowledge base saved system historical failure diagnostic knowledge, fault handling plan knowledge and failure prediction knowledge are analyzed use for fault knowledge study analysis module.Wherein fault diagnosis knowledge comprises computer mode Monitoring Data and symptom of acquisition etc., is used to diagnose by the state of management resource and external environment condition/fault foundation; The mapping of fault handling plan knowledge definition from state to action or target comprises the fault handling strategy and the predefine strategy that obtain through from host computer; Failure prediction knowledge comprises the problem solving of inferring incipient fault to known fault.The keeper can carry out artificial regeneration to the knowledge in the fault management knowledge base through man-machine interface, with replenishing as autonomous account form; The decision plan module is carried out the foundation of fault handling strategy also from the fault management knowledge base.
Fault knowledge study analysis module is utilized cluster algorithm; Fault diagnosis knowledge in the fault management knowledge base, fault handling plan knowledge and failure prediction knowledge are carried out data pick-up, cleaning, conversion and transplanting; The trend of following fault generating, the processing policy that will take to specific fault are carried out intellectual analysis, and the new knowledge that will draw after will analyzing is updated in the fault management knowledge base.Fault knowledge study analysis module is to realize the core of this method.This module can adopt like hierarchical clustering method, decomposition method, addition method, dynamic clustering method, clustering ordered samples, have classical clustering algorithms such as overlapping cluster and fuzzy clustering to realize.
State monitoring module is responsible for system state is detected.Can obtain system status information through following dual mode: obtain computer system SMIS chip level, integrated circuit board level state/failure message through being with outer hardware fault management interface (like I2C/SMBUS, JTAG, GPIO etc.); Obtain operating system grade state/failure message through band internal operating system fault management interface (realizing operation system state/fault monitoring agency) like the API of call operation system.
The decision plan module; According to the status information of from state monitoring module, collecting; Inquiry fault management knowledge base; Whether decision-making exists fault, this kind fault should carry out which kind of processing policy to current system is handled, whether need carry out early warning, and calls fault processing module and carry out corresponding fault handling action.
Fault processing module; Be responsible for the result of decision according to the decision plan module; Carry out corresponding fault handling action; Can realize through following dual mode: through realize the failure handling mechanisms of hardware-level with outer hardware fault management interface (like I2C/SMBUS, JTAG, GPIO etc.), like fan speed-regulating, parts isolation etc.; Through being with internal operating system fault management interface (realizing operation system state/fault handling agency) to realize the failure handling mechanisms of operating system grade, remap like process migration, page table etc. like the API of call operation system.
Man-machine interface can be adopted artificial mode that the content in the fault management knowledge base is upgraded by the keeper through this interface, or carry out specific fault and handle action.Can adopt multiple mode (like Web UI, GUI, CLI etc.) to realize.
Claims (4)
1. computer system fault diagnosis decision-making and disposal route is characterized in that: comprise Fault Management System, this system can be according to by the configuration of supervisory computer, running status and failure symptom; Utilize knowledge in the fault management knowledge base independently to dispose intelligently and optimize, based on this to diagnosing malfunction and adopt appropriate policies to handle, Fault Management System comprises: fault management knowledge base (1); State monitoring module (2); Fault knowledge study analysis module (3), decision plan module (4), fault processing module (5); Man-machine interface (6), wherein:
Fault management knowledge base (1) comprises fault diagnosis knowledge, fault handling plan knowledge and failure prediction knowledge; The fault management knowledge base is the basis of realizing this method;
State monitoring module (2) is responsible for system state is detected;
Fault knowledge study analysis module (3); The status information of utilizing the existing knowledge in the fault management knowledge base and from state monitoring module, collecting is analysis-by-synthesis in addition, and the knowledge in the fault management knowledge base is reconfigured and upgrades to analysis result; Fault knowledge study analysis module is to realize the core of this method;
Decision plan module (4); According to the status information of from state monitoring module, collecting; Inquiry fault management knowledge base, whether decision-making exists fault, this kind fault should carry out which kind of processing policy to current system is handled, whether need carry out early warning;
Fault processing module (5) is responsible for carrying out actual fault handling action according to the result of decision of decision plan module, comprises fan speed-regulating, parts isolation;
Man-machine interface (6); Adopt artificial mode that the content in the fault management knowledge base is upgraded through this interface by the keeper; Or carry out specific fault and handle action, man-machine interface provides keeper and Fault Management System to carry out mutual interface, replenishes as the useful of autonomous computing mechanism.
2. method according to claim 1 is characterized in that: state monitoring module adopts is with comprehensively monitoring mode in outer/band, obtains computer system SMIS chip level, integrated circuit board level, system-level state/failure message.
3. method according to claim 1; It is characterized in that; Fault knowledge study analysis module; Based on a large amount of historic state/fault knowledge in the fault management knowledge base, utilize cluster algorithm that the trend of following fault generating, the processing policy that will take to specific fault are carried out intellectual analysis, and the new knowledge that will draw after will analyzing is updated in the fault management knowledge base.
4. method according to claim 1 is characterized in that, fault processing module, and the fault tolerant mechanism of combined with hardware/operating system grade is handled that taken place or potential fault.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201210129006XA CN102662788A (en) | 2012-04-28 | 2012-04-28 | Computer system fault diagnosis decision and processing method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201210129006XA CN102662788A (en) | 2012-04-28 | 2012-04-28 | Computer system fault diagnosis decision and processing method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN102662788A true CN102662788A (en) | 2012-09-12 |
Family
ID=46772287
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201210129006XA Pending CN102662788A (en) | 2012-04-28 | 2012-04-28 | Computer system fault diagnosis decision and processing method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN102662788A (en) |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103106126A (en) * | 2013-01-16 | 2013-05-15 | 浪潮电子信息产业股份有限公司 | High-availability computer system based on virtualization |
WO2014040418A1 (en) * | 2012-09-17 | 2014-03-20 | 华为技术有限公司 | Internal fault handling method, device and system for virtual machine |
CN104007994A (en) * | 2014-06-11 | 2014-08-27 | 焦点科技股份有限公司 | Updating method, upgrading method and upgrading system based on strategy storeroom interaction |
WO2015035574A1 (en) * | 2013-09-11 | 2015-03-19 | 华为技术有限公司 | Failure processing method, computer system, and apparatus |
CN105550056A (en) * | 2015-12-11 | 2016-05-04 | 中国航空工业集团公司西安航空计算技术研究所 | System reconfiguration based fault self-recovery system and realization method therefor |
CN106921547A (en) * | 2017-01-25 | 2017-07-04 | 华为技术有限公司 | The apparatus and method of management equipment |
CN107562603A (en) * | 2017-09-25 | 2018-01-09 | 郑州云海信息技术有限公司 | A kind of intelligent fault alignment system and method based on linux |
CN108304212A (en) * | 2018-01-05 | 2018-07-20 | 北京康拓科技有限公司 | A method of platform is monitored and is safeguarded |
CN109005072A (en) * | 2018-09-06 | 2018-12-14 | 郑州信大壹密科技有限公司 | The multistage monitoring and managing method of centralization based on strategy |
CN109254895A (en) * | 2018-08-21 | 2019-01-22 | 山东超越数控电子股份有限公司 | A kind of high-performance server accident analysis prediction technique based on BMC |
CN109581995A (en) * | 2017-09-28 | 2019-04-05 | 上海微电子装备(集团)股份有限公司 | A kind of intelligent diagnosis system and method |
CN114185275A (en) * | 2021-12-06 | 2022-03-15 | 红云红河烟草(集团)有限责任公司 | Fault diagnosis method and device of equipment |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN2524291Y (en) * | 2002-01-11 | 2002-12-04 | 中国人民解放军第二炮兵工程学院技术开发中心 | Rapid breakdown test devices for automatic computer measure and control system |
CN101833497A (en) * | 2010-03-30 | 2010-09-15 | 山东高效能服务器和存储研究院 | Computer fault management system based on expert system method |
CN101866271A (en) * | 2010-06-08 | 2010-10-20 | 华中科技大学 | Security early warning system and method based on RAID |
CN102364444A (en) * | 2011-09-19 | 2012-02-29 | 浪潮电子信息产业股份有限公司 | System guiding method based on interaction of in-band system and out-of-band system |
US20120066376A1 (en) * | 2010-09-09 | 2012-03-15 | Hitachi, Ltd. | Management method of computer system and management system |
-
2012
- 2012-04-28 CN CN201210129006XA patent/CN102662788A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN2524291Y (en) * | 2002-01-11 | 2002-12-04 | 中国人民解放军第二炮兵工程学院技术开发中心 | Rapid breakdown test devices for automatic computer measure and control system |
CN101833497A (en) * | 2010-03-30 | 2010-09-15 | 山东高效能服务器和存储研究院 | Computer fault management system based on expert system method |
CN101866271A (en) * | 2010-06-08 | 2010-10-20 | 华中科技大学 | Security early warning system and method based on RAID |
US20120066376A1 (en) * | 2010-09-09 | 2012-03-15 | Hitachi, Ltd. | Management method of computer system and management system |
CN102364444A (en) * | 2011-09-19 | 2012-02-29 | 浪潮电子信息产业股份有限公司 | System guiding method based on interaction of in-band system and out-of-band system |
Cited By (18)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2014040418A1 (en) * | 2012-09-17 | 2014-03-20 | 华为技术有限公司 | Internal fault handling method, device and system for virtual machine |
US9483368B2 (en) | 2012-09-17 | 2016-11-01 | Huawei Technologies Co., Ltd. | Method, apparatus, and system for handling virtual machine internal fault |
CN103106126A (en) * | 2013-01-16 | 2013-05-15 | 浪潮电子信息产业股份有限公司 | High-availability computer system based on virtualization |
WO2015035574A1 (en) * | 2013-09-11 | 2015-03-19 | 华为技术有限公司 | Failure processing method, computer system, and apparatus |
US9678826B2 (en) | 2013-09-11 | 2017-06-13 | Huawei Technologies Co., Ltd. | Fault isolation method, computer system, and apparatus |
CN104007994A (en) * | 2014-06-11 | 2014-08-27 | 焦点科技股份有限公司 | Updating method, upgrading method and upgrading system based on strategy storeroom interaction |
CN104007994B (en) * | 2014-06-11 | 2015-04-01 | 焦点科技股份有限公司 | Updating method, upgrading method and upgrading system based on strategy storeroom interaction |
CN105550056B (en) * | 2015-12-11 | 2019-08-06 | 中国航空工业集团公司西安航空计算技术研究所 | A kind of fault self-recovery system and its implementation based on system reconfiguration |
CN105550056A (en) * | 2015-12-11 | 2016-05-04 | 中国航空工业集团公司西安航空计算技术研究所 | System reconfiguration based fault self-recovery system and realization method therefor |
CN106921547A (en) * | 2017-01-25 | 2017-07-04 | 华为技术有限公司 | The apparatus and method of management equipment |
CN107562603A (en) * | 2017-09-25 | 2018-01-09 | 郑州云海信息技术有限公司 | A kind of intelligent fault alignment system and method based on linux |
CN109581995A (en) * | 2017-09-28 | 2019-04-05 | 上海微电子装备(集团)股份有限公司 | A kind of intelligent diagnosis system and method |
CN109581995B (en) * | 2017-09-28 | 2021-09-17 | 上海微电子装备(集团)股份有限公司 | Intelligent diagnosis system and method |
CN108304212A (en) * | 2018-01-05 | 2018-07-20 | 北京康拓科技有限公司 | A method of platform is monitored and is safeguarded |
CN109254895A (en) * | 2018-08-21 | 2019-01-22 | 山东超越数控电子股份有限公司 | A kind of high-performance server accident analysis prediction technique based on BMC |
CN109005072A (en) * | 2018-09-06 | 2018-12-14 | 郑州信大壹密科技有限公司 | The multistage monitoring and managing method of centralization based on strategy |
CN109005072B (en) * | 2018-09-06 | 2021-12-17 | 郑州信大壹密科技有限公司 | Centralized multi-level supervision system based on strategy |
CN114185275A (en) * | 2021-12-06 | 2022-03-15 | 红云红河烟草(集团)有限责任公司 | Fault diagnosis method and device of equipment |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN102662788A (en) | Computer system fault diagnosis decision and processing method | |
Chen et al. | Outage prediction and diagnosis for cloud service systems | |
Lin et al. | Predicting node failure in cloud service systems | |
US7730364B2 (en) | Systems and methods for predictive failure management | |
CN102231681B (en) | High availability cluster computer system and fault treatment method thereof | |
Sharma et al. | CloudPD: Problem determination and diagnosis in shared dynamic clouds | |
US20080281959A1 (en) | Managing addition and removal of nodes in a network | |
US20150169721A1 (en) | Discovering relationships between data processing environment components | |
CN109144813B (en) | System and method for monitoring server node fault of cloud computing system | |
CN102402395A (en) | Quorum disk-based non-interrupted operation method for high availability system | |
JP2021530067A (en) | Data Center Hardware Instance Network Training | |
CN103439629A (en) | Power distribution network fault diagnosis system based on data grid | |
US10936375B2 (en) | Hyper-converged infrastructure (HCI) distributed monitoring system | |
CN103605581B (en) | A kind of Distributed Computer System troubleshooting process | |
Fekade et al. | Clustering hypervisors to minimize failures in mobile cloud computing | |
Becker et al. | Towards aiops in edge computing environments | |
KR102013141B1 (en) | Apparatus and method for detecting abnormality in network equipment | |
CN103106126A (en) | High-availability computer system based on virtualization | |
Yin et al. | A dependable ESB framework for service integration | |
Di Sanzo et al. | Machine learning for achieving self-* properties and seamless execution of applications in the cloud | |
Khalid et al. | Survey of frameworks, architectures and techniques in autonomic computing | |
Di Martino et al. | Measuring the resiliency of extreme-scale computing environments | |
Lu et al. | Iaso: an autonomous fault-tolerant management system for supercomputers | |
CN103326880B (en) | Genesys calling system high availability cloud computing monitoring system and method | |
CN108021463B (en) | GPU fault management method based on finite-state machine |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C02 | Deemed withdrawal of patent application after publication (patent law 2001) | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20120912 |