CN103995759B - High-availability computer system failure handling method and device based on core internal-external synergy - Google Patents

High-availability computer system failure handling method and device based on core internal-external synergy Download PDF

Info

Publication number
CN103995759B
CN103995759B CN201410215175.4A CN201410215175A CN103995759B CN 103995759 B CN103995759 B CN 103995759B CN 201410215175 A CN201410215175 A CN 201410215175A CN 103995759 B CN103995759 B CN 103995759B
Authority
CN
China
Prior art keywords
fault
hardware
trouble report
service
operating system
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201410215175.4A
Other languages
Chinese (zh)
Other versions
CN103995759A (en
Inventor
廖湘科
颜跃进
李俊良
刘晓建
杨沙洲
姚望
汪黎
秦莹
周强
王非
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN201410215175.4A priority Critical patent/CN103995759B/en
Publication of CN103995759A publication Critical patent/CN103995759A/en
Application granted granted Critical
Publication of CN103995759B publication Critical patent/CN103995759B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Debugging And Monitoring (AREA)
  • Hardware Redundancy (AREA)

Abstract

The invention discloses a high-availability computer system failure handling method and device based on core internal-external synergy. The method comprises the steps of 1 respectively detecting service failures and hardware failures and outputting the failures through a failure reporting interface; 2 detecting and then analyzing a failure report and performing failure handling on the hardware failures or the service failures according to an analysis result, reporting logs, informing a manager and then judging whether dual-computer hot standby is needed or not, wherein specific dual-computer hot standby software is informed to perform dual-computer hot standby if the dual-computer hot standby is needed. The device comprises a unified failure reporting subsystem and a unified failure handling subsystem, wherein the subsystems completely correspond to the steps of the method. The high-availability computer system failure handling method and device based on core internal-external synergy can achieve software and hardware unified failure management and efficiently and timely detect software and hardware failures, a handling process is simple, failure handling rules are convenient to expand, and the high availability of a computer system subjected to the software or hardware failures can be ensured.

Description

Based on high-availability computer system fault handling method collaborative inside and outside core and device
Technical field
The present invention relates to the High Availabitity administrative skill field of computer system, be specifically related to a kind of based on high-availability computer system fault handling method collaborative inside and outside core and device.
Background technology
The availability of computer system is the reliable and stable index of evaluation computer system, and it is measured by the mean free error time usually.Mean free error time is longer, then the availability of this computer system is higher.Also there is hardware aspect the existing software aspect of factor affecting computer system availability.Software fault is often referred to the program of computer system or software because certain factors disrupt causes normally to work or affecting normal use, and the domain of influence of software fault is generally software self and depends on other software or the program of this software.Hardware fault is often referred to the physical hardware of computer system because certain factors disrupt causes normally to work or affecting normal use, and hardware fault is comparatively large on computer system impact, and system can be caused time serious to delay machine.
The computer system of prior art depends on hardware drive program for the detection of hardware fault, and for software fault, usually adopts automatic regular polling mechanism to complete service state and detect.After completing fault detect, carry out troubleshooting according to driving or program default policy immediately, and record respective process daily record.But there is following problem in the computer system of prior art in High Availabitity management: 1, computer system independent process and reporting software and hardware fault, the unified management of shortage hardware and software failure; 2, traditional backup technique is low to software fault monitoring efficiency, cannot in time perception hardware fault; 3, computer system is complicated to hardware and software failure handling process, and user cannot define and dispose rule.
Summary of the invention
The technical problem to be solved in the present invention is: the technical problem existed for prior art, there is provided one can realize hardware and software failure unified management, efficiently timely to the detection of software fault and hardware fault, handling process is simple, troubleshooting Rule Extended is convenient, can ensure computer system under software fault or hardware fault high availability based on high-availability computer system fault handling method collaborative inside and outside core and device.
In order to solve the problems of the technologies described above, technical scheme provided by the invention is:
Based on a high-availability computer system fault handling method collaborative inside and outside core, implementation step is as follows:
1) outside operating system nucleus, Trouble Report interface is set up, detect at operating system nucleus the service fault comprising system service fault and application service fault outward generate Trouble Report and exported by described Trouble Report interface, in operating system nucleus, detection hardware fault is generated Trouble Report and is exported by described Trouble Report interface simultaneously;
2) in the Trouble Report of the outer detection failure reporting interface of operating system nucleus, after receiving Trouble Report, Trouble Report is analyzed, the hardware corresponding to hardware fault in operating system nucleus according to analysis result carries out troubleshooting, or carry out troubleshooting in the outer service corresponding to service fault of operating system nucleus, to keeper, notice is sent to troubleshooting log, then the rule judgment that basis is default, the need of carrying out two-node cluster hot backup, if need two-node cluster hot backup, notifies that the two-node cluster hot backup software of specifying carries out two-node cluster hot backup.
Preferably, detect at operating system nucleus the service fault comprising system service fault and application service fault outward in described step 1) generate Trouble Report and specifically referred to by described Trouble Report interface output:
1.1.1) operating system nucleus outer with the mode of poll to operating system in system service and application service carry out state-detection, if abnormality appears in system service or application service arbitrarily, then judge service fault occurs;
1.1.2) after service fault occurs in judgement, occur that the information of abnormality generates Trouble Report according to system service or application service, described Trouble Report is exported by described Trouble Report interface.
Preferably, the detailed step that in described step 1), detection hardware fault generation Trouble Report is also exported by described Trouble Report interface in operating system nucleus is as follows:
1.2.1) corresponding hardware status information is detected by the multiple hardware state monitoring points be distributed in advance in direct fault location interface, fault interrupting process routine and hardware driving, if the hardware state that any hardware status monitoring point detects occurs abnormal, then the field data of corresponding hardware is collected as hardware fault data according to the rule preset in described hardware state monitoring point;
1.2.2) hardware fault data are carried out encapsulate generate Trouble Report and stored in preset failure message queue;
1.2.3) according to failure message queue, scheduling distribution is carried out to the Trouble Report stored in failure message queue;
1.2.4) thread is utilized the Trouble Report that scheduling exports to be exported by described Trouble Report interface.
Preferably, described step 2) detailed step as follows:
2.1) in the outer Trouble Report based on finger daemon detection failure reporting interface of operating system nucleus;
2.2) operating system nucleus is external receive Trouble Report after Trouble Report is analyzed, the fault type of failure judgement report, if fault type is service fault, then describes according to service dependence and system service corresponding to service fault or application service are recovered; If fault type is hardware fault, then judge whether to need the hardware corresponding to Trouble Report to carry out faulty hardware isolation, if need to carry out faulty hardware to isolate, redirect performs step 2.3), otherwise judge whether to need the hardware corresponding to Trouble Report to carry out failed hardware recovery, if need to carry out failed hardware recovery, redirect performs step 2.4), otherwise redirect performs step 2.5);
2.3) when needing the hardware corresponding to Trouble Report to carry out faulty hardware isolation, hardware corresponding to Trouble Report in operating system nucleus carries out faulty hardware isolation;
2.4) when needing the hardware corresponding to Trouble Report to carry out failed hardware recovery, hardware corresponding to Trouble Report in operating system nucleus carries out failed hardware recovery;
2.5) to troubleshooting log;
2.6) notice is sent to keeper;
2.7) rule judgment that basis is default, the need of carrying out two-node cluster hot backup, if need two-node cluster hot backup, by calling the notice plug-in unit of the two-node cluster hot backup software of specifying, notifies that described two-node cluster hot backup software carries out two-node cluster hot backup.
The present invention also provides a kind of based on high-availability computer system fault treating apparatus collaborative inside and outside core, comprising:
Fault unifies report subsystem, for setting up Trouble Report interface outside operating system nucleus, detect at operating system nucleus the service fault comprising system service fault and application service fault outward generate Trouble Report and exported by described Trouble Report interface, in operating system nucleus, detection hardware fault is generated Trouble Report and is exported by described Trouble Report interface simultaneously;
Fault is unified disposes subsystem, for the Trouble Report at the outer detection failure reporting interface of operating system nucleus, after receiving Trouble Report, Trouble Report is analyzed, the hardware corresponding to hardware fault in operating system nucleus according to analysis result carries out troubleshooting, or carry out troubleshooting in the outer service corresponding to service fault of operating system nucleus, to keeper, notice is sent to troubleshooting log, then the rule judgment that basis is default is the need of carrying out two-node cluster hot backup, if need two-node cluster hot backup, notify that the two-node cluster hot backup software of specifying carries out two-node cluster hot backup.
Preferably, described fault is unified report subsystem and is comprised for detecting the service fault generation Trouble Report comprising system service fault and application service fault outward at operating system nucleus and the service detection module exported by described Trouble Report interface, and described service detection module comprises:
Service state poll detection sub-module, for operating system nucleus outer with the mode of poll to operating system in system service and application service carry out state-detection, if abnormality appears in system service or application service arbitrarily, then judge service fault occurs;
According to system service or application service, service fault report submodule, for after service fault occurs in judgement, is occurred that the information of abnormality generates Trouble Report, described Trouble Report is exported by described Trouble Report interface.
Preferably, described fault is unified report subsystem and is comprised the hardware detecting module exported for the generation Trouble Report of detection hardware fault in operating system nucleus and by described Trouble Report interface, and described hardware detecting module comprises:
Hardware state monitoring submodule, for detecting corresponding hardware status information by the multiple hardware state monitoring points be distributed in advance in direct fault location interface, fault interrupting process routine and hardware driving, if the hardware state that any hardware status monitoring point detects occurs abnormal, then the field data of corresponding hardware is collected as hardware fault data according to the rule preset in described hardware state monitoring point;
Hardware fault data encapsulation submodule, for hardware fault data are carried out encapsulate generate Trouble Report and stored in preset failure message queue;
Failure message queue scheduling submodule, for carrying out scheduling distribution according to failure message queue to the Trouble Report stored in failure message queue;
Hardware fault data report submodule, is exported the Trouble Report that scheduling exports by described Trouble Report interface for utilizing thread.
Preferably, the unified subsystem of disposing of described fault comprises:
Fault management finger daemon module, for the Trouble Report based on finger daemon detection failure reporting interface outside operating system nucleus;
Troubleshooting engine, for operating system nucleus is external receive Trouble Report after Trouble Report is analyzed, the fault type of failure judgement report, if fault type is service fault, then describes according to service dependence and system service corresponding to service fault or application service are recovered; If fault type is hardware fault, then judge whether to need the hardware corresponding to Trouble Report to carry out faulty hardware isolation, if need to carry out faulty hardware isolation, call Fault Isolation module, otherwise judge whether to need the hardware corresponding to Trouble Report to carry out failed hardware recovery, if need to carry out failed hardware recovery, call Failure Recovery Module;
Fault Isolation module, for when needing the hardware corresponding to Trouble Report to carry out faulty hardware isolation, hardware corresponding to Trouble Report in operating system nucleus carries out faulty hardware isolation;
Failure Recovery Module, for when needing the hardware corresponding to Trouble Report to carry out failed hardware recovery, hardware corresponding to Trouble Report in operating system nucleus carries out failed hardware recovery;
Logger module, for troubleshooting log;
Failure notification module, for sending notice to keeper;
Two-node cluster hot backup processing module, for according to preset rule judgment the need of carrying out two-node cluster hot backup, if need two-node cluster hot backup, by calling the notice plug-in unit of the two-node cluster hot backup software of specifying, notify that described two-node cluster hot backup software carries out two-node cluster hot backup.
The present invention is based on the inside and outside collaborative high-availability computer system fault handling method of core and there is following technique effect: the present invention is generated Trouble Report by detecting the service fault comprising system service fault and application service fault outward at operating system nucleus and exported by Trouble Report interface, in operating system nucleus, detection hardware fault is generated Trouble Report and is exported by the Trouble Report interface set up outward at operating system nucleus simultaneously, the mode combined by hardware detection and service detection realizes the unified report of software fault (service fault) and hardware fault, and realize having reported follow-up unified treatment mechanism and two-node cluster hot backup process by guarding the mode obtaining unified report combination outside kernel module and core, the fault that can solve computer system existence in High Availabitity management cannot work in coordination with report, software and hardware cannot unified management and traditional hot standby software cannot the problem of in time perception fault, efficiently timely to the detection of software fault and hardware fault, handling process is simple, troubleshooting Rule Extended is convenient, carry out in read failure data being disposed fault by Fault Isolation and fault recovery under the prerequisite diagnosed and business between multi-host hot swap implement software machine of circulating a notice of switches, ensure the high availability of computer system under the exceptional condition of software fault or hardware fault.
The present invention is based on the inside and outside collaborative high-availability computer system fault treating apparatus of core is the present invention is based on the completely corresponding device of the inside and outside collaborative high-availability computer system fault handling method of core, therefore also there is the technique effect identical with the present invention is based on high-availability computer system fault handling method collaborative inside and outside core, therefore do not repeat them here.
Accompanying drawing explanation
Fig. 1 is the basic procedure schematic diagram of embodiment of the present invention method.
Fig. 2 is the schematic flow sheet that embodiment of the present invention method detects service fault.
Fig. 3 is the schematic flow sheet of embodiment of the present invention method detection hardware fault.
Fig. 4 is embodiment of the present invention method to the unified disposal process schematic diagram of service fault and hardware fault.
Fig. 5 is the frame structure schematic diagram of embodiment of the present invention device.
Fig. 6 is the frame structure schematic diagram that in embodiment of the present invention device, fault unifies report subsystem.
Fig. 7 is the unified frame structure schematic diagram disposing subsystem of fault in embodiment of the present invention device.
Detailed description of the invention
As shown in Figure 1, the implementation step that the present invention is based on the inside and outside collaborative high-availability computer system fault handling method of core is as follows:
1) outside operating system nucleus, Trouble Report interface is set up, detect at operating system nucleus the service fault comprising system service fault and application service fault outward generate Trouble Report and exported by Trouble Report interface, in operating system nucleus, detection hardware fault is generated Trouble Report and is exported by Trouble Report interface simultaneously;
2) in the Trouble Report of the outer detection failure reporting interface of operating system nucleus, after receiving Trouble Report, Trouble Report is analyzed, the hardware corresponding to hardware fault in operating system nucleus according to analysis result carries out troubleshooting, or carry out troubleshooting in the outer service corresponding to service fault of operating system nucleus, to keeper, notice is sent to troubleshooting log, then the rule judgment that basis is default, the need of carrying out two-node cluster hot backup, if need two-node cluster hot backup, notifies that the two-node cluster hot backup software of specifying carries out two-node cluster hot backup.
As shown in Figure 2, detect at operating system nucleus the service fault comprising system service fault and application service fault outward in the present embodiment step 1) generate Trouble Report and specifically referred to by the output of Trouble Report interface:
1.1.1) operating system nucleus outer with the mode of poll to operating system in system service and application service carry out state-detection, if abnormality appears in system service or application service arbitrarily, then judge service fault occurs;
1.1.2) after service fault occurs in judgement, occur that the information of abnormality generates Trouble Report according to system service or application service, Trouble Report is exported by Trouble Report interface.
As shown in Figure 3, in the present embodiment step 1), in operating system nucleus, detection hardware fault generates Trouble Report and the detailed step exported by Trouble Report interface is as follows:
1.2.1) corresponding hardware status information is detected by the multiple hardware state monitoring points be distributed in advance in direct fault location interface, fault interrupting process routine and hardware driving, if the hardware state that any hardware status monitoring point detects occurs abnormal, then the field data of corresponding hardware is collected as hardware fault data according to the rule preset in hardware state monitoring point;
1.2.2) hardware fault data are carried out encapsulate generate Trouble Report and stored in preset failure message queue;
1.2.3) according to failure message queue, scheduling distribution is carried out to the Trouble Report stored in failure message queue;
1.2.4) thread is utilized the Trouble Report that scheduling exports to be exported by Trouble Report interface.
As shown in Figure 4, the present embodiment step 2) detailed step as follows:
2.1) in the outer Trouble Report based on finger daemon detection failure reporting interface of operating system nucleus;
2.2) operating system nucleus is external receive Trouble Report after Trouble Report is analyzed, the fault type of failure judgement report, if fault type is service fault, then describes according to service dependence and system service corresponding to service fault or application service are recovered; If fault type is hardware fault, then judge whether to need the hardware corresponding to Trouble Report to carry out faulty hardware isolation, if need to carry out faulty hardware to isolate, redirect performs step 2.3), otherwise judge whether to need the hardware corresponding to Trouble Report to carry out failed hardware recovery, if need to carry out failed hardware recovery, redirect performs step 2.4), otherwise redirect performs step 2.5);
2.3) when needing the hardware corresponding to Trouble Report to carry out faulty hardware isolation, hardware corresponding to Trouble Report in operating system nucleus carries out faulty hardware isolation;
2.4) when needing the hardware corresponding to Trouble Report to carry out failed hardware recovery, hardware corresponding to Trouble Report in operating system nucleus carries out failed hardware recovery;
2.5) to troubleshooting log;
2.6) notice is sent to keeper;
2.7) rule judgment that basis is default is the need of carrying out two-node cluster hot backup, if need two-node cluster hot backup, by calling the notice plug-in unit of the two-node cluster hot backup software of specifying, notice two-node cluster hot backup software carries out two-node cluster hot backup.
As shown in Figure 5, the present embodiment comprises based on high-availability computer system fault treating apparatus collaborative inside and outside core:
Fault unifies report subsystem, for setting up Trouble Report interface outside operating system nucleus, detect at operating system nucleus the service fault comprising system service fault and application service fault outward generate Trouble Report and exported by Trouble Report interface, in operating system nucleus, detection hardware fault is generated Trouble Report and is exported by Trouble Report interface simultaneously;
Fault is unified disposes subsystem, for the Trouble Report at the outer detection failure reporting interface of operating system nucleus, after receiving Trouble Report, Trouble Report is analyzed, the hardware corresponding to hardware fault in operating system nucleus according to analysis result carries out troubleshooting, or carry out troubleshooting in the outer service corresponding to service fault of operating system nucleus, to keeper, notice is sent to troubleshooting log, then the rule judgment that basis is default is the need of carrying out two-node cluster hot backup, if need two-node cluster hot backup, notify that the two-node cluster hot backup software of specifying carries out two-node cluster hot backup.
As shown in Figure 6, the fault of the present embodiment is unified report subsystem and is comprised for detecting the service fault generation Trouble Report the service detection module exported by Trouble Report interface that comprise system service fault and application service fault outward at operating system nucleus, and service detection module comprises:
Service state poll detection sub-module, for operating system nucleus outer with the mode of poll to operating system in system service and application service carry out state-detection, if abnormality appears in system service or application service arbitrarily, then judge service fault occurs;
According to system service or application service, service fault report submodule, for after service fault occurs in judgement, is occurred that the information of abnormality generates Trouble Report, Trouble Report is exported by Trouble Report interface.
As shown in Figure 6, the fault of the present embodiment is unified report subsystem and is comprised for the generation Trouble Report of detection hardware fault in operating system nucleus and the hardware detecting module exported by Trouble Report interface, and hardware detecting module comprises:
Hardware state monitoring submodule, for detecting corresponding hardware status information by the multiple hardware state monitoring points be distributed in advance in direct fault location interface, fault interrupting process routine and hardware driving, if the hardware state that any hardware status monitoring point detects occurs abnormal, then the field data of corresponding hardware is collected as hardware fault data according to the rule preset in hardware state monitoring point;
Hardware fault data encapsulation submodule, for hardware fault data are carried out encapsulate generate Trouble Report and stored in preset failure message queue;
Failure message queue scheduling submodule, for carrying out scheduling distribution according to failure message queue to the Trouble Report stored in failure message queue;
Hardware fault data report submodule, is exported the Trouble Report that scheduling exports by Trouble Report interface for utilizing thread.
In the present embodiment, hardware state monitoring submodule detects corresponding hardware status information by the multiple hardware state monitoring points be distributed in advance in direct fault location interface, fault interrupting process routine and hardware driving, the early warning to hardware fault, fast ability of discovery can be promoted, improve promptness and the efficiency of hardware fault discovery.
As shown in Figure 7, the unified subsystem of disposing of the fault of the present embodiment comprises:
Fault management finger daemon module, for the Trouble Report based on finger daemon detection failure reporting interface outside operating system nucleus;
Troubleshooting engine, for operating system nucleus is external receive Trouble Report after Trouble Report is analyzed, the fault type of failure judgement report, if fault type is service fault, then describes according to service dependence and system service corresponding to service fault or application service are recovered; If fault type is hardware fault, then judge whether to need the hardware corresponding to Trouble Report to carry out faulty hardware isolation, if need to carry out faulty hardware isolation, call Fault Isolation module, otherwise judge whether to need the hardware corresponding to Trouble Report to carry out failed hardware recovery, if need to carry out failed hardware recovery, call Failure Recovery Module;
Fault Isolation module, for when needing the hardware corresponding to Trouble Report to carry out faulty hardware isolation, hardware corresponding to Trouble Report in operating system nucleus carries out faulty hardware isolation;
Failure Recovery Module, for when needing the hardware corresponding to Trouble Report to carry out failed hardware recovery, hardware corresponding to Trouble Report in operating system nucleus carries out failed hardware recovery;
Logger module, for troubleshooting log;
Failure notification module, for sending notice to keeper;
Two-node cluster hot backup processing module, for according to the rule judgment preset the need of carrying out two-node cluster hot backup, if need two-node cluster hot backup, by calling the notice plug-in unit of the two-node cluster hot backup software of specifying, notice two-node cluster hot backup software carries out two-node cluster hot backup.
The mode that the present embodiment is combined by kernel module and the outer finger daemon of core, devise fault unified disposal subsystem, fault is unified to be disposed in subsystem, troubleshooting engine passes through the description of service dependence and the state-detection for special services, real-time perception kernel and key service running status, the uniformity of maintenance system service state, can the stability of elevator system.The mode that the present embodiment is combined by hardware detection and service detection based on high-availability computer system fault treating apparatus collaborative inside and outside core, devise fault and unify report frame unifies report subsystem specific implementation as fault, fault unifies report frame by inserting hardware state checkpoint in processor and memory management code, bus and device driver code, improves the early warning to hardware fault, fast ability of discovery.Step is as follows: 1, hardware state Checkpoint detection is to abnormal; 2, collect hardware status data and encapsulate; 3, data are put into message queue; 4, call fault transmission thread to send.The mode that the present embodiment is combined by kernel module and the outer finger daemon of core based on high-availability computer system fault treating apparatus collaborative inside and outside core, devise the unified framework of disposing of fault as the unified specific implementation disposing subsystem of fault, the unified framework of disposing of fault passes through the description of service dependence and the state-detection for special services, real-time perception kernel and key service running status, the uniformity of maintenance system service state.Step is as follows: 1, kernel or service detection module are to the report of fault management finger daemon extremely; 2, fault management finger daemon carries out troubleshooting; 3, Fault Isolation module isolated fault hardware is notified; 4, notify that Failure Recovery Module completes fault recovery; 5, failure notification is sent by failure notification module; 6, by logger module log; The present embodiment unifies report frame and fault unified disposal framework based on high-availability computer system fault treating apparatus collaborative inside and outside core based on fault, traditional multi-host hot swap technology is improved, it is characterized in that unifying report frame by fault, obtain hardware and software failure information real-time, process is completed by the unified framework of disposing of fault, and by result by efficient event communication and callback mechanism circular multi-host hot swap software, carry out business between machine by the latter and switch or migration.Step is as follows: 1, multi-host hot swap software is to fault management finger daemon registration migration signal and migration signal triggering rule; 2, fault processing module handling failure, when meeting triggering rule, sends migration signal; 3, multi-host hot swap software receipt is to migration signal, business migration between enforcement machine.The present embodiment realizes based on high-availability computer system fault treating apparatus collaborative inside and outside core and is integrated with the unified monitoring of hardware fault and software fault, ensures that keeper can obtain hardware and software failure information real-time.Application and trouble diagnosis, Fault Isolation and Failure Recovery Module are disposed fault and business between multi-host hot swap implement software machine of circulating a notice of switches, and ensure the high availability of computer system under the exceptional condition of software fault or hardware fault.
The above is only the preferred embodiment of the present invention, protection scope of the present invention be not only confined to above-described embodiment, and all technical schemes belonged under thinking of the present invention all belong to protection scope of the present invention.It should be pointed out that for those skilled in the art, some improvements and modifications without departing from the principles of the present invention, these improvements and modifications also should be considered as protection scope of the present invention.

Claims (8)

1., based on a high-availability computer system fault handling method collaborative inside and outside core, it is characterized in that implementation step is as follows:
1) outside operating system nucleus, Trouble Report interface is set up, detect at operating system nucleus the service fault comprising system service fault and application service fault outward generate Trouble Report and exported by described Trouble Report interface, in operating system nucleus, detection hardware fault is generated Trouble Report and is exported by described Trouble Report interface simultaneously;
2) in the Trouble Report of the outer detection failure reporting interface of operating system nucleus, after receiving Trouble Report, Trouble Report is analyzed, the hardware corresponding to hardware fault in operating system nucleus according to analysis result carries out troubleshooting, or carry out troubleshooting in the outer service corresponding to service fault of operating system nucleus, to keeper, notice is sent to troubleshooting log, then the rule judgment that basis is default, the need of carrying out two-node cluster hot backup, if need two-node cluster hot backup, notifies that the two-node cluster hot backup software of specifying carries out two-node cluster hot backup.
2. according to claim 1 based on high-availability computer system fault handling method collaborative inside and outside core, it is characterized in that: detect at operating system nucleus the service fault comprising system service fault and application service fault outward in described step 1) and generate Trouble Report and specifically referred to by described Trouble Report interface output:
1.1.1) operating system nucleus outer with the mode of poll to operating system in system service and application service carry out state-detection, if abnormality appears in system service or application service arbitrarily, then judge service fault occurs;
1.1.2) after service fault occurs in judgement, occur that the information of abnormality generates Trouble Report according to system service or application service, described Trouble Report is exported by described Trouble Report interface.
3. according to claim 2 based on high-availability computer system fault handling method collaborative inside and outside core, it is characterized in that, the detailed step that in described step 1), detection hardware fault generation Trouble Report is also exported by described Trouble Report interface in operating system nucleus is as follows:
1.2.1) corresponding hardware status information is detected by the multiple hardware state monitoring points be distributed in advance in direct fault location interface, fault interrupting process routine and hardware driving, if the hardware state that any hardware status monitoring point detects occurs abnormal, then the field data of corresponding hardware is collected as hardware fault data according to the rule preset in described hardware state monitoring point;
1.2.2) hardware fault data are carried out encapsulate generate Trouble Report and stored in preset failure message queue;
1.2.3) according to failure message queue, scheduling distribution is carried out to the Trouble Report stored in failure message queue;
1.2.4) thread is utilized the Trouble Report that scheduling exports to be exported by described Trouble Report interface.
4. according to claim 3 based on high-availability computer system fault handling method collaborative inside and outside core, to it is characterized in that, described step 2) detailed step as follows:
2.1) in the outer Trouble Report based on finger daemon detection failure reporting interface of operating system nucleus;
2.2) operating system nucleus is external receive Trouble Report after Trouble Report is analyzed, the fault type of failure judgement report, if fault type is service fault, then describes according to service dependence and system service corresponding to service fault or application service are recovered; If fault type is hardware fault, then judge whether to need the hardware corresponding to Trouble Report to carry out faulty hardware isolation, if need to carry out faulty hardware to isolate, redirect performs step 2.3), otherwise judge whether to need the hardware corresponding to Trouble Report to carry out failed hardware recovery, if need to carry out failed hardware recovery, redirect performs step 2.4), otherwise redirect performs step 2.5);
2.3) when needing the hardware corresponding to Trouble Report to carry out faulty hardware isolation, hardware corresponding to Trouble Report in operating system nucleus carries out faulty hardware isolation;
2.4) when needing the hardware corresponding to Trouble Report to carry out failed hardware recovery, hardware corresponding to Trouble Report in operating system nucleus carries out failed hardware recovery;
2.5) to troubleshooting log;
2.6) notice is sent to keeper;
2.7) rule judgment that basis is default, the need of carrying out two-node cluster hot backup, if need two-node cluster hot backup, by calling the notice plug-in unit of the two-node cluster hot backup software of specifying, notifies that described two-node cluster hot backup software carries out two-node cluster hot backup.
5., based on a high-availability computer system fault treating apparatus collaborative inside and outside core, it is characterized in that comprising:
Fault unifies report subsystem, for setting up Trouble Report interface outside operating system nucleus, detect at operating system nucleus the service fault comprising system service fault and application service fault outward generate Trouble Report and exported by described Trouble Report interface, in operating system nucleus, detection hardware fault is generated Trouble Report and is exported by described Trouble Report interface simultaneously;
Fault is unified disposes subsystem, for the Trouble Report at the outer detection failure reporting interface of operating system nucleus, after receiving Trouble Report, Trouble Report is analyzed, the hardware corresponding to hardware fault in operating system nucleus according to analysis result carries out troubleshooting, or carry out troubleshooting in the outer service corresponding to service fault of operating system nucleus, to keeper, notice is sent to troubleshooting log, then the rule judgment that basis is default is the need of carrying out two-node cluster hot backup, if need two-node cluster hot backup, notify that the two-node cluster hot backup software of specifying carries out two-node cluster hot backup.
6. according to claim 5 based on high-availability computer system fault treating apparatus collaborative inside and outside core, it is characterized in that: described fault is unified report subsystem and comprised for detecting the service fault generation Trouble Report comprising system service fault and application service fault outward at operating system nucleus and the service detection module exported by described Trouble Report interface, and described service detection module comprises:
Service state poll detection sub-module, for operating system nucleus outer with the mode of poll to operating system in system service and application service carry out state-detection, if abnormality appears in system service or application service arbitrarily, then judge service fault occurs;
According to system service or application service, service fault report submodule, for after service fault occurs in judgement, is occurred that the information of abnormality generates Trouble Report, described Trouble Report is exported by described Trouble Report interface.
7. according to claim 6 based on high-availability computer system fault treating apparatus collaborative inside and outside core, it is characterized in that, described fault is unified report subsystem and is comprised the hardware detecting module exported for the generation Trouble Report of detection hardware fault in operating system nucleus and by described Trouble Report interface, and described hardware detecting module comprises:
Hardware state monitoring submodule, for detecting corresponding hardware status information by the multiple hardware state monitoring points be distributed in advance in direct fault location interface, fault interrupting process routine and hardware driving, if the hardware state that any hardware status monitoring point detects occurs abnormal, then the field data of corresponding hardware is collected as hardware fault data according to the rule preset in described hardware state monitoring point;
Hardware fault data encapsulation submodule, for hardware fault data are carried out encapsulate generate Trouble Report and stored in preset failure message queue;
Failure message queue scheduling submodule, for carrying out scheduling distribution according to failure message queue to the Trouble Report stored in failure message queue;
Hardware fault data report submodule, is exported the Trouble Report that scheduling exports by described Trouble Report interface for utilizing thread.
8. the high-availability computer system fault treating apparatus based on working in coordination with inside and outside core according to claim 7, is characterized in that, the unified subsystem of disposing of described fault comprises:
Fault management finger daemon module, for the Trouble Report based on finger daemon detection failure reporting interface outside operating system nucleus;
Troubleshooting engine, for operating system nucleus is external receive Trouble Report after Trouble Report is analyzed, the fault type of failure judgement report, if fault type is service fault, then describes according to service dependence and system service corresponding to service fault or application service are recovered; If fault type is hardware fault, then judge whether to need the hardware corresponding to Trouble Report to carry out faulty hardware isolation, if need to carry out faulty hardware isolation, call Fault Isolation module, otherwise judge whether to need the hardware corresponding to Trouble Report to carry out failed hardware recovery, if need to carry out failed hardware recovery, call Failure Recovery Module;
Fault Isolation module, for when needing the hardware corresponding to Trouble Report to carry out faulty hardware isolation, hardware corresponding to Trouble Report in operating system nucleus carries out faulty hardware isolation;
Failure Recovery Module, for when needing the hardware corresponding to Trouble Report to carry out failed hardware recovery, hardware corresponding to Trouble Report in operating system nucleus carries out failed hardware recovery;
Logger module, for troubleshooting log;
Failure notification module, for sending notice to keeper;
Two-node cluster hot backup processing module, for according to preset rule judgment the need of carrying out two-node cluster hot backup, if need two-node cluster hot backup, by calling the notice plug-in unit of the two-node cluster hot backup software of specifying, notify that described two-node cluster hot backup software carries out two-node cluster hot backup.
CN201410215175.4A 2014-05-21 2014-05-21 High-availability computer system failure handling method and device based on core internal-external synergy Active CN103995759B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410215175.4A CN103995759B (en) 2014-05-21 2014-05-21 High-availability computer system failure handling method and device based on core internal-external synergy

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410215175.4A CN103995759B (en) 2014-05-21 2014-05-21 High-availability computer system failure handling method and device based on core internal-external synergy

Publications (2)

Publication Number Publication Date
CN103995759A CN103995759A (en) 2014-08-20
CN103995759B true CN103995759B (en) 2015-04-29

Family

ID=51309932

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410215175.4A Active CN103995759B (en) 2014-05-21 2014-05-21 High-availability computer system failure handling method and device based on core internal-external synergy

Country Status (1)

Country Link
CN (1) CN103995759B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10002056B2 (en) 2015-09-15 2018-06-19 Texas Instruments Incorporated Integrated circuit chip with cores asymmetrically oriented with respect to each other
CN106338982A (en) * 2016-09-26 2017-01-18 深圳前海弘稼科技有限公司 Fault processing method, fault processing device and server
CN106815114A (en) * 2017-01-12 2017-06-09 西安科技大学 A kind of computer system fault handling method based on software-hardware synergism
CN111367769B (en) * 2020-03-30 2023-07-21 浙江大华技术股份有限公司 Application fault processing method and electronic equipment

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1841341A (en) * 2005-03-31 2006-10-04 冲电气工业株式会社 Information processing device, information processing method and information processing program
US8190946B2 (en) * 2008-05-30 2012-05-29 Fujitsu Limited Fault detecting method and information processing apparatus

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101833497B (en) * 2010-03-30 2015-01-21 浪潮电子信息产业股份有限公司 Computer fault management system based on expert system method
CN102364448B (en) * 2011-09-19 2014-01-15 浪潮电子信息产业股份有限公司 Fault-tolerant method for computer fault management system
US9268591B2 (en) * 2012-10-18 2016-02-23 Vmware, Inc. Systems and methods for detecting system exceptions in guest operating systems
CN103279367A (en) * 2013-05-07 2013-09-04 浪潮电子信息产业股份有限公司 Kernel drive isolating system

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1841341A (en) * 2005-03-31 2006-10-04 冲电气工业株式会社 Information processing device, information processing method and information processing program
US8190946B2 (en) * 2008-05-30 2012-05-29 Fujitsu Limited Fault detecting method and information processing apparatus

Also Published As

Publication number Publication date
CN103995759A (en) 2014-08-20

Similar Documents

Publication Publication Date Title
US11360842B2 (en) Fault processing method, related apparatus, and computer
TWI746512B (en) Physical machine fault classification processing method and device, and virtual machine recovery method and system
CN106789306B (en) Method and system for detecting, collecting and recovering software fault of communication equipment
JP4882845B2 (en) Virtual computer system
US7281040B1 (en) Diagnostic/remote monitoring by email
EP3142011B1 (en) Anomaly recovery method for virtual machine in distributed environment
EP3148116B1 (en) Information system fault scenario information collection method and system
CN104268061B (en) A kind of storage state monitoring method suitable for virtual machine
CN103607297A (en) Fault processing method of computer cluster system
CN103995759B (en) High-availability computer system failure handling method and device based on core internal-external synergy
CN106598790A (en) Server hardware failure detection method, apparatus of server, and server
CN112073262B (en) Cloud platform monitoring method, device, equipment and system
CN103324565B (en) Daily record monitoring method
CN105243004A (en) Failure resource detection method and apparatus
CN103207825A (en) Method and device for managing faults of entire equipment cabinet
CN104283718A (en) Network device and hardware fault diagnosis method used for network device
CN103605592A (en) Mechanism of detecting malfunctions of distributed computer system
CN108809729A (en) The fault handling method and device that CTDB is serviced in a kind of distributed system
CN112100019B (en) Multi-source fault collaborative analysis positioning method for large-scale system
CN116126772A (en) UART serial port management system and method applied to ARM server
CN115102838B (en) Emergency processing method and device for server downtime risk and electronic equipment
CN102231124B (en) A kind of guard method of tasks of embedded system
JP2009252006A (en) Log management system and method in computer system
CN105955864A (en) Power supply fault processing method, power supply module, monitoring management module and server
CN114915541B (en) System fault elimination method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant