CN103995759B

CN103995759B - High-availability computer system failure handling method and device based on core internal-external synergy

Info

Publication number: CN103995759B
Application number: CN201410215175.4A
Authority: CN
Inventors: 廖湘科; 颜跃进; 李俊良; 刘晓建; 杨沙洲; 姚望; 汪黎; 秦莹; 周强; 王非
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2014-05-21
Filing date: 2014-05-21
Publication date: 2015-04-29
Anticipated expiration: 2034-05-21
Also published as: CN103995759A

Abstract

The invention discloses a high-availability computer system failure handling method and device based on core internal-external synergy. The method comprises the steps of 1 respectively detecting service failures and hardware failures and outputting the failures through a failure reporting interface; 2 detecting and then analyzing a failure report and performing failure handling on the hardware failures or the service failures according to an analysis result, reporting logs, informing a manager and then judging whether dual-computer hot standby is needed or not, wherein specific dual-computer hot standby software is informed to perform dual-computer hot standby if the dual-computer hot standby is needed. The device comprises a unified failure reporting subsystem and a unified failure handling subsystem, wherein the subsystems completely correspond to the steps of the method. The high-availability computer system failure handling method and device based on core internal-external synergy can achieve software and hardware unified failure management and efficiently and timely detect software and hardware failures, a handling process is simple, failure handling rules are convenient to expand, and the high availability of a computer system subjected to the software or hardware failures can be ensured.

Description

Based on high-availability computer system fault handling method collaborative inside and outside core and device

Technical field

The present invention relates to the High Availabitity administrative skill field of computer system, be specifically related to a kind of based on high-availability computer system fault handling method collaborative inside and outside core and device.

Background technology

The availability of computer system is the reliable and stable index of evaluation computer system, and it is measured by the mean free error time usually.Mean free error time is longer, then the availability of this computer system is higher.Also there is hardware aspect the existing software aspect of factor affecting computer system availability.Software fault is often referred to the program of computer system or software because certain factors disrupt causes normally to work or affecting normal use, and the domain of influence of software fault is generally software self and depends on other software or the program of this software.Hardware fault is often referred to the physical hardware of computer system because certain factors disrupt causes normally to work or affecting normal use, and hardware fault is comparatively large on computer system impact, and system can be caused time serious to delay machine.

The computer system of prior art depends on hardware drive program for the detection of hardware fault, and for software fault, usually adopts automatic regular polling mechanism to complete service state and detect.After completing fault detect, carry out troubleshooting according to driving or program default policy immediately, and record respective process daily record.But there is following problem in the computer system of prior art in High Availabitity management: 1, computer system independent process and reporting software and hardware fault, the unified management of shortage hardware and software failure; 2, traditional backup technique is low to software fault monitoring efficiency, cannot in time perception hardware fault; 3, computer system is complicated to hardware and software failure handling process, and user cannot define and dispose rule.

Summary of the invention

The technical problem to be solved in the present invention is: the technical problem existed for prior art, there is provided one can realize hardware and software failure unified management, efficiently timely to the detection of software fault and hardware fault, handling process is simple, troubleshooting Rule Extended is convenient, can ensure computer system under software fault or hardware fault high availability based on high-availability computer system fault handling method collaborative inside and outside core and device.

In order to solve the problems of the technologies described above, technical scheme provided by the invention is:

Based on a high-availability computer system fault handling method collaborative inside and outside core, implementation step is as follows:

1) outside operating system nucleus, Trouble Report interface is set up, detect at operating system nucleus the service fault comprising system service fault and application service fault outward generate Trouble Report and exported by described Trouble Report interface, in operating system nucleus, detection hardware fault is generated Trouble Report and is exported by described Trouble Report interface simultaneously;

2) in the Trouble Report of the outer detection failure reporting interface of operating system nucleus, after receiving Trouble Report, Trouble Report is analyzed, the hardware corresponding to hardware fault in operating system nucleus according to analysis result carries out troubleshooting, or carry out troubleshooting in the outer service corresponding to service fault of operating system nucleus, to keeper, notice is sent to troubleshooting log, then the rule judgment that basis is default, the need of carrying out two-node cluster hot backup, if need two-node cluster hot backup, notifies that the two-node cluster hot backup software of specifying carries out two-node cluster hot backup.

Preferably, detect at operating system nucleus the service fault comprising system service fault and application service fault outward in described step 1) generate Trouble Report and specifically referred to by described Trouble Report interface output:

1.1.1) operating system nucleus outer with the mode of poll to operating system in system service and application service carry out state-detection, if abnormality appears in system service or application service arbitrarily, then judge service fault occurs;

1.1.2) after service fault occurs in judgement, occur that the information of abnormality generates Trouble Report according to system service or application service, described Trouble Report is exported by described Trouble Report interface.

Preferably, the detailed step that in described step 1), detection hardware fault generation Trouble Report is also exported by described Trouble Report interface in operating system nucleus is as follows:

1.2.1) corresponding hardware status information is detected by the multiple hardware state monitoring points be distributed in advance in direct fault location interface, fault interrupting process routine and hardware driving, if the hardware state that any hardware status monitoring point detects occurs abnormal, then the field data of corresponding hardware is collected as hardware fault data according to the rule preset in described hardware state monitoring point;

1.2.2) hardware fault data are carried out encapsulate generate Trouble Report and stored in preset failure message queue;

1.2.3) according to failure message queue, scheduling distribution is carried out to the Trouble Report stored in failure message queue;

1.2.4) thread is utilized the Trouble Report that scheduling exports to be exported by described Trouble Report interface.

Preferably, described step 2) detailed step as follows:

2.1) in the outer Trouble Report based on finger daemon detection failure reporting interface of operating system nucleus;

2.2) operating system nucleus is external receive Trouble Report after Trouble Report is analyzed, the fault type of failure judgement report, if fault type is service fault, then describes according to service dependence and system service corresponding to service fault or application service are recovered; If fault type is hardware fault, then judge whether to need the hardware corresponding to Trouble Report to carry out faulty hardware isolation, if need to carry out faulty hardware to isolate, redirect performs step 2.3), otherwise judge whether to need the hardware corresponding to Trouble Report to carry out failed hardware recovery, if need to carry out failed hardware recovery, redirect performs step 2.4), otherwise redirect performs step 2.5);

2.3) when needing the hardware corresponding to Trouble Report to carry out faulty hardware isolation, hardware corresponding to Trouble Report in operating system nucleus carries out faulty hardware isolation;

2.4) when needing the hardware corresponding to Trouble Report to carry out failed hardware recovery, hardware corresponding to Trouble Report in operating system nucleus carries out failed hardware recovery;

2.5) to troubleshooting log;

2.6) notice is sent to keeper;

2.7) rule judgment that basis is default, the need of carrying out two-node cluster hot backup, if need two-node cluster hot backup, by calling the notice plug-in unit of the two-node cluster hot backup software of specifying, notifies that described two-node cluster hot backup software carries out two-node cluster hot backup.

The present invention also provides a kind of based on high-availability computer system fault treating apparatus collaborative inside and outside core, comprising:

Fault unifies report subsystem, for setting up Trouble Report interface outside operating system nucleus, detect at operating system nucleus the service fault comprising system service fault and application service fault outward generate Trouble Report and exported by described Trouble Report interface, in operating system nucleus, detection hardware fault is generated Trouble Report and is exported by described Trouble Report interface simultaneously;

Fault is unified disposes subsystem, for the Trouble Report at the outer detection failure reporting interface of operating system nucleus, after receiving Trouble Report, Trouble Report is analyzed, the hardware corresponding to hardware fault in operating system nucleus according to analysis result carries out troubleshooting, or carry out troubleshooting in the outer service corresponding to service fault of operating system nucleus, to keeper, notice is sent to troubleshooting log, then the rule judgment that basis is default is the need of carrying out two-node cluster hot backup, if need two-node cluster hot backup, notify that the two-node cluster hot backup software of specifying carries out two-node cluster hot backup.

Preferably, described fault is unified report subsystem and is comprised for detecting the service fault generation Trouble Report comprising system service fault and application service fault outward at operating system nucleus and the service detection module exported by described Trouble Report interface, and described service detection module comprises:

Service state poll detection sub-module, for operating system nucleus outer with the mode of poll to operating system in system service and application service carry out state-detection, if abnormality appears in system service or application service arbitrarily, then judge service fault occurs;

According to system service or application service, service fault report submodule, for after service fault occurs in judgement, is occurred that the information of abnormality generates Trouble Report, described Trouble Report is exported by described Trouble Report interface.

Preferably, described fault is unified report subsystem and is comprised the hardware detecting module exported for the generation Trouble Report of detection hardware fault in operating system nucleus and by described Trouble Report interface, and described hardware detecting module comprises:

Hardware state monitoring submodule, for detecting corresponding hardware status information by the multiple hardware state monitoring points be distributed in advance in direct fault location interface, fault interrupting process routine and hardware driving, if the hardware state that any hardware status monitoring point detects occurs abnormal, then the field data of corresponding hardware is collected as hardware fault data according to the rule preset in described hardware state monitoring point;

Hardware fault data encapsulation submodule, for hardware fault data are carried out encapsulate generate Trouble Report and stored in preset failure message queue;

Failure message queue scheduling submodule, for carrying out scheduling distribution according to failure message queue to the Trouble Report stored in failure message queue;

Hardware fault data report submodule, is exported the Trouble Report that scheduling exports by described Trouble Report interface for utilizing thread.

Preferably, the unified subsystem of disposing of described fault comprises:

Fault management finger daemon module, for the Trouble Report based on finger daemon detection failure reporting interface outside operating system nucleus;

Troubleshooting engine, for operating system nucleus is external receive Trouble Report after Trouble Report is analyzed, the fault type of failure judgement report, if fault type is service fault, then describes according to service dependence and system service corresponding to service fault or application service are recovered; If fault type is hardware fault, then judge whether to need the hardware corresponding to Trouble Report to carry out faulty hardware isolation, if need to carry out faulty hardware isolation, call Fault Isolation module, otherwise judge whether to need the hardware corresponding to Trouble Report to carry out failed hardware recovery, if need to carry out failed hardware recovery, call Failure Recovery Module;

Fault Isolation module, for when needing the hardware corresponding to Trouble Report to carry out faulty hardware isolation, hardware corresponding to Trouble Report in operating system nucleus carries out faulty hardware isolation;

Failure Recovery Module, for when needing the hardware corresponding to Trouble Report to carry out failed hardware recovery, hardware corresponding to Trouble Report in operating system nucleus carries out failed hardware recovery;

Logger module, for troubleshooting log;

Failure notification module, for sending notice to keeper;

Two-node cluster hot backup processing module, for according to preset rule judgment the need of carrying out two-node cluster hot backup, if need two-node cluster hot backup, by calling the notice plug-in unit of the two-node cluster hot backup software of specifying, notify that described two-node cluster hot backup software carries out two-node cluster hot backup.

The present invention is based on the inside and outside collaborative high-availability computer system fault handling method of core and there is following technique effect: the present invention is generated Trouble Report by detecting the service fault comprising system service fault and application service fault outward at operating system nucleus and exported by Trouble Report interface, in operating system nucleus, detection hardware fault is generated Trouble Report and is exported by the Trouble Report interface set up outward at operating system nucleus simultaneously, the mode combined by hardware detection and service detection realizes the unified report of software fault (service fault) and hardware fault, and realize having reported follow-up unified treatment mechanism and two-node cluster hot backup process by guarding the mode obtaining unified report combination outside kernel module and core, the fault that can solve computer system existence in High Availabitity management cannot work in coordination with report, software and hardware cannot unified management and traditional hot standby software cannot the problem of in time perception fault, efficiently timely to the detection of software fault and hardware fault, handling process is simple, troubleshooting Rule Extended is convenient, carry out in read failure data being disposed fault by Fault Isolation and fault recovery under the prerequisite diagnosed and business between multi-host hot swap implement software machine of circulating a notice of switches, ensure the high availability of computer system under the exceptional condition of software fault or hardware fault.

The present invention is based on the inside and outside collaborative high-availability computer system fault treating apparatus of core is the present invention is based on the completely corresponding device of the inside and outside collaborative high-availability computer system fault handling method of core, therefore also there is the technique effect identical with the present invention is based on high-availability computer system fault handling method collaborative inside and outside core, therefore do not repeat them here.

Accompanying drawing explanation

Fig. 1 is the basic procedure schematic diagram of embodiment of the present invention method.

Fig. 2 is the schematic flow sheet that embodiment of the present invention method detects service fault.

Fig. 3 is the schematic flow sheet of embodiment of the present invention method detection hardware fault.

Fig. 4 is embodiment of the present invention method to the unified disposal process schematic diagram of service fault and hardware fault.

Fig. 5 is the frame structure schematic diagram of embodiment of the present invention device.

Fig. 6 is the frame structure schematic diagram that in embodiment of the present invention device, fault unifies report subsystem.

Fig. 7 is the unified frame structure schematic diagram disposing subsystem of fault in embodiment of the present invention device.

Detailed description of the invention

As shown in Figure 1, the implementation step that the present invention is based on the inside and outside collaborative high-availability computer system fault handling method of core is as follows:

1) outside operating system nucleus, Trouble Report interface is set up, detect at operating system nucleus the service fault comprising system service fault and application service fault outward generate Trouble Report and exported by Trouble Report interface, in operating system nucleus, detection hardware fault is generated Trouble Report and is exported by Trouble Report interface simultaneously;

As shown in Figure 2, detect at operating system nucleus the service fault comprising system service fault and application service fault outward in the present embodiment step 1) generate Trouble Report and specifically referred to by the output of Trouble Report interface:

1.1.2) after service fault occurs in judgement, occur that the information of abnormality generates Trouble Report according to system service or application service, Trouble Report is exported by Trouble Report interface.

As shown in Figure 3, in the present embodiment step 1), in operating system nucleus, detection hardware fault generates Trouble Report and the detailed step exported by Trouble Report interface is as follows:

1.2.1) corresponding hardware status information is detected by the multiple hardware state monitoring points be distributed in advance in direct fault location interface, fault interrupting process routine and hardware driving, if the hardware state that any hardware status monitoring point detects occurs abnormal, then the field data of corresponding hardware is collected as hardware fault data according to the rule preset in hardware state monitoring point;

1.2.4) thread is utilized the Trouble Report that scheduling exports to be exported by Trouble Report interface.

As shown in Figure 4, the present embodiment step 2) detailed step as follows:

2.5) to troubleshooting log;

2.6) notice is sent to keeper;

2.7) rule judgment that basis is default is the need of carrying out two-node cluster hot backup, if need two-node cluster hot backup, by calling the notice plug-in unit of the two-node cluster hot backup software of specifying, notice two-node cluster hot backup software carries out two-node cluster hot backup.

As shown in Figure 5, the present embodiment comprises based on high-availability computer system fault treating apparatus collaborative inside and outside core:

Fault unifies report subsystem, for setting up Trouble Report interface outside operating system nucleus, detect at operating system nucleus the service fault comprising system service fault and application service fault outward generate Trouble Report and exported by Trouble Report interface, in operating system nucleus, detection hardware fault is generated Trouble Report and is exported by Trouble Report interface simultaneously;

As shown in Figure 6, the fault of the present embodiment is unified report subsystem and is comprised for detecting the service fault generation Trouble Report the service detection module exported by Trouble Report interface that comprise system service fault and application service fault outward at operating system nucleus, and service detection module comprises:

According to system service or application service, service fault report submodule, for after service fault occurs in judgement, is occurred that the information of abnormality generates Trouble Report, Trouble Report is exported by Trouble Report interface.

As shown in Figure 6, the fault of the present embodiment is unified report subsystem and is comprised for the generation Trouble Report of detection hardware fault in operating system nucleus and the hardware detecting module exported by Trouble Report interface, and hardware detecting module comprises:

Hardware state monitoring submodule, for detecting corresponding hardware status information by the multiple hardware state monitoring points be distributed in advance in direct fault location interface, fault interrupting process routine and hardware driving, if the hardware state that any hardware status monitoring point detects occurs abnormal, then the field data of corresponding hardware is collected as hardware fault data according to the rule preset in hardware state monitoring point;

Hardware fault data report submodule, is exported the Trouble Report that scheduling exports by Trouble Report interface for utilizing thread.

In the present embodiment, hardware state monitoring submodule detects corresponding hardware status information by the multiple hardware state monitoring points be distributed in advance in direct fault location interface, fault interrupting process routine and hardware driving, the early warning to hardware fault, fast ability of discovery can be promoted, improve promptness and the efficiency of hardware fault discovery.

As shown in Figure 7, the unified subsystem of disposing of the fault of the present embodiment comprises:

Logger module, for troubleshooting log;

Failure notification module, for sending notice to keeper;

Two-node cluster hot backup processing module, for according to the rule judgment preset the need of carrying out two-node cluster hot backup, if need two-node cluster hot backup, by calling the notice plug-in unit of the two-node cluster hot backup software of specifying, notice two-node cluster hot backup software carries out two-node cluster hot backup.

The mode that the present embodiment is combined by kernel module and the outer finger daemon of core, devise fault unified disposal subsystem, fault is unified to be disposed in subsystem, troubleshooting engine passes through the description of service dependence and the state-detection for special services, real-time perception kernel and key service running status, the uniformity of maintenance system service state, can the stability of elevator system.The mode that the present embodiment is combined by hardware detection and service detection based on high-availability computer system fault treating apparatus collaborative inside and outside core, devise fault and unify report frame unifies report subsystem specific implementation as fault, fault unifies report frame by inserting hardware state checkpoint in processor and memory management code, bus and device driver code, improves the early warning to hardware fault, fast ability of discovery.Step is as follows: 1, hardware state Checkpoint detection is to abnormal; 2, collect hardware status data and encapsulate; 3, data are put into message queue; 4, call fault transmission thread to send.The mode that the present embodiment is combined by kernel module and the outer finger daemon of core based on high-availability computer system fault treating apparatus collaborative inside and outside core, devise the unified framework of disposing of fault as the unified specific implementation disposing subsystem of fault, the unified framework of disposing of fault passes through the description of service dependence and the state-detection for special services, real-time perception kernel and key service running status, the uniformity of maintenance system service state.Step is as follows: 1, kernel or service detection module are to the report of fault management finger daemon extremely; 2, fault management finger daemon carries out troubleshooting; 3, Fault Isolation module isolated fault hardware is notified; 4, notify that Failure Recovery Module completes fault recovery; 5, failure notification is sent by failure notification module; 6, by logger module log; The present embodiment unifies report frame and fault unified disposal framework based on high-availability computer system fault treating apparatus collaborative inside and outside core based on fault, traditional multi-host hot swap technology is improved, it is characterized in that unifying report frame by fault, obtain hardware and software failure information real-time, process is completed by the unified framework of disposing of fault, and by result by efficient event communication and callback mechanism circular multi-host hot swap software, carry out business between machine by the latter and switch or migration.Step is as follows: 1, multi-host hot swap software is to fault management finger daemon registration migration signal and migration signal triggering rule; 2, fault processing module handling failure, when meeting triggering rule, sends migration signal; 3, multi-host hot swap software receipt is to migration signal, business migration between enforcement machine.The present embodiment realizes based on high-availability computer system fault treating apparatus collaborative inside and outside core and is integrated with the unified monitoring of hardware fault and software fault, ensures that keeper can obtain hardware and software failure information real-time.Application and trouble diagnosis, Fault Isolation and Failure Recovery Module are disposed fault and business between multi-host hot swap implement software machine of circulating a notice of switches, and ensure the high availability of computer system under the exceptional condition of software fault or hardware fault.

The above is only the preferred embodiment of the present invention, protection scope of the present invention be not only confined to above-described embodiment, and all technical schemes belonged under thinking of the present invention all belong to protection scope of the present invention.It should be pointed out that for those skilled in the art, some improvements and modifications without departing from the principles of the present invention, these improvements and modifications also should be considered as protection scope of the present invention.

Claims

1., based on a high-availability computer system fault handling method collaborative inside and outside core, it is characterized in that implementation step is as follows:

2. according to claim 1 based on high-availability computer system fault handling method collaborative inside and outside core, it is characterized in that: detect at operating system nucleus the service fault comprising system service fault and application service fault outward in described step 1) and generate Trouble Report and specifically referred to by described Trouble Report interface output:

3. according to claim 2 based on high-availability computer system fault handling method collaborative inside and outside core, it is characterized in that, the detailed step that in described step 1), detection hardware fault generation Trouble Report is also exported by described Trouble Report interface in operating system nucleus is as follows:

4. according to claim 3 based on high-availability computer system fault handling method collaborative inside and outside core, to it is characterized in that, described step 2) detailed step as follows:

2.5) to troubleshooting log;

2.6) notice is sent to keeper;

5., based on a high-availability computer system fault treating apparatus collaborative inside and outside core, it is characterized in that comprising:

6. according to claim 5 based on high-availability computer system fault treating apparatus collaborative inside and outside core, it is characterized in that: described fault is unified report subsystem and comprised for detecting the service fault generation Trouble Report comprising system service fault and application service fault outward at operating system nucleus and the service detection module exported by described Trouble Report interface, and described service detection module comprises:

7. according to claim 6 based on high-availability computer system fault treating apparatus collaborative inside and outside core, it is characterized in that, described fault is unified report subsystem and is comprised the hardware detecting module exported for the generation Trouble Report of detection hardware fault in operating system nucleus and by described Trouble Report interface, and described hardware detecting module comprises:

8. the high-availability computer system fault treating apparatus based on working in coordination with inside and outside core according to claim 7, is characterized in that, the unified subsystem of disposing of described fault comprises:

Logger module, for troubleshooting log;

Failure notification module, for sending notice to keeper;