CN103995759A

CN103995759A - High-availability computer system failure handling method and device based on core internal-external synergy

Info

Publication number: CN103995759A
Application number: CN201410215175.4A
Authority: CN
Inventors: 廖湘科; 颜跃进; 李俊良; 刘晓建; 杨沙洲; 姚望; 汪黎; 秦莹; 周强; 王非
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2014-05-21
Filing date: 2014-05-21
Publication date: 2014-08-20
Anticipated expiration: 2034-05-21
Also published as: CN103995759B

Abstract

The invention discloses a high-availability computer system failure handling method and device based on core internal-external synergy. The method comprises the steps of 1 respectively detecting service failures and hardware failures and outputting the failures through a failure reporting interface; 2 detecting and then analyzing a failure report and performing failure handling on the hardware failures or the service failures according to an analysis result, reporting logs, informing a manager and then judging whether dual-computer hot standby is needed or not, wherein specific dual-computer hot standby software is informed to perform dual-computer hot standby if the dual-computer hot standby is needed. The device comprises a unified failure reporting subsystem and a unified failure handling subsystem, wherein the subsystems completely correspond to the steps of the method. The high-availability computer system failure handling method and device based on core internal-external synergy can achieve software and hardware unified failure management and efficiently and timely detect software and hardware failures, a handling process is simple, failure handling rules are convenient to expand, and the high availability of a computer system subjected to the software or hardware failures can be ensured.

Description

High available computers system failure disposal route and device based on collaborative inside and outside core

Technical field

The present invention relates to the available administrative skill of the height field of computer system, be specifically related to a kind of high available computers system failure disposal route and device based on collaborative inside and outside core.

Background technology

The availability of computer system is to evaluate an index that computer system is reliable and stable, and it is measured by the mean free error time conventionally.Mean free error time is longer, and the availability of this computer system is just higher.Also there is hardware aspect the existing software of the factor aspect that affects computer system availability.Program or software that software fault is often referred to computer system cause normally working or to affect normal use, other software or program that the domain of influence of software fault is generally software self and depends on this software because of certain factors disrupt.Hardware fault is often referred to the physical hardware of computer system because certain factors disrupt causes normally working or to affect normal use, and hardware fault is larger on computer system impact, can cause the system machine of delaying when serious.

The computer system of prior art depends on hardware drive program for the detection of hardware fault, and for software fault, conventionally adopts automatic regular polling mechanism to complete service state and detect.Complete after fault detect, according to driving or program default policy, carry out fault handling immediately, and record processing daily record separately.But there is following problem in the computer system of prior art in the available management of height: 1, computer system independent processing and reporting software and hardware fault, lack hardware and software failure unified management; 2, traditional backup technique is low to software fault monitoring efficiency, cannot in time perception hardware fault; 3, computer system is complicated to hardware and software failure treatment scheme, and user cannot define and dispose rule.

Summary of the invention

The technical problem to be solved in the present invention is: the technical matters existing for prior art, provide a kind of and can realize hardware and software failure unified management, efficiently timely to the detection of software fault and hardware fault, treatment scheme is simple, fault handling Rule Extended is convenient, can guarantee computer system under software fault or hardware fault high availability based on core inside and outside collaborative high available computers system failure disposal route and device.

In order to solve the problems of the technologies described above, technical scheme provided by the invention is:

A high available computers system failure disposal route based on collaborative inside and outside core, implementation step is as follows:

1) outside operating system nucleus, detect and comprise system service fault and application service fault generates Trouble Report and exports by described Trouble Report interface at interior service fault, in operating system nucleus, detection hardware fault generates Trouble Report and exports by the Trouble Report interface of foundation operating system nucleus outside simultaneously;

2) Trouble Report of detection failure reporting interface outside operating system nucleus, after receiving Trouble Report, Trouble Report is analyzed, according to analysis result, in operating system nucleus, hardware corresponding to hardware fault is carried out to fault handling, or to service fault, fault handling is carried out in corresponding service outside operating system nucleus, to fault handling log and send notice to keeper, then according to default rule judgment, whether need to carry out two-node cluster hot backup, if need two-node cluster hot backup, notify the two-node cluster hot backup software of appointment to carry out two-node cluster hot backup.

Preferably, in described step 1), outside operating system nucleus, detect and comprise system service fault and application service fault generates Trouble Report and specifically refers to by described Trouble Report interface output at interior service fault:

1.1.1) outside operating system nucleus, the mode with poll is carried out state-detection to system service in operating system and application service, if abnormality appears in any system service or application service, judges service fault occurs;

1.1.2) after judge there is service fault, according to system service or application service, there is the Information generation Trouble Report of abnormality, described Trouble Report is exported by described Trouble Report interface.

Preferably, in described step 1), in operating system nucleus, detection hardware fault generates Trouble Report as follows by the detailed step of described Trouble Report interface output:

1.2.1) by a plurality of hardware states monitoring point being distributed in advance in fault grouting socket, fault interrupting processing routine and hardware driving, detect corresponding hardware status information, if the hardware state that any hardware status monitoring point detects occurs abnormal, the field data that corresponding hardware is collected in described hardware state monitoring point according to default rule is as hardware fault data;

1.2.2) hardware fault data are encapsulated and generate Trouble Report and deposit default failure message queue in;

1.2.3) according to failure message queue, to depositing the Trouble Report of failure message queue in, dispatch distribution;

1.2.4) utilize thread that the Trouble Report of scheduling output is exported by described Trouble Report interface.

Preferably, detailed step described step 2) is as follows:

2.1) Trouble Report based on finger daemon detection failure reporting interface outside operating system nucleus;

2.2) Trouble Report is analyzed after receiving Trouble Report operating system nucleus is external, the fault type of failure judgement report, if fault type is service fault, according to service dependence, describes system service corresponding to service fault or application service are recovered; If fault type is hardware fault, judge whether to carry out faulty hardware isolation to hardware corresponding to Trouble Report, if need to carry out faulty hardware isolation, redirect execution step 2.3), otherwise judge whether to carry out failed hardware recovery to hardware corresponding to Trouble Report, if need to carry out failed hardware recovery, redirect execution step 2.4), otherwise redirect execution step 2.5);

2.3) in the time need to carrying out faulty hardware isolation to hardware corresponding to Trouble Report, in operating system nucleus, hardware corresponding to Trouble Report is carried out to faulty hardware isolation;

2.4) in the time need to carrying out failed hardware recovery to hardware corresponding to Trouble Report, in operating system nucleus, hardware corresponding to Trouble Report is carried out to failed hardware recovery;

2.5) to fault handling log;

2.6) to keeper, send notice;

2.7) according to default rule judgment, whether need to carry out two-node cluster hot backup, if need two-node cluster hot backup, by calling the notice plug-in unit of the two-node cluster hot backup software of appointment, notify described two-node cluster hot backup software to carry out two-node cluster hot backup.

The present invention also provides a kind of high available computers system failure treating apparatus based on collaborative inside and outside core, comprising:

The unified report of fault subsystem, for detecting operating system nucleus outside, comprise system service fault and application service fault generates Trouble Report and exports by described Trouble Report interface at interior service fault, in operating system nucleus, detection hardware fault generates Trouble Report and exports by the Trouble Report interface of foundation outside operating system nucleus simultaneously;

The unified subsystem of disposing of fault, Trouble Report for detection failure reporting interface outside operating system nucleus, after receiving Trouble Report, Trouble Report is analyzed, according to analysis result, in operating system nucleus, hardware corresponding to hardware fault is carried out to fault handling, or to service fault, fault handling is carried out in corresponding service outside operating system nucleus, to fault handling log and send notice to keeper, then according to default rule judgment, whether need to carry out two-node cluster hot backup, if need two-node cluster hot backup, notify the two-node cluster hot backup software of appointment to carry out two-node cluster hot backup.

Preferably, the unified report of described fault subsystem comprises for detecting operating system nucleus outside and comprises the service detection module that system service fault and application service fault generate Trouble Report and export by described Trouble Report interface at interior service fault, and described service detection module comprises:

Service state poll detection sub-module, carries out state-detection for the mode with poll outside operating system nucleus to operating system system service and application service, if abnormality appears in any system service or application service, judges service fault occurs;

, for occurring after service fault judging, there is the Information generation Trouble Report of abnormality in service fault report submodule, described Trouble Report is exported by described Trouble Report interface according to system service or application service.

Preferably, the unified report of described fault subsystem comprises the hardware detecting module for generating Trouble Report and export by described Trouble Report interface in operating system nucleus detection hardware fault, and described hardware detecting module comprises:

Hardware state monitoring submodule, for detecting corresponding hardware status information by being distributed in advance a plurality of hardware states monitoring point of fault grouting socket, fault interrupting processing routine and hardware driving, if the hardware state that any hardware status monitoring point detects occurs abnormal, the field data that corresponding hardware is collected in described hardware state monitoring point according to default rule is as hardware fault data;

Hardware fault data encapsulation submodule, generates Trouble Report and deposits default failure message queue in for hardware fault data are encapsulated;

Failure message queue scheduling submodule, for dispatching distribution according to failure message queue to depositing the Trouble Report of failure message queue in;

Hardware fault data report submodule, for utilizing thread that the Trouble Report of scheduling output is exported by described Trouble Report interface.

Preferably, the unified subsystem of disposing of described fault comprises:

Fault management finger daemon module, for the Trouble Report based on finger daemon detection failure reporting interface outside operating system nucleus;

Fault handling engine, for Trouble Report being analyzed after receiving Trouble Report operating system nucleus is external, the fault type of failure judgement report, if fault type is service fault, according to service dependence, describes system service corresponding to service fault or application service are recovered; If fault type is hardware fault, judge whether to carry out faulty hardware isolation to hardware corresponding to Trouble Report, if need to carry out faulty hardware isolation, call fault isolation module, otherwise judge whether to carry out failed hardware recovery to hardware corresponding to Trouble Report, if need to carry out failed hardware recovery, call fault recovery module;

Fault isolation module in the time need to carrying out faulty hardware isolation to hardware corresponding to Trouble Report, is carried out faulty hardware isolation to hardware corresponding to Trouble Report in operating system nucleus;

Fault recovery module in the time need to carrying out failed hardware recovery to hardware corresponding to Trouble Report, is carried out failed hardware recovery to hardware corresponding to Trouble Report in operating system nucleus;

Logger module, for to fault handling log;

Signalling trouble module, for sending notice to keeper;

Whether two-node cluster hot backup processing module, for needing to carry out two-node cluster hot backup according to default rule judgment, if need two-node cluster hot backup, by calling the notice plug-in unit of the two-node cluster hot backup software of appointment, notify described two-node cluster hot backup software to carry out two-node cluster hot backup.

The present invention is based on the inside and outside collaborative high available computers system failure disposal route of core and there is following technique effect: the present invention comprises system service fault by detection outside operating system nucleus and application service fault generates Trouble Report and exports by Trouble Report interface at interior service fault, in operating system nucleus, detection hardware fault generates Trouble Report and exports by the Trouble Report interface of setting up outside operating system nucleus simultaneously, by the mode of hardware detection and service detection combination, realize the unified report of software fault (service fault) and hardware fault, and by kernel module and core, guard the mode of obtaining unified report combination outward and realize and reported that follow-up unified treatment mechanism and two-node cluster hot backup process, the fault that computer system exists in the available management of height can be solved and report cannot be worked in coordination with, software and hardware cannot unified management and traditional hot standby software cannot in time perception fault problem, efficiently timely to the detection of software fault and hardware fault, treatment scheme is simple, fault handling Rule Extended is convenient, under the prerequisite of diagnosing in read failure data, by fault isolation and fault recovery, fault is disposed and the business between multi-host hot swap implement software machine of circulating a notice of is switched, guarantee the high availability of computer system under the exception condition of software fault or hardware fault.

The present invention is based on the inside and outside collaborative high available computers system failure treating apparatus of core is to the present invention is based on the completely corresponding device of the inside and outside collaborative high available computers system failure disposal route of core, therefore also have with the present invention is based on core inside and outside the identical technique effect of collaborative high available computers system failure disposal route, therefore do not repeat them here.

Accompanying drawing explanation

Fig. 1 is the basic procedure schematic diagram of embodiment of the present invention method.

Fig. 2 is the schematic flow sheet that embodiment of the present invention method detects service fault.

Fig. 3 is the schematic flow sheet of embodiment of the present invention method detection hardware fault.

Fig. 4 is the unified disposal process schematic diagram of embodiment of the present invention method to service fault and hardware fault.

Fig. 5 is the framed structure schematic diagram of embodiment of the present invention device.

Fig. 6 is the framed structure schematic diagram of the unified report of fault subsystem in embodiment of the present invention device.

Fig. 7 is the unified framed structure schematic diagram of disposing subsystem of fault in embodiment of the present invention device.

Embodiment

As shown in Figure 1, the present invention is based on the implementation step of the inside and outside collaborative high available computers system failure disposal route of core as follows:

1) outside operating system nucleus, detect and comprise system service fault and application service fault generates Trouble Report and exports by Trouble Report interface at interior service fault, in operating system nucleus, detection hardware fault generates Trouble Report and exports by the Trouble Report interface of foundation operating system nucleus outside simultaneously;

As shown in Figure 2, in the present embodiment step 1), outside operating system nucleus, detect and comprise system service fault and application service fault generates Trouble Report and specifically refers to by the output of Trouble Report interface at interior service fault:

1.1.2) after judge there is service fault, according to system service or application service, there is the Information generation Trouble Report of abnormality, Trouble Report is exported by Trouble Report interface.

As shown in Figure 3, in the present embodiment step 1) in operating system nucleus detection hardware fault generate Trouble Report and the detailed step exported by Trouble Report interface as follows:

1.2.1) by a plurality of hardware states monitoring point being distributed in advance in fault grouting socket, fault interrupting processing routine and hardware driving, detect corresponding hardware status information, if the hardware state that any hardware status monitoring point detects occurs abnormal, the field data that corresponding hardware is collected in hardware state monitoring point according to default rule is as hardware fault data;

1.2.4) utilize thread that the Trouble Report of scheduling output is exported by Trouble Report interface.

As shown in Figure 4, detailed step the present embodiment step 2) is as follows:

2.5) to fault handling log;

2.6) to keeper, send notice;

2.7) whether according to default rule judgment, need to carry out two-node cluster hot backup, if need two-node cluster hot backup, by calling the notice plug-in unit of the two-node cluster hot backup software of appointment, notice two-node cluster hot backup software carries out two-node cluster hot backup.

As shown in Figure 5, the high available computers system failure treating apparatus of the present embodiment based on collaborative inside and outside core comprises:

The unified report of fault subsystem, for detecting operating system nucleus outside, comprise system service fault and application service fault generates Trouble Report and exports by Trouble Report interface at interior service fault, in operating system nucleus, detection hardware fault generates Trouble Report and exports by the Trouble Report interface of foundation outside operating system nucleus simultaneously;

As shown in Figure 6, the unified report of the fault of the present embodiment subsystem comprises for detecting operating system nucleus outside and comprises system service fault and application service fault in interior service fault generation Trouble Report the service detection module of exporting by Trouble Report interface, and service detection module comprises:

, for occurring after service fault judging, there is the Information generation Trouble Report of abnormality in service fault report submodule, Trouble Report is exported by Trouble Report interface according to system service or application service.

As shown in Figure 6, the unified report of the fault of the present embodiment subsystem comprises the hardware detecting module for generating Trouble Report and export by Trouble Report interface in operating system nucleus detection hardware fault, and hardware detecting module comprises:

Hardware state monitoring submodule, for detecting corresponding hardware status information by being distributed in advance a plurality of hardware states monitoring point of fault grouting socket, fault interrupting processing routine and hardware driving, if the hardware state that any hardware status monitoring point detects occurs abnormal, the field data that corresponding hardware is collected in hardware state monitoring point according to default rule is as hardware fault data;

Hardware fault data report submodule, for utilizing thread that the Trouble Report of scheduling output is exported by Trouble Report interface.

In the present embodiment, hardware state monitoring submodule detects corresponding hardware status information by a plurality of hardware states monitoring point being distributed in advance in fault grouting socket, fault interrupting processing routine and hardware driving, the early warning to hardware fault, quick ability of discovery be can promote, promptness and efficiency that hardware fault is found improved.

As shown in Figure 7, the unified subsystem of disposing of the fault of the present embodiment comprises:

Logger module, for to fault handling log;

Signalling trouble module, for sending notice to keeper;

Two-node cluster hot backup processing module, whether for needing to carry out two-node cluster hot backup according to default rule judgment, if need two-node cluster hot backup, by calling the notice plug-in unit of the two-node cluster hot backup software of appointment, notice two-node cluster hot backup software carries out two-node cluster hot backup.

The present embodiment is by the mode of kernel module and the outer finger daemon combination of core, designed the unified subsystem of disposing of fault, fault is unified to be disposed in subsystem, fault handling engine is by describing service dependence and for the state-detection of specific service, real-time perception kernel and key service running status, the consistance of maintenance system service state, stability that can elevator system.The high available computers system failure treating apparatus of the present embodiment based on collaborative inside and outside core is by the mode of hardware detection and service detection combination, design fault and unified report frame as the specific implementation of the unified report of fault subsystem, fault is unified report frame by insert hardware state checkpoint in processor and memory management code, bus and device driver code, has promoted the early warning to hardware fault, quick ability of discovery.Step is as follows: 1, hardware state Checkpoint detection is to abnormal; 2, collect hardware state data encapsulation; 3, data are put into message queue; 4, calling fault send-thread sends.The high available computers system failure treating apparatus of the present embodiment based on collaborative inside and outside core is by the mode of kernel module and the outer finger daemon combination of core, designed the unified framework of disposing of fault as the unified specific implementation of disposing subsystem of fault, fault is unified disposes framework by service dependence being described and for the state-detection of specific service, real-time perception kernel and key service running status, the consistance of maintenance system service state.Step is as follows: 1, kernel or service detection module are abnormal to the report of fault management finger daemon; 2, fault management finger daemon carries out fault handling; 3, notice fault isolation module isolated fault hardware; 4, notice fault recovery module completes fault recovery; 5, by signalling trouble module, send signalling trouble; 6, by logger module log; The high available computers system failure treating apparatus of the present embodiment based on collaborative inside and outside core unified report frame and the unified framework of disposing of fault based on fault, traditional multi-host hot swap technology is improved, it is characterized in that unifying report frame by fault, obtain real-time hardware and software failure information, by the unified framework of disposing of fault, complete processing, and result is circulated a notice of to multi-host hot swap software by efficient event communication and callback mechanism, by the latter, carry out business between machine and switch or migration.Step is as follows: 1, multi-host hot swap software is to fault management finger daemon registration migration signal and migration signal triggering rule; 2, fault processing module handling failure, when meeting triggering rule, sends migration signal; 3, multi-host hot swap software receives migration signal, business migration between enforcement machine.The high available computers system failure treating apparatus of the present embodiment based on collaborative inside and outside core realized and the unified monitoring of integrated hardware fault and software fault, and assurance keeper can obtain hardware and software failure information real-time.Application and trouble diagnosis, fault isolation and fault recovery module are disposed fault and the business between multi-host hot swap implement software machine of circulating a notice of is switched, and guarantee the high availability of computer system under the exception condition of software fault or hardware fault.

The above is only the preferred embodiment of the present invention, and protection scope of the present invention is also not only confined to above-described embodiment, and all technical schemes belonging under thinking of the present invention all belong to protection scope of the present invention.It should be pointed out that for those skilled in the art, some improvements and modifications without departing from the principles of the present invention, these improvements and modifications also should be considered as protection scope of the present invention.

Claims

1. the high available computers system failure disposal route based on collaborative inside and outside core, is characterized in that implementation step is as follows:

2. the high available computers system failure disposal route based on collaborative inside and outside core according to claim 2, is characterized in that: in described step 1), operating system nucleus outside, detect and comprise system service fault and application service fault generates Trouble Report and exported and specifically referred to by described Trouble Report interface at interior service fault:

3. the high available computers system failure disposal route based on collaborative inside and outside core according to claim 2, it is characterized in that, in described step 1), in operating system nucleus, detection hardware fault generates Trouble Report as follows by the detailed step of described Trouble Report interface output:

4. the high available computers system failure disposal route based on collaborative inside and outside core according to claim 3, is characterized in that described step 2) detailed step as follows:

2.5) to fault handling log;

2.6) to keeper, send notice;

5. the high available computers system failure treating apparatus based on collaborative inside and outside core, is characterized in that comprising:

6. the high available computers system failure treating apparatus based on collaborative inside and outside core according to claim 5, it is characterized in that: the unified report of described fault subsystem comprises for detecting operating system nucleus outside and comprise the service detection module that system service fault and application service fault generate Trouble Report and export by described Trouble Report interface at interior service fault, described service detection module comprises:

7. the high available computers system failure treating apparatus based on collaborative inside and outside core according to claim 6, it is characterized in that, the unified report of described fault subsystem comprises the hardware detecting module for generating Trouble Report and export by described Trouble Report interface in operating system nucleus detection hardware fault, and described hardware detecting module comprises:

8. the high available computers system failure treating apparatus based on collaborative inside and outside core according to claim 7, is characterized in that, the unified subsystem of disposing of described fault comprises:

Logger module, for to fault handling log;

Signalling trouble module, for sending notice to keeper;