CN103995759A - High-availability computer system failure handling method and device based on core internal-external synergy - Google Patents

High-availability computer system failure handling method and device based on core internal-external synergy Download PDF

Info

Publication number
CN103995759A
CN103995759A CN201410215175.4A CN201410215175A CN103995759A CN 103995759 A CN103995759 A CN 103995759A CN 201410215175 A CN201410215175 A CN 201410215175A CN 103995759 A CN103995759 A CN 103995759A
Authority
CN
China
Prior art keywords
fault
hardware
trouble report
service
operating system
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201410215175.4A
Other languages
Chinese (zh)
Other versions
CN103995759B (en
Inventor
廖湘科
颜跃进
李俊良
刘晓建
杨沙洲
姚望
汪黎
秦莹
周强
王非
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN201410215175.4A priority Critical patent/CN103995759B/en
Publication of CN103995759A publication Critical patent/CN103995759A/en
Application granted granted Critical
Publication of CN103995759B publication Critical patent/CN103995759B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Debugging And Monitoring (AREA)
  • Hardware Redundancy (AREA)

Abstract

The invention discloses a high-availability computer system failure handling method and device based on core internal-external synergy. The method comprises the steps of 1 respectively detecting service failures and hardware failures and outputting the failures through a failure reporting interface; 2 detecting and then analyzing a failure report and performing failure handling on the hardware failures or the service failures according to an analysis result, reporting logs, informing a manager and then judging whether dual-computer hot standby is needed or not, wherein specific dual-computer hot standby software is informed to perform dual-computer hot standby if the dual-computer hot standby is needed. The device comprises a unified failure reporting subsystem and a unified failure handling subsystem, wherein the subsystems completely correspond to the steps of the method. The high-availability computer system failure handling method and device based on core internal-external synergy can achieve software and hardware unified failure management and efficiently and timely detect software and hardware failures, a handling process is simple, failure handling rules are convenient to expand, and the high availability of a computer system subjected to the software or hardware failures can be ensured.

Description

High available computers system failure disposal route and device based on collaborative inside and outside core
Technical field
The present invention relates to the available administrative skill of the height field of computer system, be specifically related to a kind of high available computers system failure disposal route and device based on collaborative inside and outside core.
Background technology
The availability of computer system is to evaluate an index that computer system is reliable and stable, and it is measured by the mean free error time conventionally.Mean free error time is longer, and the availability of this computer system is just higher.Also there is hardware aspect the existing software of the factor aspect that affects computer system availability.Program or software that software fault is often referred to computer system cause normally working or to affect normal use, other software or program that the domain of influence of software fault is generally software self and depends on this software because of certain factors disrupt.Hardware fault is often referred to the physical hardware of computer system because certain factors disrupt causes normally working or to affect normal use, and hardware fault is larger on computer system impact, can cause the system machine of delaying when serious.
The computer system of prior art depends on hardware drive program for the detection of hardware fault, and for software fault, conventionally adopts automatic regular polling mechanism to complete service state and detect.Complete after fault detect, according to driving or program default policy, carry out fault handling immediately, and record processing daily record separately.But there is following problem in the computer system of prior art in the available management of height: 1, computer system independent processing and reporting software and hardware fault, lack hardware and software failure unified management; 2, traditional backup technique is low to software fault monitoring efficiency, cannot in time perception hardware fault; 3, computer system is complicated to hardware and software failure treatment scheme, and user cannot define and dispose rule.
Summary of the invention
The technical problem to be solved in the present invention is: the technical matters existing for prior art, provide a kind of and can realize hardware and software failure unified management, efficiently timely to the detection of software fault and hardware fault, treatment scheme is simple, fault handling Rule Extended is convenient, can guarantee computer system under software fault or hardware fault high availability based on core inside and outside collaborative high available computers system failure disposal route and device.
In order to solve the problems of the technologies described above, technical scheme provided by the invention is:
A high available computers system failure disposal route based on collaborative inside and outside core, implementation step is as follows:
1) outside operating system nucleus, detect and comprise system service fault and application service fault generates Trouble Report and exports by described Trouble Report interface at interior service fault, in operating system nucleus, detection hardware fault generates Trouble Report and exports by the Trouble Report interface of foundation operating system nucleus outside simultaneously;
2) Trouble Report of detection failure reporting interface outside operating system nucleus, after receiving Trouble Report, Trouble Report is analyzed, according to analysis result, in operating system nucleus, hardware corresponding to hardware fault is carried out to fault handling, or to service fault, fault handling is carried out in corresponding service outside operating system nucleus, to fault handling log and send notice to keeper, then according to default rule judgment, whether need to carry out two-node cluster hot backup, if need two-node cluster hot backup, notify the two-node cluster hot backup software of appointment to carry out two-node cluster hot backup.
Preferably, in described step 1), outside operating system nucleus, detect and comprise system service fault and application service fault generates Trouble Report and specifically refers to by described Trouble Report interface output at interior service fault:
1.1.1) outside operating system nucleus, the mode with poll is carried out state-detection to system service in operating system and application service, if abnormality appears in any system service or application service, judges service fault occurs;
1.1.2) after judge there is service fault, according to system service or application service, there is the Information generation Trouble Report of abnormality, described Trouble Report is exported by described Trouble Report interface.
Preferably, in described step 1), in operating system nucleus, detection hardware fault generates Trouble Report as follows by the detailed step of described Trouble Report interface output:
1.2.1) by a plurality of hardware states monitoring point being distributed in advance in fault grouting socket, fault interrupting processing routine and hardware driving, detect corresponding hardware status information, if the hardware state that any hardware status monitoring point detects occurs abnormal, the field data that corresponding hardware is collected in described hardware state monitoring point according to default rule is as hardware fault data;
1.2.2) hardware fault data are encapsulated and generate Trouble Report and deposit default failure message queue in;
1.2.3) according to failure message queue, to depositing the Trouble Report of failure message queue in, dispatch distribution;
1.2.4) utilize thread that the Trouble Report of scheduling output is exported by described Trouble Report interface.
Preferably, detailed step described step 2) is as follows:
2.1) Trouble Report based on finger daemon detection failure reporting interface outside operating system nucleus;
2.2) Trouble Report is analyzed after receiving Trouble Report operating system nucleus is external, the fault type of failure judgement report, if fault type is service fault, according to service dependence, describes system service corresponding to service fault or application service are recovered; If fault type is hardware fault, judge whether to carry out faulty hardware isolation to hardware corresponding to Trouble Report, if need to carry out faulty hardware isolation, redirect execution step 2.3), otherwise judge whether to carry out failed hardware recovery to hardware corresponding to Trouble Report, if need to carry out failed hardware recovery, redirect execution step 2.4), otherwise redirect execution step 2.5);
2.3) in the time need to carrying out faulty hardware isolation to hardware corresponding to Trouble Report, in operating system nucleus, hardware corresponding to Trouble Report is carried out to faulty hardware isolation;
2.4) in the time need to carrying out failed hardware recovery to hardware corresponding to Trouble Report, in operating system nucleus, hardware corresponding to Trouble Report is carried out to failed hardware recovery;
2.5) to fault handling log;
2.6) to keeper, send notice;
2.7) according to default rule judgment, whether need to carry out two-node cluster hot backup, if need two-node cluster hot backup, by calling the notice plug-in unit of the two-node cluster hot backup software of appointment, notify described two-node cluster hot backup software to carry out two-node cluster hot backup.
The present invention also provides a kind of high available computers system failure treating apparatus based on collaborative inside and outside core, comprising:
The unified report of fault subsystem, for detecting operating system nucleus outside, comprise system service fault and application service fault generates Trouble Report and exports by described Trouble Report interface at interior service fault, in operating system nucleus, detection hardware fault generates Trouble Report and exports by the Trouble Report interface of foundation outside operating system nucleus simultaneously;
The unified subsystem of disposing of fault, Trouble Report for detection failure reporting interface outside operating system nucleus, after receiving Trouble Report, Trouble Report is analyzed, according to analysis result, in operating system nucleus, hardware corresponding to hardware fault is carried out to fault handling, or to service fault, fault handling is carried out in corresponding service outside operating system nucleus, to fault handling log and send notice to keeper, then according to default rule judgment, whether need to carry out two-node cluster hot backup, if need two-node cluster hot backup, notify the two-node cluster hot backup software of appointment to carry out two-node cluster hot backup.
Preferably, the unified report of described fault subsystem comprises for detecting operating system nucleus outside and comprises the service detection module that system service fault and application service fault generate Trouble Report and export by described Trouble Report interface at interior service fault, and described service detection module comprises:
Service state poll detection sub-module, carries out state-detection for the mode with poll outside operating system nucleus to operating system system service and application service, if abnormality appears in any system service or application service, judges service fault occurs;
, for occurring after service fault judging, there is the Information generation Trouble Report of abnormality in service fault report submodule, described Trouble Report is exported by described Trouble Report interface according to system service or application service.
Preferably, the unified report of described fault subsystem comprises the hardware detecting module for generating Trouble Report and export by described Trouble Report interface in operating system nucleus detection hardware fault, and described hardware detecting module comprises:
Hardware state monitoring submodule, for detecting corresponding hardware status information by being distributed in advance a plurality of hardware states monitoring point of fault grouting socket, fault interrupting processing routine and hardware driving, if the hardware state that any hardware status monitoring point detects occurs abnormal, the field data that corresponding hardware is collected in described hardware state monitoring point according to default rule is as hardware fault data;
Hardware fault data encapsulation submodule, generates Trouble Report and deposits default failure message queue in for hardware fault data are encapsulated;
Failure message queue scheduling submodule, for dispatching distribution according to failure message queue to depositing the Trouble Report of failure message queue in;
Hardware fault data report submodule, for utilizing thread that the Trouble Report of scheduling output is exported by described Trouble Report interface.
Preferably, the unified subsystem of disposing of described fault comprises:
Fault management finger daemon module, for the Trouble Report based on finger daemon detection failure reporting interface outside operating system nucleus;
Fault handling engine, for Trouble Report being analyzed after receiving Trouble Report operating system nucleus is external, the fault type of failure judgement report, if fault type is service fault, according to service dependence, describes system service corresponding to service fault or application service are recovered; If fault type is hardware fault, judge whether to carry out faulty hardware isolation to hardware corresponding to Trouble Report, if need to carry out faulty hardware isolation, call fault isolation module, otherwise judge whether to carry out failed hardware recovery to hardware corresponding to Trouble Report, if need to carry out failed hardware recovery, call fault recovery module;
Fault isolation module in the time need to carrying out faulty hardware isolation to hardware corresponding to Trouble Report, is carried out faulty hardware isolation to hardware corresponding to Trouble Report in operating system nucleus;
Fault recovery module in the time need to carrying out failed hardware recovery to hardware corresponding to Trouble Report, is carried out failed hardware recovery to hardware corresponding to Trouble Report in operating system nucleus;
Logger module, for to fault handling log;
Signalling trouble module, for sending notice to keeper;
Whether two-node cluster hot backup processing module, for needing to carry out two-node cluster hot backup according to default rule judgment, if need two-node cluster hot backup, by calling the notice plug-in unit of the two-node cluster hot backup software of appointment, notify described two-node cluster hot backup software to carry out two-node cluster hot backup.
The present invention is based on the inside and outside collaborative high available computers system failure disposal route of core and there is following technique effect: the present invention comprises system service fault by detection outside operating system nucleus and application service fault generates Trouble Report and exports by Trouble Report interface at interior service fault, in operating system nucleus, detection hardware fault generates Trouble Report and exports by the Trouble Report interface of setting up outside operating system nucleus simultaneously, by the mode of hardware detection and service detection combination, realize the unified report of software fault (service fault) and hardware fault, and by kernel module and core, guard the mode of obtaining unified report combination outward and realize and reported that follow-up unified treatment mechanism and two-node cluster hot backup process, the fault that computer system exists in the available management of height can be solved and report cannot be worked in coordination with, software and hardware cannot unified management and traditional hot standby software cannot in time perception fault problem, efficiently timely to the detection of software fault and hardware fault, treatment scheme is simple, fault handling Rule Extended is convenient, under the prerequisite of diagnosing in read failure data, by fault isolation and fault recovery, fault is disposed and the business between multi-host hot swap implement software machine of circulating a notice of is switched, guarantee the high availability of computer system under the exception condition of software fault or hardware fault.
The present invention is based on the inside and outside collaborative high available computers system failure treating apparatus of core is to the present invention is based on the completely corresponding device of the inside and outside collaborative high available computers system failure disposal route of core, therefore also have with the present invention is based on core inside and outside the identical technique effect of collaborative high available computers system failure disposal route, therefore do not repeat them here.
Accompanying drawing explanation
Fig. 1 is the basic procedure schematic diagram of embodiment of the present invention method.
Fig. 2 is the schematic flow sheet that embodiment of the present invention method detects service fault.
Fig. 3 is the schematic flow sheet of embodiment of the present invention method detection hardware fault.
Fig. 4 is the unified disposal process schematic diagram of embodiment of the present invention method to service fault and hardware fault.
Fig. 5 is the framed structure schematic diagram of embodiment of the present invention device.
Fig. 6 is the framed structure schematic diagram of the unified report of fault subsystem in embodiment of the present invention device.
Fig. 7 is the unified framed structure schematic diagram of disposing subsystem of fault in embodiment of the present invention device.
Embodiment
As shown in Figure 1, the present invention is based on the implementation step of the inside and outside collaborative high available computers system failure disposal route of core as follows:
1) outside operating system nucleus, detect and comprise system service fault and application service fault generates Trouble Report and exports by Trouble Report interface at interior service fault, in operating system nucleus, detection hardware fault generates Trouble Report and exports by the Trouble Report interface of foundation operating system nucleus outside simultaneously;
2) Trouble Report of detection failure reporting interface outside operating system nucleus, after receiving Trouble Report, Trouble Report is analyzed, according to analysis result, in operating system nucleus, hardware corresponding to hardware fault is carried out to fault handling, or to service fault, fault handling is carried out in corresponding service outside operating system nucleus, to fault handling log and send notice to keeper, then according to default rule judgment, whether need to carry out two-node cluster hot backup, if need two-node cluster hot backup, notify the two-node cluster hot backup software of appointment to carry out two-node cluster hot backup.
As shown in Figure 2, in the present embodiment step 1), outside operating system nucleus, detect and comprise system service fault and application service fault generates Trouble Report and specifically refers to by the output of Trouble Report interface at interior service fault:
1.1.1) outside operating system nucleus, the mode with poll is carried out state-detection to system service in operating system and application service, if abnormality appears in any system service or application service, judges service fault occurs;
1.1.2) after judge there is service fault, according to system service or application service, there is the Information generation Trouble Report of abnormality, Trouble Report is exported by Trouble Report interface.
As shown in Figure 3, in the present embodiment step 1) in operating system nucleus detection hardware fault generate Trouble Report and the detailed step exported by Trouble Report interface as follows:
1.2.1) by a plurality of hardware states monitoring point being distributed in advance in fault grouting socket, fault interrupting processing routine and hardware driving, detect corresponding hardware status information, if the hardware state that any hardware status monitoring point detects occurs abnormal, the field data that corresponding hardware is collected in hardware state monitoring point according to default rule is as hardware fault data;
1.2.2) hardware fault data are encapsulated and generate Trouble Report and deposit default failure message queue in;
1.2.3) according to failure message queue, to depositing the Trouble Report of failure message queue in, dispatch distribution;
1.2.4) utilize thread that the Trouble Report of scheduling output is exported by Trouble Report interface.
As shown in Figure 4, detailed step the present embodiment step 2) is as follows:
2.1) Trouble Report based on finger daemon detection failure reporting interface outside operating system nucleus;
2.2) Trouble Report is analyzed after receiving Trouble Report operating system nucleus is external, the fault type of failure judgement report, if fault type is service fault, according to service dependence, describes system service corresponding to service fault or application service are recovered; If fault type is hardware fault, judge whether to carry out faulty hardware isolation to hardware corresponding to Trouble Report, if need to carry out faulty hardware isolation, redirect execution step 2.3), otherwise judge whether to carry out failed hardware recovery to hardware corresponding to Trouble Report, if need to carry out failed hardware recovery, redirect execution step 2.4), otherwise redirect execution step 2.5);
2.3) in the time need to carrying out faulty hardware isolation to hardware corresponding to Trouble Report, in operating system nucleus, hardware corresponding to Trouble Report is carried out to faulty hardware isolation;
2.4) in the time need to carrying out failed hardware recovery to hardware corresponding to Trouble Report, in operating system nucleus, hardware corresponding to Trouble Report is carried out to failed hardware recovery;
2.5) to fault handling log;
2.6) to keeper, send notice;
2.7) whether according to default rule judgment, need to carry out two-node cluster hot backup, if need two-node cluster hot backup, by calling the notice plug-in unit of the two-node cluster hot backup software of appointment, notice two-node cluster hot backup software carries out two-node cluster hot backup.
As shown in Figure 5, the high available computers system failure treating apparatus of the present embodiment based on collaborative inside and outside core comprises:
The unified report of fault subsystem, for detecting operating system nucleus outside, comprise system service fault and application service fault generates Trouble Report and exports by Trouble Report interface at interior service fault, in operating system nucleus, detection hardware fault generates Trouble Report and exports by the Trouble Report interface of foundation outside operating system nucleus simultaneously;
The unified subsystem of disposing of fault, Trouble Report for detection failure reporting interface outside operating system nucleus, after receiving Trouble Report, Trouble Report is analyzed, according to analysis result, in operating system nucleus, hardware corresponding to hardware fault is carried out to fault handling, or to service fault, fault handling is carried out in corresponding service outside operating system nucleus, to fault handling log and send notice to keeper, then according to default rule judgment, whether need to carry out two-node cluster hot backup, if need two-node cluster hot backup, notify the two-node cluster hot backup software of appointment to carry out two-node cluster hot backup.
As shown in Figure 6, the unified report of the fault of the present embodiment subsystem comprises for detecting operating system nucleus outside and comprises system service fault and application service fault in interior service fault generation Trouble Report the service detection module of exporting by Trouble Report interface, and service detection module comprises:
Service state poll detection sub-module, carries out state-detection for the mode with poll outside operating system nucleus to operating system system service and application service, if abnormality appears in any system service or application service, judges service fault occurs;
, for occurring after service fault judging, there is the Information generation Trouble Report of abnormality in service fault report submodule, Trouble Report is exported by Trouble Report interface according to system service or application service.
As shown in Figure 6, the unified report of the fault of the present embodiment subsystem comprises the hardware detecting module for generating Trouble Report and export by Trouble Report interface in operating system nucleus detection hardware fault, and hardware detecting module comprises:
Hardware state monitoring submodule, for detecting corresponding hardware status information by being distributed in advance a plurality of hardware states monitoring point of fault grouting socket, fault interrupting processing routine and hardware driving, if the hardware state that any hardware status monitoring point detects occurs abnormal, the field data that corresponding hardware is collected in hardware state monitoring point according to default rule is as hardware fault data;
Hardware fault data encapsulation submodule, generates Trouble Report and deposits default failure message queue in for hardware fault data are encapsulated;
Failure message queue scheduling submodule, for dispatching distribution according to failure message queue to depositing the Trouble Report of failure message queue in;
Hardware fault data report submodule, for utilizing thread that the Trouble Report of scheduling output is exported by Trouble Report interface.
In the present embodiment, hardware state monitoring submodule detects corresponding hardware status information by a plurality of hardware states monitoring point being distributed in advance in fault grouting socket, fault interrupting processing routine and hardware driving, the early warning to hardware fault, quick ability of discovery be can promote, promptness and efficiency that hardware fault is found improved.
As shown in Figure 7, the unified subsystem of disposing of the fault of the present embodiment comprises:
Fault management finger daemon module, for the Trouble Report based on finger daemon detection failure reporting interface outside operating system nucleus;
Fault handling engine, for Trouble Report being analyzed after receiving Trouble Report operating system nucleus is external, the fault type of failure judgement report, if fault type is service fault, according to service dependence, describes system service corresponding to service fault or application service are recovered; If fault type is hardware fault, judge whether to carry out faulty hardware isolation to hardware corresponding to Trouble Report, if need to carry out faulty hardware isolation, call fault isolation module, otherwise judge whether to carry out failed hardware recovery to hardware corresponding to Trouble Report, if need to carry out failed hardware recovery, call fault recovery module;
Fault isolation module in the time need to carrying out faulty hardware isolation to hardware corresponding to Trouble Report, is carried out faulty hardware isolation to hardware corresponding to Trouble Report in operating system nucleus;
Fault recovery module in the time need to carrying out failed hardware recovery to hardware corresponding to Trouble Report, is carried out failed hardware recovery to hardware corresponding to Trouble Report in operating system nucleus;
Logger module, for to fault handling log;
Signalling trouble module, for sending notice to keeper;
Two-node cluster hot backup processing module, whether for needing to carry out two-node cluster hot backup according to default rule judgment, if need two-node cluster hot backup, by calling the notice plug-in unit of the two-node cluster hot backup software of appointment, notice two-node cluster hot backup software carries out two-node cluster hot backup.
The present embodiment is by the mode of kernel module and the outer finger daemon combination of core, designed the unified subsystem of disposing of fault, fault is unified to be disposed in subsystem, fault handling engine is by describing service dependence and for the state-detection of specific service, real-time perception kernel and key service running status, the consistance of maintenance system service state, stability that can elevator system.The high available computers system failure treating apparatus of the present embodiment based on collaborative inside and outside core is by the mode of hardware detection and service detection combination, design fault and unified report frame as the specific implementation of the unified report of fault subsystem, fault is unified report frame by insert hardware state checkpoint in processor and memory management code, bus and device driver code, has promoted the early warning to hardware fault, quick ability of discovery.Step is as follows: 1, hardware state Checkpoint detection is to abnormal; 2, collect hardware state data encapsulation; 3, data are put into message queue; 4, calling fault send-thread sends.The high available computers system failure treating apparatus of the present embodiment based on collaborative inside and outside core is by the mode of kernel module and the outer finger daemon combination of core, designed the unified framework of disposing of fault as the unified specific implementation of disposing subsystem of fault, fault is unified disposes framework by service dependence being described and for the state-detection of specific service, real-time perception kernel and key service running status, the consistance of maintenance system service state.Step is as follows: 1, kernel or service detection module are abnormal to the report of fault management finger daemon; 2, fault management finger daemon carries out fault handling; 3, notice fault isolation module isolated fault hardware; 4, notice fault recovery module completes fault recovery; 5, by signalling trouble module, send signalling trouble; 6, by logger module log; The high available computers system failure treating apparatus of the present embodiment based on collaborative inside and outside core unified report frame and the unified framework of disposing of fault based on fault, traditional multi-host hot swap technology is improved, it is characterized in that unifying report frame by fault, obtain real-time hardware and software failure information, by the unified framework of disposing of fault, complete processing, and result is circulated a notice of to multi-host hot swap software by efficient event communication and callback mechanism, by the latter, carry out business between machine and switch or migration.Step is as follows: 1, multi-host hot swap software is to fault management finger daemon registration migration signal and migration signal triggering rule; 2, fault processing module handling failure, when meeting triggering rule, sends migration signal; 3, multi-host hot swap software receives migration signal, business migration between enforcement machine.The high available computers system failure treating apparatus of the present embodiment based on collaborative inside and outside core realized and the unified monitoring of integrated hardware fault and software fault, and assurance keeper can obtain hardware and software failure information real-time.Application and trouble diagnosis, fault isolation and fault recovery module are disposed fault and the business between multi-host hot swap implement software machine of circulating a notice of is switched, and guarantee the high availability of computer system under the exception condition of software fault or hardware fault.
The above is only the preferred embodiment of the present invention, and protection scope of the present invention is also not only confined to above-described embodiment, and all technical schemes belonging under thinking of the present invention all belong to protection scope of the present invention.It should be pointed out that for those skilled in the art, some improvements and modifications without departing from the principles of the present invention, these improvements and modifications also should be considered as protection scope of the present invention.

Claims (8)

1. the high available computers system failure disposal route based on collaborative inside and outside core, is characterized in that implementation step is as follows:
1) outside operating system nucleus, detect and comprise system service fault and application service fault generates Trouble Report and exports by described Trouble Report interface at interior service fault, in operating system nucleus, detection hardware fault generates Trouble Report and exports by the Trouble Report interface of foundation operating system nucleus outside simultaneously;
2) Trouble Report of detection failure reporting interface outside operating system nucleus, after receiving Trouble Report, Trouble Report is analyzed, according to analysis result, in operating system nucleus, hardware corresponding to hardware fault is carried out to fault handling, or to service fault, fault handling is carried out in corresponding service outside operating system nucleus, to fault handling log and send notice to keeper, then according to default rule judgment, whether need to carry out two-node cluster hot backup, if need two-node cluster hot backup, notify the two-node cluster hot backup software of appointment to carry out two-node cluster hot backup.
2. the high available computers system failure disposal route based on collaborative inside and outside core according to claim 2, is characterized in that: in described step 1), operating system nucleus outside, detect and comprise system service fault and application service fault generates Trouble Report and exported and specifically referred to by described Trouble Report interface at interior service fault:
1.1.1) outside operating system nucleus, the mode with poll is carried out state-detection to system service in operating system and application service, if abnormality appears in any system service or application service, judges service fault occurs;
1.1.2) after judge there is service fault, according to system service or application service, there is the Information generation Trouble Report of abnormality, described Trouble Report is exported by described Trouble Report interface.
3. the high available computers system failure disposal route based on collaborative inside and outside core according to claim 2, it is characterized in that, in described step 1), in operating system nucleus, detection hardware fault generates Trouble Report as follows by the detailed step of described Trouble Report interface output:
1.2.1) by a plurality of hardware states monitoring point being distributed in advance in fault grouting socket, fault interrupting processing routine and hardware driving, detect corresponding hardware status information, if the hardware state that any hardware status monitoring point detects occurs abnormal, the field data that corresponding hardware is collected in described hardware state monitoring point according to default rule is as hardware fault data;
1.2.2) hardware fault data are encapsulated and generate Trouble Report and deposit default failure message queue in;
1.2.3) according to failure message queue, to depositing the Trouble Report of failure message queue in, dispatch distribution;
1.2.4) utilize thread that the Trouble Report of scheduling output is exported by described Trouble Report interface.
4. the high available computers system failure disposal route based on collaborative inside and outside core according to claim 3, is characterized in that described step 2) detailed step as follows:
2.1) Trouble Report based on finger daemon detection failure reporting interface outside operating system nucleus;
2.2) Trouble Report is analyzed after receiving Trouble Report operating system nucleus is external, the fault type of failure judgement report, if fault type is service fault, according to service dependence, describes system service corresponding to service fault or application service are recovered; If fault type is hardware fault, judge whether to carry out faulty hardware isolation to hardware corresponding to Trouble Report, if need to carry out faulty hardware isolation, redirect execution step 2.3), otherwise judge whether to carry out failed hardware recovery to hardware corresponding to Trouble Report, if need to carry out failed hardware recovery, redirect execution step 2.4), otherwise redirect execution step 2.5);
2.3) in the time need to carrying out faulty hardware isolation to hardware corresponding to Trouble Report, in operating system nucleus, hardware corresponding to Trouble Report is carried out to faulty hardware isolation;
2.4) in the time need to carrying out failed hardware recovery to hardware corresponding to Trouble Report, in operating system nucleus, hardware corresponding to Trouble Report is carried out to failed hardware recovery;
2.5) to fault handling log;
2.6) to keeper, send notice;
2.7) according to default rule judgment, whether need to carry out two-node cluster hot backup, if need two-node cluster hot backup, by calling the notice plug-in unit of the two-node cluster hot backup software of appointment, notify described two-node cluster hot backup software to carry out two-node cluster hot backup.
5. the high available computers system failure treating apparatus based on collaborative inside and outside core, is characterized in that comprising:
The unified report of fault subsystem, for detecting operating system nucleus outside, comprise system service fault and application service fault generates Trouble Report and exports by described Trouble Report interface at interior service fault, in operating system nucleus, detection hardware fault generates Trouble Report and exports by the Trouble Report interface of foundation outside operating system nucleus simultaneously;
The unified subsystem of disposing of fault, Trouble Report for detection failure reporting interface outside operating system nucleus, after receiving Trouble Report, Trouble Report is analyzed, according to analysis result, in operating system nucleus, hardware corresponding to hardware fault is carried out to fault handling, or to service fault, fault handling is carried out in corresponding service outside operating system nucleus, to fault handling log and send notice to keeper, then according to default rule judgment, whether need to carry out two-node cluster hot backup, if need two-node cluster hot backup, notify the two-node cluster hot backup software of appointment to carry out two-node cluster hot backup.
6. the high available computers system failure treating apparatus based on collaborative inside and outside core according to claim 5, it is characterized in that: the unified report of described fault subsystem comprises for detecting operating system nucleus outside and comprise the service detection module that system service fault and application service fault generate Trouble Report and export by described Trouble Report interface at interior service fault, described service detection module comprises:
Service state poll detection sub-module, carries out state-detection for the mode with poll outside operating system nucleus to operating system system service and application service, if abnormality appears in any system service or application service, judges service fault occurs;
, for occurring after service fault judging, there is the Information generation Trouble Report of abnormality in service fault report submodule, described Trouble Report is exported by described Trouble Report interface according to system service or application service.
7. the high available computers system failure treating apparatus based on collaborative inside and outside core according to claim 6, it is characterized in that, the unified report of described fault subsystem comprises the hardware detecting module for generating Trouble Report and export by described Trouble Report interface in operating system nucleus detection hardware fault, and described hardware detecting module comprises:
Hardware state monitoring submodule, for detecting corresponding hardware status information by being distributed in advance a plurality of hardware states monitoring point of fault grouting socket, fault interrupting processing routine and hardware driving, if the hardware state that any hardware status monitoring point detects occurs abnormal, the field data that corresponding hardware is collected in described hardware state monitoring point according to default rule is as hardware fault data;
Hardware fault data encapsulation submodule, generates Trouble Report and deposits default failure message queue in for hardware fault data are encapsulated;
Failure message queue scheduling submodule, for dispatching distribution according to failure message queue to depositing the Trouble Report of failure message queue in;
Hardware fault data report submodule, for utilizing thread that the Trouble Report of scheduling output is exported by described Trouble Report interface.
8. the high available computers system failure treating apparatus based on collaborative inside and outside core according to claim 7, is characterized in that, the unified subsystem of disposing of described fault comprises:
Fault management finger daemon module, for the Trouble Report based on finger daemon detection failure reporting interface outside operating system nucleus;
Fault handling engine, for Trouble Report being analyzed after receiving Trouble Report operating system nucleus is external, the fault type of failure judgement report, if fault type is service fault, according to service dependence, describes system service corresponding to service fault or application service are recovered; If fault type is hardware fault, judge whether to carry out faulty hardware isolation to hardware corresponding to Trouble Report, if need to carry out faulty hardware isolation, call fault isolation module, otherwise judge whether to carry out failed hardware recovery to hardware corresponding to Trouble Report, if need to carry out failed hardware recovery, call fault recovery module;
Fault isolation module in the time need to carrying out faulty hardware isolation to hardware corresponding to Trouble Report, is carried out faulty hardware isolation to hardware corresponding to Trouble Report in operating system nucleus;
Fault recovery module in the time need to carrying out failed hardware recovery to hardware corresponding to Trouble Report, is carried out failed hardware recovery to hardware corresponding to Trouble Report in operating system nucleus;
Logger module, for to fault handling log;
Signalling trouble module, for sending notice to keeper;
Whether two-node cluster hot backup processing module, for needing to carry out two-node cluster hot backup according to default rule judgment, if need two-node cluster hot backup, by calling the notice plug-in unit of the two-node cluster hot backup software of appointment, notify described two-node cluster hot backup software to carry out two-node cluster hot backup.
CN201410215175.4A 2014-05-21 2014-05-21 High-availability computer system failure handling method and device based on core internal-external synergy Active CN103995759B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410215175.4A CN103995759B (en) 2014-05-21 2014-05-21 High-availability computer system failure handling method and device based on core internal-external synergy

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410215175.4A CN103995759B (en) 2014-05-21 2014-05-21 High-availability computer system failure handling method and device based on core internal-external synergy

Publications (2)

Publication Number Publication Date
CN103995759A true CN103995759A (en) 2014-08-20
CN103995759B CN103995759B (en) 2015-04-29

Family

ID=51309932

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410215175.4A Active CN103995759B (en) 2014-05-21 2014-05-21 High-availability computer system failure handling method and device based on core internal-external synergy

Country Status (1)

Country Link
CN (1) CN103995759B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106338982A (en) * 2016-09-26 2017-01-18 深圳前海弘稼科技有限公司 Fault processing method, fault processing device and server
CN106815114A (en) * 2017-01-12 2017-06-09 西安科技大学 A kind of computer system fault handling method based on software-hardware synergism
CN107851054A (en) * 2015-09-15 2018-03-27 德克萨斯仪器股份有限公司 IC chip with multiple kernels
CN111367769A (en) * 2020-03-30 2020-07-03 浙江大华技术股份有限公司 Application fault processing method and electronic equipment

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1841341A (en) * 2005-03-31 2006-10-04 冲电气工业株式会社 Information processing device, information processing method and information processing program
CN101833497A (en) * 2010-03-30 2010-09-15 山东高效能服务器和存储研究院 Computer fault management system based on expert system method
CN102364448A (en) * 2011-09-19 2012-02-29 浪潮电子信息产业股份有限公司 Fault-tolerant method for computer fault management system
US8190946B2 (en) * 2008-05-30 2012-05-29 Fujitsu Limited Fault detecting method and information processing apparatus
CN103279367A (en) * 2013-05-07 2013-09-04 浪潮电子信息产业股份有限公司 Kernel drive isolating system
US20140115575A1 (en) * 2012-10-18 2014-04-24 Vmware, Inc. Systems and methods for detecting system exceptions in guest operating systems

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1841341A (en) * 2005-03-31 2006-10-04 冲电气工业株式会社 Information processing device, information processing method and information processing program
US8190946B2 (en) * 2008-05-30 2012-05-29 Fujitsu Limited Fault detecting method and information processing apparatus
CN101833497A (en) * 2010-03-30 2010-09-15 山东高效能服务器和存储研究院 Computer fault management system based on expert system method
CN102364448A (en) * 2011-09-19 2012-02-29 浪潮电子信息产业股份有限公司 Fault-tolerant method for computer fault management system
US20140115575A1 (en) * 2012-10-18 2014-04-24 Vmware, Inc. Systems and methods for detecting system exceptions in guest operating systems
CN103279367A (en) * 2013-05-07 2013-09-04 浪潮电子信息产业股份有限公司 Kernel drive isolating system

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107851054A (en) * 2015-09-15 2018-03-27 德克萨斯仪器股份有限公司 IC chip with multiple kernels
CN107851054B (en) * 2015-09-15 2022-02-08 德克萨斯仪器股份有限公司 Integrated circuit chip with multiple cores
US11269742B2 (en) 2015-09-15 2022-03-08 Texas Instruments Incorporated Integrated circuit chip with cores asymmetrically oriented with respect to each other
US11698841B2 (en) 2015-09-15 2023-07-11 Texas Instruments Incorporated Integrated circuit chip with cores asymmetrically oriented with respect to each other
CN106338982A (en) * 2016-09-26 2017-01-18 深圳前海弘稼科技有限公司 Fault processing method, fault processing device and server
CN106815114A (en) * 2017-01-12 2017-06-09 西安科技大学 A kind of computer system fault handling method based on software-hardware synergism
CN111367769A (en) * 2020-03-30 2020-07-03 浙江大华技术股份有限公司 Application fault processing method and electronic equipment
CN111367769B (en) * 2020-03-30 2023-07-21 浙江大华技术股份有限公司 Application fault processing method and electronic equipment

Also Published As

Publication number Publication date
CN103995759B (en) 2015-04-29

Similar Documents

Publication Publication Date Title
CN106789306B (en) Method and system for detecting, collecting and recovering software fault of communication equipment
JP4882845B2 (en) Virtual computer system
CN104268061B (en) A kind of storage state monitoring method suitable for virtual machine
EP3148116B1 (en) Information system fault scenario information collection method and system
CN103324565B (en) Daily record monitoring method
CN106598790A (en) Server hardware failure detection method, apparatus of server, and server
CN103607297A (en) Fault processing method of computer cluster system
CN104102572A (en) Method and device for detecting and processing system faults
CN105224888B (en) A kind of data of magnetic disk array protection system based on safe early warning technology
CN102364448A (en) Fault-tolerant method for computer fault management system
CN101556679A (en) Method for processing failures in integrated front-end system and computer equipment
CN103995759B (en) High-availability computer system failure handling method and device based on core internal-external synergy
CN105243004A (en) Failure resource detection method and apparatus
CN111143167B (en) Alarm merging method, device, equipment and storage medium for multiple platforms
CN103490919A (en) Fault management system and fault management method
CN116126772A (en) UART serial port management system and method applied to ARM server
CN109062723A (en) The treating method and apparatus of server failure
CN105760241A (en) Exporting method and system for memory data
CN103207825A (en) Method and device for managing faults of entire equipment cabinet
CN111078445A (en) PSU power failure reason detection method and device
CN104283718A (en) Network device and hardware fault diagnosis method used for network device
CN103605592A (en) Mechanism of detecting malfunctions of distributed computer system
JP2014120001A (en) Monitoring device, monitoring method of monitoring object host, monitoring program, and recording medium
CN101131657A (en) System and method for assisting CPU to drive chips
CN115102838B (en) Emergency processing method and device for server downtime risk and electronic equipment

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant