CN102521060A - Pseudo halt solving method of high-availability cluster system based on watchdog local detecting technique - Google Patents

Pseudo halt solving method of high-availability cluster system based on watchdog local detecting technique Download PDF

Info

Publication number
CN102521060A
CN102521060A CN2011103629295A CN201110362929A CN102521060A CN 102521060 A CN102521060 A CN 102521060A CN 2011103629295 A CN2011103629295 A CN 2011103629295A CN 201110362929 A CN201110362929 A CN 201110362929A CN 102521060 A CN102521060 A CN 102521060A
Authority
CN
China
Prior art keywords
dog
seemingly
load
house dog
parameter
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2011103629295A
Other languages
Chinese (zh)
Inventor
蔡强
王幸福
袁泉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
GUANGDONG NEWSTART TECHNOLOGY SERVICE Ltd
Original Assignee
GUANGDONG NEWSTART TECHNOLOGY SERVICE Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by GUANGDONG NEWSTART TECHNOLOGY SERVICE Ltd filed Critical GUANGDONG NEWSTART TECHNOLOGY SERVICE Ltd
Priority to CN2011103629295A priority Critical patent/CN102521060A/en
Publication of CN102521060A publication Critical patent/CN102521060A/en
Pending legal-status Critical Current

Links

Images

Abstract

The invention provides a pseudo halt solving method of a high-availability cluster system based on a watchdog local detecting technique, and the pseudo halt solving method belongs to the technical field of computer clusters. The method comprises the following steps that: a dog feeding parameter is detected whether to be met or not at a time interval T by presetting a dog feeding parameter condition, a dog feeding operation is carried out if the parameter is met, and then, a next detection is carried out after waiting time T; otherwise, the dog feeding operation is not carried out, and the next detection is carried out after waiting the time T; when continuous N detections fail, a watchdog is overtime, and a system is restarted; and in the process that the system is restarted, a service running on a node is moved to a backup node, so as to guarantee the high availability and the data safety of the system. The problems that the service runs on two nodes simultaneously after a host generates pseudo halt and then recovers, the magnetic matrix double hanging can occur and the loss of user data is caused are solved, and the running stability of the whole system is guaranteed. The method can be widely applied to the technical field of the computer clusters.

Description

The seemingly-dead solution of highly available cluster system based on the local detection technique of house dog
Technical field
The invention belongs to the computer cluster technical field, particularly relate to the seemingly-dead method of high availability cluster (High-availability clusters) system that solves.
Background technology
Along with the continuous expansion of computer application field and the develop rapidly of communication network technology; Key areas such as telecommunications, finance, E-Government to the server continuous service require increasingly high; The business that fault caused such as machine stops to bring the loss that can't estimate because server is delayed, and in order to tackle such situation, customary means is to adopt highly available cluster system at present; Even certain station server breaks down; Customer service and data also can switch on the backup server rapidly, thereby it is normal to have guaranteed that total system is externally served, and using for 24 hours x365 days key business of enterprise provides powerful guarantee.
But; The problem that existing highly available cluster system is difficult to overcome is the seemingly-dead problem of node, reaches certain numerical value such as the request amount when client, the server hardware resource occurs and is difficult to satisfy the demands; Server can not normally externally provide service, a kind of situation of semiparalysis.The duration of this situation is indefinite, whether can recover also unknown.If this moment, the backup server adapter was professional, the seemingly-dead recovery of main frame back service moves at two nodes simultaneously, and the two extensions of magnetic battle array can take place, and causes user data loss.Cluster can not normally externally not provide service if backup server is not taken over business.
Therefore, we hope to have a kind of method can solve seemingly-dead problem, guarantee highly available cluster system can be safer stable service is provided.
Existing watchdog technique is watchdog timer again, is a timer circuit; The I/0 pin of general watchdog chip and CPU links to each other, and it periodically sends into high level (hello dog) to this I/O pin on this pin of house dog through programmed control, in case when CPU is absorbed in the endless loop state owing to interference causes behind the program fleet; Feeding dog just can not be performed; In this time, watchdog circuit will be owing to can not get the signal that CPU sends here, just it with pin that the cpu reset pin links to each other on see a reset signal off; CPU is resetted, and system restarts.
Summary of the invention
To the problems referred to above, the present invention provides the seemingly-dead solution of highly available cluster system based on the local detection technique of house dog, and this method detects, confirms torpor based on watchdog technique, avoids professional in the generation of ruuning situation simultaneously of two nodes.
The present invention realizes through following technological means: the seemingly-dead solution of highly available cluster system based on the local detection technique of house dog may further comprise the steps:
When A. group system starts, read configuration file, obtain feeding dog time interval T and the maximum frequency of failure N that detects; It is T * N that the house dog time-out time is set, and opens house dog;
B. set the dog parameter condition of feeding;
C. start the timer process, whether per interval T detects hello dog parameter and meets, and parameter meets then carries out dog feeding operation, carries out next time then behind the stand-by period T and detects; Otherwise execution in step D;
D. detect failure, do not feed dog, carry out next time behind the stand-by period T and detect; When N continuous time detection failure, house dog is overtime, and system restarts;
E. in system's restarting process, the services migrating that moves on this node guarantees the high availability and the data security of system to backup node.
The present invention also can do following improvement:
Among the step B, said hello dog parameter condition is that the timer process is normally moved.
Among the step B, the said dog parameter condition of feeding is for weighing system load, and system load then meets the dog condition of feeding less than threshold values.
The method of said measurement system load is at first, to obtain total number Num of system CPU; Secondly, the nearest 5-20 of reading system minute total load value Load calculates current average load LoadAvg=Load/Num; The system load threshold value Thres that stipulates in current average load LoadAvg of comparison system and the configuration file then meets if LoadAvg, then feeds the dog parameter less than Thres.
The mode of said measurement system load is that iowait checks at least a in the load of disk I load, vmstat estimation internal memory.
Among the step B, the said dog parameter condition of feeding does, at first, reads configuration file, the service that record need be detected by house dog with and detection script information; Then, serve local the detection, if the service detection success is then fed the dog parameter and met.
The local detection mode of said service is for sending TCP connection request, SQL query, the message specific to service, the packet header of band agreement sign position or the availability that any mode in the inclusion (comprising text and binary stream agreement) detects service through detection script.
Said detection script is the trace routine of being write by any language among python, perl, shell, the C.
Said house dog is a hardware watchdog.
Compared with prior art, the beneficial effect that has of the present invention is:
1) the seemingly-dead solution of highly available cluster system based on the local detection technique of house dog provided by the invention; Whether the inspection main frame is in torpor earlier, confirms that main frame uses watchdog technique to restart main frame after seemingly-dead, guarantees the service stopping of moving on the main frame; Effectively avoided the seemingly-dead recovery of main frame back service to move simultaneously at two nodes; The two extensions of magnetic battle array can take place, cause the problem of user data loss, guaranteed the stability of total system operation.
2) the seemingly-dead solution of highly available cluster system based on the local detection technique of house dog provided by the invention; Whether the inspection main frame is in torpor earlier; Confirm that main frame uses watchdog technique to restart main frame after seemingly-dead, backup node is taken over business in the main frame restarting process, has solved main frame and has continued seemingly-dead; Cluster can not normally externally provide service problem, has guaranteed the service sustainability.
Description of drawings
Fig. 1 is high available multinode group system topological diagram;
Fig. 2 is the seemingly-dead solution process flow diagram of highly available cluster system based on the local detection technique of house dog of the present invention;
Among the figure: 1. disk array; 2. active node; 3. backup node.
Embodiment
Below in conjunction with accompanying drawing and embodiment the present invention is carried out detailed description, further understanding the object of the invention, scheme and effect, but not as the restriction to accompanying claims protection of the present invention.
Embodiment 1
Cluster configuration L node (L >=2), each node all has the hardware watchdog module.Carrying out field deployment according to Fig. 1, is cluster configuration m service as required.The seemingly-dead solution of highly available cluster system based on the local detection technique of house dog may further comprise the steps:
When A. group system starts, read configuration file, obtain feeding dog time interval T and the maximum frequency of failure N that detects; It is T * N that the house dog time-out time is set, and opens house dog;
B. set the dog parameter condition of feeding;
C. start the timer process, whether per interval T detects hello dog parameter and meets, and parameter meets then carries out dog feeding operation, carries out next time then behind the stand-by period T and detects; Otherwise execution in step D;
D. detect failure, do not feed dog, carry out next time behind the stand-by period T and detect; When N continuous time detection failure, house dog is overtime, and system restarts;
E. in system's restarting process, the services migrating that moves on this node guarantees the high availability and the data security of system to backup node.
After starting cluster, carry out system's torpor test, on the node L2 operation service is arranged, the service method that adopts the excess client to visit simultaneously on the L2 causes node L2 seemingly-dead.Seemingly-dead time T * N is after second for L2, and house dog is overtime, and node L2 is restarted, and formerly operates in services migrating on the node L2 to backup node, and whole group system can normally externally provide service.
Embodiment 2
Cluster configuration L node (L >=2), each node all has the hardware watchdog module.Carrying out field deployment according to Fig. 1, is cluster configuration m service as required.The seemingly-dead solution of highly available cluster system based on the local detection technique of house dog may further comprise the steps:
When A. group system starts, read configuration file, obtain feeding dog time interval T and the maximum frequency of failure N that detects; It is T * N that the house dog time-out time is set, and opens house dog;
B. set and feed dog parameter condition, for the timer process is normally moved;
C. start the timer process, whether per interval T detects hello dog parameter and meets, and parameter meets then carries out dog feeding operation, carries out next time then behind the stand-by period T and detects; Otherwise execution in step D;
D. detect failure, do not feed dog, carry out next time behind the stand-by period T and detect; When N continuous time detection failure, house dog is overtime, and system restarts;
E. in system's restarting process, the services migrating that moves on this node guarantees the high availability and the data security of system to backup node.
After starting cluster, carry out system's torpor test, on the node L2 operation service is arranged, operation causes the seemingly-dead test procedure of system to cause node L2 seemingly-dead on L2.Seemingly-dead time T * N is after second for L2, and house dog is overtime, and node L2 is restarted, and formerly operates in services migrating on the node L2 to backup node, and whole group system can normally externally provide service.
Embodiment 3
Cluster configuration L node (L >=2), each node all has the hardware watchdog module.Carrying out field deployment according to Fig. 1, is cluster configuration m service as required.The seemingly-dead solution of highly available cluster system based on the local detection technique of house dog may further comprise the steps:
When A. group system starts, read configuration file, obtain feeding dog time interval T and the maximum frequency of failure N that detects; It is T * N that the house dog time-out time is set, and opens house dog;
B. set and feed dog parameter condition, at first, obtain total number Num of system CPU; Secondly, the total load value Load that reading system is nearest 15 minutes calculates current average load LoadAvg=Load/Num; The system load threshold value Thres that stipulates in current average load LoadAvg of comparison system and the configuration file then meets if LoadAvg, then feeds the dog parameter less than Thres;
C. start the timer process, whether per interval T detects hello dog parameter and meets, and parameter meets then carries out dog feeding operation, carries out next time then behind the stand-by period T and detects; Otherwise execution in step D;
D. detect failure, do not feed dog, carry out next time behind the stand-by period T and detect; When N continuous time detection failure, house dog is overtime, and system restarts;
E. in system's restarting process, the services migrating that moves on this node guarantees the high availability and the data security of system to backup node.
After starting cluster, carry out the system state test, on the node L2 operation service is arranged; The service method that adopts the excess client to visit simultaneously on the L2 causes average load LoadAvg greater than system load threshold value Thres; T * N is after second for this state continuance, and house dog is overtime, and node L2 is restarted; Formerly operate in services migrating on the node L2 to backup node, whole group system can normally externally provide service.
Embodiment 4
Cluster configuration L node (L >=2), each node all has the hardware watchdog module.Carrying out field deployment according to Fig. 1, is cluster configuration m service as required.The seemingly-dead solution of highly available cluster system based on the local detection technique of house dog may further comprise the steps:
When A. group system starts, read configuration file, obtain feeding dog time interval T and the maximum frequency of failure N that detects; It is T * N that the house dog time-out time is set, and opens house dog;
B. set feeding dog parameter condition does, at first, reads configuration file, the service that record need be detected by house dog with and detection script information; Then, serve local the detection, if the service detection success is then fed the dog parameter and met through detection script transmission TCP connection request, SQL query mode;
C. start the timer process, whether per interval T detects hello dog parameter and meets, and parameter meets then carries out dog feeding operation, carries out next time then behind the stand-by period T and detects; Otherwise execution in step D;
D. detect failure, do not feed dog, carry out next time behind the stand-by period T and detect; When N continuous time detection failure, house dog is overtime, and system restarts;
E. in system's restarting process, the services migrating that moves on this node guarantees the high availability and the data security of system to backup node.
After starting cluster, carry out the system state test, on the node L2 operation service is arranged; The operation test procedure causes node L2 seemingly-dead on L2, and this state continuance T * N is after second, and failure is detected in service N continuous time this locality; T * N is after second for this state continuance, and house dog is overtime, and node L2 is restarted; Formerly operate in services migrating on the node L2 to backup node, whole group system can normally externally provide service.
The above embodiments are merely the preferred embodiments of the present invention, can not limit interest field of the present invention with this, and therefore, the equivalent variations according to claim of the present invention is done still belongs to the scope that the present invention is contained.

Claims (9)

1. based on the seemingly-dead solution of highly available cluster system of the local detection technique of house dog, it is characterized in that may further comprise the steps:
When A. group system starts, read configuration file, obtain feeding dog time interval T and the maximum frequency of failure N that detects; It is T * N that the house dog time-out time is set, and opens house dog;
B. set the dog parameter condition of feeding;
C. start the timer process, whether per interval T detects hello dog parameter and meets, and parameter meets then carries out dog feeding operation, carries out next time then behind the stand-by period T and detects; Otherwise execution in step D;
D. detect failure, do not feed dog, carry out next time behind the stand-by period T and detect; When N continuous time detection failure, house dog is overtime, and system restarts;
E. in system's restarting process, the services migrating that moves on this node guarantees the high availability and the data security of system to backup node.
2. the seemingly-dead solution of highly available cluster system based on the local detection technique of house dog according to claim 1 is characterized in that: among the step B, said hello dog parameter condition is that the timer process is normally moved.
3. the seemingly-dead solution of highly available cluster system based on the local detection technique of house dog according to claim 1 is characterized in that: among the step B, the said dog parameter condition of feeding is for weighing system load, and system load then meets the dog condition of feeding less than threshold values.
4. the seemingly-dead solution of highly available cluster system based on the local detection technique of house dog according to claim 3, it is characterized in that: the method for said measurement system load is at first, to obtain total number Num of system CPU; Secondly, the nearest 5-20 of reading system minute total load value Load calculates current average load LoadAvg=Load/Num; The system load threshold value Thres that stipulates in current average load LoadAvg of comparison system and the configuration file then meets if LoadAvg, then feeds the dog parameter less than Thres.
5. the seemingly-dead solution of highly available cluster system based on the local detection technique of house dog according to claim 3 is characterized in that: the mode of said measurement system load is that iowait checks at least a in disk I load, the load of vmstat estimation internal memory.
6. the seemingly-dead solution of highly available cluster system based on the local detection technique of house dog according to claim 1; It is characterized in that: among the step B, the said dog parameter condition of feeding does, at first; Read configuration file, the service that record need be detected by house dog with and detection script information; Then, serve local the detection, if the service detection success is then fed the dog parameter and met.
7. the seemingly-dead solution of highly available cluster system based on the local detection technique of house dog according to claim 6 is characterized in that: the local detection mode of said service is for sending TCP connection request, SQL query, indicating the packet header of position or the availability that any mode in the inclusion detects service specific to the message of serving, band agreement through detection script.
8. the seemingly-dead solution of highly available cluster system based on the local detection technique of house dog according to claim 7, it is characterized in that: said detection script is the trace routine of being write by any language among python, perl, shell, the C.
9. according to each described seemingly-dead solution of highly available cluster system based on the local detection technique of house dog among the claim 1-8, it is characterized in that: said house dog is a hardware watchdog.
CN2011103629295A 2011-11-16 2011-11-16 Pseudo halt solving method of high-availability cluster system based on watchdog local detecting technique Pending CN102521060A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2011103629295A CN102521060A (en) 2011-11-16 2011-11-16 Pseudo halt solving method of high-availability cluster system based on watchdog local detecting technique

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2011103629295A CN102521060A (en) 2011-11-16 2011-11-16 Pseudo halt solving method of high-availability cluster system based on watchdog local detecting technique

Publications (1)

Publication Number Publication Date
CN102521060A true CN102521060A (en) 2012-06-27

Family

ID=46291995

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2011103629295A Pending CN102521060A (en) 2011-11-16 2011-11-16 Pseudo halt solving method of high-availability cluster system based on watchdog local detecting technique

Country Status (1)

Country Link
CN (1) CN102521060A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103200257A (en) * 2013-03-28 2013-07-10 中标软件有限公司 Node in high availability cluster system and resource switching method of node in high availability cluster system
CN103533297A (en) * 2012-07-05 2014-01-22 英飞凌科技股份有限公司 Monitoring circuit with a signature watchdog
CN107426051A (en) * 2017-07-19 2017-12-01 北京华云网际科技有限公司 The monitoring method of the working condition of distributed cluster system interior joint, apparatus and system
CN107577575A (en) * 2017-09-06 2018-01-12 长沙曙通信息科技有限公司 A kind of disaster tolerant backup system management of monitor implementation method
CN110377487A (en) * 2019-07-11 2019-10-25 无锡华云数据技术服务有限公司 A kind of method and device handling high-availability cluster fissure
CN114911642A (en) * 2022-04-27 2022-08-16 北京计算机技术及应用研究所 Firmware restarting method based on UEFI event mechanism and watchdog

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101686261A (en) * 2009-09-01 2010-03-31 卡斯柯信号有限公司 RAC-based redundant server system

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101686261A (en) * 2009-09-01 2010-03-31 卡斯柯信号有限公司 RAC-based redundant server system

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103533297A (en) * 2012-07-05 2014-01-22 英飞凌科技股份有限公司 Monitoring circuit with a signature watchdog
CN103533297B (en) * 2012-07-05 2018-07-27 英飞凌科技股份有限公司 Monitoring circuit with signature monitor
US10838795B2 (en) 2012-07-05 2020-11-17 Infineon Technologies Ag Monitoring circuit with a signature watchdog
CN103200257A (en) * 2013-03-28 2013-07-10 中标软件有限公司 Node in high availability cluster system and resource switching method of node in high availability cluster system
CN107426051A (en) * 2017-07-19 2017-12-01 北京华云网际科技有限公司 The monitoring method of the working condition of distributed cluster system interior joint, apparatus and system
CN107426051B (en) * 2017-07-19 2018-06-05 北京华云网际科技有限公司 The monitoring method of the working condition of distributed cluster system interior joint, apparatus and system
CN107577575A (en) * 2017-09-06 2018-01-12 长沙曙通信息科技有限公司 A kind of disaster tolerant backup system management of monitor implementation method
CN110377487A (en) * 2019-07-11 2019-10-25 无锡华云数据技术服务有限公司 A kind of method and device handling high-availability cluster fissure
CN114911642A (en) * 2022-04-27 2022-08-16 北京计算机技术及应用研究所 Firmware restarting method based on UEFI event mechanism and watchdog
CN114911642B (en) * 2022-04-27 2024-04-19 北京计算机技术及应用研究所 Firmware restarting method based on UEFI event mechanism and watchdog

Similar Documents

Publication Publication Date Title
US11729044B2 (en) Service resiliency using a recovery controller
CN108847982B (en) Distributed storage cluster and node fault switching method and device thereof
CN102521060A (en) Pseudo halt solving method of high-availability cluster system based on watchdog local detecting technique
CN101325610B (en) Virtual tape library backup system and magnetic disk power supply control method
CN102761439B (en) Device and method for detecting and recording abnormity on basis of watchdog in PON (Passive Optical Network) access system
US20070276983A1 (en) System method and circuit for differential mirroring of data
CN102394914A (en) Cluster brain-split processing method and device
US20060212754A1 (en) Multiprocessor system
CN101976217A (en) Anomaly detection method and system for network processing unit
CN101980171B (en) Failure self-recovery method for software system and software watchdog system used by same
CN103116531A (en) Storage system failure predicting method and storage system failure predicting device
CN108429629A (en) Equipment fault restoration methods and device
CN103401712A (en) Content distribution based intelligent high-availability task processing method and system
CN103488721B (en) Database bisynchronous method and system for master and slave boards
CN109245926B (en) Intelligent network card, intelligent network card system and control method
CN103475696A (en) System and method for monitoring state of cloud computing cluster server
CN105045533A (en) Disk heartbeat transmitting and receiving method suitable for dual-control high-availability memory system
CN102510343A (en) Highly available cluster system feign death solution based on both remote detection and power management
CN114064217B (en) OpenStack-based node virtual machine migration method and device
WO2020233001A1 (en) Distributed storage system comprising dual-control architecture, data reading method and device, and storage medium
CN107357800A (en) A kind of database High Availabitity zero loses solution method
CN109257218B (en) Island self-healing method of network system based on SNMP protocol
CN104158843A (en) Storage unit invalidation detecting method and device for distributed file storage system
CN110677288A (en) Edge computing system and method generally used for multi-scene deployment
CN106685697B (en) Method and system for recovering and processing abnormal marginal message data

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C12 Rejection of a patent application after its publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20120627