CN102521060A

CN102521060A - Pseudo halt solving method of high-availability cluster system based on watchdog local detecting technique

Info

Publication number: CN102521060A
Application number: CN2011103629295A
Authority: CN
Inventors: 蔡强; 王幸福; 袁泉
Original assignee: GUANGDONG NEWSTART TECHNOLOGY SERVICE Ltd
Current assignee: GUANGDONG NEWSTART TECHNOLOGY SERVICE Ltd
Priority date: 2011-11-16
Filing date: 2011-11-16
Publication date: 2012-06-27

Abstract

The invention provides a pseudo halt solving method of a high-availability cluster system based on a watchdog local detecting technique, and the pseudo halt solving method belongs to the technical field of computer clusters. The method comprises the following steps that: a dog feeding parameter is detected whether to be met or not at a time interval T by presetting a dog feeding parameter condition, a dog feeding operation is carried out if the parameter is met, and then, a next detection is carried out after waiting time T; otherwise, the dog feeding operation is not carried out, and the next detection is carried out after waiting the time T; when continuous N detections fail, a watchdog is overtime, and a system is restarted; and in the process that the system is restarted, a service running on a node is moved to a backup node, so as to guarantee the high availability and the data safety of the system. The problems that the service runs on two nodes simultaneously after a host generates pseudo halt and then recovers, the magnetic matrix double hanging can occur and the loss of user data is caused are solved, and the running stability of the whole system is guaranteed. The method can be widely applied to the technical field of the computer clusters.

Description

The seemingly-dead solution of highly available cluster system based on the local detection technique of house dog

Technical field

The invention belongs to the computer cluster technical field, particularly relate to the seemingly-dead method of high availability cluster (High-availability clusters) system that solves.

Background technology

Along with the continuous expansion of computer application field and the develop rapidly of communication network technology; Key areas such as telecommunications, finance, E-Government to the server continuous service require increasingly high; The business that fault caused such as machine stops to bring the loss that can't estimate because server is delayed, and in order to tackle such situation, customary means is to adopt highly available cluster system at present; Even certain station server breaks down; Customer service and data also can switch on the backup server rapidly, thereby it is normal to have guaranteed that total system is externally served, and using for 24 hours x365 days key business of enterprise provides powerful guarantee.

But; The problem that existing highly available cluster system is difficult to overcome is the seemingly-dead problem of node, reaches certain numerical value such as the request amount when client, the server hardware resource occurs and is difficult to satisfy the demands; Server can not normally externally provide service, a kind of situation of semiparalysis.The duration of this situation is indefinite, whether can recover also unknown.If this moment, the backup server adapter was professional, the seemingly-dead recovery of main frame back service moves at two nodes simultaneously, and the two extensions of magnetic battle array can take place, and causes user data loss.Cluster can not normally externally not provide service if backup server is not taken over business.

Therefore, we hope to have a kind of method can solve seemingly-dead problem, guarantee highly available cluster system can be safer stable service is provided.

Existing watchdog technique is watchdog timer again, is a timer circuit; The I/0 pin of general watchdog chip and CPU links to each other, and it periodically sends into high level (hello dog) to this I/O pin on this pin of house dog through programmed control, in case when CPU is absorbed in the endless loop state owing to interference causes behind the program fleet; Feeding dog just can not be performed; In this time, watchdog circuit will be owing to can not get the signal that CPU sends here, just it with pin that the cpu reset pin links to each other on see a reset signal off; CPU is resetted, and system restarts.

Summary of the invention

To the problems referred to above, the present invention provides the seemingly-dead solution of highly available cluster system based on the local detection technique of house dog, and this method detects, confirms torpor based on watchdog technique, avoids professional in the generation of ruuning situation simultaneously of two nodes.

The present invention realizes through following technological means: the seemingly-dead solution of highly available cluster system based on the local detection technique of house dog may further comprise the steps:

When A. group system starts, read configuration file, obtain feeding dog time interval T and the maximum frequency of failure N that detects; It is T * N that the house dog time-out time is set, and opens house dog;

B. set the dog parameter condition of feeding;

C. start the timer process, whether per interval T detects hello dog parameter and meets, and parameter meets then carries out dog feeding operation, carries out next time then behind the stand-by period T and detects; Otherwise execution in step D;

D. detect failure, do not feed dog, carry out next time behind the stand-by period T and detect; When N continuous time detection failure, house dog is overtime, and system restarts;

E. in system's restarting process, the services migrating that moves on this node guarantees the high availability and the data security of system to backup node.

The present invention also can do following improvement:

Among the step B, said hello dog parameter condition is that the timer process is normally moved.

Among the step B, the said dog parameter condition of feeding is for weighing system load, and system load then meets the dog condition of feeding less than threshold values.

The method of said measurement system load is at first, to obtain total number Num of system CPU; Secondly, the nearest 5-20 of reading system minute total load value Load calculates current average load LoadAvg=Load/Num; The system load threshold value Thres that stipulates in current average load LoadAvg of comparison system and the configuration file then meets if LoadAvg, then feeds the dog parameter less than Thres.

The mode of said measurement system load is that iowait checks at least a in the load of disk I load, vmstat estimation internal memory.

Among the step B, the said dog parameter condition of feeding does, at first, reads configuration file, the service that record need be detected by house dog with and detection script information; Then, serve local the detection, if the service detection success is then fed the dog parameter and met.

The local detection mode of said service is for sending TCP connection request, SQL query, the message specific to service, the packet header of band agreement sign position or the availability that any mode in the inclusion (comprising text and binary stream agreement) detects service through detection script.

Said detection script is the trace routine of being write by any language among python, perl, shell, the C.

Said house dog is a hardware watchdog.

Compared with prior art, the beneficial effect that has of the present invention is:

1) the seemingly-dead solution of highly available cluster system based on the local detection technique of house dog provided by the invention; Whether the inspection main frame is in torpor earlier, confirms that main frame uses watchdog technique to restart main frame after seemingly-dead, guarantees the service stopping of moving on the main frame; Effectively avoided the seemingly-dead recovery of main frame back service to move simultaneously at two nodes; The two extensions of magnetic battle array can take place, cause the problem of user data loss, guaranteed the stability of total system operation.

2) the seemingly-dead solution of highly available cluster system based on the local detection technique of house dog provided by the invention; Whether the inspection main frame is in torpor earlier; Confirm that main frame uses watchdog technique to restart main frame after seemingly-dead, backup node is taken over business in the main frame restarting process, has solved main frame and has continued seemingly-dead; Cluster can not normally externally provide service problem, has guaranteed the service sustainability.

Description of drawings

Fig. 1 is high available multinode group system topological diagram;

Fig. 2 is the seemingly-dead solution process flow diagram of highly available cluster system based on the local detection technique of house dog of the present invention;

Among the figure: 1. disk array; 2. active node; 3. backup node.

Embodiment

Below in conjunction with accompanying drawing and embodiment the present invention is carried out detailed description, further understanding the object of the invention, scheme and effect, but not as the restriction to accompanying claims protection of the present invention.

Embodiment 1

Cluster configuration L node (L >=2), each node all has the hardware watchdog module.Carrying out field deployment according to Fig. 1, is cluster configuration m service as required.The seemingly-dead solution of highly available cluster system based on the local detection technique of house dog may further comprise the steps:

B. set the dog parameter condition of feeding;

After starting cluster, carry out system's torpor test, on the node L2 operation service is arranged, the service method that adopts the excess client to visit simultaneously on the L2 causes node L2 seemingly-dead.Seemingly-dead time T * N is after second for L2, and house dog is overtime, and node L2 is restarted, and formerly operates in services migrating on the node L2 to backup node, and whole group system can normally externally provide service.

Embodiment 2

B. set and feed dog parameter condition, for the timer process is normally moved;

After starting cluster, carry out system's torpor test, on the node L2 operation service is arranged, operation causes the seemingly-dead test procedure of system to cause node L2 seemingly-dead on L2.Seemingly-dead time T * N is after second for L2, and house dog is overtime, and node L2 is restarted, and formerly operates in services migrating on the node L2 to backup node, and whole group system can normally externally provide service.

Embodiment 3

B. set and feed dog parameter condition, at first, obtain total number Num of system CPU; Secondly, the total load value Load that reading system is nearest 15 minutes calculates current average load LoadAvg=Load/Num; The system load threshold value Thres that stipulates in current average load LoadAvg of comparison system and the configuration file then meets if LoadAvg, then feeds the dog parameter less than Thres;

After starting cluster, carry out the system state test, on the node L2 operation service is arranged; The service method that adopts the excess client to visit simultaneously on the L2 causes average load LoadAvg greater than system load threshold value Thres; T * N is after second for this state continuance, and house dog is overtime, and node L2 is restarted; Formerly operate in services migrating on the node L2 to backup node, whole group system can normally externally provide service.

Embodiment 4

B. set feeding dog parameter condition does, at first, reads configuration file, the service that record need be detected by house dog with and detection script information; Then, serve local the detection, if the service detection success is then fed the dog parameter and met through detection script transmission TCP connection request, SQL query mode;

After starting cluster, carry out the system state test, on the node L2 operation service is arranged; The operation test procedure causes node L2 seemingly-dead on L2, and this state continuance T * N is after second, and failure is detected in service N continuous time this locality; T * N is after second for this state continuance, and house dog is overtime, and node L2 is restarted; Formerly operate in services migrating on the node L2 to backup node, whole group system can normally externally provide service.

The above embodiments are merely the preferred embodiments of the present invention, can not limit interest field of the present invention with this, and therefore, the equivalent variations according to claim of the present invention is done still belongs to the scope that the present invention is contained.

Claims

1. based on the seemingly-dead solution of highly available cluster system of the local detection technique of house dog, it is characterized in that may further comprise the steps:

B. set the dog parameter condition of feeding;

2. the seemingly-dead solution of highly available cluster system based on the local detection technique of house dog according to claim 1 is characterized in that: among the step B, said hello dog parameter condition is that the timer process is normally moved.

3. the seemingly-dead solution of highly available cluster system based on the local detection technique of house dog according to claim 1 is characterized in that: among the step B, the said dog parameter condition of feeding is for weighing system load, and system load then meets the dog condition of feeding less than threshold values.

4. the seemingly-dead solution of highly available cluster system based on the local detection technique of house dog according to claim 3, it is characterized in that: the method for said measurement system load is at first, to obtain total number Num of system CPU; Secondly, the nearest 5-20 of reading system minute total load value Load calculates current average load LoadAvg=Load/Num; The system load threshold value Thres that stipulates in current average load LoadAvg of comparison system and the configuration file then meets if LoadAvg, then feeds the dog parameter less than Thres.

5. the seemingly-dead solution of highly available cluster system based on the local detection technique of house dog according to claim 3 is characterized in that: the mode of said measurement system load is that iowait checks at least a in disk I load, the load of vmstat estimation internal memory.

6. the seemingly-dead solution of highly available cluster system based on the local detection technique of house dog according to claim 1; It is characterized in that: among the step B, the said dog parameter condition of feeding does, at first; Read configuration file, the service that record need be detected by house dog with and detection script information; Then, serve local the detection, if the service detection success is then fed the dog parameter and met.

7. the seemingly-dead solution of highly available cluster system based on the local detection technique of house dog according to claim 6 is characterized in that: the local detection mode of said service is for sending TCP connection request, SQL query, indicating the packet header of position or the availability that any mode in the inclusion detects service specific to the message of serving, band agreement through detection script.

8. the seemingly-dead solution of highly available cluster system based on the local detection technique of house dog according to claim 7, it is characterized in that: said detection script is the trace routine of being write by any language among python, perl, shell, the C.

9. according to each described seemingly-dead solution of highly available cluster system based on the local detection technique of house dog among the claim 1-8, it is characterized in that: said house dog is a hardware watchdog.