CN102394791A

CN102394791A - Downtime recovery method and system

Info

Publication number: CN102394791A
Application number: CN201110329567XA
Authority: CN
Inventors: 刘希猛
Original assignee: Inspur Beijing Electronic Information Industry Co Ltd
Current assignee: Inspur Beijing Electronic Information Industry Co Ltd
Priority date: 2011-10-26
Filing date: 2011-10-26
Publication date: 2012-03-28

Abstract

The invention provides downtime recovery method and system, relates to the communication field, and solves the problem that the working efficiency of the system is low because the requirement for fast system failure response can not be satisfied by manual detection. The method comprises the following steps: a monitoring server periodically and automatically sends heartbeat detection messages to a plurality of detection servers in an intranet according to the preset intranet heartbeat time; the monitoring server receives heartbeat detection results responding to the heartbeat detection messages and returned by the detection servers in the intranet; and when the heartbeat detection result returned by the detection server shows that the network of the detection server is down, the monitoring server sends a power supply switch-off and re-start command to the detection server which is down. The technical scheme provided by the invention is suitable for a multi-server network, and achieves automatic and efficient downtime detection and recovery.

Description

Machine restoration methods and system delay

Technical field

The present invention relates to the communications field, relate in particular to a kind of machine restoration methods and system of delaying.

Background technology

Developing rapidly of IT industry makes enterprise application software towards automation, intelligent development.And for these complicated applications Development of Software merchants, how the functional software that becomes increasingly complex better goes test also is a difficult problem, and especially those need the long-term test contents that continues.Test program needs intellectuality more could adapt to the complicated of IT software and hardware function equally.

Along with the complexity of software systems applied environment, the probability that software is made mistakes also constantly increases, and software is faced with a very crucial demand and after system makes mistakes, can recovers exactly.So, can find timely that system mistake just becomes matter of utmost importance to be solved.

The manually actuated detection mode of general at present employing is carried out the wrong detection monitoring to system; And for the server of 7*24 hour follow-on test in the work; In case the server machine of delaying appears, because manual work can't the real time inspection state of runtime machine, and might be when manual detection be found system mistake that the server machine of delaying causes; This server has been delayed machine a period of time, and then causes the substantive test waste of time.

To sum up, existing software systems applied environment is increasingly sophisticated, and the probability of makeing mistakes also constantly increases, and manual detection can't satisfy the demand of quick response system mistake, makes the system works inefficiency.

Summary of the invention

The invention provides a kind of machine restoration methods and system of delaying, solved the demand that manual detection can't satisfy the quick response system mistake, make the problem of system works inefficiency.

A kind of machine restoration methods of delaying comprises:

Monitoring server is according to the Intranet heart time that presets, and a plurality of testing servers send heartbeat detection message in Intranet periodically automatically;

Said monitoring server receives the heartbeat detection result of the said heartbeat detection message of response that each testing server returns in the said Intranet;

When the heartbeat test result of returning at said testing server showed that this testing server network is delayed machine, said monitoring server sent the power-off instruction of restarting to the testing server of this machine of delaying.

Preferably, said heartbeat detection message is specially the ping order.

Preferably, said heartbeat detection result comprises normal operation of said testing server and the said testing server network two kinds of situation of machine of delaying.

Preferably, said monitoring server is specially to the testing server transmission power-off instruction of restarting of this machine of delaying:

Said monitoring server adopts IPMI (IPMI) administration order to send the power-off instruction of restarting to the testing server of the said machine of delaying.

Preferably, the said heartbeat test result of returning at said testing server shows when this testing server network is delayed machine that said monitoring server also comprises after the testing server of this machine of delaying sends the step of power-off instruction of restarting:

The testing server of the said machine of delaying restarts this testing server according to the power-off instruction of restarting that the said monitoring server that receives sends.

Preferably, according to the Intranet heart time that presets, a plurality of testing servers send before the step of heartbeat detection message in Intranet periodically automatically, also comprise at monitoring server:

From a plurality of testing servers of said Intranet, the testing server that selection one is stablized and load is lower is as monitoring server.

Preferably, the above-mentioned machine restoration methods of delaying also comprises:

Dispose said Intranet heart time, send heartbeat detection message with the indication monitoring server according to this Intranet heart time, said Intranet heart time restarts the required time greater than testing server.

The present invention also provides a kind of machine recovery system of delaying, and comprises a plurality of testing servers under monitoring server and the monitoring of this monitoring server, and said monitoring server and said a plurality of testing server are in the same Intranet, and be interconnected through said Intranet;

Said monitoring server; Be used for according to the Intranet heart time that presets; Periodic said a plurality of testing servers in the said Intranet of trend send heartbeat detection message; Receive the heartbeat detection result that each testing server returns in the said Intranet, and when the heartbeat test result that said testing server returns showed that this testing server network is delayed machine, said monitoring server sent the power-off instruction of restarting to the testing server of this machine of delaying;

Said testing server is used to receive the heartbeat detection message that said monitoring server sends, and the heartbeat detection result of the said heartbeat detection message of returning to said monitoring server of response.

Preferably, said monitoring server adopts the IPMI administration order to send the power-off instruction of restarting.

Preferably, said testing server also is used for the power-off instruction of restarting according to the said monitoring server transmission that receives, and restarts this testing server.

The invention provides a kind of machine restoration methods and system of delaying; Monitoring server is according to the Intranet heart time that presets; Periodically automatically, a plurality of testing servers send heartbeat detection message in Intranet; Receive the heartbeat detection result of the said heartbeat detection message of response that each testing server returns in the said Intranet, and when the heartbeat test result that said testing server returns showed that this testing server network is delayed machine, said monitoring server sent the power-off instruction of restarting to the testing server of this machine of delaying; Realized the delay automatic time of machine of server is detected; Shortened server in the system has been delayed response time of machine, solved the demand that manual detection can't satisfy the quick response system mistake, made the problem of system works inefficiency.

Description of drawings

The structural representation of a kind of machine recovery system of delaying that Fig. 1 provides for embodiments of the invention;

The flow chart of a kind of machine restoration methods of delaying that Fig. 2 provides for embodiments of the invention.

Embodiment

In order to address the above problem, embodiments of the invention provide a kind of machine restoration methods and system of delaying.Hereinafter will combine accompanying drawing that embodiments of the invention are elaborated.Need to prove that under the situation of not conflicting, embodiment among the application and the characteristic among the embodiment be combination in any each other.

The embodiment of the invention provides a kind of machine recovery system of delaying, and its structure is as shown in Figure 1, comprising:

A plurality of testing servers 102 under monitoring server 101 is monitored with this monitoring server, said monitoring server 101 is in the same Intranet with said a plurality of testing servers 102, and is interconnected through said Intranet;

Said monitoring server 101; Be used for according to the Intranet heart time that presets; Periodic said a plurality of testing servers 102 in the said Intranet of trend send heartbeat detection message; Receive the heartbeat detection result that each testing server 102 returns in the said Intranet, and when the heartbeat test result that said testing server 102 returns showed that this testing server 102 is delayed machine, said monitoring server 101 sent the power-off instruction of restarting to the testing server 102 of this machine of delaying;

Said testing server 102 is used to receive the heartbeat detection message that said monitoring server 101 sends, and the heartbeat detection result of the said heartbeat detection message of returning to said monitoring server 101 of response.

Preferably, the heartbeat detection message that said monitoring server 101 sends is specially the ping order, and said heartbeat detection result comprises said the testing server 102 normal and said testing servers 102 two kinds of situation of machine of delaying.

Preferably, said monitoring server 101 adopts the IPMI administration order to send the power-off instruction of restarting.

Preferably, said testing server 102 also is used for the power-off instruction of restarting according to the said monitoring server that receives 101 transmissions, restarts this testing server 102.

Need explanation to be; Monitoring server 101 is the common server in the Intranet with testing server 102; Generally speaking, select one to stablize and load lower server as monitoring server in a plurality of servers from Intranet, other servers promptly receive the monitoring of this monitoring server.According to the variation of each server working condition in the Intranet, but also other servers of human configuration are monitoring server.

In conjunction with the above-mentioned machine recovery system of delaying, embodiments of the invention provide a kind of machine restoration methods of delaying, and it is as shown in Figure 2 to use this method to accomplish the flow process that detection and control to server in the Intranet recovers, and comprising:

Step 201, from a plurality of testing servers of said Intranet, select stable and a testing server that load is lower as monitoring server;

In this step, the server of selecting continual and steady operation is as monitoring server.

Preferably, in Intranet, build baseboard management controller (Baseboard Management Controller, BMC) internet.IPMI tool software bag is installed on monitoring server; The BMC address of configuration testing server, and open the IPMI service.At present most servers is integrated IPMI way to manage on mainboard BMC; After monitoring server discovery testing server network is delayed machine; Can adopt the IPMI administration order to send the instruction that power-off is restarted to testing server BMC; Restart server rapidly, in time recover the bottom hardware system of testing service.

Step 202, configuration Intranet heart time send heartbeat detection message with the indication monitoring server according to this Intranet heart time;

In this step; Configuration Intranet heart time; The Intranet heart time needs to restart the required time greater than testing server; Otherwise possibly cause in the testing server restarting process monitoring server to detect the same testing server network machine of delaying once again, and then repeat to send the power-off instruction of restarting, cause same testing server by frequent power-off restarting.The Intranet heart time restarts the required time greater than testing server, and outage leaves the certain time interval buffering with the start that powers on when restarting for testing server.

Preferably, all right configuration monitoring server is to starting up's item of testing server monitoring.Can realize the different intelligence restorations that continue test assignments through specific resources monitoring and the unlatching flow process of calling different test programs in starting up's item.

Step 203, monitoring server are according to the Intranet heart time that presets, and a plurality of testing servers send heartbeat detection message in Intranet periodically automatically;

The simplest method is as heartbeat detection message with network timing ping order.

Step 204, said monitoring server receive the heartbeat detection result of the said heartbeat detection message of response that each testing server returns in the said Intranet;

Operation has the heartbeat trace routine on testing server, the heartbeat detection message that this program response receives, and return current heartbeat detection result.The heartbeat detection result who returns is divided into delay two kinds in machine of the normal operation of testing server and testing server network.

When step 204 detects heartbeat through the ping order; In this step; Through returning of ping result, utilize shell script to obtain operating state UP (corresponding testing server normally moves this result) or the DOWN (corresponding testing server network delay this result of machine) of remote testing server automatically.

Step 205, when the heartbeat test result of returning at said testing server shows that this testing server network is delayed machine, said monitoring server sends the power-off instruction of restarting to the testing server of this machine of delaying;

In this step, concrete, the instruction that the supervising the network of monitoring server through IPMI restarts to the BMC send server of the machine testing server of delaying, this instruction directly acts on the testing server power supply.Even testing server still can restart rapidly under (software delay machine) situation because of test program causes crashing.

The testing server of step 206, the said machine of delaying restarts this testing server according to the power-off instruction of restarting that the said monitoring server that receives sends.

After testing server is restarted, through monitoring resource and the startup flow process that adds test program in the flow process that starts at testing server.Need not the automatic recovery of the realization heartbeat detection program of manual intervention, guarantee to continue the fast quick-recovery of test, save the human and material resources and the time cost of test.

Embodiments of the invention provide a kind of machine restoration methods and system of delaying; Monitoring server is according to the Intranet heart time that presets; Periodically automatically, a plurality of testing servers send heartbeat detection message in Intranet; Receive the heartbeat detection result of the said heartbeat detection message of response that each testing server returns in the said Intranet, and when the heartbeat test result that said testing server returns showed that this testing server network is delayed machine, said monitoring server sent the power-off instruction of restarting to the testing server of this machine of delaying; Realized the delay automatic time of machine of server is detected; Shortened server in the system has been delayed response time of machine, solved the demand that manual detection can't satisfy the quick response system mistake, made the problem of system works inefficiency.In the os starting flow process of testing server, add test program necessary monitoring resource and start flow process.After server network is normal, need not the automatic recovery of the realization test program of manual intervention, guarantee to continue the fast quick-recovery of test, save the human and material resources and the time cost of test.Automatically the key of recovering is to coordinate the sequence flow of monitoring resource and test program startup.

The all or part of step that the one of ordinary skill in the art will appreciate that the foregoing description program circuit that can use a computer is realized; Said computer program can be stored in the computer-readable recording medium; Said computer program (like system, unit, device etc.) on the relevant hardware platform is carried out; When carrying out, comprise one of step or its combination of method embodiment.

Alternatively, all or part of step of the foregoing description also can use integrated circuit to realize, these steps can be made into integrated circuit modules one by one respectively, perhaps a plurality of modules in them or step is made into the single integrated circuit module and realizes.Like this, the present invention is not restricted to any specific hardware and software combination.

Each device/functional module/functional unit in the foregoing description can adopt the general calculation device to realize, they can concentrate on the single calculation element, also can be distributed on the network that a plurality of calculation element forms.

Each device/functional module/functional unit in the foregoing description is realized with the form of software function module and during as independently production marketing or use, can be stored in the computer read/write memory medium.The above-mentioned computer read/write memory medium of mentioning can be a read-only memory, disk or CD etc.

Any technical staff who is familiar with the present technique field can expect changing or replacement in the technical scope that the present invention discloses easily, all should be encompassed within protection scope of the present invention.Therefore, protection scope of the present invention should be as the criterion with the described protection range of claim.

Claims

1. the machine restoration methods of delaying is characterized in that, comprising:

2. the machine restoration methods of delaying according to claim 1 is characterized in that, said heartbeat detection message is specially the ping order.

3. the machine restoration methods of delaying according to claim 1 and 2 is characterized in that, said heartbeat detection result comprises normal operation of said testing server and the said testing server network two kinds of situation of machine of delaying.

4. the machine restoration methods of delaying according to claim 1 is characterized in that, said monitoring server is specially to the testing server transmission power-off instruction of restarting of this machine of delaying:

5. the machine restoration methods of delaying according to claim 1; It is characterized in that; The said heartbeat test result of returning at said testing server shows when this testing server network is delayed machine; Said monitoring server also comprises after the testing server of this machine of delaying sends the step of power-off instruction of restarting:

6. the machine restoration methods of delaying according to claim 1 is characterized in that, according to the Intranet heart time that presets, a plurality of testing servers send before the step of heartbeat detection message in Intranet periodically automatically, also comprise at monitoring server:

7. the machine restoration methods of delaying according to claim 1 is characterized in that this method also comprises:

8. the machine recovery system of delaying is characterized in that, comprises a plurality of testing servers under monitoring server and the monitoring of this monitoring server, and said monitoring server and said a plurality of testing server are in the same Intranet, and be interconnected through said Intranet;

9. the machine recovery system of delaying according to claim 8 is characterized in that,

Said monitoring server adopts the IPMI administration order to send the power-off instruction of restarting.

10. the machine recovery system of delaying according to claim 7 is characterized in that,

Said testing server also is used for the power-off instruction of restarting according to the said monitoring server transmission that receives, and restarts this testing server.