CN102360324B

CN102360324B - Failure recovery method and equipment for failure recovery

Info

Publication number: CN102360324B
Application number: CN201110335042.7A
Authority: CN
Inventors: 孙奇辉
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2011-10-28
Filing date: 2011-10-28
Publication date: 2014-04-16
Anticipated expiration: 2031-10-28
Also published as: CN102360324A

Abstract

The embodiment of the invention provides a failure recovery method and equipment for failure recovery. The failure recovery method comprises the following steps that: first equipment loads a configuration file of a working instance to operate the working instance; and the first equipment sends the configuration file to second equipment to make the second equipment operate the working instance according to the configuration file when the first equipment cannot operate the working instance. The embodiment of the invention also provides the equipment for the failure recovery. According to the technical scheme, the first equipment sends the configuration file to the second equipment to make the second equipment recover the working instance, which has a failure, in the first equipment, so that the working instance can be effectively fault-tolerant. The invention is high in availability, and the interruption of the operation of the working instance caused by single-point failure can be avoided.

Description

Fault recovery method and for the equipment of fault recovery

Technical field

The present invention relates to computer realm, and more specifically, relate to fault recovery method in computer realm and for the equipment of fault recovery.

Background technology

Along with increasing substantially and the huge increase of quantity of information of number of users, in many industries, especially in telecommunications industry, infotech (Information Technology, IT) system, except requiring great processing power, also more and more requires to have high fault-tolerant ability.Even in the face of data and the concurrent processing of magnanimity, also to reach high available target.

When the working example of operation breaks down, existing information technology system tends to rerun this working example.Yet, for various reasons, rerun failure may occur, cause thus the working example breaking down to be recovered timely.

Summary of the invention

The invention provides fault recovery method and for the equipment of fault recovery, solved the limitation problem of fault recovery in prior art, can carry out working example effectively fault-tolerantly, there is high availability.

On the one hand, the invention provides a kind of fault recovery method, comprising: the configuration file of the first device loads working example is to move described working example; Described the first equipment sends described configuration file to the second equipment, so that described the second equipment is when definite described the first equipment cannot move described working example, according to working example described in described Profile Up.

On the other hand, the invention provides a kind of fault recovery method, comprising: the second equipment receives and store the configuration file of the working example of described the first equipment operation from the first equipment; Described the second equipment is when definite described the first equipment cannot move described working example, according to working example described in described Profile Up.

Again on the one hand, the invention provides a kind of equipment for fault recovery, comprising: load-on module, for the configuration file that loads working example to move described working example; The first sending module, for sending described configuration file to another equipment, so that described another equipment is when definite described equipment cannot move described working example, according to working example described in described Profile Up.

Another aspect, the invention provides a kind of equipment for fault recovery, comprising: the first receiver module, for receive and store the configuration file of the working example of described another equipment operation from another equipment; Recover module, for when determining that described another equipment cannot move described working example, according to working example described in described Profile Up.

According to technique scheme, by utilizing the configuration file of working example, can to the working example breaking down, recover in strange land, thereby can carry out working example effectively fault-tolerant, there is high availability, avoid Single Point of Faliure to cause the interruption of working example operation.

Accompanying drawing explanation

In order to be illustrated more clearly in the technical scheme of the embodiment of the present invention, to the accompanying drawing of required use in embodiment be briefly described below, apparently, accompanying drawing in the following describes is only some embodiments of the present invention, for those skilled in the art, do not paying under the prerequisite of creative work, can also obtain according to these accompanying drawings other accompanying drawing.

Fig. 1 is according to the process flow diagram of the fault recovery method of the embodiment of the present invention.

Fig. 2 is according to the process flow diagram of another fault recovery method of the embodiment of the present invention.

Fig. 3 carries out the schematic diagram of the example of fault recovery in distributed system.

Fig. 4 carries out the process flow diagram of fault recovery in the example shown in Fig. 3.

Fig. 5 carries out the sequential chart of fault recovery in the example shown in Fig. 3.

Fig. 6 is according to the structured flowchart of the equipment for fault recovery of the embodiment of the present invention.

Fig. 7 is according to the structured flowchart of another equipment for fault recovery of the embodiment of the present invention.

Fig. 8 is the structured flowchart of an equipment again for fault recovery according to the embodiment of the present invention.

Fig. 9 is according to the structured flowchart of the another equipment for fault recovery of the embodiment of the present invention.

Embodiment

Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme of the embodiment of the present invention is clearly and completely described, obviously, described embodiment is a part of embodiment of the present invention, rather than whole embodiment.Described embodiment according to the present invention, the every other embodiment that those skilled in the art obtain under the prerequisite of not making creative work, should belong to the scope of protection of the invention.

First, in conjunction with Fig. 1, describe according to the fault recovery method 100 of the embodiment of the present invention.

As shown in Figure 1, fault recovery method 100 comprises:

In S110, the configuration file of the first device loads working example is with operation working example;

In S120, the first equipment sends configuration file to the second equipment, so that the second equipment is when definite the first equipment cannot move working example, according to Profile Up working example.

Method 100 is carried out by the first equipment.The first equipment and the second equipment can be two equipment in distributed system, and they can work in coordination with identical task, also can complete different tasks.Between the first equipment and the second equipment, can be MS master-slave (master-slave) relation, when the first device fails as main equipment, as the second equipment from equipment, can take over work.The first equipment and the second equipment can be also equity (peer-to-peer) relations, can back up each other, and when certain equipment has fault, other equipment are taken over its work.The present invention does not limit the relation between the first equipment and the second equipment.

In the first equipment and the second equipment, can move working example.The operation of working example need to be by the configuration file of this working example of device loads.Whether in the first equipment, operation has at least one working example, have working example operation not to be restricted in the second equipment.Suppose, in the first equipment, operation has working example A1, and the first equipment breaks down because hardware problem, software issue etc. causes the operation of working example A1, causes working example A1 can not continue operation and goes down.

The fault occurring in order to solve working example A1, the first equipment can send to the second equipment by the configuration file of working example A1, such the second equipment can be when working example A1 breaks down, the example A1 that resumes work on the second equipment by the configuration file of the working example A1 that obtains, makes working example A1 as working example B1 newly-built on the second equipment, continue operation and goes down.Like this, the example of resuming work in strange land by utilizing configuration file, can realize effectively fault-tolerant to working example, has high availability, avoids Single Point of Faliure to cause the interruption of working example operation.Next, describe in detail according to the S110 to S120 of the embodiment of the present invention.

In S110, the first equipment, by loading the configuration file of working example, can move this working example in the first equipment.The process of loading configuration file is same as the prior art, repeats no more.

In S120, the first equipment can send to the second equipment by the configuration file of working example, when if working example breaks down and can not normally move in the first equipment like this, the second equipment can move relevant work example by loading the configuration file obtaining from the first equipment, thereby realize, the mistake of out of order working example is recovered.

According to embodiments of the invention, when certain working example (A1 represents with working example) moving in the first equipment breaks down, first the first equipment can attempt recovering this working example in the first equipment.Like this, if resume work example A1 success in the first equipment, without the second equipment, according to configuration file, in strange land, recovered.

But also likely the first equipment is failed at local recovery working example A1.In this case, recover in the strange land that need to carry out working example on the second equipment.According to one embodiment of present invention, the first equipment is when resuming work example A1 failure, the state of working example A1 can be changed into and is used to indicate the state that the first equipment cannot move working example A1, so that the second equipment determines that according to this state the first equipment cannot move working example A1.In another embodiment of the present invention, the first equipment is when resuming work example A1 failure, can send for asking the solicited message of the second equipment operation working example A1, so that the second equipment determines that according to solicited message the first equipment cannot move working example A1 to the second equipment.Like this, the second equipment, in the situation that definite working example A1 can not normally move in the first equipment, can carry out fault recovery according to the configuration file of the working example A1 obtaining in the second equipment.To the recovery of working example, can realize by continuing the subsequent process of execution work example, also can realize by re-executing the overall process of working example A1, can also realize by other modes that it may occur to persons skilled in the art that.

When working example breaks down, preferential local recovery, then carry out strange land recovery, and can try one's best and not change the running environment of working example, the generation because of fault does not produce larger impact to system.By local recovery and strange land, recover the mode combining, can improve fault-tolerant validity, avoid can not effectively carrying out fault recovery under single fault-tolerant way.Thereby, by local recovery and strange land, recover the mode combining, can improve fault-tolerant efficiency and success ratio.

In addition,, when needs carry out strange land recovery, the first equipment can initiatively send request information so that the second equipment carries out fault recovery to the second equipment; The first equipment also can only change the state of working example, and is initiatively found the existence of fault and carried out strange land recovery by the second equipment.Like this, can carry out easily strange land recovery by mode flexibly, realize simple.

In specific implementation process, the first equipment can regularly report the state of the working example moving on the first equipment to the second equipment.In this case, when the second equipment is not all received the state of the working example moving on the first equipment that the first equipment reports within least one continuous cycle (this amount of cycles can set in advance according to specific needs), the second equipment can be assert the first device fails, the first equipment cannot move working example, then by loading the configuration file obtaining from the first equipment, moves relevant work example.If the first device fails, the first equipment cannot move all working example being arranged on this first equipment.

In addition,, in specific implementation process, the second equipment also can regularly send STATUS ENQUIRY message to the first equipment, to inquire about the state of the working example moving on the first equipment.In this case, when the second equipment is not all received the state (first continuous several cycles of equipment did not all respond the STATUS ENQUIRY message of the second equipment) of the working example moving on the first equipment that the first equipment returns within least one continuous cycle (this amount of cycles can set in advance according to specific needs), the second equipment can be assert the first device fails, the first equipment cannot move working example, then by loading the configuration file obtaining from the first equipment, moves relevant work example.If the first device fails, the first equipment cannot move all working example being arranged on this first equipment.

According to one embodiment of present invention, the first equipment, when the configuration file of working example occurs to upgrade, sends the configuration file after this working example upgrades to the second equipment.

Wherein, if the configuration file of certain or some working examples upgrades in the first equipment, the first equipment can initiatively send the configuration file after upgrading to the second equipment so, so that the second equipment can correctly recover fault according to up-to-date configuration file.

The fault recovery method providing according to the embodiment of the present invention, by utilizing the configuration file of working example, can to the working example breaking down, recover in strange land, thereby can carry out working example effectively fault-tolerant, there is high availability, avoid Single Point of Faliure to cause the interruption of working example operation.Particularly in computer software distributed system, due to fault-tolerant existence and the fault-tolerant existence of container floor of distributed layer, make to add the fault-tolerant of working example layer that the embodiment of the present invention provides, can realize hierarchical fault-tolerant, thereby can realize stronger fault recovery function, there is higher reliability.Meanwhile, owing to can recover fault in the operational process of working example, making can be dynamically and the example of resuming work in time, avoids introducing excessive time delay and affects system handling property.

Fig. 2 is the process flow diagram that the fault recovery method that the embodiment of the present invention provided from the second equipment one side is described.

As shown in Figure 2, fault recovery method 200 comprises:

In S210, the second equipment receives and stores the configuration file of the working example of the first equipment operation from the first equipment;

In S220, the second equipment is when definite the first equipment cannot move working example, according to Profile Up working example.

Method 200 is carried out by the second equipment.Because the operation of the second equipment is corresponding with the operation of above-mentioned the first equipment, therefore, the corresponding contents of the description of method 200 in can reference method 100.

In S210, the configuration file that the second equipment receives from the first equipment can comprise the configuration file of all working examples that move the first equipment.When the configuration file of the first device loads working example, the first equipment can send to this configuration file the second equipment, so that the second equipment also can move corresponding working example by loading this configuration file.

The second equipment also may only obtain the configuration file of the some work example moving in the first equipment, this part working example is due to the reason such as reliability requirement is higher or importance is larger, need in a plurality of equipment, be configured the backup of file, fault-tolerant efficiently to realize.Like this, the first equipment can send to the second equipment by the configuration file of this part working example.

Certainly, it may occur to persons skilled in the art that, the second equipment also can acquisition request the first equipment in the configuration file of some working example, to these working examples are carried out to strange land recovery and fault-tolerant.

In S220, when the second equipment determines that the working example A1 moving on the first equipment can not normally move in the first equipment, the second equipment can be carried in the configuration file of the working example A1 obtaining in S210, thus on the second equipment execution work example A1.Working example A1 can not normally move explanation working example A1 and break down in the first equipment, this fault may cause due to software executive problem in the first equipment, also may, because hardware resource limitations causes, may be also that the working example A1 that other problems causes can not normally move.

Like this, although working example A1 moves unsuccessfully on the first equipment, can on the second equipment, continue to be processed, thereby can carry out effectively fault-tolerant.In addition, whether the working example A1 that the second equipment recovers belongs to the task not restriction identical with the task of the second equipment operation for this working example, make the fault recovery method that the embodiment of the present invention provides can break through in prior art and can only carry out fault-tolerant restriction to same task, there is stronger availability and validity.

According to embodiments of the invention, the second equipment can determine whether certain working example normally moves in the first equipment in several ways.

According to one embodiment of present invention, before S220, can also comprise the steps: the state of the working example that the second equipment Inspection is arranged by the first equipment; When this state first equipment of indicating cannot move this working example, the second equipment determines that the first equipment cannot move this working example.

Wherein, the first equipment can be respectively the working example A1 to An moving in the first equipment state is set, and by state, shows that working example A1 to An normally moves or breaks down.For example, can three states be set for working example, the first state is normal state, represents that working example normally moves; The second state is transition state, although expression working example breaks down, attempts local recovery; The third state is fault case, represents that working example can not continue in this locality to carry out in local recovery failure.When the second equipment Inspection to the state of working example while being the third state, illustrate that this working example can not normally carry out in the first equipment.When the second equipment Inspection to the state of working example while being the first state or the second state, the second equipment can determine that this working example does not also break down, and therefore does not also need to carry out strange land recovery.Above-mentioned for the state of working example setting be an example; protection scope of the present invention is not formed to any restriction; whether those skilled in the art it will also be appreciated that other arrange the mode of state, by state second equipment of informing, need relevant work example to carry out strange land fault recovery.

According to one embodiment of present invention, before S220, second equipment that can also comprise the steps: at first equipment of receiving, send when asking the solicited message of the second equipment operation working example, determine that the first equipment cannot move this working example.

Wherein, when the first device discovery working example A1 breaks down and can not be at local recovery, or the first device discovery working example A1 break down but local do not have enough can with resource while recovering, although or working example A1 does not break down but the first equipment wishes working example A1 to move to while carrying out on the second equipment, the first equipment can send request information to the second equipment.When the second equipment is received solicited message, the second equipment determines that working example A1 breaks down and can not in the first equipment, normally carry out, and therefore the second equipment is determined and resumed work example A1 and fix a breakdown.

According to one embodiment of present invention, the second equipment receives the configuration file after the renewal that the first equipment sends after the configuration file of working example upgrades, and the configuration file after upgrading according to this upgrades the configuration file of storage.Like this, in the second equipment, can preserve up-to-date configuration file, thereby can correctly recover fault, avoid using out-of-date configuration file and cause the wasting of resources and without efficient recovery.

According to one embodiment of present invention, after S220, second equipment that can also comprise the steps: is in operation working example when failure, the daily record of writing task instance recovery failure, or send a warning message to presumptive address or third party's system.

Wherein, if the second equipment is resumed work, example A1 failure, can record corresponding daily record, to wait for that technician processes working example A1 according to daily record.The second equipment can also send a warning message to presumptive address or third party's system, to notify in time working example A1 to break down.Presumptive address can be technician operate for system being managed to the address of the computing machine of maintenance, it can be the IP (Internet protocol of this computing machine, Internet Protocol) address, also can be MAC (medium access control, Media Access Control) address can also be that other are for the address of this computing machine of unique identification.Third party's system can be for carrying out the software systems of failure monitoring, can be also administrator computer etc.

The fault recovery method providing according to the embodiment of the present invention, the second equipment is by obtaining the configuration file of working example from the first equipment, can to the working example breaking down, recover in strange land, thereby can carry out working example effectively fault-tolerant, there is high availability, avoid Single Point of Faliure to cause the interruption of working example operation.Particularly in computer software distributed system, because distributed layer is fault-tolerant and the fault-tolerant existence of container floor, make to add the fault-tolerant of working example layer that the embodiment of the present invention provides, can realize hierarchical fault-tolerant, thereby can realize stronger fault recovery function, there is higher reliability.Meanwhile, owing to can recover fault in the operational process of working example, making can be dynamically and the example of resuming work in time, avoids introducing excessive time delay and affects system handling property.

Next, in conjunction with concrete example, the fault recovery method according to the embodiment of the present invention is described.In following example, working example is embodied as to the operation to the CDR file of country variant, the first equipment and the second equipment are embodied as to Tomcat server.But it may occur to persons skilled in the art that; working example can be also that other processes object; the first equipment and the second equipment can be also the equipment of other types; its concrete form is not construed as limiting protection scope of the present invention; the example of describing in Fig. 3 to Fig. 5 is just in order to illustrate better inventive concept of the present invention; so that those skilled in the art understand technical scheme of the present invention more all sidedly, and protection scope of the present invention is not formed to any restriction.

Figure 3 illustrates Tomcat server A and Tomcat server B in distributed system.Tomcat server is the conventional application server of increasing income of Java technical field.In each Tomcat server, can move a plurality of working examples to carry out the task of disposing.

In the example shown in Fig. 3, utilized the software architecture in J2EE (Java 2 Platform Enterprise Edition, Java 2 Platform Enterprise Edition) field.In this software architecture, can in Tomcat server, create Tomcat example, in Tomcat example, create Servlet container, by the CTS (telecommunication server of continuous service, Continues Telecom Server) frame section is deployed in Servlet container and operation, and in Servlet container disposing application program, application program can show as different working examples.Here, in house software structure in Tomcat server is an example, and the present invention is not realized to environment structure restriction, it may occur to persons skilled in the art that by other software architectures such as J2EE Servlet container and in equipment, dispose working example.

In the distributed system forming at Tomcat server, distributed system can be divided into three layers, be respectively distributed layer, Tomcat layer and working example layer.In distributed layer, can dispose a plurality of Tomcat servers and form distributed system, there is the natural effect that prevents Single Point of Faliure.In Tomcat layer,, in CTS framework, whether the Tomcat example that can detect on other machines by heartbeat detection mechanism breaks down.When the Tomcat example on other machines breaks down, this Tomcat example can be restarted.In working example layer, comprise a plurality of working examples, at this layer, also can have heartbeat detection mechanism, can come testing example whether to break down by the heartbeat detection of working example layer.

As shown in Figure 3, in the CTS of Tomcat server A framework, operation is useful on the working example of the CDR file of processing Argentina area and for the treatment of the working example of the CDR file in Uruguay area, respectively referred to as Argentinian working example and Uruguay's working example.In the CTS of Tomcat server B framework, operation is useful on the working example of the CDR file of processing Chile area, referred to as Chilean working example.

Tomcat server is in distributed layer, and CTS framework is in container floor, and three working examples are in working example layer.Figure 3 illustrates the heartbeat detection between CTS framework, when this heartbeat detection mechanism detects certain CTS framework fault, can restart this CTS framework.In addition, between three working examples of working example layer, also there is heartbeat detection mechanism, but not shown in Figure 3, when the heartbeat detection mechanism of working example layer detects the fault of certain working example, can restart in this locality or this working example is recovered in strange land.

Conventionally in a Tomcat server, set up a Tomcat example.Easy for what describe, shown in Figure 3 the is Tomcat example A moving in Tomcat server A and the Tomcat example B moving in Tomcat server B.

When the Argentinian working example in Tomcat example A breaks down, may be that this Argentina's working example breaks down, may be also that the CTS framework in Tomcat example A breaks down.If this Argentina's working example breaks down, can in the CTS framework in Tomcat example A, restart this working example.If the CTS framework in Tomcat example A breaks down, second layer heartbeat detection can detect the CTS framework breaking down so, and attempts restarting this CTS framework, and when CTS framework is restarted successfully, Argentinian working example is restored in this locality.The fault of Argentina's working example can be found by the heartbeat detection mechanism of the 3rd layer, and trigger local (being Tomcat example A) or long-range (being Tomcat example B) and carry out fault recovery.After the failure of Tomcat example A local recovery, can in Tomcat example B, recover Argentinian working example.If Tomcat example B recovers Argentinian working example success, Argentinian working example is at the CTS framework relaying reforwarding row of Tomcat example B.Thereby realize the recovery to working example.

In addition, conventionally, the sense cycle of the heartbeat detection of the 3rd layer is longer than the sense cycle of the heartbeat detection of the second layer.Heartbeat detection mechanism by container floor and working example layer can be recovered the working example breaking down, thereby it is fault-tolerant to realize multi-layer, has higher reliability.

Each working example has corresponding exemplary configuration file, and it is that working example is persisted to using its work and parameter configuration the homologue saving as file on hard disk, is similar to the concept of object serialization in Java.After distributed system operation, the working example on each equipment can be transferred to other equipment the configuration file of the working example at local runtime.Like this, working example AX in device A is not restarted and during cisco unity malfunction, other low equipment B of loading can be according to the configuration file of the working example AX breaking down, and on this equipment, the working example breaking down is recovered in strange land automatically, forms working example BX.For the working example AX of device A, its work is taken over by equipment B passively and is formed working example BX.In addition, working example AX is before breaking down, or before the overload of device A load, working example AX or device A can be transferred away the execution of working example AX on one's own initiative, to move working example BX in equipment B.Work handover on one's own initiative like this combines with the method for taking over of working passively, not only can realize fault and shift (failover), if to the rational threshold value of the load setting of equipment, can also realize dynamic load balancing, reach high high availability.

Between Tomcat example A and Tomcat example B, can realize by process flow diagram as shown in Figure 4 the recovery of the Argentinian working example to breaking down.

In S410, Tomcat example A and Tomcat example B initialization success, normal operation: Tomcat example A has loaded Argentina and uruguayan working example configuration file; Tomcat example B has loaded Chilean working example configuration file.

In S415, Tomcat example A and B have started respectively the working example of Argentina, Uruguay and the Chile of each bootstrap loading, start working, and meanwhile, the heartbeat detection between the CTS framework of Tomcat example A and B is set up.

In S420, if the CTS framework heartbeat detection in Tomcat example A is moved unsuccessfully to the CTS framework in Tomcat example B, can attempt starting the CTS framework in Tomcat example B; Vice versa.This is the recovery of restarting of CTS framework level.This step can periodically be carried out, and its execution sequence is unrestricted.

If the configuration file of certain working example also detected by CTS framework heartbeat detection, occur upgrading, the configuration file after upgrading is synchronously updated to this locality.

In S425, the CTS framework in Tomcat example A and B can pass through CTS framework heartbeat detection or other self-contained process, obtains each other the other side's working example configuration file.

In S430, the CTS framework in Tomcat example A and B detects Argentinian working example at local/remote and moves unsuccessfully.After this, can restart and recover to get rid of the fault of working example at working example layer.

In S435, the state of the Argentinian working example of CTS framework in Tomcat example A is set to " temporarily failure " state, and in this locality, attempts restarting Argentinian working example.

In S440, Tomcat example A judges that whether local fault recovery is successful.If Argentinian working example is restored in this locality, advance to S445; If do not recover Argentinian working example in this locality, advance to S450.Like this, can realize preferentially at local recovery working example.

In addition, if Tomcat example A local resource does not meet predetermined requirement, such as surpass resource utilization the upper limit, can not provide Argentinian working example to require the resource that takies etc., the CTS framework in can also the proactive notification long-range Tomcat example B of Tomcat example A recovers Argentinian working example, initiatively transfers working example to remote equipment.In this case, Argentinian working example may truly not break down, but can not bootup window due to Tomcat example A, can be considered as Argentinian working example yet and break down.

If in S440 Tomcat example A at local recovery Argentinian working example, in S445, CTS framework and working example in Tomcat example A can normally move, and Tomcat example A is " normally " state by the recovering state of Argentinian working example.

If Tomcat example A is in the working example failure of local recovery Argentina in S440,, in S450, the state of the Argentinian working example of Tomcat example A is set to " local permanent failure " state.

In S455, CTS framework in Tomcat example B finds that the duty of Argentinian working example in Tomcat example A is " local permanent failure " state, illustrate that Argentinian working example is in local recovery failure, so Tomcat example B is according to the configuration file obtaining in S425, newly-built Argentinian working example in Tomcat example B.Like this, can realize the combination of local recovery and remote recovery, for fault recovery provides more approach, thereby bring higher reliability.

In S460, Tomcat example B determines that whether remote recovery (recovering Argentinian working example in Tomcat example B) is successful.If remote recovery success, advances to S465; If remote recovery failure, advances to S470.

In S465, the CTS framework in Tomcat example B, according to the configuration file of Argentinian working example, completes the newly-built of Argentinian working example on Tomcat example B, and carries out the task of Argentinian working example.The recovery in this case of Argentina's working example is to check that by Tomcat example B the state of the Argentinian working example of the upper record of Tomcat example A completes, for Tomcat example A, (Job Takeover) taken in this work is passive.

In S470, Tomcat example B, in the working example failure of remote recovery Argentina, changes into the state of Argentinian working example " forever failure " state.

In S475, the CTS framework log in Tomcat example B, this daily record is used to indicate the working example failure of remote recovery Argentina, or the CTS framework in Tomcat example B is to the address transmission alarm of user's appointment.Like this, can be so that manpower intervention, thus can solve more in time the fault of Argentinian working example.

In the above-mentioned distributed system that comprises Tomcat server A and Tomcat server B, can support heat upgrading, when Tomcat server is not shut down, configuration parameter that can modification example, thus change work behavior.When having revised the configuration parameter of certain working example, this working example can be transferred to other equipment by the configuration file of oneself in time, other equipment also can obtain up-to-date working example configuration file version like this, thus can be correct working example is recovered.

Next, in conjunction with Fig. 5, specifically describe detailed process and the details of recovering Argentinian working example in the example shown in Fig. 3, this detailed process and details be an example equally just, for helping to understand better technical scheme of the present invention, the present invention is not formed to any restriction.

Easy for what describe, below in 5 description, with CTS1, represent the CTS framework in Tomcat example A, with CTS2, represent the CTS framework in Tomcat example B.

In S505, Tomcat example A initialization success, normal operation: Tomcat example A has loaded the configuration file of Argentina and uruguayan working example.

In S510, Tomcat example B initialization success, normal operation: Tomcat example B has loaded the configuration file of Chilean working example.

In S515, Tomcat example A has started the Argentina and the uruguayan working example that load, and Argentinian working example and Uruguay's working example start normal work.

In S520, Tomcat example B has started the Chilean working example loading, and Chilean working example starts normal work.

Then, CTS1 sets up with the heartbeat detection of CTS2 machine-processed.In S525, CTS1 sends and starts heartbeat (SetupHeartBeat) to CTS2.In S530, CTS 1 from CTS2 obtain heartbeat detection to information.

Then, CTS2 sets up with the heartbeat detection of CTS1 machine-processed.In S535, CTS2 sends and starts heartbeat to CTS1.In S540, CTS2 from CTS1 obtain heartbeat detection to information.

The information of obtaining in S530 and S540 can include but not limited to whether network is communicated with, what state working example is, whether working example configuration file upgrades etc.If working example configuration file has renewal, can synchronously be updated to this locality.In addition, CTS1 sets up the restriction that order that heartbeat detection mechanism and CTS2 set up heartbeat detection mechanism can not be subject to said sequence.

In S545, the CTS framework in Tomcat example A detects Argentinian working example and moves unsuccessfully in this locality, and the CTS framework in Tomcat example B moves unsuccessfully to Argentinian working example in remote detection.Then, the state of the Argentinian working example of CTS framework in Tomcat example A is set to " temporarily failure " state, and in this locality, attempts restarting Argentinian working example.

In S550, CTS1 determines that whether local recovery is successful.If local recovery success, the CTS framework in Tomcat example A and working example normal operation all in S552, CTS1 is " normally " state the recovering state of Argentinian working example.If local recovery failure, in S554, the state of the Argentinian working example of CTS framework in Tomcat example A is set to " local permanent failure " state.

In this example, suppose that CTS1 is in the working example failure of local recovery Argentina, fault recovery flow process continues to carry out.

In S555, CTS2 heartbeat detection is periodically moved.

In S560, the state that CTS2 detects Argentina's work actual gains example in CTS1 is " local permanent failure " state.In this case, CTS2 attempts recovering Argentinian working example in Tomcat example B.

In S565, whether CTS2 determines the remote recovery of Argentinian working example successful.If remote recovery success, in S567 the CTS framework in Tomcat example B in Tomcat server B according to the newly-built Argentinian working example of the configuration file of Argentinian working example.If remote recovery failure, in S569, the state of the Argentinian working example of CTS framework in Tomcat example B is set to " forever failure " state.

If to the remote recovery of Argentinian working example also failure, so in S570, the CTS framework log in Tomcat example B, or send alarm to the address of user's appointment, the address at third party's monitoring alarm software place for example, so that manpower intervention.Like this, by above-mentioned fault recovery flow process, can preferentially to fail operation example, carry out local recovery, after local recovery failure, can carry out remote recovery, the mode that this local recovery and remote recovery combine, can provide higher reliability.

The state of working example be not limited in Fig. 5 for example, in other examples, can also characterize working example whether can be at local recovery or remote recovery for working example arranges other states.In addition,, in the specific implementation of fault recovery flow process, can write different functions for different steps and realize.For instance, can in S505 and S510, write function load (), write function startup () in S515 and S520, write function VerifyStatus () in S545, these functions are realized respectively the operation of corresponding step.Certainly, the title of function is an example and not being construed as limiting.In addition, in the present embodiment, whether recovery is restarted in this locality can have identical function call interface with long-range reconstruction recovery, be successfully recovered, and can be decided by concrete programmer.For example, can indicate by the value of returning after VerifyStatus () function call the current running status of working example.In addition,, in the flow process shown in Fig. 5, also can make the required interface function of each working example of CTS framework definition.For example, with startup () function, start the working example of business, with shutdown () function, stop this working example, with reactivate () function, reactivate this working example, it can be first to call shutdown () function, call startup () function again that the acquiescence of reactivate () function realizes.

According to the embodiment of the present invention, provide according to the resume work fault recovery method of example of configuration file, can be undertaken fault-tolerant by hierarchical mode, at distributed layer, container floor and working example layer, carry out respectively fault-tolerant, thereby bring higher reliability for system.No matter same working example is deployed in to the distributed system that forms isomorphism on an equipment or multiple devices neatly, or different working examples is deployed as to the distributed system of isomery task, can realize fault recovery.The embodiment of the present invention has also proposed the hand over of active and the method that passive work adapter combines simultaneously, and the mode in conjunction with the working example configuration file after dynamic update notifications, can reach extremely strong high availability.

Described the fault recovery method according to the embodiment of the present invention above, the following describes according to the structured flowchart of the equipment of the embodiment of the present invention.Because each following equipment can be for carrying out fault recovery, so the operation of its each module can be with reference to the description of said method and the description in each example, for fear of repetition, is below only briefly described and does not specifically launch.

Fig. 6 is according to the structured flowchart of the equipment 600 for fault recovery of the embodiment of the present invention.

Equipment 600 comprises load-on module 610 and the first sending module 620.Equipment 600 can be the treatment facility in distributed system, can also be server apparatus, can be also other computer equipments that it may occur to persons skilled in the art that.Load-on module 610 can be realized by processor, and the first sending module 620 can be realized by output interface.

Load-on module 610 for the configuration file that loads working example with operation working example.The first sending module 620 is for sending configuration file to another equipment, so that another equipment is when definite equipment 600 cannot move working example, according to Profile Up working example.

Above and other operation and/or the function of load-on module 610 and the first sending module 620 can, with reference to the description of said method 100 and relevant portion, for fear of repetition, repeat no more.

The equipment for fault recovery providing according to the embodiment of the present invention, by sending configuration file to another equipment, can make another equipment according to configuration file, the working example breaking down on this equipment be recovered, realize effectively fault-tolerant, there is high availability, avoid Single Point of Faliure to cause the interruption of working example operation.In addition, because distributed layer in computer software distributed system is fault-tolerant and the fault-tolerant existence of container floor, make to add the fault-tolerant of working example layer that the embodiment of the present invention provides, can realize hierarchical fault-tolerant, thereby can realize stronger fault recovery function, there is higher reliability.Meanwhile, owing to can recover fault in the operational process of working example, making can be dynamically and the example of resuming work in time, avoids introducing excessive time delay and affects system handling property.

Fig. 7 is according to the structured flowchart of the equipment 700 for fault recovery of the embodiment of the present invention.

The load-on module 710 of equipment 700 and the first sending module 720 are basic identical with load-on module 610 and first sending module 620 of equipment 600.

According to one embodiment of present invention, equipment 700 can also comprise recovery module 730, recovers module 730 and can realize by processor.Recover module 730 for when working example breaks down, the example of resuming work.

According to one embodiment of present invention, when equipment 700 can carry out local recovery in this locality to the working example breaking down by recovery module 730, equipment 700 can also comprise update module 740 and/or the second sending module 750, update module 740 can realize by processor, and the second sending module 750 can be realized by output interface.Update module 740, for when resuming work example failure, changes into by the state of this working example the state that the equipment of being used to indicate 700 cannot move this working example, so that another equipment determines that according to this state equipment 700 cannot move this working example.The second sending module 750, for when resuming work example failure, sends and is used for asking another equipment to move the solicited message of this working example to another equipment, so that another equipment determines that according to solicited message equipment 700 cannot move this working example.Like this, by local recovery and strange land are recovered to the mode combining, can increase efficiency and the success ratio of fault recovery, strengthen fault freedom, improve reliability.

According to one embodiment of present invention, equipment 700 can also comprise that the 3rd sending module 760, the three sending modules 760 can realize by output interface.When the 3rd sending module 760 occurs to upgrade for the configuration file at working example, to another equipment, send the configuration file after this working example upgrades.Like this, after working example breaks down, another equipment can carry out correct fault recovery according to up-to-date configuration file.

Above and other operation and/or the function of recovering module 730, update module 740, the second sending module 750 and the 3rd sending module 760 can, with reference to the description of said method 100 and relevant portion, for fear of repetition, repeat no more.

The equipment for fault recovery providing according to the embodiment of the present invention, by initiatively and/or passive carry out fault recovery, not only can carry out working example effectively fault-tolerant, can also take over when needed the working example on other equipment, therefore, not only there is higher reliability, also there is high availability.

Fig. 8 is according to the structured flowchart of the equipment 800 for fault recovery of the embodiment of the present invention.

Equipment 800 comprises the first receiver module 810 and recovers module 820.Equipment 800 can be the treatment facility in distributed system, can also be server apparatus, can be also other computer equipments that it may occur to persons skilled in the art that.The first receiver module 810 can be realized by input interface and storer, recovers module 820 and can realize by processor.

The first receiver module 810 is for receiving and store the configuration file of the working example of described another equipment operation from another equipment.Recover module 820 for when determining that described another equipment cannot move this working example, according to Profile Up working example.

Above and other operation and/or the function of the first receiver module 810 and recovery module 820 can, with reference to the description of said method 200 and relevant portion, for fear of repetition, repeat no more.

The equipment for fault recovery providing according to the embodiment of the present invention, this equipment is by obtaining the configuration file of working example from another equipment, can on this equipment, to the working example breaking down in another equipment, recover, thereby can carry out working example effectively fault-tolerant, there is high availability, avoid Single Point of Faliure to cause the interruption of working example operation.Particularly in computer software distributed system, because distributed layer is fault-tolerant and the fault-tolerant existence of container floor, make to add the fault-tolerant of working example layer that the embodiment of the present invention provides, can realize hierarchical fault-tolerant, thereby can realize stronger fault recovery function, there is higher reliability.Meanwhile, owing to can recover fault in the operational process of working example, making can be dynamically and the example of resuming work in time, avoids introducing excessive time delay and affects system handling property.

Fig. 9 is according to the structured flowchart of the equipment 900 for fault recovery of the embodiment of the present invention.

The first receiver module 910 of equipment 900 and recovery module 920 are basic identical with the first receiver module 810 and the recovery module 820 of equipment 800.

According to one embodiment of present invention, equipment 900 can also comprise detection module 930 and the first determination module 940, and detection module 930 and the first determination module 940 can be realized by processor.Detection module 930 is for detection of the state of the working example being arranged by another equipment.The first determination module 940 when indicating another equipment cannot move this working example when this state, determines that another equipment cannot move this working example.Like this, when equipment 900 determines that at the first determination module 940 working example can not normally move, can make to recover module 920 and carry out fault recovery according to configuration file.

According to one embodiment of present invention, equipment 900 can also comprise that the second determination module 950, the second determination modules 950 can realize by processor.The second determination module 950, for when receiving the solicited message for requesting service 900 operation working examples that another equipment sends, determines that another equipment cannot move this working example.Like this, recover module 920 and can, when the second determination module 950 determines that working example breaks down, according to configuration file, to this working example, carry out fault recovery.

According to embodiments of the invention, equipment 900 is inclusion test module 930, the first determination module 940 and the second determination module 950 simultaneously.Like this, the passive work of detection module 930 and the first determination module 940 realizations can be taken over to the active work handover realizing with the second determination module 950 combines, not only can realize fault shifts, if to the rational threshold value of the load setting of equipment, can also realize dynamic load balancing by work handover on one's own initiative, reach high high availability.

According to one embodiment of present invention, equipment 900 can also comprise that the second receiver module 960 and update module 970, the second receiver modules 960 can realize by input interface, and update module 970 can realize by processor.The configuration file of the second receiver module 960 after for the renewal that receives another equipment and send after the configuration file of working example upgrades.Update module 970 is upgraded the configuration file of storage for the configuration file after upgrading according to this.Like this, equipment 900 can obtain the latest configuration file about working example, is convenient to correctly recover when working example breaks down.

According to embodiments of the invention, equipment 900 can also comprise logging modle 980 and/or sending module 990, and logging modle 980 can realize by processor, and sending module 990 can be realized by output interface.Logging modle 980 is for when the failure of equipment 900 operation working example, the daily record of writing task instance recovery failure.Sending module 990, for when equipment 900 moves working example failure, sends a warning message to presumptive address or third party's system.By recoding daily log and/or send a warning message, the in the situation that of can not automatically being recovered by equipment in local recovery and remote recovery failure, be convenient in time manpower intervention, thereby can solve quickly the fault that working example occurs.

Above and other operation and/or the function of detection module 930, the first determination module 940, the second determination module 950, the second receiver module 960, update module 970, logging modle 980 and sending module 990 can be with reference to the descriptions of said method 200 and relevant portion, for fear of repetition, repeat no more.

In addition, in specific implementation process, the equipment 600 and 700 in Fig. 6 and Fig. 7 all can comprise a status report module, for the state of the working example of operation on reporting facility 600 and 700.In this case, the equipment 800 and 900 in Fig. 8 and Fig. 9 all can comprise a state receiver module, for the state of the working example of operation on receiving equipment 600 and 700.When state receiver module is not all received the state of the working example of operation on the equipment 600 and 700 that status report module reports within least one continuous cycle (this amount of cycles can set in advance according to specific needs), state receiver module can break down by definition apparatus 600 and 700, be that equipment 600 and 700 cannot move working example, after this, the recovery module 820 in equipment 800 and 900 can be moved relevant work example by loading the configuration file obtaining from equipment 600 and 700.If equipment 600 and 700 breaks down, equipment 600 and 700 cannot move all working example being arranged on this equipment 600 and 700.

In addition, in specific implementation process, the equipment 800 and 900 in Fig. 8 and Fig. 9 all can comprise a status poll module, for regularly sending STATUS ENQUIRY message to equipment 600 and 700, with the state of the working example of operation on query facility 600 and 700.In this case, the equipment 600 and 700 in Fig. 6 and Fig. 7 all can comprise a condition responsive module, for responsive state query messages, to the state of the working example of operation on status poll module Returning equipment 600 and 700.When status poll module is not all received the state (be continuous several cycles of condition responsive module all not the STATUS ENQUIRY message of responsive state enquiry module) of the working example of operation on the equipment 600 and 700 that condition responsive module returns within least one continuous cycle (this amount of cycles can set in advance according to specific needs), status poll module can break down by definition apparatus 600 and 700, be that equipment 600 and 700 cannot move working example, after this, recovery module 820 in equipment 800 and 900 can be moved relevant work example by loading the configuration file obtaining from equipment 600 and 700.If equipment 600 and 700 breaks down, equipment 600 and 700 cannot move all working example being arranged on this equipment 600 and 700.

Those skilled in the art can recognize, in conjunction with the various method steps of describing in embodiment disclosed herein and unit, can realize with electronic hardware, computer software or the combination of the two, for the interchangeability of hardware and software is clearly described, step and the composition of each embodiment described according to function in the above description in general manner.These functions are carried out with hardware or software mode actually, depend on application-specific and the design constraint of technical scheme.Those skilled in the art can realize described function with distinct methods to each specific application, but this realization should not thought and exceeds scope of the present invention.

The software program that the method step of describing in conjunction with embodiment disclosed herein can be carried out with hardware, processor or the combination of the two are implemented.Software program can be placed in the storage medium of any other form known in random access memory (RAM), internal memory, ROM (read-only memory) (ROM), electrically programmable ROM, electrically erasable ROM, register, hard disk, moveable magnetic disc, CD-ROM or technical field.

Although illustrated and described some embodiments of the present invention, it should be appreciated by those skilled in the art that without departing from the principles and spirit of the present invention, can carry out various modifications to these embodiment, such modification should fall within the scope of the present invention.

Claims

1. a fault recovery method, described fault recovery method is for distributed system, and described distributed system comprises the first equipment and the second equipment, it is characterized in that, and described fault recovery method comprises:

The configuration file of described the first device loads working example is to move described working example;

Described the first equipment sends described configuration file to described the second equipment, so that described the second equipment is when definite described the first equipment cannot move described working example, according to working example described in described Profile Up;

Described the first equipment, when described working example breaks down, recovers described working example;

Described fault recovery method also comprises:

Described the first equipment is when recovering described working example failure, the state of described working example is changed into and is used to indicate the state that described the first equipment cannot move described working example, so that described the second equipment determines that according to this state described the first equipment cannot move described working example; Or

Described the first equipment is when recovering described working example failure, to described the second equipment, send for asking described the second equipment to move the solicited message of described working example, so that described the second equipment determines that according to described request information described the first equipment cannot move described working example.

2. fault recovery method according to claim 1, is characterized in that, also comprises:

Described the first equipment, when the configuration file of described working example occurs to upgrade, sends the configuration file after this working example upgrades to described the second equipment.

3. a fault recovery method, described fault recovery method is for distributed system, and described distributed system comprises the first equipment and the second equipment, it is characterized in that, and described fault recovery method comprises:

Described the second equipment receives and stores the configuration file of the working example of described the first equipment operation from described the first equipment;

Described the second equipment is when definite described the first equipment cannot move described working example, according to working example described in described Profile Up;

Described fault recovery method also comprises:

The state of the described working example that described the second equipment Inspection is arranged by described the first equipment, when described state indicates described the first equipment to move described working example, described the second equipment determines that described the first equipment cannot move described working example; Or

Described the second equipment receive that described the first equipment sends when asking described the second equipment to move the solicited message of described working example, determine that described the first equipment cannot move described working example.

4. fault recovery method according to claim 3, is characterized in that, also comprises:

Described the second equipment receives the configuration file after the renewal that described the first equipment sends after the configuration file of described working example upgrades, and the configuration file after upgrading according to this upgrades the configuration file of storage.

5. fault recovery method according to claim 3, is characterized in that, also comprises:

Described the second equipment, when the described working example failure of operation, records the daily record of described working example failure, or sends a warning message to presumptive address or third party's system.

6. for an equipment for fault recovery, described equipment is used for distributed system, it is characterized in that, described equipment comprises:

Load-on module, for the configuration file that loads working example to move described working example;

The first sending module, sends described configuration file for another equipment to described distributed system, so that described another equipment is when definite described equipment cannot move described working example, according to working example described in described Profile Up;

Recover module, for when described working example breaks down, recover described working example;

Described equipment also comprises:

Update module, for when recovering described working example failure, the state of described working example is changed into and is used to indicate the state that described equipment cannot move described working example, so that described another equipment determines that according to this state described equipment cannot move described working example; Or

The second sending module, for when recovering described working example failure, to described another equipment, send for asking described another equipment to move the solicited message of described working example, so that described another equipment determines that according to described request information described equipment cannot move described working example.

7. equipment according to claim 6, is characterized in that, also comprises:

The 3rd sending module, while occurring to upgrade for the configuration file at described working example, sends the configuration file after this working example upgrades to described another equipment.

8. for an equipment for fault recovery, described equipment is used for distributed system, it is characterized in that, described equipment comprises:

The first receiver module, the configuration file of the working example of another equipment operation described in receiving and store for another equipment from described distributed system;

Recover module, for when determining that described another equipment cannot move described working example, according to working example described in described Profile Up;

Described equipment also comprises:

Detection module, state for detection of the described working example being arranged by described another equipment, and first determination module, for when described another equipment of described state indication cannot move described working example, determine that described another equipment cannot move described working example; Or

The second determination module, for receive that described another equipment sends when asking described equipment to move the solicited message of described working example, determine that described another equipment cannot move described working example.

9. equipment according to claim 8, is characterized in that, also comprises:

The second receiver module, for receiving the configuration file after the renewal that described another equipment sends after the configuration file of described working example upgrades;

Update module, upgrades for the configuration file after upgrading according to this configuration file of storing.

10. equipment according to claim 8, is characterized in that, also comprises:

Logging modle, when moving the failure of described working example at described equipment, records the daily record of described working example failure; Or

Sending module, when moving the failure of described working example at described equipment, sends a warning message to presumptive address or third party's system.