CN101777020B - Fault tolerance method and system used for distributed program - Google Patents

Fault tolerance method and system used for distributed program Download PDF

Info

Publication number
CN101777020B
CN101777020B CN2009102439440A CN200910243944A CN101777020B CN 101777020 B CN101777020 B CN 101777020B CN 2009102439440 A CN2009102439440 A CN 2009102439440A CN 200910243944 A CN200910243944 A CN 200910243944A CN 101777020 B CN101777020 B CN 101777020B
Authority
CN
China
Prior art keywords
fault
tolerant
client
strategy
server
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN2009102439440A
Other languages
Chinese (zh)
Other versions
CN101777020A (en
Inventor
不公告发明人
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Infobird Software Co Ltd
Original Assignee
Beijing Infobird Software Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Infobird Software Co Ltd filed Critical Beijing Infobird Software Co Ltd
Priority to CN2009102439440A priority Critical patent/CN101777020B/en
Publication of CN101777020A publication Critical patent/CN101777020A/en
Application granted granted Critical
Publication of CN101777020B publication Critical patent/CN101777020B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention aims to provide a fault tolerance method and a system used for a distributed program, which can start program processes disposed on different fault tolerance clients according to a sequence. When any one process with a dependency relationship is collapsed, a fault tolerance server can execute the corresponding operation according to a strategy configured on the fault tolerance server, and the fault tolerance clients can start the processes according to the sequence as long as the fault tolerance clients utilize a traditional monitoring program to execute monitoring, take charge of reporting the process status and receive and execute an instruction transmitted by the fault tolerance server or issued by the operation maintenance personnel so as to ensure that the operation can be normally carried out. The system is provided with a fault tolerance server and at least one fault tolerance client, wherein the fault tolerance clients monitor the status of the processes of the distributed program run by the fault tolerance clients per se. When the abnormity of the status is monitored, the fault tolerance clients inform the fault tolerance server, and the fault tolerance server utilizes a strategy execution module to execute restarting fault-tolerance treatment according to an automatic restarting strategy or a manual restarting strategy and the interprocess dependency relationship specified in a process dependency relationship form.

Description

A kind of fault-tolerance approach and system that is used for distributed program
Technical field
The application relates to a kind of fault-tolerance approach and system, in particular to a kind of fault-tolerance approach and system that is used for distributed program.
Background technology
Supervisory programme is the program of supervisory computer running state of process, and when supervisory programme confirmed that the computer processes of being kept watch on off-duty situation occurs or unusual situation occurs, supervisory programme attempted starting or restarting the said computer processes of being kept watch on.So far, this supervisory programme all can only be kept watch on the computer processes of this machine, and it is operated the process on this machine in order, if one of them process has occurred unusual then can start the process that it is kept watch on according to boot sequence.
But along with the demand that increases day by day to the computer run ability, a lot of tasks can't be accomplished by single Computer Processing.Under this background, distributed program arises at the historic moment, because of it has resource sharing, and load balancing, advantages such as security height are so the application of distributed program is more and more widely.Distributed program utilizes on several the computing machines of network technology in Distributed Computer System collaborative simultaneously executive routine to accomplish a task jointly.But a new problem is that existing machine process status supervisory programme has been difficult to tackle the status surveillance task of distributed program.Therefore, when distributed program was made mistakes, the error situation of process was checked at each computing machine place that the keeper must arrive in the Distributed Computer System respectively, and this will consume keeper's great amount of time and energy.
Summary of the invention
The object of the present invention is to provide a kind of fault-tolerance approach and system that is used for distributed program; It can be started by sequence and be deployed in the program process on the different fault-tolerant clients; When any one has the process collapse of dependence; Fault-tolerant server can be according to the strategy executing relevant operation that configures on it; And fault-tolerant client only need be utilized existing supervisory programme execution monitoring, be responsible for reporting that process status and reception and execution are sent by fault-tolerant server or by the instruction that the O&M personnel issue, the above-mentioned process of just can starting by sequence can normally be carried out with assurance work.
The objective of the invention is to realize through following manner:
The invention provides a kind of fault-tolerance approach that is used for distributed program; Be used for having the tolerant system of fault-tolerant server and at least one fault-tolerant client, said method comprises the steps: that fault-tolerant client utilizes its process status state monitoring module to keep watch on the state of a process of the distributed program of himself operation; When monitoring process status when unusual, fault-tolerant client utilizes process status abnormal information generation module to generate the process status abnormal information, and utilizes communication module that said information is sent to fault-tolerant server; Fault-tolerant server is through its communication module receiving process abnormal state information; Fault-tolerant server utilizes policy enforcement module to restart fault-tolerant processing according to restarting strategy automatically or manually restarting strategy and carry out according to the dependence between the process of stipulating in the process dependence form, and wherein said to restart strategy automatically or manually restart strategy be to utilize tactful designated module preassigned; After fault-tolerant processing was restarted in execution, fault-tolerant client was utilized the result of communication module to fault-tolerant server report process initiation.
The present invention also provides a kind of tolerant system that is used for distributed program, has fault-tolerant server and at least one fault-tolerant client.Said fault-tolerant server comprises: communication module, be used for communicating with fault-tolerant client, and receive the process status abnormal information that fault-tolerant client is sent; The strategy designated module is used for specifying in advance and restarts that fault-tolerant processing is employed restarts strategy automatically or manually restart strategy; Policy enforcement module according to the strategy of being formulated in advance by tactful designated module, is carried out the corresponding fault-tolerant processing of restarting; Policy database, storage are restarted strategy automatically and are manually restarted strategy; Process dependence database stores the process dependence form of representing dependence between the distributed program process.Said fault-tolerant client comprises: process status monitor module, the state of a process that is used to keep watch on the distributed program that himself moves; Process status abnormal information generation module is used for generating the process status abnormal information when unusual monitoring process status; Communication module is used for communicating with fault-tolerant server, and the process status abnormal information is sent to fault-tolerant server.
Through above-mentioned fault-tolerance approach that is used for distributed program and system; Can utilize existing supervisory programme; Only need this program simply is provided with; (wherein process status abnormal information generation module is used for monitoring process status generation process status abnormal information when unusual to increase process status abnormal information generation module and communication module; Communication module is used for communicating with fault-tolerant server, and the process status abnormal information is sent to fault-tolerant server), state that just can automatic monitoring distributed program process; And when the distributed program process is made mistakes, can automatically perform reply and handle, saved the keeper checks the process operation conditions respectively to each computing machine step.In addition; When changing appears in the deployment of distributed program process; For example increase new process and cause the increase of process relation, perhaps existing process relation to change, only need modification or increase corresponding strategy in the policy database, newly-increased simultaneously corresponding fault-tolerant client; And need not to change other configurations, just can realize corresponding fault tolerant mechanism.
Description of drawings
To illustrate and describe embodiment of the present invention hereinafter, in conjunction with following specific descriptions, it is obvious that various aspects of the present invention and advantage will become.In appended accompanying drawing:
Figure 1A is the block diagram that is used for the tolerant system of distributed program according to of the present invention.
Figure 1B is the physics deployment diagram that is used for the tolerant system of distributed program according to of the present invention.
Fig. 2 is according to the process flow diagram of the fault-tolerance approach that is used for distributed program of one embodiment of the present invention.
Fig. 3 is the process flow diagram that adopts based on the fault-tolerance approach of restarting strategy automatically.
Fig. 4 is the process flow diagram that adopts based on the fault-tolerance approach of manually restarting strategy.
Fig. 5 adopts based on restarting the process flow diagram of strategy with the fault-tolerance approach of auxiliary strategy automatically.
Embodiment
Below, will illustrate and describe embodiment of the present invention.
Figure 1A is the block diagram that is used for the tolerant system of distributed program according to of the present invention.As shown in Figure 1, said tolerant system comprises fault-tolerant server 10, fault-tolerant client 20, fault-tolerant client 30, all links together through network between all fault-tolerant servers and the fault-tolerant client, and said network includes but not limited to LAN, wide area network etc.And, should be appreciated that this configuration described here only is to be used for illustrative purposes, system can comprise the fault-tolerant server and the fault-tolerant client of arbitrary number.
Said fault-tolerant server comprises communication module 102, tactful designated module 104, policy enforcement module 106, policy database 108, process dependence database 110.Said communication module 102 is used for communicating with fault-tolerant client; Receive the process status abnormal information that fault-tolerant client is sent; Said tactful designated module 104 is used for specifying in advance restarts that fault-tolerant processing is employed restarts strategy automatically or manually restart strategy; Preferably; The said mode of indication in advance is that the user passes through the manually input of computer entry device such as mouse, keyboard, and said policy enforcement module 106 is carried out the corresponding fault-tolerant processing of restarting according to by tactful designated module 104 preassigned strategies; Said policy database 108 storages are restarted strategy automatically and are manually restarted strategy, and said process dependence database 110 stores the process dependence form of dependence between expression distributed program process.
Said fault-tolerant client comprises process status state monitoring module 202, process status abnormal information generation module 204 and communication module 206.Said process status state monitoring module 202 is used to keep watch on the state of a process of the distributed program that himself moves; Said process status abnormal information generation module 204 is used for generating the process status abnormal information when unusual monitoring process status; Said communication module 206 is used for communicating with fault-tolerant server, and the process status abnormal information is sent to fault-tolerant server.
Above-mentioned structure is the application's a logic configuration, that is to say, above-mentioned fault-tolerant client and fault-tolerant server belong to the different logical structure, and in fact it can be disposed in the identical or different physical nodes.What be directed against because of the application is the fault-tolerant of distributed program, so only need fault-tolerant client configuration is got final product in different physical nodes.For example; Shown in Figure 1B; For system with a fault-tolerant server and two fault-tolerant clients, fault-tolerant server and fault-tolerant client can physical configuration in node 1, and another fault-tolerant client can physical configuration in another node 2.But be to be understood that; Above-mentioned configuration mode only is to be used for illustrative purposes; System can comprise the fault-tolerant server and the fault-tolerant client of arbitrary number, also can adopt other configuration, is disposed at three different nodes etc. respectively like a said fault-tolerant server and two fault-tolerant clients.
Below strategy is elaborated.Strategy one speech is from the strategy pattern in the Design Mode (Strategy): it has defined a series of algorithm, and each algorithm is encapsulated, and makes them can also mutual alternative.Strategy pattern can not have influence on the client who uses algorithm by the variation of algorithm.In the present invention, in policy database, store the various strategies of reply process status abnormal conditions.Employed strategy comprises restarts strategy automatically and manually restarts strategy and other auxiliary strategy.When distributed program takes place when unusual, just carry out and restart strategy automatically or manually restart strategy, also can carry out auxiliary strategy simultaneously alternatively.
Automatically restart strategy; Be to send the strategy that instruction makes it to restart automatically to fault-tolerant client when sending the process status abnormal information, restart strategy here automatically and can be divided into again only sending and restart this process, restart own and sequence number at its all, strategy that need all start anew by sequence number at the back when fault-tolerant server receives fault-tolerant client.
Manually restart strategy; Be to notify the O&M personnel so that whether its manual confirmation need restart the strategy of process when fault-tolerant client is sent the process status abnormal information, for example comprise: send note so that the strategy of process is restarted in its affirmation for the O&M personnel when fault-tolerant server receives; Make a phone call to play so-and-so for the O&M personnel and serve abnormal speech so that the strategy of process is restarted in its affirmation; Send mail notification so that the strategy of process etc. is restarted in its affirmation for the O&M personnel.Be to be understood that; It only is to be used for illustrative purposes that described here these are manually restarted strategy; In the process of carrying out fault-tolerant processing; Can use above-mentioned any or arbitrarily a plurality of combination of manually restarting in the strategy, for example send the mail alert note of also transmitting messages simultaneously, the combination of Here it is two kinds of strategies.In addition, can adopt other any suitable manually to restart strategy in the present invention what this did not show clearly.
Auxiliary strategy is a strategy of carrying out subsidiary function when restarting strategy automatically and manually restarting strategy when carrying out, the strategy of misregistration daily record etc. for example, but auxiliary strategy is not the necessary part that realizes fault tolerant mechanism, as required, can not carry out auxiliary strategy.In addition, can adopt other at this any suitable auxiliary strategy of not showing clearly in the present invention.
Fig. 2 is the process flow diagram according to the fault-tolerance approach that is used for distributed program of one embodiment of the present invention.As shown in Figure 2, said method starts from step S201.At step S201, fault-tolerant client terminal start-up is handled main thread, and fault-tolerant server starts handles main thread, and inspection has or not overtime fault-tolerant client.At step S202, fault-tolerant client is sent logging request to fault-tolerant server, and number of the account and password are landed in input.At step S203, fault-tolerant server receives the logging request of fault-tolerant client, and check land number of the account and password whether with login account of being stored and password matching, if checked result is a coupling, then fault tolerant service is upgraded the information of fault-tolerant client.At step S204, fault-tolerant server sends the login return information to fault-tolerant client, confirms that fault-tolerant client successfully logins.At step S205, fault-tolerant client is sent heartbeat message to fault-tolerant server.At step S206, fault-tolerant server receives and upgrades the heartbeat message of fault-tolerant client.Whether fault-tolerant server also judges in the transmission heartbeat whether the physical link between the two is normal according to fault-tolerant client.At step S207, fault-tolerant server sends the heartbeat return information to fault-tolerant client, confirms that the heartbeat message of fault-tolerant client is successfully received.At step S208, fault-tolerant client utilizes its process status monitor module to keep watch on the state of a process of the distributed program that in himself, moves.At step S209, when monitoring process status when unusual, fault-tolerant client utilizes process status abnormal information generation module to generate the process status abnormal information, and utilizes communication module to send it to fault-tolerant server.At step S210; Fault-tolerant server receiving process abnormal state information; And according to restart strategy automatically or manually restart strategy, according to the dependence between the process that defines in the process dependence database; Utilize policy enforcement module to carry out the predetermined fault-tolerant processing of restarting, it is to utilize tactful designated module preassigned by the O&M personnel that wherein said automatic or manual is restarted strategy.At step S211, fault-tolerant client is to fault-tolerant server report process initiation result.Processing finishes.
Fig. 3 is the process flow diagram that adopts based on the fault-tolerance approach of restarting strategy automatically.As shown in Figure 3, said method starts from step S301.In step S301, the process status monitor module of fault-tolerant client finds that the own process A that is kept watch on has occurred unusually, just generates the process status abnormal information, and utilizes communication module that said message is sent to fault-tolerant server.At step S302, fault-tolerant server receives process A condition abnormal information, and lookup process relies on form, finds process B dependent process A.In step 303, restart strategy automatically according to preassigned, fault-tolerant server sends the instruction of end process B to fault-tolerant client.In step 304, when fault-tolerant client receives the instruction of end process B, then finish process B, then will finish the successful executing result message of process B and reply to fault-tolerant server.At step S305, fault-tolerant server receives end process B message of successful, then sends message to the fault-tolerant client that restarts process A.At step S306, fault-tolerant client executing is restarted the instruction of process A, and will restart the successful executing result message of process A and reply to fault-tolerant server.At step S307, fault-tolerant server receives restarts process A message of successful, the instruction of then sending startup process B.At step S308, fault-tolerant client receives instruction and the process of the startup B of startup process B, then will start process B message of successful and be sent to fault-tolerant server.At step S309, fault-tolerant server receives said startup process B message of successful.So far this fault-tolerance approach finishes.
Fig. 4 is the process flow diagram that adopts based on the fault-tolerance approach of manually restarting strategy, for example adopts to the O&M personnel and makes a phone call to play the recording of reporting to the police.As shown in Figure 4, said method starts from step S401.In step S401, the process status monitor module of fault-tolerant client finds that the own process A that is kept watch on has occurred unusually, just generates the process status abnormal information, and utilizes communication module that said message is sent to fault-tolerant server.At step S402, fault-tolerant server receives process A condition abnormal information, and lookup process relies on form, discovery process B dependent process A.In step 403, according to preassigned to the O&M personnel make a phone call to play report to the police recording manually restart strategy, fault-tolerant server is made a phone call and is play the recording of reporting to the police to the O&M personnel.In step 404,, the O&M personnel just carry out manual confirmation when receiving this phone, and fault-tolerant client finishes process B, and then fault-tolerant client will finish the successful executing result message of process B and reply to fault-tolerant server.At step S405, fault-tolerant server receives end process B message of successful, then sends message to the fault-tolerant client that restarts process A.At step S406, fault-tolerant client executing is restarted the instruction of process A, and will restart the successful executing result message of process A and reply to fault-tolerant server.At step S407, fault-tolerant server receives restarts process A message of successful, the instruction of then sending startup process B.At step S408, fault-tolerant client receives instruction and the process of the startup B of startup process B, then will start process B message of successful and be sent to fault-tolerant server.At step S409, fault-tolerant server receives said startup process B message of successful.So far this fault-tolerance approach finishes.
Fig. 5 adopts based on restarting the process flow diagram of strategy with the fault-tolerance approach of auxiliary strategy automatically.As shown in Figure 5; It only is to receive after process B starts
Figure G2009102439440D00061
message at step S309 fault-tolerant server with the difference of fault-tolerance approach shown in Figure 3 and also comprises step S501; By the daily record of fault-tolerant server misregistration, use in order to examination in the future.But
Figure G2009102439440D00062
said misregistration daily record only is an embodiment of auxiliary strategy; Auxiliary strategy is not that office
Figure G2009102439440D00063
for example can also adopt based on manually restarting the fault-tolerance approach of strategy with auxiliary strategy, for example in the method shown in
Figure G2009102439440D00064
, adds misregistration daily record and other any suitable steps.
To the process dependence among the application be described with illustrational mode below.
Referring to following table 1, be the form of three exemplary process titles and process institute store path.Three processes are respectively gateway, control center and recording server.But should be appreciated that these processes only are exemplary, can also use other process of any a plurality of numbers.
Table 1
The name of process The complete trails of process
Gateway C:\infobird\ibserver.exe
Control center d:\infobird\ibCtlServer.exe
Recording server d:\infobird\ibMonitor.exe
[0033]Store the process dependence table of dependence between definition distributed program process in the process dependence database.Said form has unlock code, process title, process place client name, sends next bar instruction after what seconds of delaying time, regularly restarts setting, specially restart projects such as beautiful.Below in conjunction with table 2, will be the explanation that makes an explanation of example one a pair of each project with these three processes of control center, gateway and recording server.Unlock code is represented the dependence between process.
Table 2 (regularly starting is What You See Is What You Get)
Unlock code The process title Process place client name How many time-delays is claimed to send next bar instruction behind the number Regularly restart setting Special reboot flag
1 Control center Fault-tolerant client 1 1 time=″17:5:5″ WhichWeek=″7″ everydayRestart=″TRUE″ 0
2 Gateway Fault-tolerant client 2 1 time=″15:5:5″ WhichWeek=″7″ everydayRestart=″TRUE″ 1
3 The recording service Fault-tolerant client 2 1 time=″12:5:5″ WhichWeek=″7″ everydayRestart=″TRUE″ 0
Being example shown in the table 2; Unlock code is that to depend on unlock code be 1 process control center for 2 process gateway; To depend on unlock code be 2 process gateway and unlock code is 3 process recording server, and then to depend on unlock code indirectly be 1 process control center.Process place client name shows which fault-tolerant client is the process of distributed program lay respectively in, and for example control center is arranged in fault-tolerant client 1, and process gateway and recording server are arranged in fault-tolerant client 2.For example; When process control center takes place when unusual; The fault-tolerant client 1 at control center place can be sent process situation abnormal information to fault-tolerant server, and fault-tolerant server receives this information, and then lookup process relies on form; Find that gateway and recording server have dependence to control center, fault-tolerant server will be according to the fault-tolerant processing of preassigned strategy execution automatic or manual so.Be treated to example with automatic fault tolerant, the instruction that fault-tolerant server sends end process gateway and recording server to fault-tolerant client 2.After successfully closing above-mentioned two processes, fault-tolerant server sends instructions to fault-tolerant client 1 with the start-up control center, then redispatches and instructs fault-tolerant client 2 successively to start gateway and recording server.
Setting is restarted in time-delay number second in the dependence form, timing and special reboot flag is optional.Time-delay second number is to be illustrated in to receive behind the trigger message to the time of sending between next bar instruction, through this project is set, can avoid indivedual routing failures occur in the network the actual arrival of instruction incorrect phenomenon in proper order.Regularly restart and be provided for guaranteeing that process can regularly restart, unlock code is since 1, and order increases progressively, and no matter program is to have made mistakes (to jump frame, in the dust), still flown (process disappears, and does not have startup), all can repair by boot sequence.And; Regularly start and have the characteristic of What You See Is What You Get, so-calledly regularly start What You See Is What You Get, the order that is meant startup is no longer relevant with special reboot flag with boot sequence, time-delay second number; And it is only relevant with time, the date of disposing in the accompanying drawing; That is to say, start, then only carry out and restart according to the time that is provided with in " regularly restarting setting " if be provided with regularly.For example, be exactly to have broken dependence to restart process control center 17: 5: 5 of every day, restarted the process gateway in 15: 5: 5, restarted the process recording server in 12: 5: 5 shown in the table 2.In addition, if do not want to be provided with the service of regularly restarting, then the method for not timing startup is, is made as sky to WhichWeek, and everydayRestart is set to vacation.Can be regularly what time restart following application like: WhichWeek=" " everydayRestart=" FALSF " parametric t ime (time: divide: second) what day (1~7) of WhichWeek, if everydayRestart=" TRUE " then WhichWeek is invalid several weekly.For special reboot flag, when special reboot flag was got 0 value, expression was only restarted oneself, and is irrelevant with unlock code; When special reboot flag was got 1 value, the own and sequence number all processes in its back were restarted in expression; When special reboot flag was got 2 values, expression need all start anew to start all processes by sequence number.Generally speaking, special reboot flag is 1, and special reboot flag 0 and 2 is the restrictions that are used to break unlock code, has broken the dependence between process, generally gives over to expansion, perhaps only is realize unconventional Starting mode.Just as some program most people is normal use, but procedure development person has but stayed the back door, oneself comes the nonconventional approaches of usefulness in the time of urgent.When hardware fault, situation about can't recover such as overheated occurring, can settings be 2 special reboot flag, just can restart all distributed programs in regular turn this moment.Value is that 0 special reboot flag representes do not have dependence between this process and other processes, is independently, but the needs that start for unified configuration are arranged on it in this table, and utilize 0 special reboot flag break and other processes between dependence.
Introduce client configuration file and server profile below.
The client configuration file is following:
<?xml?version=″1.0″encoding=″GB2312″?>
< p AutoStart=" true " name=" fault-tolerant client " localPort=" 10011 " ADServerIP=" 127.0.0.1 "
ADServerPort″10012″>
<apps>
App name=" gateway " fullPathName=" ibServer.exe "/
< app name=" Scankeyword " fullPathName=" D: program tt " />
</apps>
</p>
Explanation for the client configuration file is following:
1, the name of < p>label is the name of client, must fill in, and can not repeat.
2, the name of < app>label is the name of process, must fill in, and on same client, can not repeat.
3, ADServerIP=" 127.0.0.1 " ADServerPort=" 10012 " be fault-tolerant server ip and port also
Must fill in.
The server end configuration file is following:
<?xml?version=″1.0″encoding=″GB2312″?>
<p?AutoStart=″true″localPort=″10012″>
<appsByOrder>
< app order=" 1 " name=" Scankeyword " whichClient=" fault-tolerant client " afterSencsSendNextIns=" 1 "
time=″15:5:5″WhichWeek=″7″everydayRestart?″TRUE″specialRestartFlag=″1″/>
< app order=" 2 " name=" gateway " whichClient=" fault-tolerant client " afterSencsSendNextIns=" 1 "
time=″15:5:5″WhichWeek=″7″everydayRestart=″TRUE″specialRestartFlag=″0″/>
</appsByOrder>
</p>
Server profile is explained as follows:
1, boot sequence is since 1, and order increases progressively, to guarantee and can regularly restart. and with no matter program is to have made mistakes (jump frame, in the dust), has still flown (process disappears, and not start) and all can repair by boot sequence
2, not timing startup service method is, is made as sky to WhichWeek, and everydayRestart is set to vacation.As: WhichWeek=" " everydayRestart=" FALSE "
3, what time parametric t ime (time: divide: second) what day (1~7) of WhichWeek can be regularly restarts following application several weekly, if cverydayRestart=" TRUE " then WhichWeek is invalid.
4, whichClicnt is meant the client at program place.SpecialRestart Flag value 0-only restarts that oneself is irrelevant with unlock code, and 1-is restarted all in its back of own and sequence number, and 2-need all start anew by sequence number.
5, noting regularly starting is What You See Is What You Get, all has nothing to do with boot sequence, time-delay second number and special reboot flag.
Through the above-mentioned client configuration file and the setting of server profile; Can be with unordered being deployed on the different arbitrarily physical machines of distributed program (being applied as example with control center, gateway, recording server etc. in this application); Utilize this client configuration file, client is just known it will keep watch on for which process, and the server end configuration file is a form of expression; Also can show as the form of database, the strategy of perhaps depositing in the database etc.According to distributed process dependence form, in any case dispose, the dependence between distributed program is but fixed.And, through above-mentioned embodiment of the present invention, state that can automatic monitoring distributed program process, and when the distributed program process is made mistakes, can automatically perform reply and handle, thereby saved lot of manpower and material resources.
Embodiment of the present invention has been described as stated.Yet the present invention is not limited to the scope of above-mentioned embodiment.Can make various modifications and improvement to above-mentioned embodiment without departing from the spirit and scope of the present invention.Scope of the present invention is limited accompanying claims.

Claims (10)

1. fault-tolerance approach that is used for distributed program is used for having the tolerant system of fault-tolerant server and at least one fault-tolerant client, and said method comprises the steps:
Fault-tolerant client utilizes its process status monitor module to keep watch on the state of a process of the distributed program that himself moves;
When monitoring process status when unusual, fault-tolerant client utilizes process status abnormal information generation module to generate the process status abnormal information, and utilizes communication module that said information is sent to fault-tolerant server;
Fault-tolerant server is through its communication module receiving process abnormal state information;
Fault-tolerant server utilizes policy enforcement module to restart fault-tolerant processing according to restarting strategy automatically or manually restarting strategy and carry out according to the dependence between the process of stipulating in the process dependence form, and wherein said to restart strategy automatically or manually restart strategy be to utilize tactful designated module preassigned;
After fault-tolerant processing was restarted in execution, fault-tolerant client was utilized the result of communication module to fault-tolerant server report process initiation.
2. a tolerant system that is used for distributed program has fault-tolerant server and at least one fault-tolerant client, and said fault-tolerant server comprises:
Communication module is used for communicating with fault-tolerant client, receives the process status abnormal information that fault-tolerant client is sent;
The strategy designated module is used for specifying in advance and restarts that fault-tolerant processing is employed restarts strategy automatically or manually restart strategy;
Policy enforcement module according to the strategy of being formulated in advance by tactful designated module, according to the dependence between the process that defines in the process dependence database, is carried out the corresponding fault-tolerant processing of restarting;
Policy database, storage are restarted strategy automatically and are manually restarted strategy;
Process dependence database stores the process dependence form of representing dependence between the distributed program process;
Said fault-tolerant client comprises:
The process status monitor module, the state of a process that is used to keep watch on the distributed program that himself moves;
Process status abnormal information generation module is used for generating the process status abnormal information when unusual monitoring process status;
Communication module is used for communicating with fault-tolerant server, and the process status abnormal information is sent to fault-tolerant server.
3. tolerant system according to claim 2 is to send reset command to fault-tolerant client to restart to carry out automatically according to the said fault-tolerant processing of restarting of restarting strategy automatically wherein.
4. tolerant system according to claim 2 is that the operation of reporting system O&M personnel distributed program occurs wrong so that its manual confirmation is restarted according to the said fault-tolerant processing of restarting of manually restarting strategy wherein.
5. tolerant system according to claim 4, the mode of wherein said notice O&M personnel distributed program run-time error comprise to the O&M personnel sends note, makes a phone call to play the service abnormal speech for the O&M personnel, sends one or more in the mail for the O&M personnel.
6. according to claim 3 or 4 described tolerant systems; Wherein said distributed program has process A and process B at least; Definition process B depends on process A in the process dependence form; Occur unusually if monitor process A, then saidly utilize policy enforcement module to carry out the step of restarting fault-tolerant processing to comprise:
Process B place client finishes process B, and will finish the successful executing result message of process B and reply to fault-tolerant server;
Fault-tolerant server receives end process B message of successful, sends the fault-tolerant client at message to the process A place of restarting process A;
Process A belongs to the instruction that fault-tolerant client executing is restarted process A, and will restart the successful executing result message of process A and reply to fault-tolerant server;
Fault-tolerant server receives restarts process A message of successful, sends the client to process B place of instructing that restarts process B;
Process B belongs to fault-tolerant client and receives and restart the instruction of process B and restart process B, and will start process B message of successful and be sent to fault-tolerant server;
Fault-tolerant server receives said startup process B message of successful.
7. according to claim 3 or 4 described tolerant systems; Wherein said distributed program has process A and process B at least; Definition process B depends on process A in the process dependence form; Occur unusually if monitor process B, then saidly utilize policy enforcement module to carry out the step of restarting fault-tolerant processing to comprise:
Process B belongs to fault-tolerant client and restarts process B, and will start process B message of successful and be sent to fault-tolerant server;
Fault-tolerant server receives said startup process B message of successful.
8. tolerant system according to claim 2 also comprises the fault-tolerant processing of execution based on auxiliary strategy, and said auxiliary strategy is the misregistration daily record.
9. tolerant system according to claim 2; Wherein said process dependence table comprises unlock code, process title and process place client name; The ascending dependence that has indicated between process of said unlock code, process place client name has indicated this process and has been arranged in which client.
10. tolerant system according to claim 9; Wherein said process dependence table also comprises a time-delay second number, regularly restarts and be provided with and special reboot flag, and said time-delay number second indicated after receiving trigger message to the time of sending between next bar instruction, and said timing is restarted and is provided for guaranteeing that process can regularly restart; Special reboot flag is used for indicating restarts type; When special reboot flag is got 0 value, indicate and only restart process self, irrelevant with unlock code; When special reboot flag is got 1 value; Sign is restarted process self and the sequence number all processes in its back, and when special reboot flag was got 2 values, sign need all start anew to start all processes by sequence number.
CN2009102439440A 2009-12-25 2009-12-25 Fault tolerance method and system used for distributed program Active CN101777020B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2009102439440A CN101777020B (en) 2009-12-25 2009-12-25 Fault tolerance method and system used for distributed program

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2009102439440A CN101777020B (en) 2009-12-25 2009-12-25 Fault tolerance method and system used for distributed program

Publications (2)

Publication Number Publication Date
CN101777020A CN101777020A (en) 2010-07-14
CN101777020B true CN101777020B (en) 2012-12-05

Family

ID=42513489

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2009102439440A Active CN101777020B (en) 2009-12-25 2009-12-25 Fault tolerance method and system used for distributed program

Country Status (1)

Country Link
CN (1) CN101777020B (en)

Families Citing this family (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102012843A (en) * 2010-11-19 2011-04-13 曙光信息产业(北京)有限公司 Task migration system
CN102404139B (en) * 2011-10-21 2014-01-15 浪潮电子信息产业股份有限公司 Method for increasing fault tolerance performance of application level of fault tolerance server
CN103634311B (en) * 2013-11-26 2016-01-20 腾讯科技(深圳)有限公司 Safety protecting method and device, terminal
CN104038364B (en) 2013-12-31 2015-09-30 华为技术有限公司 The fault-tolerance approach of distributed stream treatment system, node and system
CN104580408B (en) * 2014-12-24 2018-01-23 连云港杰瑞深软科技有限公司 A kind of method of moving distributing computing system and memory node fault tolerance information
CN106325861A (en) * 2016-08-18 2017-01-11 北京奇虎科技有限公司 Method and device used for managing distributed system
CN106557380A (en) * 2016-10-24 2017-04-05 深圳有麦科技有限公司 For the method that keeps server stable and its system
CN106598819B (en) * 2016-12-12 2019-07-26 世纪龙信息网络有限责任公司 The monitor processing method and system that the load of client patch uses
CN108845916B (en) * 2018-07-03 2022-02-22 中国联合网络通信集团有限公司 Platform monitoring and alarming method, device, equipment and computer readable storage medium
JP7100260B2 (en) * 2018-11-21 2022-07-13 富士通株式会社 Information processing equipment and information processing programs
CN109634769B (en) * 2018-12-13 2021-11-09 郑州云海信息技术有限公司 Fault-tolerant processing method, device, equipment and storage medium in data storage
CN109714202B (en) * 2018-12-21 2021-10-08 郑州云海信息技术有限公司 Client off-line reason distinguishing method and cluster type safety management system
CN111858177B (en) * 2020-07-22 2023-12-26 广州六环信息科技有限公司 Inter-process communication abnormality repairing method and device, electronic equipment and storage medium
CN111898158B (en) * 2020-07-23 2023-09-26 百望股份有限公司 Encryption method of OFD (optical frequency division) document

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1464397A (en) * 2002-06-10 2003-12-31 联想(北京)有限公司 System process protection method
CN1716871A (en) * 2004-06-28 2006-01-04 华为技术有限公司 Safety managing method for network control system
CN1818868A (en) * 2006-03-10 2006-08-16 浙江大学 Multi-task parallel starting optimization of built-in operation system
CN101212340A (en) * 2006-12-25 2008-07-02 中兴通讯股份有限公司 Method for restarting control nodes in automatic switching optical network

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1464397A (en) * 2002-06-10 2003-12-31 联想(北京)有限公司 System process protection method
CN1716871A (en) * 2004-06-28 2006-01-04 华为技术有限公司 Safety managing method for network control system
CN1818868A (en) * 2006-03-10 2006-08-16 浙江大学 Multi-task parallel starting optimization of built-in operation system
CN101212340A (en) * 2006-12-25 2008-07-02 中兴通讯股份有限公司 Method for restarting control nodes in automatic switching optical network

Also Published As

Publication number Publication date
CN101777020A (en) 2010-07-14

Similar Documents

Publication Publication Date Title
CN101777020B (en) Fault tolerance method and system used for distributed program
Sousa et al. Highly available intrusion-tolerant services with proactive-reactive recovery
US9087005B2 (en) Increasing resiliency of a distributed computing system through lifeboat monitoring
US20050237926A1 (en) Method for providing fault-tolerant application cluster service
CN110134518B (en) Method and system for improving high availability of multi-node application of big data cluster
US20210165693A1 (en) Control token and hierarchical dynamic control
JP2010537563A (en) Status remote monitoring and control device
CN105095008B (en) A kind of distributed task scheduling fault redundance method suitable for group system
CN104038390B (en) A kind of linux server clusters based on netlink unify peripheral hardware action listener method
CN104391777B (en) Cloud platform and its operation and monitoring method and device based on (SuSE) Linux OS
Veeraraghavan et al. Maelstrom: Mitigating datacenter-level disasters by draining interdependent traffic safely and efficiently
US11075829B2 (en) Distributed monitoring in clusters with self-healing
CN111901422A (en) Method, system and device for managing nodes in cluster
CN103475696A (en) System and method for monitoring state of cloud computing cluster server
CN101771563A (en) Method for monitoring network service program
CN102916830B (en) Implement system for resource service optimization allocation fault-tolerant management
Sousa et al. State machine replication for the masses with bft-smart
CN113434327A (en) Fault processing system, method, equipment and storage medium
CN113312059B (en) Service processing system, method and cloud native system
CN110798339A (en) Task disaster tolerance method based on distributed task scheduling framework
CN111752962B (en) System and method for guaranteeing high availability and consistency of MHA cluster
CN113656209A (en) Resource processing method, device, equipment and storage medium
CN113434155A (en) Automatic deployment system in mixed cloud mode
CN103731291A (en) Data transmission structure and program development method of network server pool system
CN115313642A (en) Power system scene and configuration oriented trusteeship system and trusteeship method

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
DD01 Delivery of document by public notice

Addressee: Bejing Infobird Software Co.,Ltd. Pang Shanshan

Document name: the First Notification of an Office Action

C14 Grant of patent or utility model
GR01 Patent grant
PE01 Entry into force of the registration of the contract for pledge of patent right

Denomination of invention: Fault tolerance method and system used for distributed program

Effective date of registration: 20141209

Granted publication date: 20121205

Pledgee: Bank of Guiyang Limited by Share Ltd dolomite branch

Pledgor: Bejing Infobird Software Co.,Ltd.

Registration number: 2014990001054

PLDC Enforcement, change and cancellation of contracts on pledge of patent right or utility model
PC01 Cancellation of the registration of the contract for pledge of patent right

Date of cancellation: 20150710

Granted publication date: 20121205

Pledgee: Bank of Guiyang Limited by Share Ltd dolomite branch

Pledgor: Bejing Infobird Software Co.,Ltd.

Registration number: 2014990001054

PLDC Enforcement, change and cancellation of contracts on pledge of patent right or utility model