CN101539863A - Task key system survival emergency recovery method based on quaternary nested restart - Google Patents

Task key system survival emergency recovery method based on quaternary nested restart Download PDF

Info

Publication number
CN101539863A
CN101539863A CN200910071914A CN200910071914A CN101539863A CN 101539863 A CN101539863 A CN 101539863A CN 200910071914 A CN200910071914 A CN 200910071914A CN 200910071914 A CN200910071914 A CN 200910071914A CN 101539863 A CN101539863 A CN 101539863A
Authority
CN
China
Prior art keywords
restart
level
recovery
restarting
time
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN200910071914A
Other languages
Chinese (zh)
Other versions
CN101539863B (en
Inventor
王慧强
赵国生
王健
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Engineering University
Original Assignee
Harbin Engineering University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Engineering University filed Critical Harbin Engineering University
Priority to CN2009100719146A priority Critical patent/CN101539863B/en
Publication of CN101539863A publication Critical patent/CN101539863A/en
Application granted granted Critical
Publication of CN101539863B publication Critical patent/CN101539863B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Retry When Errors Occur (AREA)

Abstract

The invention provides a task key system survival emergency recovery method based on quaternary nested restart, and is technically characterized in that the restart level grade of different granularities and the restart recovery policy of different levels in system are ensured; the calculation method of restart priority level is defined, the restart sequence of each restart level module in system is ensured, and recursive nested restart link is established; and the implementation process describing recursive restart emergency recovery is formalized by using SPN and DFA. The invention has the following advantages: facing different fault scenes, the restart object of different granularities can be efficiently selected and the restart sequence is ensured, thereby reducing the recovery time, reducing the recovery cost, and strengthening recoverable elasticity of task key system survivability. When the system survivability is degenerated to a certain degree, the internal state is cleared by stopping used continuous operation or restarting system, application service or process, thereby releasing operating system resources and recovering system survivability.

Description

Based on quaternary nested task key system survival emergency recovery method of restarting
(1) technical field
The present invention relates to strengthen the emergency recovery method of task key system survival, restart restoration methods but especially meet an urgent need towards the mission critical system of high survivability.
(2) background technology
Emergency recovery technology as enhanced system survivability last resort, bias toward and ensure that crucial the application possesses the ability that continues service, this is recoverability (can the recover elasticity) ability of enhanced system when facing successfully invasion and malicious attack, the last line of defense of the profound defence of the system of accomplishing, but also be the active responding of the final stage taked of survival system.The basic thought of emergency recovery is when the existence performance of system or key service drops to a certain degree, by the periodically continuation operation of terminator, remove the internal state of continuous service system, restart and revert to original state or " healthy " intermediateness, make the existence performance of system or application service obtain to a certain degree recovery and prevention contingent more serious inefficacy in the future.
Manyly studies show that restarting (Reboot or Rejuvenation) technology can effectively eliminate some mistakes, the failure state that system accumulates in operational process, comprise mistake, inefficacy that those artificial attacks cause, therefore restart effectively recovery system or the extremely initial kilter of application service.But but the current research that recovery policy is restarted in the survivability system is not still not deeply, and update has also just been carried out seeervice level and restarted.Castelli will be restarted simply and will be divided into two-stage: seeervice level and system-level, determine that reboot time is at interval and restart the concrete grammar of priority but provide.Hong determines to recover granularity by real system resource measurement value, and when the currency of resource consumption can not reflect the degree that causes resource loss, this method was no longer suitable.Though Xie adopts semi-Markov process that method is above optimized, its basic thought does not change.More than research all has its limitation, and still shows coarse.Comparatively speaking, the research of the reboot technology after the software fault-tolerant research field is to software failure has entered the ripe stage.Calculating (Recovery-oriented Computing in UC Berkeley and the cooperation development of Stanford university towards recovery, RoC) propose recurrence in the research of project and restarted (Recursive Restartability, RR) after the technology, Candea has proposed little (Microreboot) technology of restarting again, its thought is to set up one in advance to restart tree, and each node of restarting on the tree is application or the process that can independently restart.When carrying out reboot operation, begin to restart from the node of restarting the tree bottom, performance pushes away one-level on can not recovering then, carries out wider restarting.The architecture of these two kinds of technical requirement softwares is known, and software promptly followed the principle of intermodule loose coupling at the beginning of exploitation, makes the operate as normal that can not influence other module of restarting of a module.Wang Hui has proposed a kind of general fast self-recovery method based on recursion micro-reboot technology by force, mainly is the raising of system-oriented availability, reduces the Mean Time To Recovery of system simultaneously.At present, but existing research of restarting recovery also do not occur towards the achievement in survivability field, the level Four restoration grade that the present invention proposes and above-mentioned research are mentioned littlely restarts, grand restart exist essential difference, so the present invention has novelty.
(3) summary of the invention
The object of the present invention is to provide and a kind ofly can reduce release time, reduce and recover cost, strengthen task key system survival based on quaternary nested task key system survival emergency recovery method of restarting.
The object of the present invention is achieved like this:
(1) layer be will restart and system-level, seeervice level, process level and four grades of thread-level will be divided into;
(2) restarting the priority index parameter is K s, K s=d* Δ p* ∑, wherein, s represents the crucial grade that needs the controlled object of restarting of emergency recovery, d to represent s, the existence situation grade that Δ p represents s; When determining to restart object, defer to following rule: a. at same one deck, the K of controlled object sBig more, it is high more to restart priority, preferentially more restarts; B. at different layers, if the K of a certain process sK greater than the application service under it sValue, then this process and service are restarted object as one respectively, and process has precedence over affiliated application service execution reboot operation; Otherwise then this process can not only will be served as one and restart object as restarting object; If guide the K of total system again sValue is maximum, then only with total system as restarting object, carry out simple periodically recovery;
(3) described each emergent specific implementation process of restarting recovery with SPN, controlled the home position that recovery is restarted in each time with DFA.
Described will restart layer be divided into system-level, seeervice level, process level and four grades of thread-level determine restart the rank of layer and recovery policy that different stage is restarted layer is:
(a) system-level recovery policy: system emergency is restarted and is spaced apart η, η release time i=MTBF i-MTTR i(i=0,1 ...), work as η iSystem survival is reduced to H constantly Min, this moment, the implementation system level was restarted strategy;
(b) seeervice level recovery policy: when the existence performance of a certain key service is reduced to a certain predetermined threshold P MinThe time, carry out a seeervice level and restart recovery; And work as P Max>P>P MinThe time, carry out M time process level and recover; Simultaneously, after M time process level is recovered, implement a seeervice level and recover, make the existence performance of this service return to initial value P Max, restart M time process level again and recover;
(c) process/thread level restoration strategy: utilize the loctl of system () function call, read check point file, after creating subprocess, parent process is returned user's space and is waited until that recovery tasks finishes, and subprocess then utilizes clone () function to produce a plurality of threads to recover former task again; After passing through last chokepoint, all threads promptly leave kernel spacing, enter into user's space, the code after being recovered by call back function continuation executive process.
The present invention is intended to the oriented mission critical system, has proposed a kind of level Four (system-level, seeervice level, process level and thread-level) nested fine granularity emergency recovery method of recurrence of enhanced system survivability.Its major technique feature comprises: having determined 1, that system is varigrained restarts the recovery policy of restarting that level does not reach different stage; 2, defined the computing method of restarting priority, the order of restarting of layer module is respectively restarted in the system that determines, and has set up that recurrence is nested restarts chain; 3, utilize stochastic Petri net (SPN) and definite finte-state machine (DFA) formalized description recurrence restart the implementation process of emergency recovery.
Emergency recovery is the high survivability enhancement techniques of formula of fearing, try to be the first before a kind of.The enforcement of level Four recovery policy is based on following consideration among the present invention: planned recovery is more much lower than the cost that the unplanned system machine of delaying consumes; The cost that recovers a certain application process (process level recovery) is more much lower than the cost that recovers whole application program (seeervice level recovery); The cost that recovers a certain application program (seeervice level recovery) is more much lower than the cost that recovers whole application system (system-level recovery).
Compare with traditional periodic system level restoration method, the present invention has the following advantages: towards different fault scenes, can select the varigrained object of restarting effectively, and determine and restart order, reduced release time, reduce the recovery cost, strengthened the elasticity recovered of task key system survival.When the system survival performance degradation to a certain degree the time, by the continuation operation that stops using, or restart system, application service or process clearing up its internal state, thereby releasing operation system resource is restored the system survival performance.
(4) description of drawings
Fig. 1 is system-level recurrence recovery policy figure;
Fig. 2 is seeervice level recurrence recovery policy figure;
Fig. 3 is a process/thread level restoration policy map;
Fig. 4 is quaternary nested emergency recovery strategy implementation process figure;
Fig. 5 is quaternary nested emergency recovery DFA model;
Fig. 6 is the nested emergency recovery strategy implementation process figure of k level;
Fig. 7 is a FB(flow block) of the present invention.
(5) embodiment
For example the present invention is done description in more detail below in conjunction with accompanying drawing:
1. determine to restart the recovery policy of restarting of the rank of layer and different stage
Implement the nested emergency recovery strategy of fine granularity, need system is divided into and somely restart layer, and every layer restart the independent service entities that object must have atomicity, concurrency.Thus, will restart and recover to be defined as 4 grades: system-level, seeervice level, process level and thread-level.System-level recovery is as restarting object with total system, seeervice level is recovered application service as restarting object, it is the concrete process that will carry out in system's operational process as restarting object that process level is recovered, the thread-level recovery then with each cell-thread in the process as restarting object.The seeervice level of carrying out before each system-level recovery is repeatedly recovered; The process level of carrying out repeatedly before each seeervice level is recovered is recovered; The thread-level that is accompanied by in the process level rejuvenation is repeatedly recovered.
(1) system-level recovery policy
Depend on the inefficacy regularity of distribution of existing system, define system is met an urgent need to restart and is spaced apart η, η release time i=MTBF i-MTTR i(i=0,1 ...).As shown in Figure 1, when system has just brought into operation, has best existence performance H Max, the generation along with various inefficacies, mistake, attack etc. can not be surveyed incident causes the decline gradually of system survival performance, supposes η iConstantly reduce to H Min, this moment, the implementation system level was restarted strategy, made the existence performance of system quickly recover to initial state.
(2) seeervice level recovery policy
The seeervice level recovery policy is that the existence performance whenever a certain key service is reduced to a certain predetermined threshold P MinThe time, carry out a seeervice level and restart recovery; And as the existence performance P of a certain key service Max>P>P MinThe time, carry out M time process level and recover, thereby obtain the existence performance recovery value P of a series of different these services Max (1), P Max (2)..., P Max (n)With the time interval η that carries out recovery 0, η 1..., η n, as shown in Figure 2.Because seeervice level is recovered to discharge all operations system resource that is depleted, existence performance number P and time interval sequential value η after this emergency service is recovered successively decrease successively, after certain times N, beginning is carried out recovery operation continually and service can't normally be provided, and thoroughly loses efficacy until this service.Therefore we consider after M time process level is recovered, and implement a seeervice level and recover, and make the existence performance of this service return to initial value P Max, restart M time process level again and recover.Circulation successively, if other faults do not take place, this key service is expected to move down for a long time.
(3) process level and thread-level recovery policy
After taking place, failure event carries out a large amount of wastes that cause in the calculating for fear of key service owing to starting anew, the availability of abundant raising system, the suitable moment in the normal operation of service is provided with checkpoint (CheckPoint), preserve service processes specification condition at that time, and each process correlativity is followed the tracks of and record.After the service disruption fault took place, the coherency state (checkpoint) with the associated process backrush is served before the fault re-executed from this checkpoint through behind the recovering state.
The process that process level or thread-level are recovered is recovered memory mirror exactly from check point file.Restoration Mechanism is utilized the loctl of system () function call, reads check point file, created subprocess after, parent process is returned user's space, waits until that always recovery tasks finishes, and finishes to withdraw from.Subprocess then utilizes clone () function to produce a plurality of threads to recover former task again, and one of them thread reads essential information from check point file, and recovers the numbering of each thread and the relation between them.By behind last chokepoint, all threads promptly leave kernel spacing, enter into user's space, and their call back function will continue the code after the executive process recovery, as shown in Figure 3.
2. determine that system respectively restarts the order of restarting of layer module, set up that recurrence is nested restarts chain
Generally speaking, it is thin more to recover granularity, and the service failure time is short more, and it is also low more to recover cost.But, can not therefore be: thread-level, process level, seeervice level and system-level with regard to determining to restart order simply.The present invention has introduced and has restarted priority index parameter K when determining to restart order sDefinition K s=d* Δ p* ∑ (restart the stock number that discharges behind the controlled object/restart the recovery cost that controlled object causes), wherein, s need to represent the controlled object of restarting of emergency recovery; D represents the crucial grade of s; Δ p represents the existence situation grade of s.
When determining to restart object, defer to following rule: (1) at same one deck, the K of controlled object sBig more, it is high more to restart priority, preferentially more restarts; (2) at different layers, if the K of a certain process sK greater than the application service under it sValue, then this process and service can be restarted object as one respectively, and process has precedence over affiliated application service execution reboot operation; Otherwise then this process can not only will be served as one and restart object as restarting object.If guide the K of total system again sValue is maximum, then only need carry out simple periodically recovery with total system as restarting object, and need not consider the fine granularity recovery policy of seeervice level and process level, thread-level again.
Because an application service restart replacement to all processes of enabling in its operational process, restart this application service and be equivalent to restart its all processes of subordinate.Again the guiding of system will stop all and move application services on it, discharge all system resources, be the most thorough recovery to the system survival performance.Can obtain the objects of restarting at different levels by above rule, and these can be restarted object order be a chain structure by restarting priority, is and restarts chain.
3. make up the implementation process model that recurrence is restarted emergency recovery
Describe each emergent specific implementation process of restarting recovery with SPN, control the home position that recovery is restarted in each time, simplified the SPN model thus, easy to understand and analysis with DFA.Restart the recovery policy implementation process as shown in Figure 4 but quaternary nested survival system is emergent.Control association and the transferring position of respectively recovering between the subprocess with the DFA model M, this has not only been avoided model state space blast problem but also has been avoided the DFA model can not describe the shortcoming of tactful implementation detail.Circle among the figure is represented the position, and the stain in the circle is a mark, and little rectangle frame is represented transition.Wherein, P Avail, P Down, P RejuThe state that simulation system enters, P ClockThe recording clock that recovers, hollow frame T are carried out in expression Down, T Reju, T UpThe expression enforcement time is obeyed the transition that distribute arbitrarily, solid frame T ClockTimed transition is determined in expression, determines time η iIt is the optimized database restore time interval.Transition are implemented, and then Token moves to one position, back from its last position, and system state changes thereupon.The position that to contain Token is designated as 1, and the position that does not contain Token is designated as 0, then set of locations (P Avail, P Down, P Clock, P RejuBut) tag system residing state in recovering subprocess.
Two large rectangle frames comprise two groups of positions: { P Avail0, P Avail1..., P AvailnAnd { P Clock0, P Clock1..., P Clockn, expression respectively recovers the initial position and the recording clock of subprocess respectively, and each recovery respectively has one of them position to participate in.Recover shared same transition T preceding n time Downi, T Clocki, still, corresponding different initial positions, T DowniObey different stochastic distribution F 1i(t) (0≤i<n), T ClockiEnforcement time η i(0≤i<n) also different enters the back that returns to form and implements identical transition T ArejuArbitrary moment is implemented identical transition T if system enters failure state Up, obey identical stochastic distribution.From P ClockSet of locations indicates the input of M to the connecting line of automat M, knows the current state q of system thus IjkFrom transition T Areju, T Sreju, T UpTo automat M, three-way unification is as another input of M, transition T Areju, T Sreju, T UpThe corresponding ∑ of enforcement in the operation carried out of system; The transfer function θ of M reads this two inputs, obtains the NextState of system, as the output of M, and the initial position of control next son process.The transition tabel of among Fig. 4 service layer, process level, thread layer being restarted is shown one group, and shown in grey box, wherein the transition sum equals three layers and restarts the number of times sum.
The state transitions rule format description of automat as shown in Figure 5.Use the finite state q of M among the figure IjkControl position group { P Avail0..., P AvailnAnd { P Clock0..., P ClocknIn mark, definition status is as follows: corresponding original state q 000, mark is positioned at P Avail0And P Clock0As the q that gets the hang of of system Ijk=00...010...00 (the Σ k 1 = 0 j - 1 Σ k 2 = 0 i - 1 N [ k 1 ] + N [ k 2 ] + i + j + k + 1 The position is 1), then mark is positioned at P AvailyAnd P Clocky, wherein y = Σ k 1 = 0 j - 1 Σ k 2 = 0 i - 1 N [ k 1 ] + N [ k 2 ] + i + j + k + 1 , System executed this moment k thread-level restart, j time process level is restarted, i time seeervice level is restarted; When system arrives state q MN[m] N[y] (y=N[m])The time, mark is positioned at P AvailnAnd P Clockn, prepare the executive system level and restart.
From P ClockThere is an input in set of locations to M, knows the current state q of system thus IjkFrom transition T ArejuGroup, T Sreju, T UpImport as its another to M, know the type of restarting of system thus; The transfer function θ of M reads two inputs, obtain the next state of system, learn the operation that system should carry out this moment simultaneously, as the output of M, and control the initial position that next recovers mark in subprocess, and system performed from the position restart object.
Fig. 4 is changed a little can be in order to represent the nested emergency recovery strategy of k level recurrence arbitrarily, as shown in Figure 6.
The partial code of VS.Net example process level restoration management is as follows:
// list current all processes in the system //
private?void?ListProcesses()
{ Process[]ps;
try
{ ps=Process.GetProcesses();
The content of // renewal process list
lvProcesses.BeginUpdate();
lvProcesses.Clear();
// add being listed as
LvProcesses.Columns.Add (" image name ", 100, Horizontal Alignment.Left);
LvProcesses.Columns.Add (" process ID ", 60, Horizontal Alignment.Left);
LvProcesses.Columns.Add (" priority ", 60, Horizontal Alignment.Right);
LvProcesses.Columns.Add (" CPU time ", 100, Horizontal Alignment.Right);
LvProcesses.Columns.Add (" committed memory ", 100, Horizontal Alignment.Right);
// interpolation list items
foreach(Process?p?in?ps)
{ ListViewItem?lvi=new?ListViewItem();
lvi.Text=p.ProcessName;
lvi.SubItems.Add(p.Id.ToString());
lvi.SubItems.Add(p.BasePriority.ToString());
lvi.SubItems.Add(p.TotalProcessorTime.Hours.ToString()+″:″+p.TotalProcessorTim
e.Minutes.ToString()+″:″+p.TotalProcessorTime.Seconds.ToString());
lvi.SubItems.Add(p.WorkingSet.ToString());lvProcesses.Items.Add(lvi);
}
lvProcesses.EndUpdate();
}
catch(Exception?e)
{ MessageBox.Show(e.Message);
}
}
public?int?pid=0;
// obtain process object according to process ID, show its attribute then
private?void?FormProp_Load(object?sender,System.EventArgs?e)
{?if(pid==0)return;
Process?p=Process.GetProcessById(pid);
if(p==null)return;
try
{?txtID.Text=p.Id.ToString();
txtName.Text=p.ProcessName;
txtStartTime.Text=p.StartTime.ToLongTimeString();
txtPriority.Text=p.PriorityClass.ToString();
txtVirtualMem.Text=p.VirtualMemorySize.ToString();
txtWorkingSet.Text=p.WorkingSet.ToString();
if(p.MainModule?!=null)
{?txtModuleName.Text=p.MainModule.FileName;
txtModuleDescription.Text=p.MainModule.FileVersion?Info.FileDescription;
txtModuleVersion.Text=p.MainModule.FileVersion?Info.FileVersion;
}?}
catch(Exception?ex)
{ MessageBox.Show (this, ex.Message, " occurring unusual ", MessageBox Buttons.OK, M
essageBoxIcon.Warning);
}
finally
{
p.Close();
}
}
///
// create the ProcessStartInfo object instance according to the input of user in forms
// use this object instance to start new process then
private?void?btnOK_Click(object?sender,System.EventArgs?e)
{ ProcessStartInfo?si=new?ProcessStartInfo();
si.FileName=txtFileName.Text;
si.Arguments=txtParam.Text;
si.ErrorDialog=chkErrDlg.Checked;
si.UseShellExecute=chkUseShell.Checked;
if(cbxVerb.SelectedIndex !=-1)
si.Verb=cbxVerb.Text;
si.WorkingDirectory=txtWorkDir.Text;
if(cbxStyle.SelectedIndex==0)
si.WindowStyle=ProcessWindowStyle.Maximized;
else?if(cbxStyle.SelectedIndex==1)
si.WindowStyle=ProcessWindowStyle.Minimized;
else
si.WindowStyle=ProcessWindowStyle.Normal;
try{ Process.Start(si);}
catch(Exception?ex)
{
MessageBox.Show (this, ex.Message, " occurring unusual ", MessageBox Buttons.OK, MessageBoxIcon.War
ning);
this.DialogResult=DialogResult.None;
}}
private?void?Form1_Load(object?sender,System.EventArgs?e)
{ ListProcesses();}
The process of the current selection of // deletion
private?void?btnKillProcess_Click(object?sender,System.Event?Args?e)
{
If (MessageBox.Show (this, " determine end process: "+lvProcesses.SelectedItems[0] .Text, " finish to advance
Journey ", MessageBoxButtons.OK Cancel, MessageBoxIcon.Warning)==DialogResult.Cancel)
return;
int?pid=Int32.Parse(lvProcesses.SelectedItems[0].Sub?Items[1].Text);
Process?p=Process.GetProcessById(pid);
if(p==null)return;
if(!p.CloseMainWindow())
p.Kill();
p.WaitForExit();
p.Close();
ListProcesses();
}
// refresh process list
private?void?btnRefresh_Click(object?sender,System.EventArgs?e)
{ ListProcesses();}
// create new process
private?void?btnNewProcess_Click(object?sender,System.Event?Args?e)
{ FormStartInfo?dlg=new?FormStartInfo();
if(dlg.ShowDialog()==DialogResult.OK)
ListProcesses();}
The attribute of the current selection process of // demonstration
private?void?btnProcessProp_Click(object?sender,System.Event?Args?e)
{ FormProp?dlg=new?FormProp();
Dlg.Text=" process "+lvProcesses.SelectedItems[0] .Text+ " attribute ";
dlg.pid=Int32.Parse(lvProcesses.SelectedItems[0].Sub?Items[1].Text);
dlg.ShowDialog()
}
// process is hung up ﹠amp; Process is recovered
Private?Declare?Function?OpenProcess?Lib″kernel32″(ByVal?dwDesiredAccess?As?Long,
ByVal?bInheritHandle?As?Long,ByVal?dwProcessId?As?Long)As?Long
Private?Declare?Function?CloseHandle?Lib″kernel32″(ByVal?hObject?As?Long)As?Long
Private?Const?SYNCHRONIZE=&H100000
Private?Const?STANDARD_RIGHTS_REQUIRED=&HF0000
Private?Const?PROCESS_ALL_ACCESS=(STANDARD_RIGHTS_REQUIRED?Or?SYNCHRONIZE?Or?&HFF
F)
Private?Declare?Function?NtSuspendProcess?Lib″ntdll.dll″(ByVal?hProc?As?Long)As
Long
Private?Declare?Function?NtResumeProcess?Lib″ntdll.dll″(ByVal?hProc?As?Long)As?L
ong
Private?Declare?Function?TerminateProcess?Lib″kernel32″(ByVal?hProcess?As?Long,B
yVal?uExitCode?As?Long)As?Long
Private?hProcess?As?Long
Private Sub cmdClose_Click () ' stops
CloseHandle?hProcess
End?Sub
Private Sub cmdResume_Click () ' closes handle
If?IsNumeric(txtPid.Text)Then
hProcess=OpenProcess(PROCESS_ALL_ACCESS,False,CLng(txtPid.Text))
If?hProcess<>0Then
NtResumeProcess?hProcess
End?If
End?If
End?Sub
Private Sub cmdTerminate Click () ' recovers
If?hProcess?Then
TerminateProcess?hProcess,0
Else
If?IsNumeric(txtPid.Text)Then
hProcess=OpenProcess(PROCESS_ALL_ACCESS,False,CLng(txtPid.Text))

Claims (2)

1, a kind of based on quaternary nested task key system survival emergency recovery method of restarting, it is characterized in that:
(1) layer be will restart and system-level, seeervice level, process level and four grades of thread-level will be divided into;
(2) restarting the priority index parameter is K s, K s=d* Δ p* ∑, wherein, s represents the crucial grade that needs the controlled object of restarting of emergency recovery, d to represent s, the existence situation grade that Δ p represents s; When determining to restart object, defer to following rule: a. at same one deck, the K of controlled object sBig more, it is high more to restart priority, preferentially more restarts; B. at different layers, if the K of a certain process sK greater than the application service under it sValue, then this process and service are restarted object as one respectively, and process has precedence over affiliated application service execution reboot operation; Otherwise then this process can not only will be served as one and restart object as restarting object; If guide the K of total system again sValue is maximum, then only with total system as restarting object, carry out simple periodically recovery;
(3) describe emergent each time specific implementation process of restarting recovery based on the SPN method, restart the home position of recovery simultaneously with DFA control each time.
2, according to claim 1 based on quaternary nested task key system survival emergency recovery method of restarting, it is characterized in that: described will restart layer be divided into system-level, seeervice level, process level and four grades of thread-level determine restart the rank of layer and recovery policy that different stage is restarted layer is:
(a) system-level recovery policy: system emergency is restarted and is spaced apart η, η release time i=MTBF i-MTTR i(i=0,1 ...), work as η iSystem survival is reduced to H constantly Min, this moment, the implementation system level was restarted strategy;
(b) seeervice level recovery policy: when the existence performance of a certain key service is reduced to a certain predetermined threshold P MinThe time, carry out a seeervice level and restart recovery; And work as P Max>P>P MinThe time, carry out M time process level and recover; Simultaneously, after M time process level is recovered, implement a seeervice level and recover, make the existence performance of this service return to initial value P Max, restart M time process level again and recover;
(c) process/thread level restoration strategy: utilize the loctl of system () function call, read check point file, after creating subprocess, parent process is returned user's space and is waited until that recovery tasks finishes, and subprocess then utilizes clone () function to produce a plurality of threads to recover former task again; After passing through last chokepoint, all threads promptly leave kernel spacing, enter into user's space, the code after being recovered by call back function continuation executive process.
CN2009100719146A 2009-04-29 2009-04-29 Task key system survival emergency recovery method based on quaternary nested restart Expired - Fee Related CN101539863B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2009100719146A CN101539863B (en) 2009-04-29 2009-04-29 Task key system survival emergency recovery method based on quaternary nested restart

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2009100719146A CN101539863B (en) 2009-04-29 2009-04-29 Task key system survival emergency recovery method based on quaternary nested restart

Publications (2)

Publication Number Publication Date
CN101539863A true CN101539863A (en) 2009-09-23
CN101539863B CN101539863B (en) 2012-10-31

Family

ID=41123065

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2009100719146A Expired - Fee Related CN101539863B (en) 2009-04-29 2009-04-29 Task key system survival emergency recovery method based on quaternary nested restart

Country Status (1)

Country Link
CN (1) CN101539863B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018103045A1 (en) * 2016-12-08 2018-06-14 华为技术有限公司 Checkpoint creation method, device and system
CN111445118A (en) * 2020-03-24 2020-07-24 昆明理工大学 Task collaborative flow network model construction method and efficiency evaluation method for mine accident emergency rescue digital plan

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
ZA200506983B (en) * 2004-10-01 2007-04-25 Microsoft Corp System and method for determining target tailback and target priority for a distributed file system

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018103045A1 (en) * 2016-12-08 2018-06-14 华为技术有限公司 Checkpoint creation method, device and system
CN108604205A (en) * 2016-12-08 2018-09-28 华为技术有限公司 The creation method of test point, device and system
CN108604205B (en) * 2016-12-08 2021-02-12 华为技术有限公司 Test point creating method, device and system
CN111445118A (en) * 2020-03-24 2020-07-24 昆明理工大学 Task collaborative flow network model construction method and efficiency evaluation method for mine accident emergency rescue digital plan
CN111445118B (en) * 2020-03-24 2022-07-26 昆明理工大学 Task collaborative flow network model construction method and efficiency evaluation method for mine accident emergency rescue digital plan

Also Published As

Publication number Publication date
CN101539863B (en) 2012-10-31

Similar Documents

Publication Publication Date Title
Real et al. Mode change protocols for real-time systems: A survey and a new proposal
CN101957751B (en) Method and device for realizing state machine
CN100449478C (en) Method and apparatus for real-time multithreading
CN103370693B (en) restart process
CN101452404B (en) Task scheduling apparatus and method for embedded operating system
CN102520925B (en) AADL2TASM (Architecture Analysis and Design Language-to-Timed Abstract State Machine) model transformation method
CN103080903A (en) Scheduler, multi-core processor system, and scheduling method
CN103370694A (en) Restarting data processing systems
CN103279840A (en) Workflow engine implement method based on dynamic language and event processing mechanism
CN103995691B (en) Service state consistency maintenance method based on transactions
CN101103338A (en) Method for counting instructions for logging and replay of a deterministic sequence of events
CN103092682A (en) Asynchronous network application program processing method
CN101840352A (en) Method and device for monitoring database connection pool
CN105930360A (en) Storm based stream computing frame text index method and system
CN103077068B (en) A kind of high-performance simulation system based on shared drive realizes method
CN108681598A (en) Task runs method, system, computer equipment and storage medium again automatically
CN106354563A (en) Distributed computing system for 3D (three-dimensional reconstruction) and 3D reconstruction method
CN107943592B (en) GPU cluster environment-oriented method for avoiding GPU resource contention
CN102193831B (en) Method for establishing hierarchical mapping/reduction parallel programming model
CN101539863B (en) Task key system survival emergency recovery method based on quaternary nested restart
CN110737504B (en) Fault-tolerant method, system, terminal and storage medium for deep learning model training
Rukoz et al. Faceta*: Checkpointing for transactional composite web service execution based on petri-nets
CN107368498A (en) The lock for optimizing MySQL Pessimistic Lockings waits the method and device of time-out time
CN101996091B (en) System and method for realizing graph executing virtual machine supporting multiple flow operation
CN107621975B (en) TIMER logic implementation method based on JAVA TIMER high availability

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20121031

Termination date: 20180429

CF01 Termination of patent right due to non-payment of annual fee