CN101539863A

CN101539863A - Task key system survival emergency recovery method based on quaternary nested restart

Info

Publication number: CN101539863A
Application number: CN200910071914A
Authority: CN
Inventors: 王慧强; 赵国生; 王健
Original assignee: Harbin Engineering University
Current assignee: Harbin Engineering University
Priority date: 2009-04-29
Filing date: 2009-04-29
Publication date: 2009-09-23
Anticipated expiration: 2029-04-29
Also published as: CN101539863B

Abstract

The invention provides a task key system survival emergency recovery method based on quaternary nested restart, and is technically characterized in that the restart level grade of different granularities and the restart recovery policy of different levels in system are ensured; the calculation method of restart priority level is defined, the restart sequence of each restart level module in system is ensured, and recursive nested restart link is established; and the implementation process describing recursive restart emergency recovery is formalized by using SPN and DFA. The invention has the following advantages: facing different fault scenes, the restart object of different granularities can be efficiently selected and the restart sequence is ensured, thereby reducing the recovery time, reducing the recovery cost, and strengthening recoverable elasticity of task key system survivability. When the system survivability is degenerated to a certain degree, the internal state is cleared by stopping used continuous operation or restarting system, application service or process, thereby releasing operating system resources and recovering system survivability.

Description

Based on quaternary nested task key system survival emergency recovery method of restarting

(1) technical field

The present invention relates to strengthen the emergency recovery method of task key system survival, restart restoration methods but especially meet an urgent need towards the mission critical system of high survivability.

(2) background technology

Emergency recovery technology as enhanced system survivability last resort, bias toward and ensure that crucial the application possesses the ability that continues service, this is recoverability (can the recover elasticity) ability of enhanced system when facing successfully invasion and malicious attack, the last line of defense of the profound defence of the system of accomplishing, but also be the active responding of the final stage taked of survival system.The basic thought of emergency recovery is when the existence performance of system or key service drops to a certain degree, by the periodically continuation operation of terminator, remove the internal state of continuous service system, restart and revert to original state or " healthy " intermediateness, make the existence performance of system or application service obtain to a certain degree recovery and prevention contingent more serious inefficacy in the future.

Manyly studies show that restarting (Reboot or Rejuvenation) technology can effectively eliminate some mistakes, the failure state that system accumulates in operational process, comprise mistake, inefficacy that those artificial attacks cause, therefore restart effectively recovery system or the extremely initial kilter of application service.But but the current research that recovery policy is restarted in the survivability system is not still not deeply, and update has also just been carried out seeervice level and restarted.Castelli will be restarted simply and will be divided into two-stage: seeervice level and system-level, determine that reboot time is at interval and restart the concrete grammar of priority but provide.Hong determines to recover granularity by real system resource measurement value, and when the currency of resource consumption can not reflect the degree that causes resource loss, this method was no longer suitable.Though Xie adopts semi-Markov process that method is above optimized, its basic thought does not change.More than research all has its limitation, and still shows coarse.Comparatively speaking, the research of the reboot technology after the software fault-tolerant research field is to software failure has entered the ripe stage.Calculating (Recovery-oriented Computing in UC Berkeley and the cooperation development of Stanford university towards recovery, RoC) propose recurrence in the research of project and restarted (Recursive Restartability, RR) after the technology, Candea has proposed little (Microreboot) technology of restarting again, its thought is to set up one in advance to restart tree, and each node of restarting on the tree is application or the process that can independently restart.When carrying out reboot operation, begin to restart from the node of restarting the tree bottom, performance pushes away one-level on can not recovering then, carries out wider restarting.The architecture of these two kinds of technical requirement softwares is known, and software promptly followed the principle of intermodule loose coupling at the beginning of exploitation, makes the operate as normal that can not influence other module of restarting of a module.Wang Hui has proposed a kind of general fast self-recovery method based on recursion micro-reboot technology by force, mainly is the raising of system-oriented availability, reduces the Mean Time To Recovery of system simultaneously.At present, but existing research of restarting recovery also do not occur towards the achievement in survivability field, the level Four restoration grade that the present invention proposes and above-mentioned research are mentioned littlely restarts, grand restart exist essential difference, so the present invention has novelty.

(3) summary of the invention

The object of the present invention is to provide and a kind ofly can reduce release time, reduce and recover cost, strengthen task key system survival based on quaternary nested task key system survival emergency recovery method of restarting.

The object of the present invention is achieved like this:

(1) layer be will restart and system-level, seeervice level, process level and four grades of thread-level will be divided into;

(2) restarting the priority index parameter is K _s, K _s=d* Δ p* ∑, wherein, s represents the crucial grade that needs the controlled object of restarting of emergency recovery, d to represent s, the existence situation grade that Δ p represents s; When determining to restart object, defer to following rule: a. at same one deck, the K of controlled object _sBig more, it is high more to restart priority, preferentially more restarts; B. at different layers, if the K of a certain process _sK greater than the application service under it _sValue, then this process and service are restarted object as one respectively, and process has precedence over affiliated application service execution reboot operation; Otherwise then this process can not only will be served as one and restart object as restarting object; If guide the K of total system again _sValue is maximum, then only with total system as restarting object, carry out simple periodically recovery;

(3) described each emergent specific implementation process of restarting recovery with SPN, controlled the home position that recovery is restarted in each time with DFA.

Described will restart layer be divided into system-level, seeervice level, process level and four grades of thread-level determine restart the rank of layer and recovery policy that different stage is restarted layer is:

(a) system-level recovery policy: system emergency is restarted and is spaced apart η, η release time _i=MTBF _i-MTTR _i(i=0,1 ...), work as η _iSystem survival is reduced to H constantly _Min, this moment, the implementation system level was restarted strategy;

(b) seeervice level recovery policy: when the existence performance of a certain key service is reduced to a certain predetermined threshold P _MinThe time, carry out a seeervice level and restart recovery; And work as P _Max＞P＞P _MinThe time, carry out M time process level and recover; Simultaneously, after M time process level is recovered, implement a seeervice level and recover, make the existence performance of this service return to initial value P _Max, restart M time process level again and recover;

(c) process/thread level restoration strategy: utilize the loctl of system () function call, read check point file, after creating subprocess, parent process is returned user's space and is waited until that recovery tasks finishes, and subprocess then utilizes clone () function to produce a plurality of threads to recover former task again; After passing through last chokepoint, all threads promptly leave kernel spacing, enter into user's space, the code after being recovered by call back function continuation executive process.

The present invention is intended to the oriented mission critical system, has proposed a kind of level Four (system-level, seeervice level, process level and thread-level) nested fine granularity emergency recovery method of recurrence of enhanced system survivability.Its major technique feature comprises: having determined 1, that system is varigrained restarts the recovery policy of restarting that level does not reach different stage; 2, defined the computing method of restarting priority, the order of restarting of layer module is respectively restarted in the system that determines, and has set up that recurrence is nested restarts chain; 3, utilize stochastic Petri net (SPN) and definite finte-state machine (DFA) formalized description recurrence restart the implementation process of emergency recovery.

Emergency recovery is the high survivability enhancement techniques of formula of fearing, try to be the first before a kind of.The enforcement of level Four recovery policy is based on following consideration among the present invention: planned recovery is more much lower than the cost that the unplanned system machine of delaying consumes; The cost that recovers a certain application process (process level recovery) is more much lower than the cost that recovers whole application program (seeervice level recovery); The cost that recovers a certain application program (seeervice level recovery) is more much lower than the cost that recovers whole application system (system-level recovery).

Compare with traditional periodic system level restoration method, the present invention has the following advantages: towards different fault scenes, can select the varigrained object of restarting effectively, and determine and restart order, reduced release time, reduce the recovery cost, strengthened the elasticity recovered of task key system survival.When the system survival performance degradation to a certain degree the time, by the continuation operation that stops using, or restart system, application service or process clearing up its internal state, thereby releasing operation system resource is restored the system survival performance.

(4) description of drawings

Fig. 1 is system-level recurrence recovery policy figure;

Fig. 2 is seeervice level recurrence recovery policy figure;

Fig. 3 is a process/thread level restoration policy map;

Fig. 4 is quaternary nested emergency recovery strategy implementation process figure;

Fig. 5 is quaternary nested emergency recovery DFA model;

Fig. 6 is the nested emergency recovery strategy implementation process figure of k level;

Fig. 7 is a FB(flow block) of the present invention.

(5) embodiment

For example the present invention is done description in more detail below in conjunction with accompanying drawing:

1. determine to restart the recovery policy of restarting of the rank of layer and different stage

Implement the nested emergency recovery strategy of fine granularity, need system is divided into and somely restart layer, and every layer restart the independent service entities that object must have atomicity, concurrency.Thus, will restart and recover to be defined as 4 grades: system-level, seeervice level, process level and thread-level.System-level recovery is as restarting object with total system, seeervice level is recovered application service as restarting object, it is the concrete process that will carry out in system's operational process as restarting object that process level is recovered, the thread-level recovery then with each cell-thread in the process as restarting object.The seeervice level of carrying out before each system-level recovery is repeatedly recovered; The process level of carrying out repeatedly before each seeervice level is recovered is recovered; The thread-level that is accompanied by in the process level rejuvenation is repeatedly recovered.

(1) system-level recovery policy

Depend on the inefficacy regularity of distribution of existing system, define system is met an urgent need to restart and is spaced apart η, η release time _i=MTBF _i-MTTR _i(i=0,1 ...).As shown in Figure 1, when system has just brought into operation, has best existence performance H _Max, the generation along with various inefficacies, mistake, attack etc. can not be surveyed incident causes the decline gradually of system survival performance, supposes η _iConstantly reduce to H _Min, this moment, the implementation system level was restarted strategy, made the existence performance of system quickly recover to initial state.

(2) seeervice level recovery policy

The seeervice level recovery policy is that the existence performance whenever a certain key service is reduced to a certain predetermined threshold P _MinThe time, carry out a seeervice level and restart recovery; And as the existence performance P of a certain key service _Max＞P＞P _MinThe time, carry out M time process level and recover, thereby obtain the existence performance recovery value P of a series of different these services _Max ⁽¹⁾, P _Max ⁽²⁾..., P _Max ⁽ⁿ⁾With the time interval η that carries out recovery ₀, η ₁..., η _n, as shown in Figure 2.Because seeervice level is recovered to discharge all operations system resource that is depleted, existence performance number P and time interval sequential value η after this emergency service is recovered successively decrease successively, after certain times N, beginning is carried out recovery operation continually and service can't normally be provided, and thoroughly loses efficacy until this service.Therefore we consider after M time process level is recovered, and implement a seeervice level and recover, and make the existence performance of this service return to initial value P _Max, restart M time process level again and recover.Circulation successively, if other faults do not take place, this key service is expected to move down for a long time.

(3) process level and thread-level recovery policy

After taking place, failure event carries out a large amount of wastes that cause in the calculating for fear of key service owing to starting anew, the availability of abundant raising system, the suitable moment in the normal operation of service is provided with checkpoint (CheckPoint), preserve service processes specification condition at that time, and each process correlativity is followed the tracks of and record.After the service disruption fault took place, the coherency state (checkpoint) with the associated process backrush is served before the fault re-executed from this checkpoint through behind the recovering state.

The process that process level or thread-level are recovered is recovered memory mirror exactly from check point file.Restoration Mechanism is utilized the loctl of system () function call, reads check point file, created subprocess after, parent process is returned user's space, waits until that always recovery tasks finishes, and finishes to withdraw from.Subprocess then utilizes clone () function to produce a plurality of threads to recover former task again, and one of them thread reads essential information from check point file, and recovers the numbering of each thread and the relation between them.By behind last chokepoint, all threads promptly leave kernel spacing, enter into user's space, and their call back function will continue the code after the executive process recovery, as shown in Figure 3.

2. determine that system respectively restarts the order of restarting of layer module, set up that recurrence is nested restarts chain

Generally speaking, it is thin more to recover granularity, and the service failure time is short more, and it is also low more to recover cost.But, can not therefore be: thread-level, process level, seeervice level and system-level with regard to determining to restart order simply.The present invention has introduced and has restarted priority index parameter K when determining to restart order _sDefinition K _s=d* Δ p* ∑ (restart the stock number that discharges behind the controlled object/restart the recovery cost that controlled object causes), wherein, s need to represent the controlled object of restarting of emergency recovery; D represents the crucial grade of s; Δ p represents the existence situation grade of s.

When determining to restart object, defer to following rule: (1) at same one deck, the K of controlled object _sBig more, it is high more to restart priority, preferentially more restarts; (2) at different layers, if the K of a certain process _sK greater than the application service under it _sValue, then this process and service can be restarted object as one respectively, and process has precedence over affiliated application service execution reboot operation; Otherwise then this process can not only will be served as one and restart object as restarting object.If guide the K of total system again _sValue is maximum, then only need carry out simple periodically recovery with total system as restarting object, and need not consider the fine granularity recovery policy of seeervice level and process level, thread-level again.

Because an application service restart replacement to all processes of enabling in its operational process, restart this application service and be equivalent to restart its all processes of subordinate.Again the guiding of system will stop all and move application services on it, discharge all system resources, be the most thorough recovery to the system survival performance.Can obtain the objects of restarting at different levels by above rule, and these can be restarted object order be a chain structure by restarting priority, is and restarts chain.

3. make up the implementation process model that recurrence is restarted emergency recovery

Describe each emergent specific implementation process of restarting recovery with SPN, control the home position that recovery is restarted in each time, simplified the SPN model thus, easy to understand and analysis with DFA.Restart the recovery policy implementation process as shown in Figure 4 but quaternary nested survival system is emergent.Control association and the transferring position of respectively recovering between the subprocess with the DFA model M, this has not only been avoided model state space blast problem but also has been avoided the DFA model can not describe the shortcoming of tactful implementation detail.Circle among the figure is represented the position, and the stain in the circle is a mark, and little rectangle frame is represented transition.Wherein, P _Avail, P _Down, P _RejuThe state that simulation system enters, P _ClockThe recording clock that recovers, hollow frame T are carried out in expression _Down, T _Reju, T _UpThe expression enforcement time is obeyed the transition that distribute arbitrarily, solid frame T _ClockTimed transition is determined in expression, determines time η _iIt is the optimized database restore time interval.Transition are implemented, and then Token moves to one position, back from its last position, and system state changes thereupon.The position that to contain Token is designated as 1, and the position that does not contain Token is designated as 0, then set of locations (P _Avail, P _Down, P _Clock, P _RejuBut) tag system residing state in recovering subprocess.

Two large rectangle frames comprise two groups of positions: { P _Avail0, P _Avail1..., P _AvailnAnd { P _Clock0, P _Clock1..., P _Clockn, expression respectively recovers the initial position and the recording clock of subprocess respectively, and each recovery respectively has one of them position to participate in.Recover shared same transition T preceding n time _Downi, T _Clocki, still, corresponding different initial positions, T _DowniObey different stochastic distribution F _1i(t) (0≤i＜n), T _ClockiEnforcement time η _i(0≤i＜n) also different enters the back that returns to form and implements identical transition T _ArejuArbitrary moment is implemented identical transition T if system enters failure state _Up, obey identical stochastic distribution.From P _ClockSet of locations indicates the input of M to the connecting line of automat M, knows the current state q of system thus _IjkFrom transition T _Areju, T _Sreju, T _UpTo automat M, three-way unification is as another input of M, transition T _Areju, T _Sreju, T _UpThe corresponding ∑ of enforcement in the operation carried out of system; The transfer function θ of M reads this two inputs, obtains the NextState of system, as the output of M, and the initial position of control next son process.The transition tabel of among Fig. 4 service layer, process level, thread layer being restarted is shown one group, and shown in grey box, wherein the transition sum equals three layers and restarts the number of times sum.

The state transitions rule format description of automat as shown in Figure 5.Use the finite state q of M among the figure _IjkControl position group { P _Avail0..., P _AvailnAnd { P _Clock0..., P _ClocknIn mark, definition status is as follows: corresponding original state q ₀₀₀, mark is positioned at P _Avail0And P _Clock0As the q that gets the hang of of system _Ijk=00...010...00 (the

Σ_{k_{1} = 0}^{j - 1} Σ_{k_{2} = 0}^{i - 1} N [k_{1}] + N [k_{2}] + i + j + k + 1

The position is 1), then mark is positioned at P _AvailyAnd P _Clocky, wherein

y = Σ_{k_{1} = 0}^{j - 1} Σ_{k_{2} = 0}^{i - 1} N [k_{1}] + N [k_{2}] + i + j + k + 1,

System executed this moment k thread-level restart, j time process level is restarted, i time seeervice level is restarted; When system arrives state q _{MN[m] N[y] (y=N[m])}The time, mark is positioned at P _AvailnAnd P _Clockn, prepare the executive system level and restart.

From P _ClockThere is an input in set of locations to M, knows the current state q of system thus _IjkFrom transition T _ArejuGroup, T _Sreju, T _UpImport as its another to M, know the type of restarting of system thus; The transfer function θ of M reads two inputs, obtain the next state of system, learn the operation that system should carry out this moment simultaneously, as the output of M, and control the initial position that next recovers mark in subprocess, and system performed from the position restart object.

Fig. 4 is changed a little can be in order to represent the nested emergency recovery strategy of k level recurrence arbitrarily, as shown in Figure 6.

The partial code of VS.Net example process level restoration management is as follows:

// list current all processes in the system //

private?void?ListProcesses()

{ Process[]ps；

try

{ ps＝Process.GetProcesses()；

The content of // renewal process list

lvProcesses.BeginUpdate()；

lvProcesses.Clear()；

// add being listed as

LvProcesses.Columns.Add (" image name ", 100, Horizontal Alignment.Left);

LvProcesses.Columns.Add (" process ID ", 60, Horizontal Alignment.Left);

LvProcesses.Columns.Add (" priority ", 60, Horizontal Alignment.Right);

LvProcesses.Columns.Add (" CPU time ", 100, Horizontal Alignment.Right);

LvProcesses.Columns.Add (" committed memory ", 100, Horizontal Alignment.Right);

// interpolation list items

foreach(Process?p?in?ps)

{ ListViewItem?lvi＝new?ListViewItem()；

lvi.Text＝p.ProcessName；

lvi.SubItems.Add(p.Id.ToString())；

lvi.SubItems.Add(p.BasePriority.ToString())；

lvi.SubItems.Add(p.TotalProcessorTime.Hours.ToString()+″：″+p.TotalProcessorTim

e.Minutes.ToString()+″：″+p.TotalProcessorTime.Seconds.ToString())；

lvi.SubItems.Add(p.WorkingSet.ToString())；lvProcesses.Items.Add(lvi)；

}

lvProcesses.EndUpdate()；

}

catch(Exception?e)

{ MessageBox.Show(e.Message)；

}

public?int?pid＝0；

// obtain process object according to process ID, show its attribute then

private?void?FormProp_Load(object?sender，System.EventArgs?e)

{?if(pid＝＝0)return；

Process?p＝Process.GetProcessById(pid)；

if(p＝＝null)return；

try

{?txtID.Text＝p.Id.ToString()；

txtName.Text＝p.ProcessName；

txtStartTime.Text＝p.StartTime.ToLongTimeString()；

txtPriority.Text＝p.PriorityClass.ToString()；

txtVirtualMem.Text＝p.VirtualMemorySize.ToString()；

txtWorkingSet.Text＝p.WorkingSet.ToString()；

if(p.MainModule?！＝null)

{?txtModuleName.Text＝p.MainModule.FileName；

txtModuleDescription.Text＝p.MainModule.FileVersion?Info.FileDescription；

txtModuleVersion.Text＝p.MainModule.FileVersion?Info.FileVersion；

}?}

catch(Exception?ex)

{ MessageBox.Show (this, ex.Message, " occurring unusual ", MessageBox Buttons.OK, M

essageBoxIcon.Warning)；

}

finally

{

p.Close()；

}

///

// create the ProcessStartInfo object instance according to the input of user in forms

// use this object instance to start new process then

private?void?btnOK_Click(object?sender，System.EventArgs?e)

{ ProcessStartInfo?si＝new?ProcessStartInfo()；

si.FileName＝txtFileName.Text；

si.Arguments＝txtParam.Text；

si.ErrorDialog＝chkErrDlg.Checked；

si.UseShellExecute＝chkUseShell.Checked；

if(cbxVerb.SelectedIndex ！＝-1)

si.Verb＝cbxVerb.Text；

si.WorkingDirectory＝txtWorkDir.Text；

if(cbxStyle.SelectedIndex＝＝0)

si.WindowStyle＝ProcessWindowStyle.Maximized；

else?if(cbxStyle.SelectedIndex＝＝1)

si.WindowStyle＝ProcessWindowStyle.Minimized；

else

si.WindowStyle＝ProcessWindowStyle.Normal；

try{ Process.Start(si)；}

catch(Exception?ex)

{

MessageBox.Show (this, ex.Message, " occurring unusual ", MessageBox Buttons.OK, MessageBoxIcon.War

ning)；

this.DialogResult＝DialogResult.None；

}}

private?void?Form1_Load(object?sender，System.EventArgs?e)

{ ListProcesses()；}

The process of the current selection of // deletion

private?void?btnKillProcess_Click(object?sender，System.Event?Args?e)

{

If (MessageBox.Show (this, " determine end process: "+lvProcesses.SelectedItems[0] .Text, " finish to advance

Journey ", MessageBoxButtons.OK Cancel, MessageBoxIcon.Warning)==DialogResult.Cancel)

return；

int?pid＝Int32.Parse(lvProcesses.SelectedItems[0].Sub?Items[1].Text)；

Process?p＝Process.GetProcessById(pid)；

if(p＝＝null)return；

if(！p.CloseMainWindow())

p.Kill()；

p.WaitForExit()；

p.Close()；

ListProcesses()；

}

// refresh process list

private?void?btnRefresh_Click(object?sender，System.EventArgs?e)

{ ListProcesses()；}

// create new process

private?void?btnNewProcess_Click(object?sender，System.Event?Args?e)

{ FormStartInfo?dlg＝new?FormStartInfo()；

if(dlg.ShowDialog()＝＝DialogResult.OK)

ListProcesses()；}

The attribute of the current selection process of // demonstration

private?void?btnProcessProp_Click(object?sender，System.Event?Args?e)

{ FormProp?dlg＝new?FormProp()；

Dlg.Text=" process "+lvProcesses.SelectedItems[0] .Text+ " attribute ";

dlg.pid＝Int32.Parse(lvProcesses.SelectedItems[0].Sub?Items[1].Text)；

dlg.ShowDialog()

}

// process is hung up ﹠amp; Process is recovered

Private?Declare?Function?OpenProcess?Lib″kernel32″(ByVal?dwDesiredAccess?As?Long，

ByVal?bInheritHandle?As?Long，ByVal?dwProcessId?As?Long)As?Long

Private?Declare?Function?CloseHandle?Lib″kernel32″(ByVal?hObject?As?Long)As?Long

Private?Const?SYNCHRONIZE＝&H100000

Private?Const?STANDARD_RIGHTS_REQUIRED＝&HF0000

Private?Const?PROCESS_ALL_ACCESS＝(STANDARD_RIGHTS_REQUIRED?Or?SYNCHRONIZE?Or?&HFF

F)

Private?Declare?Function?NtSuspendProcess?Lib″ntdll.dll″(ByVal?hProc?As?Long)As

Long

Private?Declare?Function?NtResumeProcess?Lib″ntdll.dll″(ByVal?hProc?As?Long)As?L

ong

Private?Declare?Function?TerminateProcess?Lib″kernel32″(ByVal?hProcess?As?Long，B

yVal?uExitCode?As?Long)As?Long

Private?hProcess?As?Long

Private Sub cmdClose_Click () ' stops

CloseHandle?hProcess

End?Sub

Private Sub cmdResume_Click () ' closes handle

If?IsNumeric(txtPid.Text)Then

hProcess＝OpenProcess(PROCESS_ALL_ACCESS，False，CLng(txtPid.Text))

If?hProcess<>0Then

NtResumeProcess?hProcess

End?If

End?Sub

Private Sub cmdTerminate Click () ' recovers

If?hProcess?Then

TerminateProcess?hProcess，0

Else

If?IsNumeric(txtPid.Text)Then

hProcess＝OpenProcess(PROCESS_ALL_ACCESS，False，CLng(txtPid.Text))

Claims

1, a kind of based on quaternary nested task key system survival emergency recovery method of restarting, it is characterized in that:

(3) describe emergent each time specific implementation process of restarting recovery based on the SPN method, restart the home position of recovery simultaneously with DFA control each time.

2, according to claim 1 based on quaternary nested task key system survival emergency recovery method of restarting, it is characterized in that: described will restart layer be divided into system-level, seeervice level, process level and four grades of thread-level determine restart the rank of layer and recovery policy that different stage is restarted layer is: