CN101478433A

CN101478433A - Distributed system self-healing control method based on multiple host body stochastic decision-making process

Info

Publication number: CN101478433A
Application number: CNA2009100712804A
Authority: CN
Inventors: 王慧强; 卢旭; 赵国生
Original assignee: Harbin Engineering University
Current assignee: Harbin Engineering University
Priority date: 2009-01-16
Filing date: 2009-01-16
Publication date: 2009-07-08
Anticipated expiration: 2029-01-16
Also published as: CN101478433B

Abstract

The invention provides a distributed system self-recovery control method based on multi-agent random decision process. A distributed system self-healing control system under a multi-level recovery mechanism is constructed by virtue of multi-agent random decision process, so that decision agents can obtain real-time recovery strategy through iterative computation. The invention proposes a technical scheme to apply the method in a distributed system with multi-level recovery mechanism, which can realize self-recovery control mechanism in case of failure of the distributed system. The method has the advantages of low calculation amount for strategy solving and low network communication cost, distributed control, fulfillment of dynamic and open features of the distributed system; and basically solves the self-recovery decision problem of the distributed system in case of failure. The application scheme of the self-recovery control method provides a new solution for self-recovery of the distributed system in case of failure.

Description

Based on the multiagent distributed system self-healing regulate and control method of decision process at random

(1) technical field

What the present invention relates to is the inefficacy self-healing method of computer system, particularly distributed system inefficacy self Healing Technology field.

(2) background technology

Phenomenons such as the continuous deterioration of the isomerism of distributed system, complexity and environment for use caused thrashing inevitably, depart from mission even interrupt run, collapse deadlock take place, thereby cause serious consequences such as great economic loss or even casualties.This makes that also manually finishing it administers and maintains, keeps incessantly its normal operation to become more and more difficult.And traditional reliability, the maintainability theory and technology seems too clumsy again in the face of this chance event that may occur at any time, even it is in a helpless situation hopeless, be difficult to satisfy the needs of practical application, fine-grained recovery technology such as micro-reboot technology, task hot plug technology and system-level repentance technology make distributed system recover plane mechanism at many levels becomes possibility, people expect to have the selection and optimization problem that a kind of automatic automatically and flexibly recovery policy generation method solves restoration methods under the multiple recovery means simultaneously, and the self-cure regulation and control technology just is being based on above theory and is proposing.

But it is although more for the distributed system research of self-healing both at home and abroad, but the research that generates problem for self-healing system recovery policy under the multilayer Restoration Mechanism is paid close attention to less, currently mainly contain two kinds of relatively relevant research thinkings for this problem: a kind of is representative with the multiagent self-healing distributed system research based on decision tree, the 3rd soft project research holding of international institute of electrical and electronic engineers (IEEE) in 2005, management and application meeting paper " based on the priori formula self-healing system of multiagent technology " (Proactive self-healing system basedon multi-agent technologies, In The Third ACIS international Conferenceon Software Engineering Research, Management and Applications, 2005, onpage (s): the multiagent self-healing system that proposes 256-263), by setting up supervision subjects, member body, executive agent, diagnosis main body and decision-maker are realized system self-healing, wherein decision-maker takes traditional decision-tree to produce recovery policy, its problem is to take single main body to be responsible for decision-making, belong to centralized solution, inapplicable and open, dynamic distributed system; Second method is representative with the model-driven self-healing system research based on partially observable Markov decision process (POMDP), the paper of delivering in the 24 the reliable distributed system meeting that international institute of electrical and electronic engineers in 2005 is held " distributed system of model-driven is from recovering " (Automatic model-driven recovery in distributed systems, 24th IEEESymposium on Reliable Distributed Systems, on page (s): the partially observable Markov decision process that utilizes that 25-36) proposes realizes that recovery policy generates automatically, these class methods are not considered the distributed system decentralized control equally, the characteristics of dynamic interaction, and it is big that recovery policy is found the solution amount of calculation, is difficult to satisfy the requirement that real-time online recovers in the current distributed system application.

Multiagent Markovian decision process (MMDP:Multi-Markov decision process) is a series of extensive theories of sequential decision problem at random of research.What is called is sequential decision problem at random, be meant on point of a series of moment in succession or continuous (being called decision-making constantly) and make a policy, in each decision-making moment point, the policymaker selects one according to the state that observes from available some decision-makings, after decision-making put into practice, system will obtain and the state of living in and the relevant remuneration of decision-making of adopting, and the system that influences is at the residing state of next one decision-making moment point, so-called multiagent decision-making promptly in system decision-maker not unique, and possesses coordinating communication mechanism between the decision-maker, can realize basic interactive function, the purpose of multiagent decision-making is to make certain criterion that operates in of system be issued to optimum.The multiagent that the present invention the announces application of decision-making technique realization self-cure regulation and control aspect is not at random seen discovery at present.

(3) summary of the invention

The object of the present invention is to provide a kind ofly can overcome existing self-cure regulation and control technology can not generate optimization under distributed system multilayer Restoration Mechanism recovery policy, and recovery policy find the solution amount of calculation is big and decision-making mechanism is easy to lose efficacy defective based on the multiagent distributed system self-healing regulate and control method of decision process at random.

The object of the present invention is achieved like this:

May further comprise the steps:

(1) before system moves first, be a plurality of subsystems according to Function Coupling with system divides, each subsystem possesses a decision-maker, and each subsystem is made up make a strategic decision at random five-tuple { S, A (i), p _Ij(a), and r (i, a), V, }, comprise system mode, system action, state transition probability matrix, criterion function

And set discount factor γ, the γ γ that satisfies condition〉0, all information of five-tuple are preserved then on the node of decision-maker place;

(2), preserve one and recover the availability that the behavior record table is used to judge that the recovery behavior whether can outer other subsystems of the subsystem of influence own for each decision-maker;

(3) after each subsystem detects inefficacy, send subsystem state information to decision-maker, the decision-maker utilization is appointed and is got v ° of ∈ B, and ε is set〉0 and put n=0, wherein B represents a real number space, then to each i ∈ S, calculates v ^N+1(i), judge whether to satisfy

| | v^{n + 1} - v^{n} | | < \frac{ϵ (1 - γ)}{2 γ},

If satisfy, then, get for each i ∈ S

f_{ϵ} (i) &Element; {\arg \max}_{a &Element; A (i)} {r (i, a) + γ \underset{j &Element; S}{Σ} p (j | i, a) v^{n + 1} (j)},

Obtain recovery policy;

(4) utilize the coordination system between decision-maker to communicate, judge whether the strategy that is about to take can have influence on the availability of other subsystems; If, then send thrashing message to all subsystems, if not, recovery policy then directly carried out; After decision-maker is carried out recovery policy and after subsystem is recovered, all decision-makers send and recover message in system, if there is not recovery system, then recomputate recovery policy, know that subsystem recovers; After group system decision-making main body is received internal system failure recovery message, stop to carry out data passes, up to receiving the recovery message that sends the message subsystem with the subsystem that sends thrashing message.

The present invention by make up multiagent inefficacy self-healing at random decision model and finding the solution obtain the recovery policy optimized; And a kind of technical scheme of using this method in distributed system proposed on this basis, have the self-cure regulation and control technology can not generate optimization under distributed system multilayer Restoration Mechanism recovery policy now to overcome, and recovery policy is found the solution the defective that amount of calculation is big and decision-making mechanism is easy to lose efficacy.The self-cure regulation and control method of distributed system of the present invention comprises and utilizes decision process structure self-cure regulation and control model at random; Utilization value iterative algorithm is found the solution the self-cure regulation and control strategy; By global state share and the machine-processed realization decision-maker of message communicating between mutual.

Compared with the prior art, advantage of the present invention and good effect are as follows:

(1) general process that meets self-cure regulation and control based on the step of the self-cure regulation and control method of decision process at random of the present invention, after system's operation, need not human intervention, thereby reduced system and had total cost, for enterprise uses the human resources expense of saving, simultaneously can also improve system availability, the mission critical ability that runs without interruption is provided;

(2) the recovery policy method for solving of being taked has solid theory, has simple, reasonable, feasible characteristics, can guarantee the raising of system availability in the recovery policy that can accept to obtain in the time optimum or suboptimum;

(3) the multiagent decision-making mechanism takes integral body is divided into the thinking of each subsystem, can avoid the then phenomenon of entire system collapse of the central control fails that in the self-cure regulation and control process, may occur, it is strong to have Fault Tolerance, local system lost efficacy still can guarantee the function of the normal operation of other subsystems, provides a kind of new solution thinking for service uninterruptedly is provided;

(4) coordination system between multiagent has simply, efficient characteristics, need not increasing system operation expense, can realize basic interactive function, guaranteed to cooperate with each other between each subsystem and failure effect is reduced to minimum, this mechanism is that self-cure regulation and control method disclosed by the invention possesses actual using value.

In a word, owing to adopt based on the self-cure regulation and control method of decision process at random, be the automatic generation that distributed system realizes the failure recovery strategy after can self-operating under the unmanned situation of intervening, fundamentally solved the difficult problem that the thrashing recovery policy generates under the distributed environment.It is low that this method has a cost, and side effect is little and implement characteristic of simple, has better market application.

(4) description of drawings

Fig. 1 is the adjusted and controlled figure of multiagent decision-making distributed system self-healing; Wherein: M represent decision-maker, A representative by management subject,---represent the network management communication link ,-the representative data link.

Fig. 2 uses the network security situation sensing system system assumption diagram for the self-cure regulation and control execution mode in for example.

(5) embodiment

For example the present invention is done description in more detail below in conjunction with accompanying drawing:

(1) be the plurality of sub system with system divides at first,, all represent with following five-tuple for each subsystem according to Function Coupling:

{ S, A (i), p _Ij(a), and r (i, a), V, }, i wherein, j ∈ S, a ∈ A (i)

Each first implication is as follows:

(a) S is the non-NULL state set that all possible running status of distributed subsystem is formed, and is called the state set of system.It can make collection limited, that can be listed as, in the present invention, supposes that S is a countably infinite set, uses lowercase i, and j, k wait the expression state.

(b) for state i ∈ S, A (i) goes out available decision set at state i, and it is a non-NULL, represents a decision-making with a.

(c) when system decision point constantly t be in state i, a ∈ A (i) that takes to make a strategic decision, then system the next one constantly the t+1 state probability that is in j be p _Ij(a), it is irrelevant constantly with decision-making, claims p={p _Ij(a), i, j ∈ S, a ∈ A (i) } be the state transition probability family of system, so to i ∈ S, a ∈ A (i) has ∑ _{J ∈ S}p _Ij(a)=1.

(d) when system decision point constantly t be in state i, when taking to make a strategic decision a ∈ A (i), system is that (s a), is called reward function to r with the remuneration of this stage acquisition.

(e) V is a criterion function, is also referred to as target function, and the present invention adopts the expected total reward criterion function as target function, and concrete form is

Wherein R represents reward function, and γ is a discount factor.

(2) with each subsystem is abstract be some Markovian decision process five-tuples after, recovery policy obtains by following process:

Find the solution based on the recovery policy of MDP and to finish a kind of like this computing: decision-maker selects the recovery behavior with the remuneration of maximization current generation according to the expected total reward criterion function, is exactly to guarantee that system recovery time is the shortest specifically.The value iterative step of seeking the self-cure regulation and control Optimal stationary policies and the value of approaching thereof is as follows:

(a) appoint get v ° of ∈ B, given ε 0 and put n=0, wherein B represents a real number space;

(b) to each i ∈ S, by calculating

v^{n + 1} (i) = \max_{a &Element; A (i)} {r (i, a) + γ \underset{j &Element; S}{Σ} p (j | i, a) v^{n} (j)}

Obtain v ^N+1(i), v wherein ⁿ(i) be called state-value function;

(c) if value function satisfies:

| | v^{n + 1} - v^{n} | | < \frac{ϵ (1 - γ)}{2 γ},

Enter steps d, otherwise n is increased by 1, return step b then;

(d) for each i ∈ S, get

f_{ϵ} (i) &Element; {\arg \max}_{a &Element; A (i)} {r (i, a) + γ \underset{j &Element; S}{Σ} p (j | i, a) v^{n + 1} (j)},

F wherein _ε(i) strategy that system taked behind the error factor ε that gives of expression and the state i.

(3) in the self-cure regulation and control system of multiagent decision-making, constitute the MDP system between decision-maker and the managed object, there is the decision-maker of a plurality of subsystems to coexist in the self-cure regulation and control system, each decision-maker only applies operation to the subsystem that distributes separately, the result is that the running status on the subsystem of other decision-makers management also will be affected, when which kind of next step takes tactful in decision-making, not only to know the state of own subsystem operation, also will consider the state of other subsystem operations.Take the following coordination system between the decision-maker:

(a) system manager analyzes contrast with recovery behaviors all in the self-cure regulation and control subsystem in advance, judges whether the recovery behavior can influence the availability of other subsystems except that subsystem own;

When (b) group system decision-making main body makes a policy, utilize the information among the step a to judge whether the strategy that is about to take can have influence on the availability of other subsystems.If, then send thrashing message to all subsystems, if not, recovery policy then directly carried out;

(c) after decision-maker is carried out recovery policy and after subsystem is recovered, all decision-makers send and recover message in system, if there is not recovery system, then recomputate recovery policy, know that subsystem recovers;

(d) after group system decision-making main body is received internal system failure recovery message, stop to carry out data passes, up to receiving the recovery message that sends the message subsystem with the subsystem that sends thrashing message.

Being example with Computer Applied Technology research institute of Harbin Engineering University network security situation sensing system now describes the embodiment of the concrete application scheme of self-cure regulation and control.

Network security situation sensing system has been realized assessment and prediction to network safe state in current and following a period of time by monitoring in real time and the early warning net safe condition.The situation sensory perceptual system by multi-source heterogeneous data such as fire compartment wall, intruding detection system, security audit daily record are gathered, brief, filter with related, at last by data fusion and security evaluation technology generation situation perception visualized data.

For realizing that distributed system lost efficacy from recovering, the self-cure regulation and control The Application of Technology need be gathered failure detection mechanisms and failure recovery mechanism, the effect of failure detection is for the self-cure regulation and control technology provides triggering message, and failure recovery mechanism then is responsible for carrying out the recovery policy that self-cure regulation and control generated.Failure detection mechanisms adopts logout and analytical technology herein, and micro-reboot technology is then adopted in failure recovery.

The specific embodiments step of self-cure regulation and control in network security situation sensing system is as follows:

(1) according to the Function Coupling principle, difference according to each acquisition engine can be the acquisition engine subsystem with system divides, each subsystem all possesses logout and analytic function, simultaneously can also carry out little recovery of restarting to each subsystem, after the division system, need restart the recovery behavior and assess little, judge whether the recovery behavior can influence other subsystem availabilities;

(2), make up { S, A (i), p according to self-cure regulation and control method based on decision process at random for each subsystem _Ij(a), and r (i, a), V, } five-tuple, and to set discount factor γ be 0.9, ε is 0.01.Transition probability matrix and at random the relative parameters setting of decision process be kept on the node as decision-maker;

(3) after logout and analysis module detect certain subsystem failure, the decision-maker of this subsystem brings into operation, at first utilize logout and analytical information to judge current system state of living in, probability matrix that call establishment is good and multiagent communication information table then, utilization value iterative algorithm calculates recovery policy;

(4) subsystem is taked corresponding little behavior of restarting according to the recovery policy of decision-maker, and whole recovery process will not stop after the system of detecting has inefficacy.

Claims

1, a kind of based on the multiagent distributed system self-healing regulate and control method of decision process at random, it is characterized in that may further comprise the steps:

(3) after each subsystem detects inefficacy, send subsystem state information to decision-maker, the decision-maker utilization is appointed and is got v ^o∈ B is provided with ε〉0 and put n=0, wherein B represents a real number space, then to each i ∈ S, calculates v ^N+1(i), judge whether to satisfy

If satisfy, then, get for each i ∈ S

f_{ϵ} (i) &Element; \arg \max_{a &Element; A (i)} {r (r, a) + r \underset{j &Element; S}{Σ} p (j | i, a) v^{n + 1} (j)},

Obtain recovery policy;