CN102916830A

CN102916830A - Implement system for resource service optimization allocation fault-tolerant management

Info

Publication number: CN102916830A
Application number: CN2012103356095A
Authority: CN
Inventors: 陶飞; 程颖; 张霖
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2012-09-11
Filing date: 2012-09-11
Publication date: 2013-02-06
Anticipated expiration: 2032-09-11
Also published as: CN102916830B

Abstract

The invention relates to an implement system for resource service optimization allocation (RSOA) fault-tolerant management. In the implement system, corresponding fault-tolerant management implement mechanisms are designed according to causes and types to generate faults in a process of RSOA so as to implement corresponding fault detection and elimination. The implement system comprises an information service module, an RSOA module, a fault detection module and a fault recovery module, and has the advantages of good modularity, maintainability and expansibility, can effectively detect and eliminate various faults in the process of the RSOA, and improves stability of the whole service manufacturing system and reliability of the RSOA. The implement system can effectively detect common faults caused by virtual connect, resources, tasks, applications and the like in the process of the RSOA of the service manufacturing system and provides corresponding good elimination strategies for the common faults, as well as effectively improves reliability and service quality of the RSOA of the service manufacturing system.

Description

A kind of resource service is distributed fault-tolerant management rationally and is realized system

Technical field

The invention belongs to distributed Manufacturing information integration System Fault Tolerance administrative skill field.Be specifically related to a kind of resource service and distribute fault-tolerant management rationally and realize system, it distributes the fault-tolerant management implementation framework rationally for a kind of resource service of service-oriented manufacturing system, and accordingly fault detect and based on digestion mechanism and the method for ECA.This invention can effectively detect service manufacturing system resource service and distribute most common failure in the process rationally, and provides corresponding good Removing Tactics, reliability and service quality that Effective Raise service manufacturing system resource service is distributed rationally to it.

Background technology

Service manufacturing system (cloudlike (CMfg) system of manufacturing, manufacturing service system, manufacturing grid system etc.) is made resource service and is distributed the operation that relates in the implementation procedure rationally, comprise that resource service search and coupling, QoS assessment, QoS extract, resource service is preferred, resource service combination etc., may be former thereby failed because of some, thus whole distribute rationally failure or inefficacy caused.Its possible cause mainly contains:

1. serve in the manufacturing system that two internodal virtual links disconnect or bandwidth ability descends suddenly, can't meet the demands;

2. invoked resource service breaks down or the generation state changes in the process of implementation, as be closed suddenly or withdraw from, resource service combination lost efficacy, the resource service ability descends suddenly, overload etc.;

3. submitted to or the task generation state that just moving changes, as the person of being managed or user withdraw from by force, demand improves, be suspended, invalid resource service distribution etc.;

4. in application process, go wrong, trust the access rights of deficiency, mistake, unreasonable or incorrect Code Design etc. such as both parties.

Above phenomenon is referred to as fault in the present invention.In case above situation occurs, resource service is distributed (RSOA) rationally and will be suspended or lose efficacy.May must therefore, for reliability and the service quality that improves RSOA, solve following problem: 1. which fault occur in the RSOA process? 2. how to detect the fault of appearance? 3. how analyzing and testing to fault and carry out Recovery processing?

For above problem, in the service manufacturing fields such as CMfg, also there is not at present correlative study.For overcoming the above problems, realize the fault-tolerant management in the RSOA process, improve reliability and the service quality of RSOA, the present invention at first analyzes the fault that may occur in the RSOA process and classifies, study on this basis RSOA fault-tolerant management realization mechanism, and study corresponding fault detection method and Removing Tactics.

Summary of the invention

Purpose of the present invention is: the resource service that the present invention relates to is distributed the fault-tolerant management realization mechanism rationally, can effectively detect the most common failure that produces in the service manufacturing system RSOA process, and provides corresponding good Removing Tactics and method for various faults.Reliability and service quality that Effective Raise service manufacturing system resource service is distributed rationally.

The technical solution used in the present invention is: a kind of resource service is distributed (RSOA) fault-tolerant management rationally and is realized that system, this system comprise that information service module, resource service distribute module, fault detection module and fault recovery module rationally;

Described information service module is mainly fault detect, fault recovery, resource service is distributed rationally provides information and Data support;

Described resource service is distributed module rationally and is realized that mainly resource service search, service quality (QoS) are assessed, the feature operations such as resource service is preferred, resource service combination;

Described fault detection module is responsible for the state of each node in the monitor service manufacturing system and moving of task and resource, monitors at any time and carries out state analysis; To normally or the historical data of the example that unusually withdraws from analyze and add up, make a policy and notify the fault recovery module that detected fault is processed;

Described fault recovery module, be comprehensive multiple fault tolerant mechanism based on ECA(Event-Condition-Action) resource service distribute fault rationally and clear up module, mainly comprise event detector (Event Detector), Conditions Evaluation device (Condition Detector), actuator (Action Executor), rule-based reasoning engine (Rule Engine), eca rule storehouse (ECA Rules), eca rule manager (ECA Rule manager) part.

Wherein, fault detect comprises the relevant fault detect of virtual link (VL), the fault detect that resource service (RS) is relevant, the fault detect that task (Task) is relevant, the fault detect that application is relevant in the described fault detection module; The fault detect that virtual link is relevant comprises that mainly virtual link fault (VL_Disconnect_Failure) detects and the not enough fault of bandwidth (Bandwidth_Failure) detects; Virtual link fault (VL_Disconnect_Failure) can detect by System Security Policy or the middleware that is embedded in the service manufacturing system usually; Whether two inter-entity adopt call duration time and success communication rate or two indexs of reliability to judge because bandwidth produces fault; The fault detect that resource service is relevant mainly is that resource service withdraws from fault (RS_Quit_Failure) detection, resource service overload fault (RS_Overload_Failure) detection, resource service combination fault (RS_Composition_Failure) detection; Resource service withdraws from fault (RS_Quit_Failure) and judges by the regular uninterrupted state of each resource service that checks of resource service detector; Resource service overload fault (RS_Overload_Failure) is by assessment RS _iData-handling capacity, call duration time, time of implementation judge RS _iWhether transship; Mistake matching detection rule, attribute mistake matching detection rule, QoS nonuniformity detect rule and judge resource service overload fault (RS_Overload_Failure) between the mistake matching detection rule whether satisfy between concept, data by detecting; The fault detect that task is relevant comprises that mainly task cancellation fault (Task_Cancel_Failure) and task are suspended or hang up fault (Task_Suspension_Failure) detection, resource and unsuccessfully (Task_Resource_Mismatch_Failure) detection of task matching; Task is suspended or is hung up fault (Task_Suspension_Failure) by the regular uninterrupted current state that checks each task of task detector, and task whether is in task suspension (Task_Suspended) formation and task termination (Task_Terminated) is judged; The resource service matching algorithm is adopted in resource and task matching failure (Task_Resource_Mismatch_Failure), determines whether basic coupling fault, I/O coupling fault, QoS coupling fault, comprehensive matching fault have occured; Task is suspended or to hang up fault (Task_Suspension_Failure) detection method and resource identical with task matching failure (Task_Resource_Mismatch_Failure); Use relevant fault detect, comprise that mainly trust fault (Trust_Failure) detects, uses design or coding fault (App_DesignCode_Failure) and access rights fault (App_AccessRight_Failure) detection; Trusting fault (Trust_Failure) passes through with the x of resource service Trust-QoS assessment models assessment and the trust value T between the y _{X → y}With the minimum degree of belief requirement T of entity x to y _{X → y}° size relatively judge; To use design or coding fault (App_DesignCode_Failure) and access rights fault (App_AccessRight_Failure) mainly be by System Security Policy or be embedded in the system middleware of serving in the manufacturing system detects.

Wherein, described ECA (Event-Condition-Action, event-condition-action) event definition is the corresponding event of a triggering rule (Rule) in the rule, condition (Condition) be defined as activate this rule (Rule) institute must satisfied condition, move action command for carrying out after being triggered when an eca rule; Be the event (Event) of eca rule with the fault definition that occurs in the RSOA process; Be the condition (Condition) of eca rule with the fault detect conditional definition; The processing that fault is made is defined as the action (Action) of eca rule.

Wherein, the described processing that fault is made is specially again scheduling or mates.

Wherein, described event detector (Event Detector) mainly receives the failure message that fault detection module sends over, the event of analyzing and testing fault (Event); Conditions Evaluation device (Condition Evaluator) mainly is responsible for the relevant condition (Condition) of event (Event) that detects is assessed, and sees whether it satisfies the condition of corresponding eca rule; Rule-based reasoning engine (Rule Engine) mainly is responsible for the event (Event) that detects is carried out the reasoning coupling with the respective rule in the eca rule storehouse, finds suitable rule to process the fault that detects; Actuator (Action Executor) mainly is the result according to Rule Engine reasoning, carries out selected eca rule and moves fault is processed; Eca rule manager (ECA Rule Manager) is in charge of eca rule, comprises modification, interpolation and the deletion of rule; Required various rules in the main storage failure digestion process in eca rule storehouse (ECARules).

The present invention's advantage compared with prior art is:

(1), the method is to distribute reason and the classification that fault in (RSOA) process produces rationally according to resource service specifically, designs corresponding fault-tolerant management realization mechanism, realize corresponding fault detect and clear up.This invention can effectively detect service manufacturing system resource service and distribute the most common failure that is caused by virtual link, resource, task, application etc. in the process rationally, and provide corresponding good Removing Tactics to it, reliability and the service quality of can Effective Raise service manufacturing system resource service distributing rationally.

(2), the present invention includes a kind of resource service and distribute the fault-tolerant management implementation framework rationally, and corresponding fault detect and based on digestion mechanism and the method for ECA (event-condition-action), can be applicable to distributed networked service manufacturing system, have good dynamic, modularity, maintainability, autgmentability, can effectively detect and clear up resource service and distribute various faults in the process rationally, improve the stability of whole service manufacturing system and the reliability that resource service is distributed rationally.

Description of drawings

Fig. 1 is that resource service is distributed the fault-tolerant management architecture rationally;

Fig. 2 is based on the fault recovery of ECA;

Fig. 3 is the Task_Resource_MisMatch_Failure overhaul flow chart;

Fig. 4 is the Trust_Failure overhaul flow chart;

Table 1 is the part eca rule that resource service is distributed fault-tolerant management rationally.

Embodiment

The present invention is described in further detail below in conjunction with accompanying drawing.

A kind of resource service that the present invention relates to is distributed fault-tolerant management realization mechanism and method rationally, namely by analyzing fault and the classification that may occur in the RSOA process, thereby research RSOA fault-tolerant management architecture, and study corresponding concrete fault detection method and Removing Tactics.

And if only if when following two kinds of situations or one of them occur, and claims resource service to distribute rationally and break down: 1. because the resource collapse causes it to stop service; 2. the availability of resource does not reach the minimum QoS standard of task.In actual applications, it is varied that the service manufacturing system resource service such as cloud manufacturing are distributed fault type rationally, the main and virtual link of the generation of most common failure, resource, task, four factor analysis of application.

(1) the relevant fault of virtual link

The broad sense that virtual link (VL) refers to serve two inter-entity in the manufacturing system connects.The fault that produces because of VL mainly contains virtual link fault and the not enough fault of bandwidth.

(2) the relevant fault of resource service

Resource service is the carrier of executing the task, and therefore, the withdrawing from of resource service, overload, QoS or the change of ability, the combination between resource service etc. all may cause the RSOA fault.The fault that causes because of resource service mainly contains: resource service withdraws from fault, resource service overload (or saturated) fault, resource service combination fault, resource service ability and changes the fault that causes.

Wherein the resource service combination fault mainly contains: the mistake coupling between the resource service concept, the mistake coupling between the data, attributes match error, QoS nonuniformity.

(3) the relevant fault of task

In the RSOA process, because of various reasons, may cause cancellation, hang-up of task etc., thereby cause the failure of distributing rationally.The RSOA fault that causes because of task mainly contains: task is cancelled fault, task is suspended or hang up fault, resource and task matching failure, mission requirements changes the fault that causes.

(4) use relevant fault

In application process, may lose efficacy because the reasons such as trust, access rights, coding cause RSOA, as: trust fault, use design or encode fault, access rights fault.

In the RSOA process, more than the issuable fault of four classes can cause whole RSOAS efficient and the service Quality Down.For supporting to provide fault tolerance in the RSOA process, in conjunction with the RSOAS framework, the present invention proposes RSOA fault-tolerant management architecture as shown in Figure 1.

The RSOAS fault-tolerance architecture is distributed module, fault detection module, fault recovery module four parts rationally by information service module, resource service and is formed.Realize the RSOAS fault tolerance, emphasis will solve the detection of fault and clear up.

The present invention relates to a kind of resource service and distribute fault-tolerant management realization mechanism and method rationally, comprise that a kind of resource service distributes the fault-tolerant management implementation framework rationally, and corresponding fault detect and based on digestion mechanism and the method for ECA.

RSOAS fault-tolerance architecture such as Fig. 1 distribute module, fault detection module, fault recovery module four parts rationally by information service module, resource service and form, and wherein fault detection module and fault recovery module are key content of the present invention.

(1) information service module

Information service module is mainly fault detect, fault recovery, resource service is distributed rationally provides directory information service (IIS), Resource Information Service (RIS), resource service encapsulation, QoS database information and Data support.

Wherein, directory information service (IIS) organizational information can provide the information aggregate inquiry, and supports the effective query to a plurality of RIS, and information index and the function of search of whole service-oriented manufacturing system can be provided simultaneously.IIS is comprised of three parts: general location registration process, insertable bibliographic structure and search are processed.

Resource Information Service (RIS) runs on the resource end, provides unified means to come the configuration of resource in the inquiry system platform, ability and state, and can be configured to certainly as assembling directory service.After RIS carries out authentication to the demand of input and task and resolves, according to the type of solicited message query requests is distributed to one or more informants.Then RIS passes to the user to the feedback information of resource.

The effect of resource service encapsulation template is that implementation platform is to effective management of the nodal information of participation Collaborative Manufacturing.According to the attribute between the resource (such as physical features, geographical position, dynamic characteristic, sensitivity, function etc.), customer demand (such as time, quality, price, service etc.), be used mode (such as discovery, agency, monitoring, diagnosis etc.), resource classification is encapsulated.Resource provider will be packaged into the resource service class template after platform carries out resource registering; When the client serves in request resource, download corresponding Resource Encapsulation template from system platform, and finish the instantiation of specific tasks.

Extract the QoS information of respective resources service in the QoS database.Corresponding QoS index parameter is assessed measurement, and carry out QoS relatively, thereby preferably provide information and Data support with combination for follow-up resource service.

(2) resource service is distributed module rationally

Resource service is distributed module rationally the feature operations such as resource service search, QoS assessment, resource service are preferred, resource service combination mainly is provided.

The resource service search provides all kinds of resource service information matches algorithm service, according to the subtask of the Task-decomposing demand to resource service, be responsible for from the resource service storehouse, searching satisfactory respective resources service, and generate resource service collection to be selected (RSS).

The QoS assessment is for the magnanimity that meets user's request that searches resource service collection to be selected, and purpose is for user and system's selection best resource service, carries out the reference frame that resource service is distributed rationally provides quantification, is the important step of resource optimization service configuration.From resource service information OWL-S/WSDL of being registered to server library or QoS database, extract the QoS information of respective resources service among the RSS.Corresponding QoS index parameter is assessed measurement, and carry out QoS relatively, thereby preferably provide information and Data support with combination for follow-up resource service.

Resource service is preferred: if the task that the user submits to is single resource service demand, then according to the qos parameter information requirement resource service to be selected of RSS is carried out the comprehensive assessment ordering, select best resource service to execute the task.

Resource service combination and preferred: if the user submits to is many resource service demand, then from each RSS, select a resource service to form in a certain order the combined resource service, and select from all possible combination that optimal set is incompatible executes the task.

(3) fault detection module

Be responsible for the state of each node in the monitor service manufacturing system and moving of task and resource, monitor at any time and carry out state analysis.Monitor distributing flow process and the resource that relates to and mission performance and ruuning situation rationally by local detector, and a series of management service is provided, such as task status management, resource service condition managing.To normally or the historical data of the example that unusually withdraws from analyze and add up, make a policy and notify the fault recovery module that detected fault is processed.

Below respectively relevant to virtual link, resource service relevant, task relevant and the concrete digestion procedure of using four relevant class various faults is described in detail.

(1) virtual link fault detect

1) VL_Disconnect_Failure detects

Usually can detect by System Security Policy or the middleware that is embedded in the service manufacturing system, the GRAM that can adopt Golbus to provide serves to detect VL_Disconnect_Failure.

2) Bandwidth_Failure detects

Whether because having produced fault, bandwidth reasons adopt call duration time (CT) and success communication rate (PSC) or two indexs of reliability to judge between two entities (representing with A, B) in the system.

A) adopt call duration time to judge

Make the virtual link between A, B be expressed as VL (A, B), the total information exchange capacity of VL (A, B) is SumInfor (A, B), and transmission speed (bandwidth) is V (A, B), and the stand-by period is Waite (A, B).Then corresponding total call duration time is designated as T _c(A, B) is transmission time and stand-by period sum.

B) success communication rate (PSC) or reliability are judged

If the failure rate of virtual link VL (A, B), node A and B is respectively α (A, B), α (A), α (B), then can ask reliability S between VL (A, B) by the definition of reliability _C(A, B).

If the minimum CT of user's request and PSC require to be respectively

With

Then working as virtual link VL (A, B) satisfies

Or

The time, then Bandwidth_Failure has occured in system's judgement.

(2) the relevant fault detect of resource service

1) RS_Quit_Failure detects

For detecting to distribute whether produced RS_Quit_Failure in the process rationally in resource service, the resource service detector is the uninterrupted state that checks each resource service regularly.If this resource service is reaction not, then RS_Quit_Failure has occured in system's judgement.

2) RS_Overload_Failure detects

By assessment RS _iData-handling capacity (DC), call duration time (CT), time of implementation (ET) judge RS _iWhether transship, namely whether system has produced RS_Overload_Failure.

If distribute to RS in the certain hour section _iTask-set be Γ _i={ Task ₁, Task ₂..., Task _j..., Task _k.Task wherein _jNeed RS _iQuantity be

d _{I, j}Be task task _jCall RS _iRequired data access amount, V (i, j) is Task _jWith RS _iBetween the virtual link bandwidth; Et _jBe each RS _iCarry out Task _jThe required time of implementation.RS in the running then _iCorresponding ET, DC, CT calculates respectively If RS _iET, DC, the CT upper limit is respectively

{Lim}_{{RS}_{i}}^{CT}, {Lim}_{{RS}_{i}}^{DC},

When system detects RS _iSatisfy

{ET}_{{RS}_{i}} > {Lim}_{{RS}_{i}}^{ET},

{CT}_{{RS}_{i}} > {Lim}_{{RS}_{i}}^{CT},

{DC}_{{RS}_{i}} > {Lim}_{{RS}_{i}}^{DC}

One of them person, then RS_Overload_Failure has occured in system's judgement.

3) RS_Composition_Failure detects

RS_Composition_Failure comprises that mainly coupling, attribute mistake are mated and four kinds of situations of QoS nonuniformity by mistake between coupling, data by mistake between concept.

A) rule of the mistake matching detection between concept

(1) if RS _iRS _kSubclass and RS _kBe not contained in RS _j, RS then _iWith RS _jBetween have gap (gap).This detects rule for the coupling of the mistake between the resource service concept (having the gap between concept); (2) if RS _iRS _kSubclass and RS _kRS _jSubclass, RS then _iRS _jSubclass.This is the mistake coupling (RS between the resource service concept _iRS _jSubclass) detect rule.

B) mistake matching detection rule between data

(1) if DU _NitT _Ransfer(RS _i) equal RS _j, RS so _iAnd RS _jSame parameters have identical data type, but different dimension.DU wherein _NitT _Ransfer() is that the data dimension transforms function.(2) if DT _YpeT _Ransfer(RS _i) equal RS _j, RS so _iAnd RS _jHave identical concept of parameter, but different types of data.DT wherein _YpeT _Ransfer() is that data type transforms function.

C) attribute mistake matching detection rule

If RS _jRequired property parameters compares RS _iMany and the RS that can provide _iWith

Common factor be not empty, RS then _iAttribute can not satisfy RS _jRequirement, wherein

Be the split function.

More than relevant resource service combination detect just part of rule, in actual applications, can design as required the interpolation new regulation.

D) the QoS nonuniformity detects rule

If With

Be respectively RS _iAnd RS _jNumber of parameters, if by analyzing two adjacent resource service RS in the composite services _iAnd RS _jQoS be consistent, these composite services are effectively, otherwise system judges RS_Composition_Failure. has occured

(3) the relevant fault detect of task

1) Task_Cancel_Failure and Task_Suspension_Failure detect

In order to detect in resource service to distribute in the process whether produced into RS_Quit_Failure the regular uninterrupted current state that checks each task of task detector rationally.When task was in the Task_Suspended formation, then system's judgement had produced Task_Suspension_Failure.If be in Task_Terminated, then Task_Cancel_Failure has been given birth in system's judgement fixed output quota.

2) Task_Resource_Mismatch_Failure detects

If Resources allocation service RS _iTask executes the task _j, according to the resource service matching algorithm, establish ζ _Bas, ζ _I/o, ζ _QoS, ζ is respectively basic coupling threshold values, I/O coupling threshold values, QoS coupling threshold values, the comprehensive matching threshold values that system or user set.Then:

(1) if resource service RS _iWith task task _jBasic matching value less than basic coupling threshold values ζ _Bas, then system's judgement has produced basic coupling fault;

(2) if resource service RS _iWith task task _jThe I/O matching value less than I/O coupling threshold values ζ _I/o, then system judges that having produced I/O mates fault;

(3) if resource service RS _iWith task task _jThe QoS matching value less than QoS coupling threshold values ζ _QoS, then system judges that having produced QoS mates event;

(4) if resource service RS _iWith task task _jLast matching value less than comprehensive matching threshold values ζ, then system judges and to have produced the comprehensive matching fault.

The testing process of Task_Resource_Mismatch_Failure as shown in Figure 3.The Task_RequireChange_Failure detection method is identical with Task_Resource_Mismatch_Failure.

(4) use relevant fault detect

1) Trust_Failure detects

If two entities of participating in business among the RSOA are respectively x and y, then in the process of distributing rationally, can assess trust value T between x and the y according to resource service Trust-QoS assessment models _{X → y}If entity x requires to be T to the minimum degree of belief of y _{X → y}°, then work as T _{X → y}＜T _{X → y}° the time, then system judges that fixed output quota given birth to Trust_Failure, such as Fig. 4.

2) App_DesignCode_Failure and App_AccessRight_Failure detect

The same with the VL_Disconnected_Failure detection method, App_DesignCode_Failure and App_AccessRight_Failure mainly are by System Security Policy or are embedded in the system middleware of service in the manufacturing system and detect that the related service or the middleware that mainly adopt Globus to provide detect.

(4) fault recovery module

When occuring and detecting fault, must repair it.Current failure tolerant mechanism mainly contains following several:

1) fault-tolerant based on the task of checkpoint strategy: system passes through periodically Checkpointing, correct status when program is moved is saved in the reliable memory equipment, when breaking down, return to nearest state and resume operation, thereby at utmost reduce the loss that barrier for some reason brings.

2) based on the task fault-tolerant strategy of retry: distribute rationally in the running in resource service; if the operation of breaking down has been carried out or do not have the operation of execution not ignore; then system can attempt re-executing this operation in the situation that does not change execution route; retry is to the constraint of maximum number of repetitions, if repeatedly the execute exception activity until maximum times still be not resolved then stop repetitive operation.

3) based on the task fault-tolerant strategy that backs up: its thought is that a task is carried out copy backup in different resources, all makes mistakes so long as not all backups, and task is final just can successful operation.

4) based on the fault-tolerant strategy that substitutes: when task broke down, the task of having identical function by the operation another one substituted.

5) fault-tolerant based on the task of redundancy: its thought is to select a plurality of different executed activity or the paths that can realize task, although different execution features is arranged, the function of these activities or execution route is identical.

6) based on self-defined unusual fault-tolerant strategy: user-defined unusual permission user defines various abnormality eliminating methods for special duty.If in running, break down, then activate the abnormality eliminating method that is defined on this task.

The present invention also adopts event-condition-action (ECA) rule to support the RSOA fault-tolerant management except the above fault tolerant mechanism of comprehensive employing.Be the Event of eca rule by the fault definition that will occur in the RSOA process; Be the Condition of eca rule with the fault detect conditional definition; The processing that fault is made (such as again scheduling, coupling etc. again) is defined as the Action of eca rule.

With reference to typical eca rule, the present invention has designed the resource service based on ECA as shown in Figure 2 and has distributed fault rationally and clear up module.Mainly comprise event detector, Conditions Evaluation device, actuator, Rule Engine, eca rule storehouse, the several parts of eca rule manager.

1) Event Detector: mainly receive the failure message that fault detection module sends over, the Event of analyzing and testing fault.

2) Condition Evaluator: main being responsible for assessed the relevant Condition of Event that detects, and sees whether it satisfies the condition of corresponding eca rule.

3) Rule Engine: main being responsible for carried out the reasoning coupling to the Event and the respective rule in the eca rule storehouse that detect, finds suitable rule to process the fault that detects.

4) Action Executor: mainly be the result according to Rule Engine reasoning, carry out selected eca rule and move fault is processed.

5) ECARules: be the eca rule storehouse.

6) ECARule Manager: be in charge of eca rule, comprise regular modification, interpolation, deletion etc.

Distribute rationally in the fault tolerant mechanism in the service manufacturing system resource service that proposes, eca rule directly is used for supporting fault recovery.For above fault and the detection method that provides, the present invention has designed eca rule as shown in table 1 and has supported the service manufacturing system resource service such as CMfg to distribute fault recovery rationally.

To distribute the fault resolution rule rationally be a part in the eca rule storehouse to listed relevant service manufacturing system resource service in the table 1.In actual applications, design as required new rule, add in the eca rule storehouse by ECARule Managemr.

Table 1

Claims

1. a resource service is distributed fault-tolerant management rationally and realized system, it is characterized in that: this system comprises that information service module, resource service distribute module, fault detection module and fault recovery module rationally;

2. a kind of resource service according to claim 1 is distributed fault-tolerant management rationally and realized system, it is characterized in that: fault detect comprises the relevant fault detect of virtual link (VL), the fault detect that resource service (RS) is relevant, the fault detect that task (Task) is relevant, the fault detect that application is relevant in the described fault detection module; The fault detect that virtual link is relevant comprises that mainly virtual link fault (VL_Disconnect_Failure) detects and the not enough fault of bandwidth (Bandwidth_Failure) detects; Virtual link fault (VL_Disconnect_Failure) can detect by System Security Policy or the middleware that is embedded in the service manufacturing system usually; Whether two inter-entity adopt call duration time and success communication rate or two indexs of reliability to judge because bandwidth produces fault; The fault detect that resource service is relevant mainly is that resource service withdraws from fault (RS_Quit_Failure) detection, resource service overload fault (RS_Overload_Failure) detection, resource service combination fault (RS_Composition_Failure) detection; Resource service withdraws from fault (RS_Quit_Failure) and judges by the regular uninterrupted state of each resource service that checks of resource service detector; Resource service overload fault (RS_Overload_Failure) is by assessment RS _iData-handling capacity, call duration time, time of implementation judge RS _iWhether transship; Mistake matching detection rule, attribute mistake matching detection rule, QoS nonuniformity detect rule and judge resource service overload fault (RS_Overload_Failure) between the mistake matching detection rule whether satisfy between concept, data by detecting; The fault detect that task is relevant comprises that mainly task cancellation fault (Task_Cancel_Failure) and task are suspended or hang up fault (Task_Suspension_Failure) detection, resource and unsuccessfully (Task_Resource_Mismatch_Failure) detection of task matching; Task is suspended or is hung up fault (Task_Suspension_Failure) by the regular uninterrupted current state that checks each task of task detector, and task whether is in task suspension (Task_Suspended) formation and task termination (Task_Terminated) is judged; The resource service matching algorithm is adopted in resource and task matching failure (Task_Resource_Mismatch_Failure), determines whether basic coupling fault, I/O coupling fault, QoS coupling fault, comprehensive matching fault have occured; Task is suspended or to hang up fault (Task_Suspension_Failure) detection method and resource identical with task matching failure (Task_Resource_Mismatch_Failure); Use relevant fault detect, comprise that mainly trust fault (Trust_Failure) detects, uses design or coding fault (App_DesignCode_Failure) and access rights fault (App_AccessRight_Failure) detection; Trusting fault (Trust_Failure) passes through with the x of resource service Trust-QoS assessment models assessment and the trust value T between the y _{X → y}With the minimum degree of belief requirement T of entity x to y _{X → y}° size relatively judge; To use design or coding fault (App_DesignCode_Failure) and access rights fault (App_AccessRight_Failure) mainly be by System Security Policy or be embedded in the system middleware of serving in the manufacturing system detects.

3. a kind of resource service according to claim 1 is distributed fault-tolerant management rationally and is realized system, it is characterized in that: ECA (Event-Condition-Action, event-condition-action) event (Event) in the rule is defined as the corresponding event of a rule (Rule) that triggers, condition (Condition) is defined as the condition that this rule (Rule) institute must satisfy that activates, and action (Action) is the action command that will carry out after an eca rule is triggered; Be the event (Event) of eca rule with the fault definition that occurs in the RSOA process; Be the condition (Condition) of eca rule with the fault detect conditional definition; The processing that fault is made is defined as the action (Action) of eca rule.

4. a kind of resource service according to claim 3 is distributed fault-tolerant management rationally and realized system, it is characterized in that: the described processing that fault is made is specially again scheduling or mates.

5. a kind of resource service according to claim 1 is distributed fault-tolerant management rationally and is realized system, it is characterized in that: event detector (Event Detector) mainly receives the failure message that fault detection module sends over, the event of analyzing and testing fault (Event); Conditions Evaluation device (Condition Evaluator) mainly is responsible for the relevant condition (Condition) of event (Event) that detects is assessed, and sees whether it satisfies the condition of corresponding eca rule; Rule-based reasoning engine (Rule Engine) mainly is responsible for the event (Event) that detects is carried out the reasoning coupling with the respective rule in the eca rule storehouse, finds suitable rule to process the fault that detects; Actuator (Action Executor) mainly is the result according to Rule Engine reasoning, carries out selected eca rule and moves fault is processed; Eca rule manager (ECA Rule Manager) is in charge of eca rule, comprises modification, interpolation and the deletion of rule; Required various rules in the main storage failure digestion process in eca rule storehouse (ECA Rules).