CN102916830B

CN102916830B - Implement system for resource service optimization allocation fault-tolerant management

Info

Publication number: CN102916830B
Application number: CN2012103356095A
Authority: CN
Inventors: 陶飞; 程颖; 张霖
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2012-09-11
Filing date: 2012-09-11
Publication date: 2013-12-11
Anticipated expiration: 2032-09-11
Also published as: CN102916830A

Abstract

The invention relates to an implement system for resource service optimization allocation (RSOA) fault-tolerant management. In the implement system, corresponding fault-tolerant management implement mechanisms are designed according to causes and types to generate faults in a process of RSOA so as to implement corresponding fault detection and elimination. The implement system comprises an information service module, an RSOA module, a fault detection module and a fault recovery module, and has the advantages of good modularity, maintainability and expansibility, can effectively detect and eliminate various faults in the process of the RSOA, and improves stability of the whole service manufacturing system and reliability of the RSOA. The implement system can effectively detect common faults caused by virtual connect, resources, tasks, applications and the like in the process of the RSOA of the service manufacturing system and provides corresponding good elimination strategies for the common faults, as well as effectively improves reliability and service quality of the RSOA of the service manufacturing system.

Description

A kind of resource service is distributed fault-tolerant management rationally and is realized system

Technical field

The invention belongs to distributed Manufacturing information integration System Fault Tolerance administrative skill field.Be specifically related to a kind of resource service and distribute fault-tolerant management rationally and realize system, it distributes the fault-tolerant management implementation framework rationally for a kind of resource service of service-oriented manufacturing system, and corresponding fault detect and the digestion mechanism based on ECA and method.This invention can effectively detect service manufacturing system resource service and distribute the most common failure in process rationally, and provides corresponding good Removing Tactics to it, effectively improves reliability and service quality that service manufacturing system resource service is distributed rationally.

Background technology

Service manufacturing system (cloudlike (CMfg) system of manufacture, manufacturing service system, manufacture grid system etc.) is manufactured resource service and is distributed the operation related in implementation procedure rationally, comprise that resource service search and coupling, QoS assessment, QoS extract, resource service preferably, resource service combination etc., may be former thereby failed because of some, thus whole distribute rationally failure or inefficacy caused.Its possible cause mainly contains:

1. serve in manufacturing system that two internodal virtual links disconnect or bandwidth ability descends suddenly, can't meet the demands;

2. invoked resource service breaks down or the generation state changes in the process of implementation, as be closed suddenly or exited, resource service combination lost efficacy, the resource service ability descends suddenly, overload etc.;

3. submitted to or the task generation state that just moving changes, as the person of being managed or the user exits by force, demand improves, be suspended, invalid resource service distribution etc.;

4. in application process, go wrong, as both parties trust the access rights of deficiency, mistake, unreasonable or incorrect Code Design etc.

Above phenomenon is referred to as fault in the present invention.Once above situation occur, resource service is distributed (RSOA) rationally and will be suspended or lose efficacy.May must therefore, for reliability and the service quality that improves RSOA, solve following problem: 1. in the RSOA process, which fault occur? 2. how to detect the fault of appearance? 3. how analyzing and testing to fault and carry out Recovery processing?

For above problem, in the service manufacture fields such as CMfg, also there is no at present correlative study.For overcoming the above problems, realize the fault-tolerant management in the RSOA process, improve reliability and the service quality of RSOA, at first the present invention analyzes the fault that may occur in the RSOA process and is classified, study on this basis RSOA fault-tolerant management realization mechanism, and study corresponding fault detection method and Removing Tactics.

Summary of the invention

Purpose of the present invention is: the resource service the present invention relates to is distributed the fault-tolerant management realization mechanism rationally, the most common failure produced in service manufacturing system RSOA process can effectively be detected, and provide corresponding good Removing Tactics and method for various faults.Effectively improve reliability and service quality that service manufacturing system resource service is distributed rationally.

The technical solution used in the present invention is: a kind of resource service is distributed (RSOA) fault-tolerant management rationally and is realized system, and this system comprises that information service module, resource service distribute module, fault detection module and fault recovery module rationally;

Described information service module is mainly fault detect, fault recovery, resource service is distributed rationally provides information and Data support;

Described resource service is distributed module rationally and is mainly realized that resource service search, service quality (QoS) are assessed, the feature operations such as resource service is preferred, resource service combination;

Described fault detection module is responsible for the state of each node in the monitor service manufacturing system and moving of task and resource, monitors at any time and carries out state analysis; To normally or the historical data of the example extremely exited analyzed and added up, make a policy and notify the fault recovery module to be processed detected fault;

Described fault recovery module, be comprehensive multiple fault tolerant mechanism based on ECA(Event-Condition-Action) resource service distribute fault rationally and clear up module, mainly comprise event detector (Event Detector), Conditions Evaluation device (Condition Detector), actuator (Action Executor), rule-based reasoning engine (Rule Engine), eca rule storehouse (ECA Rules), eca rule manager (ECA Rule manager) part.

Wherein, in described fault detection module, fault detect comprises fault detect that virtual link (VL) is relevant, fault detect that resource service (RS) is relevant, fault detect that task (Task) is relevant, applies relevant fault detect; The fault detect that virtual link is relevant, mainly comprise that virtual link fault (VL_Disconnect_Failure) detects and the not enough fault of bandwidth (Bandwidth_Failure) detects; Virtual link fault (VL_Disconnect_Failure) can detect by System Security Policy or the middleware be embedded in the service manufacturing system usually; Whether two inter-entity adopt call duration time and success communication rate or two indexs of reliability to judge because bandwidth produces fault; The fault detect that resource service is relevant is mainly that resource service exits fault (RS_Quit_Failure) detection, resource service overload fault (RS_Overload_Failure) detection, resource service combination fault (RS_Composition_Failure) detection; Resource service exits fault (RS_Quit_Failure) and judges by the state of regular uninterrupted each resource service of inspection of resource service detector; Resource service overload fault (RS_Overload_Failure) is by assessment RS _idata-handling capacity, call duration time, time of implementation judge RS _iwhether transship; Resource service overload fault (RS_Overload_Failure) detects rule and judges by detecting mistake matching detection rule, attribute mistake matching detection rule, QoS nonuniformity between the mistake matching detection rule whether meet between concept, data; The fault detect that task is relevant, mainly comprise that task cancels fault (Task_Cancel_Failure) and task and suspended or hang up fault (Task_Suspension_Failure) detection, resource and task it fails to match (Task_Resource_Mismatch_Failure) detection; Task is suspended or hangs up fault (Task_Suspension_Failure) regularly uninterruptedly checks the current state of each task by the task detector, and whether task is judged in task suspension (Task_Suspended) queue and task termination (Task_Terminated); It fails to match for resource and task (Task_Resource_Mismatch_Failure) adopts the resource service matching algorithm, determines whether basic coupling fault, I/O coupling fault, QoS coupling fault, comprehensive matching fault have occurred; Task is suspended or is hung up fault (Task_Suspension_Failure) detection method and resource and task it fails to match that (Task_Resource_Mismatch_Failure) is identical; Apply relevant fault detect, mainly comprise that trust fault (Trust_Failure) detects, application designs or coding fault (App_DesignCode_Failure) and access rights fault (App_AccessRight_Failure) detection; Trust fault (Trust_Failure) by the x with the assessment of resource service Trust-QoS assessment models and the trust value T between y _{x → y}with the minimum degree of belief requirement T of entity x to y _{x → y}° size relatively judge; Application design or coding fault (App_DesignCode_Failure) and access rights fault (App_AccessRight_Failure) are mainly by System Security Policy or are embedded in the system middleware of serving in manufacturing system and detect.

Wherein, described ECA (Event-Condition-Action, event-condition-action) in rule, event definition is the corresponding event of a triggering rule (Rule), condition (Condition) is defined as and activates the necessary satisfied condition of this rule (Rule) institute, the action command moved as carrying out after being triggered when an eca rule; The event (Event) that is eca rule by the fault definition that occurs in the RSOA process; The condition that is eca rule by the fault detect conditional definition (Condition); The processing that fault is made is defined as the action (Action) of eca rule.

Wherein, the described processing that fault is made is specially scheduling again or mates.

Wherein, described event detector (Event Detector) mainly receives the failure message that fault detection module sends over, the event of analyzing and testing fault (Event); Conditions Evaluation device (Condition Evaluator) mainly is responsible for the relevant condition (Condition) of event (Event) detected is assessed, and sees whether it meets the condition of corresponding eca rule; The event (Event) that rule-based reasoning engine (Rule Engine) mainly is responsible for detecting is carried out the reasoning coupling with the respective rule in the eca rule storehouse, finds suitable rule to process the fault detected; Actuator (Action Executor) is mainly the result according to Rule Engine reasoning, carries out selected eca rule and moves fault is processed; Eca rule manager (ECA Rule Manager) is in charge of eca rule, comprises regular modification, interpolation and deletion; Required various rules in the main storage failure digestion process in eca rule storehouse (ECARules).

The present invention's advantage compared with prior art is:

(1), the method is to distribute according to resource service reason and the classification that fault in (RSOA) process produces rationally specifically, designs corresponding fault-tolerant management realization mechanism, realize corresponding fault detect and clear up.This invention can effectively detect service manufacturing system resource service and distribute the most common failure caused by virtual link, resource, task, application etc. in process rationally, and provide corresponding good Removing Tactics to it, can effectively improve reliability and service quality that service manufacturing system resource service is distributed rationally.

(2), the present invention includes a kind of resource service and distribute the fault-tolerant management implementation framework rationally, and corresponding fault detect and digestion mechanism and method based on ECA (event-condition-action), can be applicable to distributed networked service manufacturing system, there is good dynamic, modularity, maintainability, autgmentability, can effectively detect and clear up resource service and distribute the various faults in process rationally, improve the stability of whole service manufacturing system and the reliability that resource service is distributed rationally.

The accompanying drawing explanation

Fig. 1 is that resource service is distributed the fault-tolerant management architecture rationally;

Fig. 2 is based on the fault recovery of ECA;

Fig. 3 is the Task_Resource_MisMatch_Failure overhaul flow chart;

Fig. 4 is the Trust_Failure overhaul flow chart;

Table 1 is the part eca rule that resource service is distributed fault-tolerant management rationally.

Embodiment

Below in conjunction with accompanying drawing, the present invention is described in further detail.

A kind of resource service the present invention relates to is distributed fault-tolerant management realization mechanism and method rationally, by analyzing fault and the classification that may occur in the RSOA process, thereby research RSOA fault-tolerant management architecture, and study corresponding concrete fault detection method and Removing Tactics.

And if only if when following two kinds of situations or one of them occur, and claims resource service to distribute rationally and break down: 1. because the resource collapse causes it, stop service; 2. the availability of resource does not reach the minimum QoS standard of task.In actual applications, it is varied that the service manufacturing system resource service such as cloud manufacture are distributed fault type rationally, the main and virtual link of the generation of most common failure, resource, task, four factor analysis of application.

(1) the relevant fault of virtual link

Virtual link (VL) refers to that the broad sense of two inter-entity in the service manufacturing system connects.The fault produced because of VL mainly contains virtual link fault and the not enough fault of bandwidth.

(2) the relevant fault of resource service

Resource service is the carrier of executing the task, and therefore, the exiting of resource service, overload, QoS or the change of ability, the combination between resource service etc. all may cause the RSOA fault.The fault caused because of resource service mainly contains: resource service exits fault, resource service overload (or saturated) fault, resource service combination fault, resource service ability and changes the fault caused.

Wherein the resource service combination fault mainly contains: the mistake coupling between the resource service concept, the mistake coupling between data, attributes match error, QoS nonuniformity.

(3) the relevant fault of task

In the RSOA process, because of various reasons, may cause cancellation, hang-up of task etc., thereby cause the failure of distributing rationally.The RSOA fault caused because of task mainly contains: task is cancelled fault, task is suspended or hang up fault, resource and the task fault that it fails to match, the mission requirements change causes.

(4) apply relevant fault

In application process, may lose efficacy because the reasons such as trust, access rights, coding cause RSOA, as: trust fault, application design or coding fault, access rights fault.

In the RSOA process, the issuable fault of above four class can cause efficiency and the service Quality Down of whole RSOAS.For supporting to provide fault tolerance in the RSOA process, in conjunction with the RSOAS framework, the present invention proposes RSOA fault-tolerant management architecture as shown in Figure 1.

The RSOAS fault-tolerance architecture is distributed module, fault detection module, fault recovery module four parts rationally by information service module, resource service and is formed.Realize the RSOAS fault tolerance, emphasis will solve the detection of fault and clear up.

The present invention relates to a kind of resource service and distribute fault-tolerant management realization mechanism and method rationally, comprise that a kind of resource service distributes the fault-tolerant management implementation framework rationally, and corresponding fault detect and the digestion mechanism based on ECA and method.

The RSOAS fault-tolerance architecture is as Fig. 1, distributes module, fault detection module, fault recovery module four parts rationally by information service module, resource service and forms, and wherein fault detection module and fault recovery module are key content of the present invention.

(1) information service module

Information service module is mainly fault detect, fault recovery, resource service is distributed rationally provides directory information service (IIS), Resource Information Service (RIS), resource service encapsulation, QoS database information and Data support.

Wherein, directory information service (IIS) organizational information can provide the information aggregate inquiry, and supports the effective query to a plurality of RIS, and information index and the function of search of whole service-oriented manufacturing system can be provided simultaneously.IIS is comprised of three parts: general location registration process, insertable bibliographic structure and search are processed.

Resource Information Service (RIS) runs on the resource end, provides unified means to come the configuration of resource in the inquiry system platform, ability and state, and can be configured to certainly as assembling directory service.After RIS carries out authentication to the demand of input and task and resolves, according to the type of solicited message, inquiry request is distributed to one or more informants.Then RIS passes to the user to the feedback information of resource.

The effect of resource service encapsulation template is the effective management of implementation platform to the nodal information of participation Collaborative Manufacturing.According to the attribute between resource (as physical features, geographical position, dynamic characteristic, sensitivity, function etc.), customer demand (as time, quality, price, service etc.), be used mode (as discovery, agency, monitoring, diagnosis etc.), resource classification is encapsulated.Resource provider, after platform carries out resource registering, will be packaged into the resource service class template; When the client serves in request resource, from system platform, download corresponding Resource Encapsulation template, and complete the instantiation of specific tasks.

Extract the QoS information of respective resources service in the QoS database.Corresponding QoS index parameter is assessed to measurement, and carry out QoS relatively, thereby preferably with combination, provide information and Data support for follow-up resource service.

(2) resource service is distributed module rationally

Resource service distribute rationally module mainly provide resource service search, QoS assessment, resource service preferably, the feature operations such as resource service combination.

The resource service search provides all kinds of resource service information matches algorithm service, demand according to the subtask of Task-decomposing to resource service, be responsible for searching satisfactory respective resources service from the resource service storehouse, and generate resource service collection to be selected (RSS).

The QoS assessment is for the magnanimity that meets user's request searched resource service collection to be selected, and purpose is for user and the service of system selection best resource, carry out resource service and distribute the reference frame that quantification is provided rationally, and be the important step of resource optimization service configuration.Extract the QoS information of respective resources service in RSS from resource service information OWL-S/WSDL of being registered to server library or QoS database.Corresponding QoS index parameter is assessed to measurement, and carry out QoS relatively, thereby preferably with combination, provide information and Data support for follow-up resource service.

Resource service is preferred: if the task that the user submits to is single resource service demand, according to the qos parameter information requirement, the resource service to be selected of RSS is carried out to the comprehensive assessment sequence, select best resource service to execute the task.

Resource service combination and preferably: if the user submits to is many resource service demand, from each RSS, select a resource service to form in a certain order the combined resource service, and select from all possible combination that optimal set is incompatible executes the task.

(3) fault detection module

Be responsible for the state of each node in the monitor service manufacturing system and moving of task and resource, monitor at any time and carry out state analysis.Monitored distributing flow process and the resource related to and mission performance and ruuning situation rationally by local detector, and a series of management service is provided, as task status management, resource service condition managing.To normally or the historical data of the example extremely exited analyzed and added up, make a policy and notify the fault recovery module to be processed detected fault.

Below respectively relevant to virtual link, resource service relevant, task relevant and the concrete digestion procedure of the four class various faults that application is relevant is described in detail.

(1) virtual link fault detect

1) VL_Disconnect_Failure detects

Usually can detect by System Security Policy or the middleware be embedded in the service manufacturing system, the GRAM that can adopt Golbus to provide serves to detect VL_Disconnect_Failure.

2) Bandwidth_Failure detects

In system, between two entities (meaning with A, B), whether because having produced fault, bandwidth reasons adopt call duration time (CT) and success communication rate (PSC) or two indexs of reliability to judge.

A) adopt the call duration time judgement

Make the virtual link between A, B be expressed as VL (A, B), the total information exchange capacity of VL (A, B) is SumInfor (A, B), and transmission speed (bandwidth) is V (A, B), and the stand-by period is Waite (A, B).Corresponding total call duration time, be designated as T _c(A, B) is transmission time and stand-by period sum.

B) success communication rate (PSC) or reliability judgement

If the failure rate of virtual link VL (A, B), node A and B is respectively α (A, B), α (A), α (B), by the definition of reliability, can ask the reliability S between VL (A, B) _c(A, B).

If the minimum CT of user's request and PSC require to be respectively

with

working as virtual link VL (A, B) meets

or

the time, Bandwidth_Failure has occurred in the system judgement.

(2) the relevant fault detect of resource service

1) RS_Quit_Failure detects

For detecting to distribute in process whether produced RS_Quit_Failure rationally in resource service, the resource service detector is the uninterrupted state that checks each resource service regularly.If this resource service is reaction not, RS_Quit_Failure has occurred in the system judgement.

2) RS_Overload_Failure detects

By assessment RS _idata-handling capacity (DC), call duration time (CT), time of implementation (ET) judge RS _iwhether transship, whether system has produced RS_Overload_Failure.

If distribute to RS in the certain hour section _itask-set be Γ _i={ Task ₁, Task ₂..., Task _j..., Task _k.Task wherein _jneed RS _iquantity be

d _{i, j}for task task _jcall RS _irequired data access amount, V (i, j) is Task _jwith RS _ibetween the virtual link bandwidth; Et _jfor each RS _icarry out Task _jthe required time of implementation.RS in running _icorresponding ET, DC, CT calculates respectively

if RS _ieT, DC, the CT upper limit is respectively

{Lim}_{{RS}_{i}}^{CT}, {Lim}_{{RS}_{i}}^{DC},

When system detects RS _imeet

{ET}_{{RS}_{i}} > {Lim}_{{RS}_{i}}^{ET},

{CT}_{{RS}_{i}} > {Lim}_{{RS}_{i}}^{CT},

{DC}_{{RS}_{i}} > {Lim}_{{RS}_{i}}^{DC}

One of them person, RS_Overload_Failure has occurred in the system judgement.

3) RS_Composition_Failure detects

RS_Composition_Failure mainly comprises that between concept, between coupling, data, coupling, attribute mistake are mated and four kinds of situations of QoS nonuniformity by mistake by mistake.

A) the mistake matching detection rule between concept

(1) if RS _irS _ksubclass and RS _kbe not contained in RS _j, RS _iwith RS _jbetween have gap (gap).This detects rule for the coupling of the mistake between the resource service concept (having gap between concept); (2) if RS _irS _ksubclass and RS _krS _jsubclass, RS _irS _jsubclass.This is the mistake coupling (RS between the resource service concept _irS _jsubclass) detect rule.

B) mistake matching detection rule between data

(1) if DU _nitt _ransfer(RS _i) equal RS _j, RS so _iand RS _jsame parameters there is identical data type, but different dimension.DU wherein _nitt _ransfer() is that the data dimension transforms function.(2) if DT _ypet _ransfer(RS _i) equal RS _j, RS so _iand RS _jthere is identical concept of parameter, but different types of data.DT wherein _ypet _ransfer() is that data type transforms function.

C) attribute mistake matching detection rule

If RS _jrequired property parameters compares RS _imany and the RS that can provide _iwith

common factor be not empty, RS _iattribute can not meet RS _jrequirement, wherein

for the split function.

Above relevant resource service combination detects just part of rule, in actual applications, can design as required the interpolation new regulation.

D) the QoS nonuniformity detects rule

If with

be respectively RS _iand RS _jnumber of parameters, if by analyzing two adjacent resource service RS in composite services _iand RS _jqoS be consistent, these composite services are effectively, otherwise system is judged RS_Composition_Failure. has been occurred

(3) the relevant fault detect of task

1) Task_Cancel_Failure and Task_Suspension_Failure detect

Whether in order to detect in resource service, to distribute rationally in process and produced into RS_Quit_Failure, the task detector is the uninterrupted current state that checks each task regularly.When task during in the Task_Suspended queue, system is judged and has been produced Task_Suspension_Failure.If in Task_Terminated, Task_Cancel_Failure has been given birth in system judgement fixed output quota.

2) Task_Resource_Mismatch_Failure detects

If Resources allocation service RS _itask executes the task _j, according to the resource service matching algorithm, establish ζ _bas, ζ _i/o, ζ _qoS, ζ is respectively basic coupling threshold values, I/O coupling threshold values, QoS coupling threshold values, the comprehensive matching threshold values that system or user set.:

(1) if resource service RS _iwith task task _jbasic matching value be less than basic coupling threshold values ζ _bas, the system judgement has produced basic coupling fault;

(2) if resource service RS _iwith task task _jthe I/O matching value be less than I/O coupling threshold values ζ _i/o, system judges that having produced I/O mates fault;

(3) if resource service RS _iwith task task _jthe QoS matching value be less than QoS coupling threshold values ζ _qoS, system judges that having produced QoS mates event;

(4) if resource service RS _iwith task task _jlast matching value be less than comprehensive matching threshold values ζ, system is judged and have been produced the comprehensive matching fault.

The testing process of Task_Resource_Mismatch_Failure as shown in Figure 3.The Task_RequireChange_Failure detection method is identical with Task_Resource_Mismatch_Failure.

(4) apply relevant fault detect

1) Trust_Failure detects

If two entities of participating in business in RSOA are respectively x and y,, in the process of distributing rationally, can assess the trust value T between x and y according to resource service Trust-QoS assessment models _{x → y}.If entity x requires as T the minimum degree of belief of y _{x → y}°, work as T _{x → y}<T _{x → y}° the time, Trust_Failure has been given birth in system judgement fixed output quota, as Fig. 4.

2) App_DesignCode_Failure and App_AccessRight_Failure detect

The same with the VL_Disconnected_Failure detection method, App_DesignCode_Failure and App_AccessRight_Failure be mainly by System Security Policy or be embedded in the system middleware of service in manufacturing system and detect, and the related service or the middleware that mainly adopt Globus to provide detect.

(4) fault recovery module

When occurring and fault detected, must be repaired it.Current failure tolerant mechanism mainly contains following several:

1) task based on the checkpoint strategy is fault-tolerant: system is passed through periodically Checkpointing, correct status when program is moved is saved in reliable memory equipment, when breaking down, return to nearest state and resume operation, thereby at utmost reducing the loss that barrier brings for some reason.

2) the task fault-tolerant strategy based on retry: distribute rationally in running in resource service; if the operation of breaking down has been carried out or do not have the operation of carrying out not ignore; system can attempt re-executing this operation in the situation that do not change execution route; retry is to the constraint of maximum number of repetitions, if repeatedly the execute exception activity until maximum times still be not resolved stop repetitive operation.

3) the task fault-tolerant strategy based on backup: its thought is that a task is carried out to copy backup on different resources, so long as not all backups, all makes mistakes, and task is final just can successful operation.

4) fault-tolerant strategy based on alternative: when task breaks down, the task of having identical function by the operation another one substitutes.

5) task based on redundancy is fault-tolerant: its thought is to select a plurality of different executed activity or the paths that can realize task, although different execution features is arranged, the function of these activities or execution route is identical.

6) based on self-defined abnormal fault-tolerant strategy: user-defined abnormal permission user defines various abnormality eliminating methods for special duty.If break down in running, activate and be defined in the abnormality eliminating method on this task.

The present invention, except the above fault tolerant mechanism of comprehensive employing, also adopts event-condition-action (ECA) rule to support the RSOA fault-tolerant management.The Event that is eca rule by the fault definition by occurring in the RSOA process; The Condition that is eca rule by the fault detect conditional definition; The processing that fault is made (as dispatched again, coupling etc.) again is defined as the Action of eca rule.

With reference to typical eca rule, the present invention has designed the resource service based on ECA as shown in Figure 2 and has distributed fault rationally and clear up module.Mainly comprise event detector, Conditions Evaluation device, actuator, Rule Engine, eca rule storehouse, the several parts of eca rule manager.

1) Event Detector: mainly receive the failure message that fault detection module sends over, the Event of analyzing and testing fault.

2) Condition Evaluator: main being responsible for assessed the relevant Condition of Event detected, and sees whether it meets the condition of corresponding eca rule.

3) Rule Engine: main be responsible for Event to detecting and the respective rule in the eca rule storehouse is carried out the reasoning coupling, find suitable rule to process the fault detected.

4) Action Executor: be mainly the result according to Rule Engine reasoning, carry out selected eca rule and move fault is processed.

5) ECARules: be the eca rule storehouse.

6) ECARule Manager: be in charge of eca rule, comprise regular modification, interpolation, deletion etc.

In proposed service manufacturing system resource service is distributed fault tolerant mechanism rationally, eca rule directly is used for supporting fault recovery.For above fault and the detection method provided, the present invention has designed eca rule as shown in table 1 and has supported the service manufacturing system resource service such as CMfg to distribute fault recovery rationally.

In table 1, to distribute the fault resolution rule rationally be the part in the eca rule storehouse to listed relevant service manufacturing system resource service.In actual applications, design as required new rule, add in the eca rule storehouse by ECARule Managemr.

Table 1

Claims

1. a resource service is distributed fault-tolerant management rationally and realized system, it is characterized in that: this system comprises that information service module, resource service distribute module, fault detection module and fault recovery module rationally;

Described information service module is distributed rationally information and Data support is provided for fault detect, fault recovery, resource service;

Described resource service is distributed module rationally and is realized that resource service search, service quality (QoS) are assessed, resource service is preferred, the resource service combination feature operation;

Described fault recovery module, be comprehensive multiple fault tolerant mechanism based on ECA(Event-Condition-Action) resource service distribute fault rationally and clear up module, comprise event detector (Event Detector), Conditions Evaluation device (Condition Detector), actuator (Action Executor), rule-based reasoning engine (Rule Engine), eca rule storehouse (ECARules), eca rule manager (ECA Rule manager) part;

In described fault detection module, fault detect comprises fault detect that virtual link (VL) is relevant, fault detect that resource service (RS) is relevant, fault detect that task (Task) is relevant, applies relevant fault detect; The fault detect that virtual link is relevant, comprise that virtual link fault (VL_Disconnect_Failure) detects and the not enough fault of bandwidth (Bandwidth_Failure) detects; Virtual link fault (VL_Disconnect_Failure) can detect by System Security Policy or the middleware be embedded in the service manufacturing system usually; Whether two inter-entity adopt call duration time and success communication rate or two indexs of reliability to judge because bandwidth produces fault; The fault detect that resource service is relevant is that resource service exits fault (RS_Quit_Failure) detection, resource service overload fault (RS_Overload_Failure) detection, resource service combination fault (RS_Composition_Failure) detection; Resource service exits fault (RS_Quit_Failure) and judges by the state of regular uninterrupted each resource service of inspection of resource service detector; Resource service overload fault (RS_Overload_Failure) is by assessment RS _idata-handling capacity, call duration time, time of implementation judge RS _iwhether transship; Resource service overload fault (RS_Overload_Failure) detects rule and judges by detecting mistake matching detection rule, attribute mistake matching detection rule, QoS nonuniformity between the mistake matching detection rule whether meet between concept, data; The fault detect that task is relevant, comprise that task cancels fault (Task_Cancel_Failure) and task and suspended or hang up fault (Task_Suspension_Failure) detection, resource and task it fails to match (Task_Resource_Mismatch_Failure) detection; Task is suspended or hangs up fault (Task_Suspension_Failure) regularly uninterruptedly checks the current state of each task by the task detector, and whether task is judged in task suspension (Task_Suspended) queue and task termination (Task_Terminated); It fails to match for resource and task (Task_Resource_Mismatch_Failure) adopts the resource service matching algorithm, determines whether basic coupling fault, I/O coupling fault, QoS coupling fault, comprehensive matching fault have occurred; Task is suspended or is hung up fault (Task_Suspension_Failure) detection method and resource and task it fails to match that (Task_Resource_Mismatch_Failure) is identical; Apply relevant fault detect, comprise that trust fault (Trust_Failure) detects, application designs or coding fault (App_DesignCode_Failure) and access rights fault (App_AccessRight_Failure) detection; Trust fault (Trust_Failure) by the x with the assessment of resource service Trust-QoS assessment models and the trust value T between y _{x → y}with the minimum degree of belief requirement T of entity x to y _{x →}the size of y ° is relatively judged; Apply design or coding fault (App_DesignCode_Failure) and access rights fault (App_AccessRight_Failure) by System Security Policy or be embedded in the system middleware of serving in manufacturing system and detect.

2. a kind of resource service according to claim 1 is distributed fault-tolerant management rationally and is realized system, it is characterized in that: ECA (Event-Condition-Action, event-condition-action) event (Event) in rule is defined as the corresponding event of a rule (Rule) that triggers, condition (Condition) be defined as activate this rule (Rule) institute must satisfied condition, action (Action) is the action command that will carry out after an eca rule is triggered; The event (Event) that is eca rule by the fault definition that occurs in the RSOA process; The condition that is eca rule by the fault detect conditional definition (Condition); The processing that fault is made is defined as the action (Action) of eca rule.

3. a kind of resource service according to claim 2 is distributed fault-tolerant management rationally and realized system, it is characterized in that: the described processing that fault is made is specially scheduling again or mates.

4. a kind of resource service according to claim 1 is distributed fault-tolerant management rationally and is realized system, it is characterized in that: event detector (Event Detector) receives the failure message that fault detection module sends over, the event of analyzing and testing fault (Event); Conditions Evaluation device (Condition Evaluator) is responsible for the relevant condition (Condition) of event (Event) detected is assessed, and sees whether it meets the condition of corresponding eca rule; The event (Event) that rule-based reasoning engine (Rule Engine) is responsible for detecting is carried out the reasoning coupling with the respective rule in the eca rule storehouse, finds suitable rule to process the fault detected; Actuator (Action Executor) is according to the result of Rule Engine reasoning, carries out selected eca rule and moves fault is processed; Eca rule manager (ECA Rule Manager) is in charge of eca rule, comprises regular modification, interpolation and deletion; Required various rules in the storage failure digestion process of eca rule storehouse (ECA Rules).