CN102916830B - Implement system for resource service optimization allocation fault-tolerant management - Google Patents

Implement system for resource service optimization allocation fault-tolerant management Download PDF

Info

Publication number
CN102916830B
CN102916830B CN2012103356095A CN201210335609A CN102916830B CN 102916830 B CN102916830 B CN 102916830B CN 2012103356095 A CN2012103356095 A CN 2012103356095A CN 201210335609 A CN201210335609 A CN 201210335609A CN 102916830 B CN102916830 B CN 102916830B
Authority
CN
China
Prior art keywords
fault
task
failure
rule
resource service
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN2012103356095A
Other languages
Chinese (zh)
Other versions
CN102916830A (en
Inventor
陶飞
程颖
张霖
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beihang University
Original Assignee
Beihang University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beihang University filed Critical Beihang University
Priority to CN2012103356095A priority Critical patent/CN102916830B/en
Publication of CN102916830A publication Critical patent/CN102916830A/en
Application granted granted Critical
Publication of CN102916830B publication Critical patent/CN102916830B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention relates to an implement system for resource service optimization allocation (RSOA) fault-tolerant management. In the implement system, corresponding fault-tolerant management implement mechanisms are designed according to causes and types to generate faults in a process of RSOA so as to implement corresponding fault detection and elimination. The implement system comprises an information service module, an RSOA module, a fault detection module and a fault recovery module, and has the advantages of good modularity, maintainability and expansibility, can effectively detect and eliminate various faults in the process of the RSOA, and improves stability of the whole service manufacturing system and reliability of the RSOA. The implement system can effectively detect common faults caused by virtual connect, resources, tasks, applications and the like in the process of the RSOA of the service manufacturing system and provides corresponding good elimination strategies for the common faults, as well as effectively improves reliability and service quality of the RSOA of the service manufacturing system.

Description

A kind of resource service is distributed fault-tolerant management rationally and is realized system
Technical field
The invention belongs to distributed Manufacturing information integration System Fault Tolerance administrative skill field.Be specifically related to a kind of resource service and distribute fault-tolerant management rationally and realize system, it distributes the fault-tolerant management implementation framework rationally for a kind of resource service of service-oriented manufacturing system, and corresponding fault detect and the digestion mechanism based on ECA and method.This invention can effectively detect service manufacturing system resource service and distribute the most common failure in process rationally, and provides corresponding good Removing Tactics to it, effectively improves reliability and service quality that service manufacturing system resource service is distributed rationally.
Background technology
Service manufacturing system (cloudlike (CMfg) system of manufacture, manufacturing service system, manufacture grid system etc.) is manufactured resource service and is distributed the operation related in implementation procedure rationally, comprise that resource service search and coupling, QoS assessment, QoS extract, resource service preferably, resource service combination etc., may be former thereby failed because of some, thus whole distribute rationally failure or inefficacy caused.Its possible cause mainly contains:
1. serve in manufacturing system that two internodal virtual links disconnect or bandwidth ability descends suddenly, can't meet the demands;
2. invoked resource service breaks down or the generation state changes in the process of implementation, as be closed suddenly or exited, resource service combination lost efficacy, the resource service ability descends suddenly, overload etc.;
3. submitted to or the task generation state that just moving changes, as the person of being managed or the user exits by force, demand improves, be suspended, invalid resource service distribution etc.;
4. in application process, go wrong, as both parties trust the access rights of deficiency, mistake, unreasonable or incorrect Code Design etc.
Above phenomenon is referred to as fault in the present invention.Once above situation occur, resource service is distributed (RSOA) rationally and will be suspended or lose efficacy.May must therefore, for reliability and the service quality that improves RSOA, solve following problem: 1. in the RSOA process, which fault occur? 2. how to detect the fault of appearance? 3. how analyzing and testing to fault and carry out Recovery processing?
For above problem, in the service manufacture fields such as CMfg, also there is no at present correlative study.For overcoming the above problems, realize the fault-tolerant management in the RSOA process, improve reliability and the service quality of RSOA, at first the present invention analyzes the fault that may occur in the RSOA process and is classified, study on this basis RSOA fault-tolerant management realization mechanism, and study corresponding fault detection method and Removing Tactics.
Summary of the invention
Purpose of the present invention is: the resource service the present invention relates to is distributed the fault-tolerant management realization mechanism rationally, the most common failure produced in service manufacturing system RSOA process can effectively be detected, and provide corresponding good Removing Tactics and method for various faults.Effectively improve reliability and service quality that service manufacturing system resource service is distributed rationally.
The technical solution used in the present invention is: a kind of resource service is distributed (RSOA) fault-tolerant management rationally and is realized system, and this system comprises that information service module, resource service distribute module, fault detection module and fault recovery module rationally;
Described information service module is mainly fault detect, fault recovery, resource service is distributed rationally provides information and Data support;
Described resource service is distributed module rationally and is mainly realized that resource service search, service quality (QoS) are assessed, the feature operations such as resource service is preferred, resource service combination;
Described fault detection module is responsible for the state of each node in the monitor service manufacturing system and moving of task and resource, monitors at any time and carries out state analysis; To normally or the historical data of the example extremely exited analyzed and added up, make a policy and notify the fault recovery module to be processed detected fault;
Described fault recovery module, be comprehensive multiple fault tolerant mechanism based on ECA(Event-Condition-Action) resource service distribute fault rationally and clear up module, mainly comprise event detector (Event Detector), Conditions Evaluation device (Condition Detector), actuator (Action Executor), rule-based reasoning engine (Rule Engine), eca rule storehouse (ECA Rules), eca rule manager (ECA Rule manager) part.
Wherein, in described fault detection module, fault detect comprises fault detect that virtual link (VL) is relevant, fault detect that resource service (RS) is relevant, fault detect that task (Task) is relevant, applies relevant fault detect; The fault detect that virtual link is relevant, mainly comprise that virtual link fault (VL_Disconnect_Failure) detects and the not enough fault of bandwidth (Bandwidth_Failure) detects; Virtual link fault (VL_Disconnect_Failure) can detect by System Security Policy or the middleware be embedded in the service manufacturing system usually; Whether two inter-entity adopt call duration time and success communication rate or two indexs of reliability to judge because bandwidth produces fault; The fault detect that resource service is relevant is mainly that resource service exits fault (RS_Quit_Failure) detection, resource service overload fault (RS_Overload_Failure) detection, resource service combination fault (RS_Composition_Failure) detection; Resource service exits fault (RS_Quit_Failure) and judges by the state of regular uninterrupted each resource service of inspection of resource service detector; Resource service overload fault (RS_Overload_Failure) is by assessment RS idata-handling capacity, call duration time, time of implementation judge RS iwhether transship; Resource service overload fault (RS_Overload_Failure) detects rule and judges by detecting mistake matching detection rule, attribute mistake matching detection rule, QoS nonuniformity between the mistake matching detection rule whether meet between concept, data; The fault detect that task is relevant, mainly comprise that task cancels fault (Task_Cancel_Failure) and task and suspended or hang up fault (Task_Suspension_Failure) detection, resource and task it fails to match (Task_Resource_Mismatch_Failure) detection; Task is suspended or hangs up fault (Task_Suspension_Failure) regularly uninterruptedly checks the current state of each task by the task detector, and whether task is judged in task suspension (Task_Suspended) queue and task termination (Task_Terminated); It fails to match for resource and task (Task_Resource_Mismatch_Failure) adopts the resource service matching algorithm, determines whether basic coupling fault, I/O coupling fault, QoS coupling fault, comprehensive matching fault have occurred; Task is suspended or is hung up fault (Task_Suspension_Failure) detection method and resource and task it fails to match that (Task_Resource_Mismatch_Failure) is identical; Apply relevant fault detect, mainly comprise that trust fault (Trust_Failure) detects, application designs or coding fault (App_DesignCode_Failure) and access rights fault (App_AccessRight_Failure) detection; Trust fault (Trust_Failure) by the x with the assessment of resource service Trust-QoS assessment models and the trust value T between y x → ywith the minimum degree of belief requirement T of entity x to y x → y° size relatively judge; Application design or coding fault (App_DesignCode_Failure) and access rights fault (App_AccessRight_Failure) are mainly by System Security Policy or are embedded in the system middleware of serving in manufacturing system and detect.
Wherein, described ECA (Event-Condition-Action, event-condition-action) in rule, event definition is the corresponding event of a triggering rule (Rule), condition (Condition) is defined as and activates the necessary satisfied condition of this rule (Rule) institute, the action command moved as carrying out after being triggered when an eca rule; The event (Event) that is eca rule by the fault definition that occurs in the RSOA process; The condition that is eca rule by the fault detect conditional definition (Condition); The processing that fault is made is defined as the action (Action) of eca rule.
Wherein, the described processing that fault is made is specially scheduling again or mates.
Wherein, described event detector (Event Detector) mainly receives the failure message that fault detection module sends over, the event of analyzing and testing fault (Event); Conditions Evaluation device (Condition Evaluator) mainly is responsible for the relevant condition (Condition) of event (Event) detected is assessed, and sees whether it meets the condition of corresponding eca rule; The event (Event) that rule-based reasoning engine (Rule Engine) mainly is responsible for detecting is carried out the reasoning coupling with the respective rule in the eca rule storehouse, finds suitable rule to process the fault detected; Actuator (Action Executor) is mainly the result according to Rule Engine reasoning, carries out selected eca rule and moves fault is processed; Eca rule manager (ECA Rule Manager) is in charge of eca rule, comprises regular modification, interpolation and deletion; Required various rules in the main storage failure digestion process in eca rule storehouse (ECARules).
The present invention's advantage compared with prior art is:
(1), the method is to distribute according to resource service reason and the classification that fault in (RSOA) process produces rationally specifically, designs corresponding fault-tolerant management realization mechanism, realize corresponding fault detect and clear up.This invention can effectively detect service manufacturing system resource service and distribute the most common failure caused by virtual link, resource, task, application etc. in process rationally, and provide corresponding good Removing Tactics to it, can effectively improve reliability and service quality that service manufacturing system resource service is distributed rationally.
(2), the present invention includes a kind of resource service and distribute the fault-tolerant management implementation framework rationally, and corresponding fault detect and digestion mechanism and method based on ECA (event-condition-action), can be applicable to distributed networked service manufacturing system, there is good dynamic, modularity, maintainability, autgmentability, can effectively detect and clear up resource service and distribute the various faults in process rationally, improve the stability of whole service manufacturing system and the reliability that resource service is distributed rationally.
The accompanying drawing explanation
Fig. 1 is that resource service is distributed the fault-tolerant management architecture rationally;
Fig. 2 is based on the fault recovery of ECA;
Fig. 3 is the Task_Resource_MisMatch_Failure overhaul flow chart;
Fig. 4 is the Trust_Failure overhaul flow chart;
Table 1 is the part eca rule that resource service is distributed fault-tolerant management rationally.
Embodiment
Below in conjunction with accompanying drawing, the present invention is described in further detail.
A kind of resource service the present invention relates to is distributed fault-tolerant management realization mechanism and method rationally, by analyzing fault and the classification that may occur in the RSOA process, thereby research RSOA fault-tolerant management architecture, and study corresponding concrete fault detection method and Removing Tactics.
And if only if when following two kinds of situations or one of them occur, and claims resource service to distribute rationally and break down: 1. because the resource collapse causes it, stop service; 2. the availability of resource does not reach the minimum QoS standard of task.In actual applications, it is varied that the service manufacturing system resource service such as cloud manufacture are distributed fault type rationally, the main and virtual link of the generation of most common failure, resource, task, four factor analysis of application.
(1) the relevant fault of virtual link
Virtual link (VL) refers to that the broad sense of two inter-entity in the service manufacturing system connects.The fault produced because of VL mainly contains virtual link fault and the not enough fault of bandwidth.
(2) the relevant fault of resource service
Resource service is the carrier of executing the task, and therefore, the exiting of resource service, overload, QoS or the change of ability, the combination between resource service etc. all may cause the RSOA fault.The fault caused because of resource service mainly contains: resource service exits fault, resource service overload (or saturated) fault, resource service combination fault, resource service ability and changes the fault caused.
Wherein the resource service combination fault mainly contains: the mistake coupling between the resource service concept, the mistake coupling between data, attributes match error, QoS nonuniformity.
(3) the relevant fault of task
In the RSOA process, because of various reasons, may cause cancellation, hang-up of task etc., thereby cause the failure of distributing rationally.The RSOA fault caused because of task mainly contains: task is cancelled fault, task is suspended or hang up fault, resource and the task fault that it fails to match, the mission requirements change causes.
(4) apply relevant fault
In application process, may lose efficacy because the reasons such as trust, access rights, coding cause RSOA, as: trust fault, application design or coding fault, access rights fault.
In the RSOA process, the issuable fault of above four class can cause efficiency and the service Quality Down of whole RSOAS.For supporting to provide fault tolerance in the RSOA process, in conjunction with the RSOAS framework, the present invention proposes RSOA fault-tolerant management architecture as shown in Figure 1.
The RSOAS fault-tolerance architecture is distributed module, fault detection module, fault recovery module four parts rationally by information service module, resource service and is formed.Realize the RSOAS fault tolerance, emphasis will solve the detection of fault and clear up.
The present invention relates to a kind of resource service and distribute fault-tolerant management realization mechanism and method rationally, comprise that a kind of resource service distributes the fault-tolerant management implementation framework rationally, and corresponding fault detect and the digestion mechanism based on ECA and method.
The RSOAS fault-tolerance architecture is as Fig. 1, distributes module, fault detection module, fault recovery module four parts rationally by information service module, resource service and forms, and wherein fault detection module and fault recovery module are key content of the present invention.
(1) information service module
Information service module is mainly fault detect, fault recovery, resource service is distributed rationally provides directory information service (IIS), Resource Information Service (RIS), resource service encapsulation, QoS database information and Data support.
Wherein, directory information service (IIS) organizational information can provide the information aggregate inquiry, and supports the effective query to a plurality of RIS, and information index and the function of search of whole service-oriented manufacturing system can be provided simultaneously.IIS is comprised of three parts: general location registration process, insertable bibliographic structure and search are processed.
Resource Information Service (RIS) runs on the resource end, provides unified means to come the configuration of resource in the inquiry system platform, ability and state, and can be configured to certainly as assembling directory service.After RIS carries out authentication to the demand of input and task and resolves, according to the type of solicited message, inquiry request is distributed to one or more informants.Then RIS passes to the user to the feedback information of resource.
The effect of resource service encapsulation template is the effective management of implementation platform to the nodal information of participation Collaborative Manufacturing.According to the attribute between resource (as physical features, geographical position, dynamic characteristic, sensitivity, function etc.), customer demand (as time, quality, price, service etc.), be used mode (as discovery, agency, monitoring, diagnosis etc.), resource classification is encapsulated.Resource provider, after platform carries out resource registering, will be packaged into the resource service class template; When the client serves in request resource, from system platform, download corresponding Resource Encapsulation template, and complete the instantiation of specific tasks.
Extract the QoS information of respective resources service in the QoS database.Corresponding QoS index parameter is assessed to measurement, and carry out QoS relatively, thereby preferably with combination, provide information and Data support for follow-up resource service.
(2) resource service is distributed module rationally
Resource service distribute rationally module mainly provide resource service search, QoS assessment, resource service preferably, the feature operations such as resource service combination.
The resource service search provides all kinds of resource service information matches algorithm service, demand according to the subtask of Task-decomposing to resource service, be responsible for searching satisfactory respective resources service from the resource service storehouse, and generate resource service collection to be selected (RSS).
The QoS assessment is for the magnanimity that meets user's request searched resource service collection to be selected, and purpose is for user and the service of system selection best resource, carry out resource service and distribute the reference frame that quantification is provided rationally, and be the important step of resource optimization service configuration.Extract the QoS information of respective resources service in RSS from resource service information OWL-S/WSDL of being registered to server library or QoS database.Corresponding QoS index parameter is assessed to measurement, and carry out QoS relatively, thereby preferably with combination, provide information and Data support for follow-up resource service.
Resource service is preferred: if the task that the user submits to is single resource service demand, according to the qos parameter information requirement, the resource service to be selected of RSS is carried out to the comprehensive assessment sequence, select best resource service to execute the task.
Resource service combination and preferably: if the user submits to is many resource service demand, from each RSS, select a resource service to form in a certain order the combined resource service, and select from all possible combination that optimal set is incompatible executes the task.
(3) fault detection module
Be responsible for the state of each node in the monitor service manufacturing system and moving of task and resource, monitor at any time and carry out state analysis.Monitored distributing flow process and the resource related to and mission performance and ruuning situation rationally by local detector, and a series of management service is provided, as task status management, resource service condition managing.To normally or the historical data of the example extremely exited analyzed and added up, make a policy and notify the fault recovery module to be processed detected fault.
Below respectively relevant to virtual link, resource service relevant, task relevant and the concrete digestion procedure of the four class various faults that application is relevant is described in detail.
(1) virtual link fault detect
1) VL_Disconnect_Failure detects
Usually can detect by System Security Policy or the middleware be embedded in the service manufacturing system, the GRAM that can adopt Golbus to provide serves to detect VL_Disconnect_Failure.
2) Bandwidth_Failure detects
In system, between two entities (meaning with A, B), whether because having produced fault, bandwidth reasons adopt call duration time (CT) and success communication rate (PSC) or two indexs of reliability to judge.
A) adopt the call duration time judgement
Make the virtual link between A, B be expressed as VL (A, B), the total information exchange capacity of VL (A, B) is SumInfor (A, B), and transmission speed (bandwidth) is V (A, B), and the stand-by period is Waite (A, B).Corresponding total call duration time, be designated as T c(A, B) is transmission time and stand-by period sum.
B) success communication rate (PSC) or reliability judgement
If the failure rate of virtual link VL (A, B), node A and B is respectively α (A, B), α (A), α (B), by the definition of reliability, can ask the reliability S between VL (A, B) c(A, B).
If the minimum CT of user's request and PSC require to be respectively
Figure BDA00002124510900071
with
Figure BDA00002124510900072
working as virtual link VL (A, B) meets
Figure BDA00002124510900073
or
Figure BDA00002124510900074
the time, Bandwidth_Failure has occurred in the system judgement.
(2) the relevant fault detect of resource service
1) RS_Quit_Failure detects
For detecting to distribute in process whether produced RS_Quit_Failure rationally in resource service, the resource service detector is the uninterrupted state that checks each resource service regularly.If this resource service is reaction not, RS_Quit_Failure has occurred in the system judgement.
2) RS_Overload_Failure detects
By assessment RS idata-handling capacity (DC), call duration time (CT), time of implementation (ET) judge RS iwhether transship, whether system has produced RS_Overload_Failure.
If distribute to RS in the certain hour section itask-set be Γ i={ Task 1, Task 2..., Task j..., Task k.Task wherein jneed RS iquantity be
Figure BDA00002124510900075
d i, jfor task task jcall RS irequired data access amount, V (i, j) is Task jwith RS ibetween the virtual link bandwidth; Et jfor each RS icarry out Task jthe required time of implementation.RS in running icorresponding ET, DC, CT calculates respectively
Figure BDA00002124510900076
if RS ieT, DC, the CT upper limit is respectively Lim RS i CT , Lim RS i DC , When system detects RS imeet ET RS i > Lim RS i ET , CT RS i > Lim RS i CT , DC RS i > Lim RS i DC One of them person, RS_Overload_Failure has occurred in the system judgement.
3) RS_Composition_Failure detects
RS_Composition_Failure mainly comprises that between concept, between coupling, data, coupling, attribute mistake are mated and four kinds of situations of QoS nonuniformity by mistake by mistake.
A) the mistake matching detection rule between concept
(1) if RS irS ksubclass and RS kbe not contained in RS j, RS iwith RS jbetween have gap (gap).This detects rule for the coupling of the mistake between the resource service concept (having gap between concept); (2) if RS irS ksubclass and RS krS jsubclass, RS irS jsubclass.This is the mistake coupling (RS between the resource service concept irS jsubclass) detect rule.
B) mistake matching detection rule between data
(1) if DU nitt ransfer(RS i) equal RS j, RS so iand RS jsame parameters there is identical data type, but different dimension.DU wherein nitt ransfer() is that the data dimension transforms function.(2) if DT ypet ransfer(RS i) equal RS j, RS so iand RS jthere is identical concept of parameter, but different types of data.DT wherein ypet ransfer() is that data type transforms function.
C) attribute mistake matching detection rule
If RS jrequired property parameters compares RS imany and the RS that can provide iwith
Figure BDA00002124510900081
common factor be not empty, RS iattribute can not meet RS jrequirement, wherein
Figure BDA00002124510900082
for the split function.
Above relevant resource service combination detects just part of rule, in actual applications, can design as required the interpolation new regulation.
D) the QoS nonuniformity detects rule
If with
Figure BDA00002124510900084
be respectively RS iand RS jnumber of parameters, if by analyzing two adjacent resource service RS in composite services iand RS jqoS be consistent, these composite services are effectively, otherwise system is judged RS_Composition_Failure. has been occurred
(3) the relevant fault detect of task
1) Task_Cancel_Failure and Task_Suspension_Failure detect
Whether in order to detect in resource service, to distribute rationally in process and produced into RS_Quit_Failure, the task detector is the uninterrupted current state that checks each task regularly.When task during in the Task_Suspended queue, system is judged and has been produced Task_Suspension_Failure.If in Task_Terminated, Task_Cancel_Failure has been given birth in system judgement fixed output quota.
2) Task_Resource_Mismatch_Failure detects
If Resources allocation service RS itask executes the task j, according to the resource service matching algorithm, establish ζ bas, ζ i/o, ζ qoS, ζ is respectively basic coupling threshold values, I/O coupling threshold values, QoS coupling threshold values, the comprehensive matching threshold values that system or user set.:
(1) if resource service RS iwith task task jbasic matching value be less than basic coupling threshold values ζ bas, the system judgement has produced basic coupling fault;
(2) if resource service RS iwith task task jthe I/O matching value be less than I/O coupling threshold values ζ i/o, system judges that having produced I/O mates fault;
(3) if resource service RS iwith task task jthe QoS matching value be less than QoS coupling threshold values ζ qoS, system judges that having produced QoS mates event;
(4) if resource service RS iwith task task jlast matching value be less than comprehensive matching threshold values ζ, system is judged and have been produced the comprehensive matching fault.
The testing process of Task_Resource_Mismatch_Failure as shown in Figure 3.The Task_RequireChange_Failure detection method is identical with Task_Resource_Mismatch_Failure.
(4) apply relevant fault detect
1) Trust_Failure detects
If two entities of participating in business in RSOA are respectively x and y,, in the process of distributing rationally, can assess the trust value T between x and y according to resource service Trust-QoS assessment models x → y.If entity x requires as T the minimum degree of belief of y x → y°, work as T x → y<T x → y° the time, Trust_Failure has been given birth in system judgement fixed output quota, as Fig. 4.
2) App_DesignCode_Failure and App_AccessRight_Failure detect
The same with the VL_Disconnected_Failure detection method, App_DesignCode_Failure and App_AccessRight_Failure be mainly by System Security Policy or be embedded in the system middleware of service in manufacturing system and detect, and the related service or the middleware that mainly adopt Globus to provide detect.
(4) fault recovery module
When occurring and fault detected, must be repaired it.Current failure tolerant mechanism mainly contains following several:
1) task based on the checkpoint strategy is fault-tolerant: system is passed through periodically Checkpointing, correct status when program is moved is saved in reliable memory equipment, when breaking down, return to nearest state and resume operation, thereby at utmost reducing the loss that barrier brings for some reason.
2) the task fault-tolerant strategy based on retry: distribute rationally in running in resource service; if the operation of breaking down has been carried out or do not have the operation of carrying out not ignore; system can attempt re-executing this operation in the situation that do not change execution route; retry is to the constraint of maximum number of repetitions, if repeatedly the execute exception activity until maximum times still be not resolved stop repetitive operation.
3) the task fault-tolerant strategy based on backup: its thought is that a task is carried out to copy backup on different resources, so long as not all backups, all makes mistakes, and task is final just can successful operation.
4) fault-tolerant strategy based on alternative: when task breaks down, the task of having identical function by the operation another one substitutes.
5) task based on redundancy is fault-tolerant: its thought is to select a plurality of different executed activity or the paths that can realize task, although different execution features is arranged, the function of these activities or execution route is identical.
6) based on self-defined abnormal fault-tolerant strategy: user-defined abnormal permission user defines various abnormality eliminating methods for special duty.If break down in running, activate and be defined in the abnormality eliminating method on this task.
The present invention, except the above fault tolerant mechanism of comprehensive employing, also adopts event-condition-action (ECA) rule to support the RSOA fault-tolerant management.The Event that is eca rule by the fault definition by occurring in the RSOA process; The Condition that is eca rule by the fault detect conditional definition; The processing that fault is made (as dispatched again, coupling etc.) again is defined as the Action of eca rule.
With reference to typical eca rule, the present invention has designed the resource service based on ECA as shown in Figure 2 and has distributed fault rationally and clear up module.Mainly comprise event detector, Conditions Evaluation device, actuator, Rule Engine, eca rule storehouse, the several parts of eca rule manager.
1) Event Detector: mainly receive the failure message that fault detection module sends over, the Event of analyzing and testing fault.
2) Condition Evaluator: main being responsible for assessed the relevant Condition of Event detected, and sees whether it meets the condition of corresponding eca rule.
3) Rule Engine: main be responsible for Event to detecting and the respective rule in the eca rule storehouse is carried out the reasoning coupling, find suitable rule to process the fault detected.
4) Action Executor: be mainly the result according to Rule Engine reasoning, carry out selected eca rule and move fault is processed.
5) ECARules: be the eca rule storehouse.
6) ECARule Manager: be in charge of eca rule, comprise regular modification, interpolation, deletion etc.
In proposed service manufacturing system resource service is distributed fault tolerant mechanism rationally, eca rule directly is used for supporting fault recovery.For above fault and the detection method provided, the present invention has designed eca rule as shown in table 1 and has supported the service manufacturing system resource service such as CMfg to distribute fault recovery rationally.
In table 1, to distribute the fault resolution rule rationally be the part in the eca rule storehouse to listed relevant service manufacturing system resource service.In actual applications, design as required new rule, add in the eca rule storehouse by ECARule Managemr.
Table 1
Figure BDA00002124510900101
Figure BDA00002124510900111

Claims (4)

1. a resource service is distributed fault-tolerant management rationally and realized system, it is characterized in that: this system comprises that information service module, resource service distribute module, fault detection module and fault recovery module rationally;
Described information service module is distributed rationally information and Data support is provided for fault detect, fault recovery, resource service;
Described resource service is distributed module rationally and is realized that resource service search, service quality (QoS) are assessed, resource service is preferred, the resource service combination feature operation;
Described fault detection module is responsible for the state of each node in the monitor service manufacturing system and moving of task and resource, monitors at any time and carries out state analysis; To normally or the historical data of the example extremely exited analyzed and added up, make a policy and notify the fault recovery module to be processed detected fault;
Described fault recovery module, be comprehensive multiple fault tolerant mechanism based on ECA(Event-Condition-Action) resource service distribute fault rationally and clear up module, comprise event detector (Event Detector), Conditions Evaluation device (Condition Detector), actuator (Action Executor), rule-based reasoning engine (Rule Engine), eca rule storehouse (ECARules), eca rule manager (ECA Rule manager) part;
In described fault detection module, fault detect comprises fault detect that virtual link (VL) is relevant, fault detect that resource service (RS) is relevant, fault detect that task (Task) is relevant, applies relevant fault detect; The fault detect that virtual link is relevant, comprise that virtual link fault (VL_Disconnect_Failure) detects and the not enough fault of bandwidth (Bandwidth_Failure) detects; Virtual link fault (VL_Disconnect_Failure) can detect by System Security Policy or the middleware be embedded in the service manufacturing system usually; Whether two inter-entity adopt call duration time and success communication rate or two indexs of reliability to judge because bandwidth produces fault; The fault detect that resource service is relevant is that resource service exits fault (RS_Quit_Failure) detection, resource service overload fault (RS_Overload_Failure) detection, resource service combination fault (RS_Composition_Failure) detection; Resource service exits fault (RS_Quit_Failure) and judges by the state of regular uninterrupted each resource service of inspection of resource service detector; Resource service overload fault (RS_Overload_Failure) is by assessment RS idata-handling capacity, call duration time, time of implementation judge RS iwhether transship; Resource service overload fault (RS_Overload_Failure) detects rule and judges by detecting mistake matching detection rule, attribute mistake matching detection rule, QoS nonuniformity between the mistake matching detection rule whether meet between concept, data; The fault detect that task is relevant, comprise that task cancels fault (Task_Cancel_Failure) and task and suspended or hang up fault (Task_Suspension_Failure) detection, resource and task it fails to match (Task_Resource_Mismatch_Failure) detection; Task is suspended or hangs up fault (Task_Suspension_Failure) regularly uninterruptedly checks the current state of each task by the task detector, and whether task is judged in task suspension (Task_Suspended) queue and task termination (Task_Terminated); It fails to match for resource and task (Task_Resource_Mismatch_Failure) adopts the resource service matching algorithm, determines whether basic coupling fault, I/O coupling fault, QoS coupling fault, comprehensive matching fault have occurred; Task is suspended or is hung up fault (Task_Suspension_Failure) detection method and resource and task it fails to match that (Task_Resource_Mismatch_Failure) is identical; Apply relevant fault detect, comprise that trust fault (Trust_Failure) detects, application designs or coding fault (App_DesignCode_Failure) and access rights fault (App_AccessRight_Failure) detection; Trust fault (Trust_Failure) by the x with the assessment of resource service Trust-QoS assessment models and the trust value T between y x → ywith the minimum degree of belief requirement T of entity x to y x →the size of y ° is relatively judged; Apply design or coding fault (App_DesignCode_Failure) and access rights fault (App_AccessRight_Failure) by System Security Policy or be embedded in the system middleware of serving in manufacturing system and detect.
2. a kind of resource service according to claim 1 is distributed fault-tolerant management rationally and is realized system, it is characterized in that: ECA (Event-Condition-Action, event-condition-action) event (Event) in rule is defined as the corresponding event of a rule (Rule) that triggers, condition (Condition) be defined as activate this rule (Rule) institute must satisfied condition, action (Action) is the action command that will carry out after an eca rule is triggered; The event (Event) that is eca rule by the fault definition that occurs in the RSOA process; The condition that is eca rule by the fault detect conditional definition (Condition); The processing that fault is made is defined as the action (Action) of eca rule.
3. a kind of resource service according to claim 2 is distributed fault-tolerant management rationally and realized system, it is characterized in that: the described processing that fault is made is specially scheduling again or mates.
4. a kind of resource service according to claim 1 is distributed fault-tolerant management rationally and is realized system, it is characterized in that: event detector (Event Detector) receives the failure message that fault detection module sends over, the event of analyzing and testing fault (Event); Conditions Evaluation device (Condition Evaluator) is responsible for the relevant condition (Condition) of event (Event) detected is assessed, and sees whether it meets the condition of corresponding eca rule; The event (Event) that rule-based reasoning engine (Rule Engine) is responsible for detecting is carried out the reasoning coupling with the respective rule in the eca rule storehouse, finds suitable rule to process the fault detected; Actuator (Action Executor) is according to the result of Rule Engine reasoning, carries out selected eca rule and moves fault is processed; Eca rule manager (ECA Rule Manager) is in charge of eca rule, comprises regular modification, interpolation and deletion; Required various rules in the storage failure digestion process of eca rule storehouse (ECA Rules).
CN2012103356095A 2012-09-11 2012-09-11 Implement system for resource service optimization allocation fault-tolerant management Expired - Fee Related CN102916830B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2012103356095A CN102916830B (en) 2012-09-11 2012-09-11 Implement system for resource service optimization allocation fault-tolerant management

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2012103356095A CN102916830B (en) 2012-09-11 2012-09-11 Implement system for resource service optimization allocation fault-tolerant management

Publications (2)

Publication Number Publication Date
CN102916830A CN102916830A (en) 2013-02-06
CN102916830B true CN102916830B (en) 2013-12-11

Family

ID=47615068

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2012103356095A Expired - Fee Related CN102916830B (en) 2012-09-11 2012-09-11 Implement system for resource service optimization allocation fault-tolerant management

Country Status (1)

Country Link
CN (1) CN102916830B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106341281A (en) * 2016-11-10 2017-01-18 福州智永信息科技有限公司 Distributed fault detection and recovery method of linux server
CN107040406B (en) * 2017-03-14 2020-08-11 西安电子科技大学 End cloud cooperative computing system and fault-tolerant method thereof
CN108289034B (en) * 2017-06-21 2019-04-09 新华三大数据技术有限公司 A kind of fault discovery method and apparatus
CN108021827A (en) * 2017-12-07 2018-05-11 中科开元信息技术(北京)有限公司 A kind of method and system based on area mechanism structure security system
CN114296983B (en) * 2021-12-30 2022-08-12 重庆允成互联网科技有限公司 Trigger operation record-based flow exception handling method and storage medium
CN114580911B (en) * 2022-03-04 2023-07-25 重庆大学 Site-factory mixed service and resource scheduling method

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101645036B (en) * 2009-09-11 2011-06-01 兰雨晴 Method for automatically distributing test tasks based on capability level of test executor
CN101958917B (en) * 2010-03-24 2013-02-06 北京航空航天大学 Cloud manufacturing system-oriented method for measuring and enhancing flexibility of resource service composition
CN102624870A (en) * 2012-02-01 2012-08-01 北京航空航天大学 Intelligent optimization algorithm based cloud manufacturing computing resource reconfigurable collocation method

Also Published As

Publication number Publication date
CN102916830A (en) 2013-02-06

Similar Documents

Publication Publication Date Title
CN112000448B (en) Application management method based on micro-service architecture
CN102916830B (en) Implement system for resource service optimization allocation fault-tolerant management
US20080307258A1 (en) Distributed Job Manager Recovery
US8392236B2 (en) Mobile network dynamic workflow exception handling system
CN101297536A (en) A method and system for preparing execution of systems management tasks on endpoints
US20080288621A1 (en) Agent workflow system and method
CN101777020A (en) Fault tolerance method and system used for distributed program
Gupta et al. A QoS-supported approach using fault detection and tolerance for achieving reliability in dynamic orchestration of web services
CN112579288A (en) Cloud computing-based intelligent security data management system
CN106101212A (en) Big data access method under cloud platform
CN105320522A (en) Service-oriented architecture based XBRL application platform
Abiteboul et al. The AXML artifact model
CN103326880B (en) Genesys calling system high availability cloud computing monitoring system and method
Fan et al. Model based Byzantine fault detection technique for cloud computing
Tripathi et al. An integrated approach of designing functionality with security for distributed cyber-physical systems
CN103078764A (en) Operational monitoring system and method based on virtual computing task
EP2770447B1 (en) Data processing method, computational node and system
Alhosban et al. Bottom-up fault management in service-based systems
Mahato et al. Adaptability in transaction oriented grid service
Abdeldjelil et al. A diversity-based approach for managing faults in web services
Brennand et al. SimGrid: A simulator of network monitoring topologies for peer-to-peer based computational grids
Li et al. A sensor-based approach to symptom recognition for autonomic systems
Sanjeewa Self-Healing of Distributed Systems
Faccin Automated management of remedial behaviour
CN108628708A (en) Cloud computing fault-tolerance approach and device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20131211

Termination date: 20190911

CF01 Termination of patent right due to non-payment of annual fee