CN102916830A - Implement system for resource service optimization allocation fault-tolerant management - Google Patents

Implement system for resource service optimization allocation fault-tolerant management Download PDF

Info

Publication number
CN102916830A
CN102916830A CN2012103356095A CN201210335609A CN102916830A CN 102916830 A CN102916830 A CN 102916830A CN 2012103356095 A CN2012103356095 A CN 2012103356095A CN 201210335609 A CN201210335609 A CN 201210335609A CN 102916830 A CN102916830 A CN 102916830A
Authority
CN
China
Prior art keywords
fault
task
failure
rule
resource service
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2012103356095A
Other languages
Chinese (zh)
Other versions
CN102916830B (en
Inventor
陶飞
程颖
张霖
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beihang University
Original Assignee
Beihang University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beihang University filed Critical Beihang University
Priority to CN2012103356095A priority Critical patent/CN102916830B/en
Publication of CN102916830A publication Critical patent/CN102916830A/en
Application granted granted Critical
Publication of CN102916830B publication Critical patent/CN102916830B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Hardware Redundancy (AREA)

Abstract

The invention relates to an implement system for resource service optimization allocation (RSOA) fault-tolerant management. In the implement system, corresponding fault-tolerant management implement mechanisms are designed according to causes and types to generate faults in a process of RSOA so as to implement corresponding fault detection and elimination. The implement system comprises an information service module, an RSOA module, a fault detection module and a fault recovery module, and has the advantages of good modularity, maintainability and expansibility, can effectively detect and eliminate various faults in the process of the RSOA, and improves stability of the whole service manufacturing system and reliability of the RSOA. The implement system can effectively detect common faults caused by virtual connect, resources, tasks, applications and the like in the process of the RSOA of the service manufacturing system and provides corresponding good elimination strategies for the common faults, as well as effectively improves reliability and service quality of the RSOA of the service manufacturing system.

Description

A kind of resource service is distributed fault-tolerant management rationally and is realized system
Technical field
The invention belongs to distributed Manufacturing information integration System Fault Tolerance administrative skill field.Be specifically related to a kind of resource service and distribute fault-tolerant management rationally and realize system, it distributes the fault-tolerant management implementation framework rationally for a kind of resource service of service-oriented manufacturing system, and accordingly fault detect and based on digestion mechanism and the method for ECA.This invention can effectively detect service manufacturing system resource service and distribute most common failure in the process rationally, and provides corresponding good Removing Tactics, reliability and service quality that Effective Raise service manufacturing system resource service is distributed rationally to it.
Background technology
Service manufacturing system (cloudlike (CMfg) system of manufacturing, manufacturing service system, manufacturing grid system etc.) is made resource service and is distributed the operation that relates in the implementation procedure rationally, comprise that resource service search and coupling, QoS assessment, QoS extract, resource service is preferred, resource service combination etc., may be former thereby failed because of some, thus whole distribute rationally failure or inefficacy caused.Its possible cause mainly contains:
1. serve in the manufacturing system that two internodal virtual links disconnect or bandwidth ability descends suddenly, can't meet the demands;
2. invoked resource service breaks down or the generation state changes in the process of implementation, as be closed suddenly or withdraw from, resource service combination lost efficacy, the resource service ability descends suddenly, overload etc.;
3. submitted to or the task generation state that just moving changes, as the person of being managed or user withdraw from by force, demand improves, be suspended, invalid resource service distribution etc.;
4. in application process, go wrong, trust the access rights of deficiency, mistake, unreasonable or incorrect Code Design etc. such as both parties.
Above phenomenon is referred to as fault in the present invention.In case above situation occurs, resource service is distributed (RSOA) rationally and will be suspended or lose efficacy.May must therefore, for reliability and the service quality that improves RSOA, solve following problem: 1. which fault occur in the RSOA process? 2. how to detect the fault of appearance? 3. how analyzing and testing to fault and carry out Recovery processing?
For above problem, in the service manufacturing fields such as CMfg, also there is not at present correlative study.For overcoming the above problems, realize the fault-tolerant management in the RSOA process, improve reliability and the service quality of RSOA, the present invention at first analyzes the fault that may occur in the RSOA process and classifies, study on this basis RSOA fault-tolerant management realization mechanism, and study corresponding fault detection method and Removing Tactics.
Summary of the invention
Purpose of the present invention is: the resource service that the present invention relates to is distributed the fault-tolerant management realization mechanism rationally, can effectively detect the most common failure that produces in the service manufacturing system RSOA process, and provides corresponding good Removing Tactics and method for various faults.Reliability and service quality that Effective Raise service manufacturing system resource service is distributed rationally.
The technical solution used in the present invention is: a kind of resource service is distributed (RSOA) fault-tolerant management rationally and is realized that system, this system comprise that information service module, resource service distribute module, fault detection module and fault recovery module rationally;
Described information service module is mainly fault detect, fault recovery, resource service is distributed rationally provides information and Data support;
Described resource service is distributed module rationally and is realized that mainly resource service search, service quality (QoS) are assessed, the feature operations such as resource service is preferred, resource service combination;
Described fault detection module is responsible for the state of each node in the monitor service manufacturing system and moving of task and resource, monitors at any time and carries out state analysis; To normally or the historical data of the example that unusually withdraws from analyze and add up, make a policy and notify the fault recovery module that detected fault is processed;
Described fault recovery module, be comprehensive multiple fault tolerant mechanism based on ECA(Event-Condition-Action) resource service distribute fault rationally and clear up module, mainly comprise event detector (Event Detector), Conditions Evaluation device (Condition Detector), actuator (Action Executor), rule-based reasoning engine (Rule Engine), eca rule storehouse (ECA Rules), eca rule manager (ECA Rule manager) part.
Wherein, fault detect comprises the relevant fault detect of virtual link (VL), the fault detect that resource service (RS) is relevant, the fault detect that task (Task) is relevant, the fault detect that application is relevant in the described fault detection module; The fault detect that virtual link is relevant comprises that mainly virtual link fault (VL_Disconnect_Failure) detects and the not enough fault of bandwidth (Bandwidth_Failure) detects; Virtual link fault (VL_Disconnect_Failure) can detect by System Security Policy or the middleware that is embedded in the service manufacturing system usually; Whether two inter-entity adopt call duration time and success communication rate or two indexs of reliability to judge because bandwidth produces fault; The fault detect that resource service is relevant mainly is that resource service withdraws from fault (RS_Quit_Failure) detection, resource service overload fault (RS_Overload_Failure) detection, resource service combination fault (RS_Composition_Failure) detection; Resource service withdraws from fault (RS_Quit_Failure) and judges by the regular uninterrupted state of each resource service that checks of resource service detector; Resource service overload fault (RS_Overload_Failure) is by assessment RS iData-handling capacity, call duration time, time of implementation judge RS iWhether transship; Mistake matching detection rule, attribute mistake matching detection rule, QoS nonuniformity detect rule and judge resource service overload fault (RS_Overload_Failure) between the mistake matching detection rule whether satisfy between concept, data by detecting; The fault detect that task is relevant comprises that mainly task cancellation fault (Task_Cancel_Failure) and task are suspended or hang up fault (Task_Suspension_Failure) detection, resource and unsuccessfully (Task_Resource_Mismatch_Failure) detection of task matching; Task is suspended or is hung up fault (Task_Suspension_Failure) by the regular uninterrupted current state that checks each task of task detector, and task whether is in task suspension (Task_Suspended) formation and task termination (Task_Terminated) is judged; The resource service matching algorithm is adopted in resource and task matching failure (Task_Resource_Mismatch_Failure), determines whether basic coupling fault, I/O coupling fault, QoS coupling fault, comprehensive matching fault have occured; Task is suspended or to hang up fault (Task_Suspension_Failure) detection method and resource identical with task matching failure (Task_Resource_Mismatch_Failure); Use relevant fault detect, comprise that mainly trust fault (Trust_Failure) detects, uses design or coding fault (App_DesignCode_Failure) and access rights fault (App_AccessRight_Failure) detection; Trusting fault (Trust_Failure) passes through with the x of resource service Trust-QoS assessment models assessment and the trust value T between the y X → yWith the minimum degree of belief requirement T of entity x to y X → y° size relatively judge; To use design or coding fault (App_DesignCode_Failure) and access rights fault (App_AccessRight_Failure) mainly be by System Security Policy or be embedded in the system middleware of serving in the manufacturing system detects.
Wherein, described ECA (Event-Condition-Action, event-condition-action) event definition is the corresponding event of a triggering rule (Rule) in the rule, condition (Condition) be defined as activate this rule (Rule) institute must satisfied condition, move action command for carrying out after being triggered when an eca rule; Be the event (Event) of eca rule with the fault definition that occurs in the RSOA process; Be the condition (Condition) of eca rule with the fault detect conditional definition; The processing that fault is made is defined as the action (Action) of eca rule.
Wherein, the described processing that fault is made is specially again scheduling or mates.
Wherein, described event detector (Event Detector) mainly receives the failure message that fault detection module sends over, the event of analyzing and testing fault (Event); Conditions Evaluation device (Condition Evaluator) mainly is responsible for the relevant condition (Condition) of event (Event) that detects is assessed, and sees whether it satisfies the condition of corresponding eca rule; Rule-based reasoning engine (Rule Engine) mainly is responsible for the event (Event) that detects is carried out the reasoning coupling with the respective rule in the eca rule storehouse, finds suitable rule to process the fault that detects; Actuator (Action Executor) mainly is the result according to Rule Engine reasoning, carries out selected eca rule and moves fault is processed; Eca rule manager (ECA Rule Manager) is in charge of eca rule, comprises modification, interpolation and the deletion of rule; Required various rules in the main storage failure digestion process in eca rule storehouse (ECARules).
The present invention's advantage compared with prior art is:
(1), the method is to distribute reason and the classification that fault in (RSOA) process produces rationally according to resource service specifically, designs corresponding fault-tolerant management realization mechanism, realize corresponding fault detect and clear up.This invention can effectively detect service manufacturing system resource service and distribute the most common failure that is caused by virtual link, resource, task, application etc. in the process rationally, and provide corresponding good Removing Tactics to it, reliability and the service quality of can Effective Raise service manufacturing system resource service distributing rationally.
(2), the present invention includes a kind of resource service and distribute the fault-tolerant management implementation framework rationally, and corresponding fault detect and based on digestion mechanism and the method for ECA (event-condition-action), can be applicable to distributed networked service manufacturing system, have good dynamic, modularity, maintainability, autgmentability, can effectively detect and clear up resource service and distribute various faults in the process rationally, improve the stability of whole service manufacturing system and the reliability that resource service is distributed rationally.
Description of drawings
Fig. 1 is that resource service is distributed the fault-tolerant management architecture rationally;
Fig. 2 is based on the fault recovery of ECA;
Fig. 3 is the Task_Resource_MisMatch_Failure overhaul flow chart;
Fig. 4 is the Trust_Failure overhaul flow chart;
Table 1 is the part eca rule that resource service is distributed fault-tolerant management rationally.
Embodiment
The present invention is described in further detail below in conjunction with accompanying drawing.
A kind of resource service that the present invention relates to is distributed fault-tolerant management realization mechanism and method rationally, namely by analyzing fault and the classification that may occur in the RSOA process, thereby research RSOA fault-tolerant management architecture, and study corresponding concrete fault detection method and Removing Tactics.
And if only if when following two kinds of situations or one of them occur, and claims resource service to distribute rationally and break down: 1. because the resource collapse causes it to stop service; 2. the availability of resource does not reach the minimum QoS standard of task.In actual applications, it is varied that the service manufacturing system resource service such as cloud manufacturing are distributed fault type rationally, the main and virtual link of the generation of most common failure, resource, task, four factor analysis of application.
(1) the relevant fault of virtual link
The broad sense that virtual link (VL) refers to serve two inter-entity in the manufacturing system connects.The fault that produces because of VL mainly contains virtual link fault and the not enough fault of bandwidth.
(2) the relevant fault of resource service
Resource service is the carrier of executing the task, and therefore, the withdrawing from of resource service, overload, QoS or the change of ability, the combination between resource service etc. all may cause the RSOA fault.The fault that causes because of resource service mainly contains: resource service withdraws from fault, resource service overload (or saturated) fault, resource service combination fault, resource service ability and changes the fault that causes.
Wherein the resource service combination fault mainly contains: the mistake coupling between the resource service concept, the mistake coupling between the data, attributes match error, QoS nonuniformity.
(3) the relevant fault of task
In the RSOA process, because of various reasons, may cause cancellation, hang-up of task etc., thereby cause the failure of distributing rationally.The RSOA fault that causes because of task mainly contains: task is cancelled fault, task is suspended or hang up fault, resource and task matching failure, mission requirements changes the fault that causes.
(4) use relevant fault
In application process, may lose efficacy because the reasons such as trust, access rights, coding cause RSOA, as: trust fault, use design or encode fault, access rights fault.
In the RSOA process, more than the issuable fault of four classes can cause whole RSOAS efficient and the service Quality Down.For supporting to provide fault tolerance in the RSOA process, in conjunction with the RSOAS framework, the present invention proposes RSOA fault-tolerant management architecture as shown in Figure 1.
The RSOAS fault-tolerance architecture is distributed module, fault detection module, fault recovery module four parts rationally by information service module, resource service and is formed.Realize the RSOAS fault tolerance, emphasis will solve the detection of fault and clear up.
The present invention relates to a kind of resource service and distribute fault-tolerant management realization mechanism and method rationally, comprise that a kind of resource service distributes the fault-tolerant management implementation framework rationally, and corresponding fault detect and based on digestion mechanism and the method for ECA.
RSOAS fault-tolerance architecture such as Fig. 1 distribute module, fault detection module, fault recovery module four parts rationally by information service module, resource service and form, and wherein fault detection module and fault recovery module are key content of the present invention.
(1) information service module
Information service module is mainly fault detect, fault recovery, resource service is distributed rationally provides directory information service (IIS), Resource Information Service (RIS), resource service encapsulation, QoS database information and Data support.
Wherein, directory information service (IIS) organizational information can provide the information aggregate inquiry, and supports the effective query to a plurality of RIS, and information index and the function of search of whole service-oriented manufacturing system can be provided simultaneously.IIS is comprised of three parts: general location registration process, insertable bibliographic structure and search are processed.
Resource Information Service (RIS) runs on the resource end, provides unified means to come the configuration of resource in the inquiry system platform, ability and state, and can be configured to certainly as assembling directory service.After RIS carries out authentication to the demand of input and task and resolves, according to the type of solicited message query requests is distributed to one or more informants.Then RIS passes to the user to the feedback information of resource.
The effect of resource service encapsulation template is that implementation platform is to effective management of the nodal information of participation Collaborative Manufacturing.According to the attribute between the resource (such as physical features, geographical position, dynamic characteristic, sensitivity, function etc.), customer demand (such as time, quality, price, service etc.), be used mode (such as discovery, agency, monitoring, diagnosis etc.), resource classification is encapsulated.Resource provider will be packaged into the resource service class template after platform carries out resource registering; When the client serves in request resource, download corresponding Resource Encapsulation template from system platform, and finish the instantiation of specific tasks.
Extract the QoS information of respective resources service in the QoS database.Corresponding QoS index parameter is assessed measurement, and carry out QoS relatively, thereby preferably provide information and Data support with combination for follow-up resource service.
(2) resource service is distributed module rationally
Resource service is distributed module rationally the feature operations such as resource service search, QoS assessment, resource service are preferred, resource service combination mainly is provided.
The resource service search provides all kinds of resource service information matches algorithm service, according to the subtask of the Task-decomposing demand to resource service, be responsible for from the resource service storehouse, searching satisfactory respective resources service, and generate resource service collection to be selected (RSS).
The QoS assessment is for the magnanimity that meets user's request that searches resource service collection to be selected, and purpose is for user and system's selection best resource service, carries out the reference frame that resource service is distributed rationally provides quantification, is the important step of resource optimization service configuration.From resource service information OWL-S/WSDL of being registered to server library or QoS database, extract the QoS information of respective resources service among the RSS.Corresponding QoS index parameter is assessed measurement, and carry out QoS relatively, thereby preferably provide information and Data support with combination for follow-up resource service.
Resource service is preferred: if the task that the user submits to is single resource service demand, then according to the qos parameter information requirement resource service to be selected of RSS is carried out the comprehensive assessment ordering, select best resource service to execute the task.
Resource service combination and preferred: if the user submits to is many resource service demand, then from each RSS, select a resource service to form in a certain order the combined resource service, and select from all possible combination that optimal set is incompatible executes the task.
(3) fault detection module
Be responsible for the state of each node in the monitor service manufacturing system and moving of task and resource, monitor at any time and carry out state analysis.Monitor distributing flow process and the resource that relates to and mission performance and ruuning situation rationally by local detector, and a series of management service is provided, such as task status management, resource service condition managing.To normally or the historical data of the example that unusually withdraws from analyze and add up, make a policy and notify the fault recovery module that detected fault is processed.
Below respectively relevant to virtual link, resource service relevant, task relevant and the concrete digestion procedure of using four relevant class various faults is described in detail.
(1) virtual link fault detect
1) VL_Disconnect_Failure detects
Usually can detect by System Security Policy or the middleware that is embedded in the service manufacturing system, the GRAM that can adopt Golbus to provide serves to detect VL_Disconnect_Failure.
2) Bandwidth_Failure detects
Whether because having produced fault, bandwidth reasons adopt call duration time (CT) and success communication rate (PSC) or two indexs of reliability to judge between two entities (representing with A, B) in the system.
A) adopt call duration time to judge
Make the virtual link between A, B be expressed as VL (A, B), the total information exchange capacity of VL (A, B) is SumInfor (A, B), and transmission speed (bandwidth) is V (A, B), and the stand-by period is Waite (A, B).Then corresponding total call duration time is designated as T c(A, B) is transmission time and stand-by period sum.
B) success communication rate (PSC) or reliability are judged
If the failure rate of virtual link VL (A, B), node A and B is respectively α (A, B), α (A), α (B), then can ask reliability S between VL (A, B) by the definition of reliability C(A, B).
If the minimum CT of user's request and PSC require to be respectively
Figure BDA00002124510900071
With
Figure BDA00002124510900072
Then working as virtual link VL (A, B) satisfies
Figure BDA00002124510900073
Or
Figure BDA00002124510900074
The time, then Bandwidth_Failure has occured in system's judgement.
(2) the relevant fault detect of resource service
1) RS_Quit_Failure detects
For detecting to distribute whether produced RS_Quit_Failure in the process rationally in resource service, the resource service detector is the uninterrupted state that checks each resource service regularly.If this resource service is reaction not, then RS_Quit_Failure has occured in system's judgement.
2) RS_Overload_Failure detects
By assessment RS iData-handling capacity (DC), call duration time (CT), time of implementation (ET) judge RS iWhether transship, namely whether system has produced RS_Overload_Failure.
If distribute to RS in the certain hour section iTask-set be Γ i={ Task 1, Task 2..., Task j..., Task k.Task wherein jNeed RS iQuantity be
Figure BDA00002124510900075
d I, jBe task task jCall RS iRequired data access amount, V (i, j) is Task jWith RS iBetween the virtual link bandwidth; Et jBe each RS iCarry out Task jThe required time of implementation.RS in the running then iCorresponding ET, DC, CT calculates respectively If RS iET, DC, the CT upper limit is respectively
Figure BDA00002124510900077
Lim RS i CT , Lim RS i DC , When system detects RS iSatisfy ET RS i > Lim RS i ET , CT RS i > Lim RS i CT , DC RS i > Lim RS i DC One of them person, then RS_Overload_Failure has occured in system's judgement.
3) RS_Composition_Failure detects
RS_Composition_Failure comprises that mainly coupling, attribute mistake are mated and four kinds of situations of QoS nonuniformity by mistake between coupling, data by mistake between concept.
A) rule of the mistake matching detection between concept
(1) if RS iRS kSubclass and RS kBe not contained in RS j, RS then iWith RS jBetween have gap (gap).This detects rule for the coupling of the mistake between the resource service concept (having the gap between concept); (2) if RS iRS kSubclass and RS kRS jSubclass, RS then iRS jSubclass.This is the mistake coupling (RS between the resource service concept iRS jSubclass) detect rule.
B) mistake matching detection rule between data
(1) if DU NitT Ransfer(RS i) equal RS j, RS so iAnd RS jSame parameters have identical data type, but different dimension.DU wherein NitT Ransfer() is that the data dimension transforms function.(2) if DT YpeT Ransfer(RS i) equal RS j, RS so iAnd RS jHave identical concept of parameter, but different types of data.DT wherein YpeT Ransfer() is that data type transforms function.
C) attribute mistake matching detection rule
If RS jRequired property parameters compares RS iMany and the RS that can provide iWith
Figure BDA00002124510900081
Common factor be not empty, RS then iAttribute can not satisfy RS jRequirement, wherein
Figure BDA00002124510900082
Be the split function.
More than relevant resource service combination detect just part of rule, in actual applications, can design as required the interpolation new regulation.
D) the QoS nonuniformity detects rule
If With
Figure BDA00002124510900084
Be respectively RS iAnd RS jNumber of parameters, if by analyzing two adjacent resource service RS in the composite services iAnd RS jQoS be consistent, these composite services are effectively, otherwise system judges RS_Composition_Failure. has occured
(3) the relevant fault detect of task
1) Task_Cancel_Failure and Task_Suspension_Failure detect
In order to detect in resource service to distribute in the process whether produced into RS_Quit_Failure the regular uninterrupted current state that checks each task of task detector rationally.When task was in the Task_Suspended formation, then system's judgement had produced Task_Suspension_Failure.If be in Task_Terminated, then Task_Cancel_Failure has been given birth in system's judgement fixed output quota.
2) Task_Resource_Mismatch_Failure detects
If Resources allocation service RS iTask executes the task j, according to the resource service matching algorithm, establish ζ Bas, ζ I/o, ζ QoS, ζ is respectively basic coupling threshold values, I/O coupling threshold values, QoS coupling threshold values, the comprehensive matching threshold values that system or user set.Then:
(1) if resource service RS iWith task task jBasic matching value less than basic coupling threshold values ζ Bas, then system's judgement has produced basic coupling fault;
(2) if resource service RS iWith task task jThe I/O matching value less than I/O coupling threshold values ζ I/o, then system judges that having produced I/O mates fault;
(3) if resource service RS iWith task task jThe QoS matching value less than QoS coupling threshold values ζ QoS, then system judges that having produced QoS mates event;
(4) if resource service RS iWith task task jLast matching value less than comprehensive matching threshold values ζ, then system judges and to have produced the comprehensive matching fault.
The testing process of Task_Resource_Mismatch_Failure as shown in Figure 3.The Task_RequireChange_Failure detection method is identical with Task_Resource_Mismatch_Failure.
(4) use relevant fault detect
1) Trust_Failure detects
If two entities of participating in business among the RSOA are respectively x and y, then in the process of distributing rationally, can assess trust value T between x and the y according to resource service Trust-QoS assessment models X → yIf entity x requires to be T to the minimum degree of belief of y X → y°, then work as T X → y<T X → y° the time, then system judges that fixed output quota given birth to Trust_Failure, such as Fig. 4.
2) App_DesignCode_Failure and App_AccessRight_Failure detect
The same with the VL_Disconnected_Failure detection method, App_DesignCode_Failure and App_AccessRight_Failure mainly are by System Security Policy or are embedded in the system middleware of service in the manufacturing system and detect that the related service or the middleware that mainly adopt Globus to provide detect.
(4) fault recovery module
When occuring and detecting fault, must repair it.Current failure tolerant mechanism mainly contains following several:
1) fault-tolerant based on the task of checkpoint strategy: system passes through periodically Checkpointing, correct status when program is moved is saved in the reliable memory equipment, when breaking down, return to nearest state and resume operation, thereby at utmost reduce the loss that barrier for some reason brings.
2) based on the task fault-tolerant strategy of retry: distribute rationally in the running in resource service; if the operation of breaking down has been carried out or do not have the operation of execution not ignore; then system can attempt re-executing this operation in the situation that does not change execution route; retry is to the constraint of maximum number of repetitions, if repeatedly the execute exception activity until maximum times still be not resolved then stop repetitive operation.
3) based on the task fault-tolerant strategy that backs up: its thought is that a task is carried out copy backup in different resources, all makes mistakes so long as not all backups, and task is final just can successful operation.
4) based on the fault-tolerant strategy that substitutes: when task broke down, the task of having identical function by the operation another one substituted.
5) fault-tolerant based on the task of redundancy: its thought is to select a plurality of different executed activity or the paths that can realize task, although different execution features is arranged, the function of these activities or execution route is identical.
6) based on self-defined unusual fault-tolerant strategy: user-defined unusual permission user defines various abnormality eliminating methods for special duty.If in running, break down, then activate the abnormality eliminating method that is defined on this task.
The present invention also adopts event-condition-action (ECA) rule to support the RSOA fault-tolerant management except the above fault tolerant mechanism of comprehensive employing.Be the Event of eca rule by the fault definition that will occur in the RSOA process; Be the Condition of eca rule with the fault detect conditional definition; The processing that fault is made (such as again scheduling, coupling etc. again) is defined as the Action of eca rule.
With reference to typical eca rule, the present invention has designed the resource service based on ECA as shown in Figure 2 and has distributed fault rationally and clear up module.Mainly comprise event detector, Conditions Evaluation device, actuator, Rule Engine, eca rule storehouse, the several parts of eca rule manager.
1) Event Detector: mainly receive the failure message that fault detection module sends over, the Event of analyzing and testing fault.
2) Condition Evaluator: main being responsible for assessed the relevant Condition of Event that detects, and sees whether it satisfies the condition of corresponding eca rule.
3) Rule Engine: main being responsible for carried out the reasoning coupling to the Event and the respective rule in the eca rule storehouse that detect, finds suitable rule to process the fault that detects.
4) Action Executor: mainly be the result according to Rule Engine reasoning, carry out selected eca rule and move fault is processed.
5) ECARules: be the eca rule storehouse.
6) ECARule Manager: be in charge of eca rule, comprise regular modification, interpolation, deletion etc.
Distribute rationally in the fault tolerant mechanism in the service manufacturing system resource service that proposes, eca rule directly is used for supporting fault recovery.For above fault and the detection method that provides, the present invention has designed eca rule as shown in table 1 and has supported the service manufacturing system resource service such as CMfg to distribute fault recovery rationally.
To distribute the fault resolution rule rationally be a part in the eca rule storehouse to listed relevant service manufacturing system resource service in the table 1.In actual applications, design as required new rule, add in the eca rule storehouse by ECARule Managemr.
Table 1
Figure BDA00002124510900101
Figure BDA00002124510900121

Claims (5)

1. a resource service is distributed fault-tolerant management rationally and realized system, it is characterized in that: this system comprises that information service module, resource service distribute module, fault detection module and fault recovery module rationally;
Described information service module is mainly fault detect, fault recovery, resource service is distributed rationally provides information and Data support;
Described resource service is distributed module rationally and is realized that mainly resource service search, service quality (QoS) are assessed, the feature operations such as resource service is preferred, resource service combination;
Described fault detection module is responsible for the state of each node in the monitor service manufacturing system and moving of task and resource, monitors at any time and carries out state analysis; To normally or the historical data of the example that unusually withdraws from analyze and add up, make a policy and notify the fault recovery module that detected fault is processed;
Described fault recovery module, be comprehensive multiple fault tolerant mechanism based on ECA(Event-Condition-Action) resource service distribute fault rationally and clear up module, mainly comprise event detector (Event Detector), Conditions Evaluation device (Condition Detector), actuator (Action Executor), rule-based reasoning engine (Rule Engine), eca rule storehouse (ECA Rules), eca rule manager (ECA Rule manager) part.
2. a kind of resource service according to claim 1 is distributed fault-tolerant management rationally and realized system, it is characterized in that: fault detect comprises the relevant fault detect of virtual link (VL), the fault detect that resource service (RS) is relevant, the fault detect that task (Task) is relevant, the fault detect that application is relevant in the described fault detection module; The fault detect that virtual link is relevant comprises that mainly virtual link fault (VL_Disconnect_Failure) detects and the not enough fault of bandwidth (Bandwidth_Failure) detects; Virtual link fault (VL_Disconnect_Failure) can detect by System Security Policy or the middleware that is embedded in the service manufacturing system usually; Whether two inter-entity adopt call duration time and success communication rate or two indexs of reliability to judge because bandwidth produces fault; The fault detect that resource service is relevant mainly is that resource service withdraws from fault (RS_Quit_Failure) detection, resource service overload fault (RS_Overload_Failure) detection, resource service combination fault (RS_Composition_Failure) detection; Resource service withdraws from fault (RS_Quit_Failure) and judges by the regular uninterrupted state of each resource service that checks of resource service detector; Resource service overload fault (RS_Overload_Failure) is by assessment RS iData-handling capacity, call duration time, time of implementation judge RS iWhether transship; Mistake matching detection rule, attribute mistake matching detection rule, QoS nonuniformity detect rule and judge resource service overload fault (RS_Overload_Failure) between the mistake matching detection rule whether satisfy between concept, data by detecting; The fault detect that task is relevant comprises that mainly task cancellation fault (Task_Cancel_Failure) and task are suspended or hang up fault (Task_Suspension_Failure) detection, resource and unsuccessfully (Task_Resource_Mismatch_Failure) detection of task matching; Task is suspended or is hung up fault (Task_Suspension_Failure) by the regular uninterrupted current state that checks each task of task detector, and task whether is in task suspension (Task_Suspended) formation and task termination (Task_Terminated) is judged; The resource service matching algorithm is adopted in resource and task matching failure (Task_Resource_Mismatch_Failure), determines whether basic coupling fault, I/O coupling fault, QoS coupling fault, comprehensive matching fault have occured; Task is suspended or to hang up fault (Task_Suspension_Failure) detection method and resource identical with task matching failure (Task_Resource_Mismatch_Failure); Use relevant fault detect, comprise that mainly trust fault (Trust_Failure) detects, uses design or coding fault (App_DesignCode_Failure) and access rights fault (App_AccessRight_Failure) detection; Trusting fault (Trust_Failure) passes through with the x of resource service Trust-QoS assessment models assessment and the trust value T between the y X → yWith the minimum degree of belief requirement T of entity x to y X → y° size relatively judge; To use design or coding fault (App_DesignCode_Failure) and access rights fault (App_AccessRight_Failure) mainly be by System Security Policy or be embedded in the system middleware of serving in the manufacturing system detects.
3. a kind of resource service according to claim 1 is distributed fault-tolerant management rationally and is realized system, it is characterized in that: ECA (Event-Condition-Action, event-condition-action) event (Event) in the rule is defined as the corresponding event of a rule (Rule) that triggers, condition (Condition) is defined as the condition that this rule (Rule) institute must satisfy that activates, and action (Action) is the action command that will carry out after an eca rule is triggered; Be the event (Event) of eca rule with the fault definition that occurs in the RSOA process; Be the condition (Condition) of eca rule with the fault detect conditional definition; The processing that fault is made is defined as the action (Action) of eca rule.
4. a kind of resource service according to claim 3 is distributed fault-tolerant management rationally and realized system, it is characterized in that: the described processing that fault is made is specially again scheduling or mates.
5. a kind of resource service according to claim 1 is distributed fault-tolerant management rationally and is realized system, it is characterized in that: event detector (Event Detector) mainly receives the failure message that fault detection module sends over, the event of analyzing and testing fault (Event); Conditions Evaluation device (Condition Evaluator) mainly is responsible for the relevant condition (Condition) of event (Event) that detects is assessed, and sees whether it satisfies the condition of corresponding eca rule; Rule-based reasoning engine (Rule Engine) mainly is responsible for the event (Event) that detects is carried out the reasoning coupling with the respective rule in the eca rule storehouse, finds suitable rule to process the fault that detects; Actuator (Action Executor) mainly is the result according to Rule Engine reasoning, carries out selected eca rule and moves fault is processed; Eca rule manager (ECA Rule Manager) is in charge of eca rule, comprises modification, interpolation and the deletion of rule; Required various rules in the main storage failure digestion process in eca rule storehouse (ECA Rules).
CN2012103356095A 2012-09-11 2012-09-11 Implement system for resource service optimization allocation fault-tolerant management Expired - Fee Related CN102916830B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2012103356095A CN102916830B (en) 2012-09-11 2012-09-11 Implement system for resource service optimization allocation fault-tolerant management

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2012103356095A CN102916830B (en) 2012-09-11 2012-09-11 Implement system for resource service optimization allocation fault-tolerant management

Publications (2)

Publication Number Publication Date
CN102916830A true CN102916830A (en) 2013-02-06
CN102916830B CN102916830B (en) 2013-12-11

Family

ID=47615068

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2012103356095A Expired - Fee Related CN102916830B (en) 2012-09-11 2012-09-11 Implement system for resource service optimization allocation fault-tolerant management

Country Status (1)

Country Link
CN (1) CN102916830B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106341281A (en) * 2016-11-10 2017-01-18 福州智永信息科技有限公司 Distributed fault detection and recovery method of linux server
CN107040406A (en) * 2017-03-14 2017-08-11 西安电子科技大学 A kind of end cloud cooperated computing system and its fault-tolerance approach
CN108021827A (en) * 2017-12-07 2018-05-11 中科开元信息技术(北京)有限公司 A kind of method and system based on area mechanism structure security system
CN108289034A (en) * 2017-06-21 2018-07-17 新华三大数据技术有限公司 A kind of fault discovery method and apparatus
CN114296983A (en) * 2021-12-30 2022-04-08 重庆允成互联网科技有限公司 Trigger operation record-based flow exception handling method and storage medium
CN114580911A (en) * 2022-03-04 2022-06-03 重庆大学 Site-factory hybrid service and resource scheduling method

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101645036A (en) * 2009-09-11 2010-02-10 兰雨晴 Method for automatically distributing test tasks based on capability level of test executor
CN101958917A (en) * 2010-03-24 2011-01-26 北京航空航天大学 Cloud manufacturing system-oriented method for measuring and enhancing flexibility of resource service composition
CN102624870A (en) * 2012-02-01 2012-08-01 北京航空航天大学 Intelligent optimization algorithm based cloud manufacturing computing resource reconfigurable collocation method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101645036A (en) * 2009-09-11 2010-02-10 兰雨晴 Method for automatically distributing test tasks based on capability level of test executor
CN101958917A (en) * 2010-03-24 2011-01-26 北京航空航天大学 Cloud manufacturing system-oriented method for measuring and enhancing flexibility of resource service composition
CN102624870A (en) * 2012-02-01 2012-08-01 北京航空航天大学 Intelligent optimization algorithm based cloud manufacturing computing resource reconfigurable collocation method

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106341281A (en) * 2016-11-10 2017-01-18 福州智永信息科技有限公司 Distributed fault detection and recovery method of linux server
CN107040406A (en) * 2017-03-14 2017-08-11 西安电子科技大学 A kind of end cloud cooperated computing system and its fault-tolerance approach
CN107040406B (en) * 2017-03-14 2020-08-11 西安电子科技大学 End cloud cooperative computing system and fault-tolerant method thereof
CN108289034A (en) * 2017-06-21 2018-07-17 新华三大数据技术有限公司 A kind of fault discovery method and apparatus
CN108021827A (en) * 2017-12-07 2018-05-11 中科开元信息技术(北京)有限公司 A kind of method and system based on area mechanism structure security system
CN114296983A (en) * 2021-12-30 2022-04-08 重庆允成互联网科技有限公司 Trigger operation record-based flow exception handling method and storage medium
CN114296983B (en) * 2021-12-30 2022-08-12 重庆允成互联网科技有限公司 Trigger operation record-based flow exception handling method and storage medium
CN114580911A (en) * 2022-03-04 2022-06-03 重庆大学 Site-factory hybrid service and resource scheduling method

Also Published As

Publication number Publication date
CN102916830B (en) 2013-12-11

Similar Documents

Publication Publication Date Title
CN112000448B (en) Application management method based on micro-service architecture
CN102916830B (en) Implement system for resource service optimization allocation fault-tolerant management
Jacobsen et al. The PADRES publish/subscribe system
Rahman et al. A taxonomy and survey on autonomic management of applications in grid computing environments
CN110134674B (en) Currency credit big data monitoring and analyzing system
Kamburugamuve et al. Survey of distributed stream processing for large stream sources
CN101297536A (en) A method and system for preparing execution of systems management tasks on endpoints
Sun et al. An architecture model of management and monitoring on cloud services resources
CN111581635B (en) Data processing method and system
CN105320522A (en) Service-oriented architecture based XBRL application platform
CN106101212A (en) Big data access method under cloud platform
CN112269690B (en) Data backup method and device
Abiteboul et al. The AXML artifact model
CN103326880B (en) Genesys calling system high availability cloud computing monitoring system and method
CN117573291A (en) Cross-data-center multi-cluster management method, device, equipment and storage medium
CN116841980A (en) Bank data processing system
Mendes et al. Building resilience to natural hazards. Practices and policies on governance and mitigation in the central region of Portugal
CN103078764A (en) Operational monitoring system and method based on virtual computing task
Alhosban et al. Bottom-up fault management in service-based systems
Stojnić et al. Osiris-sr: A scalable yet reliable distributed workflow execution engine
Abdeldjelil et al. A diversity-based approach for managing faults in web services
US20080005291A1 (en) Coordinated information dispersion in a distributed computing system
Dos Passos et al. Towards a Decentralized Blockchain-Based Resource Monitoring Solution For Distributed Environments
Rahman et al. A taxonomy of autonomic application management in grids
Brennand et al. SimGrid: A simulator of network monitoring topologies for peer-to-peer based computational grids

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20131211

Termination date: 20190911

CF01 Termination of patent right due to non-payment of annual fee