CN108628708A

CN108628708A - Cloud computing fault-tolerance approach and device

Info

Publication number: CN108628708A
Application number: CN201710166422.XA
Authority: CN
Inventors: 童遥; 申光
Original assignee: ZTE Corp
Current assignee: ZTE Corp
Priority date: 2017-03-20
Filing date: 2017-03-20
Publication date: 2018-10-09

Abstract

The present invention provides a kind of cloud computing fault-tolerance approach and devices, wherein this method includes：At main of cloud computing task when being run on the first processor, multiple second processing machines of the copy of selection processing cloud computing task；According to copy, the treatment effeciency on multiple second processing machines, determination handle the sequence of the second processing machine of copy when main fault-tolerant respectively.Through the invention, platform reliability low technical problem when only considering a processor processing copy in the related technology is solved.

Description

Cloud computing fault-tolerance approach and device

Technical field

The present invention relates to the communications fields, in particular to a kind of cloud computing fault-tolerance approach and device.

Background technology

With the development of internet and data center, real-time system is increasingly used in various distributed environments and shape At cloud computing increasingly attracted attention.The main thought of cloud computing is that the various computing resources on internet are incorporated into one It rising, these computing resources are all much isomeries, therefore it is extremely urgent effectively to manage large scale scale heterogeneous computing resource, meanwhile, Communication efficiency and high reliability are also the inherently required of cloud computing, are that system is supplied to the important measurement of QoS of customer to refer to Mark.

When extensive resource breaks down or is wrong, for example multiprocessor fails, and how to ensure whole system Normal operation and obtain correctly as a result, select suitable fault-tolerant scheduling strategy, especially to have the time, communication, can By property and the task of fault-tolerant demand, how to ensure that system can complete task on demand, current cloud computing needs are studied.Institute It calls fault-tolerant, refers to that computer system does not fail in the presence of failure, the ability of normal work is remained able to, according to the time Characteristic, failure can be divided into permanent fault and temporary fault.Traditional fault-tolerance approach includes：Retry, N version programs design and The fault-tolerance approaches such as recovery block technology, for the reliability of raising system, the service life for extending system has centainly these methods Application value.But these traditional fault-tolerance approaches a series of problems, such as not accounting for the real-time of system, overhead, because This is not suitable for distributed system.

Most common fault-toleranr technique is master/slave this technology in distributed system in the related technology, and there are one each tasks Main and a copy, they are deployed on two different processors, and there are three types of executive modes for traditional copy：1) masters Formula copy, Fig. 1 are present invention active mode copy schematic diagram in the related technology, and as shown in Fig. 1, abscissa indicates scheduling in figure The execution time, that execute on processor processor1 is main of task, and what is executed on processor processor2 is to appoint The copy of business；2) passive mode copy, Fig. 2 are present invention passive mode copy schematic diagram in the related technology, as shown in Fig. 2, That executed on same processor processor1 is main of task, and what is executed on processor processor2 is the pair of task This；3) overlap mode between primary copy, Fig. 3 is present invention primary copy overlap mode schematic diagram in the related technology, such as 3 institute of attached drawing Show, primary copy when being executed between it is upper there is overlapping, i.e., when the copy of this mode can either utilize active mode copy without executing Between the advantages of constraining, it may have the high efficiency of passive mode copy.The characteristics of active mode copy is that it runs simultaneously with main, Between the two without synchronized relation；Passive mode copy only at main execute failure when just start execution, the advantage is that system without In the case of failure, without executing Redundant task, while it allows the copy of the task under different faults state to be overlapped use, from And improve the utilization rate of processor.

But current many researchs consider all to assume that single machine fails when resource fault-tolerant scheduling, but large-scale cloud calculates The resource that system uses has highly dynamic property and an isomerism, resource environment intrinsic unreliable state so that cloud computing system The possibility that extensive resource failure occurs in system greatly increases.Therefore, only assume that single machine failure for the fault-tolerant of research cloud computing Problem there will naturally be significant limitation.

For the above problem present in the relevant technologies, at present it is not yet found that the solution of effect.

Invention content

An embodiment of the present invention provides a kind of cloud computing fault-tolerance approach and devices, at least to solve only to consider in the related technology Platform reliability low technical problem when one processor processing copy.

According to one embodiment of present invention, a kind of cloud computing fault-tolerance approach is provided, which is characterized in that including：In cloud For main of calculating task when being run on the first processor, selection handles multiple second processings of the copy of the cloud computing task Machine；According to the copy, the treatment effeciency on the multiple second processing machine, determination are handled when described main fault-tolerant respectively The sequence of the second processing machine of the copy.

Optionally, multiple second processing machines of the copy of the selection processing cloud computing task include：It obtains all spare The load of processor；The multiple processors for selecting load minimum in all spare processing are as the second processing machine.

Optionally, it according to the copy treatment effeciency on the multiple second processing machine respectively, determines in the master When this is fault-tolerant, the sequence of the second processing machine of the copy is handled, including：By the copy on the multiple second processing machine It is pre-processed；Calculate separately the copy pretreated deadline on the multiple second processing machine；It will be multiple described Deadline is ranked up from low to high, determines according to ranking results and is handled at the second of the copy when described main fault-tolerant The sequence of reason machine.

Optionally, described main includes one or more main of sons, and the copy includes one or more sub- copies, In, main of the son is corresponded with the sub- copy, and main different of the sons executes on different processors, different The sub- copy executes on different processors.

Optionally, it according to the copy treatment effeciency on the multiple second processing machine respectively, determines described Main it is fault-tolerant when, after the sequence of second processing machine for handling the copy, the method further includes：At described main described When breaking down on the first processor, the copy is executed on the second processing machine according to determining sequence；Locate currently When managing the second processing machine failure of copy, the candidate second processing for executing the copy is selected successively according to the sequence of the determination Machine.

According to another embodiment of the invention, a kind of cloud computing fault tolerance facility is provided, including：Selecting module is used for At main of cloud computing task when being run on the first processor, selection handles multiple the second of the copy of the cloud computing task Processor；Determining module, for according to the copy treatment effeciency on the multiple second processing machine respectively, determining in institute State main it is fault-tolerant when, handle the sequence of the second processing machine of the copy.

Optionally, the selecting module includes：Acquiring unit, the load for obtaining all spare processors；It determines single Member, multiple processors for selecting load minimum in all spare processing are as the second processing machine.

Optionally, the determining module includes：Pretreatment unit is used for the copy in the multiple second processing machine On pre-processed；Computing unit, for calculating separately the copy pretreated completion on the multiple second processing machine Time；Sequencing unit is determined according to ranking results in the master for being ranked up from low to high multiple deadlines The sequence of the second processing machine of the copy is handled when this is fault-tolerant.

Optionally, described device further includes：First execution module is distinguished for the determining module according to the copy Treatment effeciency on the multiple second processing machine determines when described main fault-tolerant, handles the second processing of the copy After the sequence of machine, at described main when breaking down on first processor, according to determining sequence described second The copy is executed on processor；Second execution module is used in the second processing machine failure of currently processed copy, according to institute Determining sequence is stated to select to execute the candidate second processing machine of the copy successively.

According to still another embodiment of the invention, a kind of storage medium is additionally provided.The storage medium is set as storage and uses In the program code for executing following steps：

At main of cloud computing task when being run on the first processor, selection handles the copy of the cloud computing task Multiple second processing machines；

According to the copy, the treatment effeciency on the multiple second processing machine, determination are fault-tolerant at described main respectively When, handle the sequence of the second processing machine of the copy.

Through the invention, at main of cloud computing task when being run on the first processor, selection handles the cloud computing Multiple second processing machines of the copy of task, according to the copy treatment effeciency on the multiple second processing machine respectively, It determines when described main fault-tolerant, handles the sequence of the second processing machine of the copy, due to having selected multiple processing copies Second processing machine, and copy is executed according to the treatment effeciency of second processing machine is sequence, it ensure that and execute cloud computing task Time optimization, can solve only to consider in the related technology that the low technology of platform reliability is asked when a processor processing copy Topic, ensure that the reliability of cloud platform.

Description of the drawings

Attached drawing described herein is used to provide further understanding of the present invention, and is constituted part of this application, this hair Bright illustrative embodiments and their description are not constituted improper limitations of the present invention for explaining the present invention.In the accompanying drawings：

Fig. 1 is present invention active mode copy schematic diagram in the related technology；

Fig. 2 is present invention passive mode copy schematic diagram in the related technology；

Fig. 3 is present invention primary copy overlap mode schematic diagram in the related technology；

Fig. 4 is the flow chart of cloud computing fault-tolerance approach according to the ... of the embodiment of the present invention；

Fig. 5 is the structure diagram of cloud computing fault tolerance facility according to the ... of the embodiment of the present invention；

Fig. 6 is the physics networking diagram of the specific embodiment of the invention one；

Fig. 7 is the logic networking diagram of the specific embodiment of the invention one；

Fig. 8 is the physics networking diagram of the specific embodiment of the invention two；

Fig. 9 is the logic networking diagram of the specific embodiment of the invention two.

Specific implementation mode

Come that the present invention will be described in detail below with reference to attached drawing and in conjunction with the embodiments.It should be noted that not conflicting In the case of, the features in the embodiments and the embodiments of the present application can be combined with each other.

It should be noted that term " first " in description and claims of this specification and above-mentioned attached drawing, " Two " etc. be for distinguishing similar object, without being used to describe specific sequence or precedence.

Embodiment 1

A kind of cloud computing fault-tolerance approach is provided in the present embodiment, and Fig. 4 is that cloud computing according to the ... of the embodiment of the present invention is held The flow chart of wrong method, as shown in figure 4, the flow includes the following steps：

Step S402, at main of cloud computing task when being run on the first processor, selection processing cloud computing task Multiple second processing machines of copy；

Step S404, according to copy, the treatment effeciency on multiple second processing machines, determination are located when main fault-tolerant respectively Manage the sequence of the second processing machine of copy.

Through the above steps, at main of cloud computing task when being run on the first processor, selection handles the cloud meter Multiple second processing machines of the copy of calculation task, according to the copy, the processing on the multiple second processing machine is imitated respectively Rate determines when described main fault-tolerant, the sequence of the second processing machine of the copy is handled, due to having selected multiple processing copies Second processing machine, and according to the treatment effeciency of second processing machine be that sequence executes copy, ensure that executing cloud computing appoints The time optimization of business can solve only to consider in the related technology platform reliability low technology when a processor processing copy Problem ensure that the reliability of cloud platform.

Optionally, the executive agent of above-mentioned steps can be the dispatch server or management server of cloud computation data center Deng, but not limited to this.

Optionally, main includes one or more main of sons, and copy includes one or more sub- copies, wherein main of son It is corresponded with sub- copy, main different of sons executes on different processors, and different sub- copies is in different processors Upper execution.

Optionally, multiple second processing machines of the copy of selection processing cloud computing task include：

S11 obtains the load of all spare processors；The load of processor and the cpu busy percentage of processor, memory, when The parameters such as the portfolio of pre-treatment are related；

S12, the multiple processors for selecting load minimum in all spare processing are as second processing machine.Load it is smaller, The processing task of processing is lighter, and priority is higher.

In the optional embodiment according to the present embodiment, according to copy, the processing on multiple second processing machines is imitated respectively Rate determines when main is fault-tolerant, handles the sequence of the second processing machine of copy, including：

S21 pre-processes copy on multiple second processing machines；

S22 calculates separately copy pretreated deadline on multiple second processing machines；

Multiple deadlines are ranked up by S23 from low to high, are determined according to ranking results and handle pair when main fault-tolerant The sequence of this second processing machine.Deadline when pre-processing copy is faster, and the sequence of fault-tolerant copy is more preferential as priority.

Optionally, it according to the copy treatment effeciency on multiple second processing machines respectively, determines when main fault-tolerant, place After the sequence for managing the second processing machine of copy, method further includes：

S31 executes pair according to determining sequence at main when breaking down on the first processor on second processing machine This；

S32 selects to execute copy according to determining sequence successively in the second processing machine failure of currently processed copy Candidate second processing machine.

Through the above description of the embodiments, those skilled in the art can be understood that according to above-mentioned implementation The method of example can add the mode of required general hardware platform to realize by software, naturally it is also possible to by hardware, but it is very much In the case of the former be more preferably embodiment.Based on this understanding, technical scheme of the present invention is substantially in other words to existing The part that technology contributes can be expressed in the form of software products, which is stored in a storage In medium (such as ROM/RAM, magnetic disc, CD), including some instructions are used so that a station terminal equipment (can be mobile phone, calculate Machine, server or network equipment etc.) method that executes each embodiment of the present invention.

Embodiment 2

A kind of cloud computing fault tolerance facility is additionally provided in the present embodiment, and the device is for realizing above-described embodiment and preferably Embodiment had carried out repeating no more for explanation.As used below, predetermined function may be implemented in term " module " The combination of software and/or hardware.Although device described in following embodiment is preferably realized with software, hardware, or The realization of the combination of person's software and hardware is also that may and be contemplated.

Fig. 5 is the structure diagram of cloud computing fault tolerance facility according to the ... of the embodiment of the present invention, as shown in figure 5, the device includes：

Selecting module 50, at main of cloud computing task when being run on the first processor, selection processing cloud computing Multiple second processing machines of the copy of task；

Determining module 52, for the treatment effeciency on multiple second processing machines, determination to be held at main respectively according to copy It staggers the time, handles the sequence of the second processing machine of copy.

Optionally, selecting module includes：Acquiring unit, the load for obtaining all spare processors；Determination unit is used In the multiple processors for selecting load minimum in all spare processing as second processing machine.

Optionally, determining module includes：Pretreatment unit, for being located copy in advance on multiple second processing machines Reason；Computing unit, for calculating separately copy pretreated deadline on multiple second processing machines；

Sequencing unit determines according to ranking results and holds at main for being ranked up from low to high multiple deadlines Stagger the time processing copy second processing machine sequence.

Optionally, device further includes：First execution module, for determining module according to copy respectively at multiple second Treatment effeciency on reason machine is determined when main fault-tolerant, after the sequence of second processing machine for handling copy, at main first When breaking down on processor, copy is executed on second processing machine according to determining sequence；Second execution module (optional), For in the second processing machine failure of currently processed copy, selecting the candidate second of execution copy successively according to determining sequence Processor.

It should be noted that above-mentioned modules can be realized by software or hardware, for the latter, Ke Yitong Following manner realization is crossed, but not limited to this：Above-mentioned module is respectively positioned in same processor；Alternatively, above-mentioned modules are with arbitrary The form of combination is located in different processors.

Embodiment 3

The present embodiment proposes the fault-tolerant scheduling strategy based on reliability QoS (service quality) index, and emphasis solves at single Under reason machine failure scenarios and in the case of multiprocessor breaks down, how by the processor of task deployment to high reliability, Ensure the optimization of execution time simultaneously.

Fault-Tolerant Scheduling Algorithm proposed by the present invention is a greedy heuristic mutation operations method based on reliability, the calculation of realization Method includes the following steps：

1) scheduler task collection is combined into S

2) unscheduled set of tasks is U

3) idle task collection is combined into U_f, and U_f∈U

4) a task i ∈ S is dispatched to processor P_jOn, at the beginning of obtaining itAnd the deadline

5) maximum processor failure quantity is denoted as ε

6) processor set is denoted as P={ P₁,P₂......P_m}

7) idle task list is denoted as α

The fault-tolerant scheduling strategy of algorithm includes step：

The first step：

System initialization, initialization S=φ, U=φ, α=φ

Second step：

I will be begun a task with to be inserted into α

Third walks：

Processor is sorted from low to high by load, it is earliest on processor in processor set that calculating begins a task with i Deadline t_i

4th step：

If task i can be in P^(ε+1)On reach earliest finish time, then on scheduler task i to+1 processor of this ε

5th step：

If there is no feasible processor for task i, failure is returned

6th step：

The idle follow-up work of task i is put into α

7th step：

Unscheduled set of tasks U is updated, task i is removed from U.

The present embodiment includes following two specific embodiments, detailed for combining different implement scenes to carry out the application Explanation：

Specific implementation mode one

Fig. 6 is the physics networking diagram of the specific embodiment of the invention one, and system group network figure is as shown in Fig. 6, and system is by handing over It changes planes, dispatch server, 4 processors composition (P1, P2, P3, P4), dispatch server effect is responsible for actual schedule task master The execution of this or copy, and determine to execute on which server.

Fig. 7 is the logic networking diagram of the specific embodiment of the invention one, and system logic figure is as shown in Fig. 7, there is 4 in system Platform processor P1, P2, P3, P4, main of task is on processor P3, and dispatch server is by calculating main deadline, really It is fixed that 3 copies are dispatched on host P1, P2, P4, it is out of order in processor P3 where main according to being ranked earliest finish time In the case of, scheduling sequence is followed successively by P4->P2->P1 first carries out processor that is, under processor P3 fault conditions where main Copy on P4, if processor P4 failures, dispatch the copy on execution processing unit P2, if processor P2 failures, are adjusted Spend the copy on processor P1.

Specific implementation mode two

Fig. 8 is the physics networking diagram of the specific embodiment of the invention two, and system group network figure is as shown in Fig. 8, and system is by handing over It changes planes, dispatch server, 5 processors composition (P1, P2, P3, P4, P5), dispatch server effect is responsible for actual schedule and appoints It is engaged in the execution of main or copy, and determines to execute on which server.

Fig. 9 is the logic networking diagram of the specific embodiment of the invention two, and system logic figure is as shown in Fig. 9, there is 5 in system Platform processor P1, P2, P3, P4, P5, main A of task on processor P3, dispatch server by calculate main A completion when Between, 3 copies are dispatched on host P1, P2, P4 by determination, are gone out in processor P3 where main according to being ranked earliest finish time In the case of failure, scheduling sequence is followed successively by P4->P2->P1 is first carried out that is, under processor P3 fault conditions where main A Copy on processor P4, if processor P4 failures, dispatch the copy on execution processing unit P2, if processor P2 events Hinder, then the copy on scheduling processor P1；For main B of task on processor P1, dispatch server passes through main B's of calculating simultaneously 3 copies are dispatched on host P2, P3, P4 by deadline, determination.

Out of order in processor P3, copy A1s of the scheduling main A of execution task on processor P4 simultaneously will Task copy B2 is dispatched on processor 5, and main B of task, which is still on processor 1, to be run.

Embodiment 4

The embodiments of the present invention also provide a kind of storage mediums.Optionally, in the present embodiment, above-mentioned storage medium can To be arranged to store the program code for executing following steps：

S1, at main of cloud computing task when being run on the first processor, the copy of selection processing cloud computing task Multiple second processing machines；

S2, according to copy, the treatment effeciency on multiple second processing machines, determination handle copy when main fault-tolerant respectively Second processing machine sequence.

Optionally, in the present embodiment, above-mentioned storage medium can include but is not limited to：USB flash disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), mobile hard disk, magnetic disc or The various media that can store program code such as CD.

Optionally, in the present embodiment, processor is executed according to stored program code in storage medium in cloud computing Main of task on the first processor when running, multiple second processing machines of the copy of selection processing cloud computing task；

Optionally, in the present embodiment, processor is executed according to stored program code in storage medium according to copy Treatment effeciency on multiple second processing machines respectively determines when main fault-tolerant, handles the sequence of the second processing machine of copy.

Optionally, the specific example in the present embodiment can refer to described in above-described embodiment and optional embodiment Example, details are not described herein for the present embodiment.

Obviously, those skilled in the art should be understood that each module of the above invention or each step can be with general Computing device realize that they can be concentrated on a single computing device, or be distributed in multiple computing devices and formed Network on, optionally, they can be realized with the program code that computing device can perform, it is thus possible to which they are stored It is performed by computing device in the storage device, and in some cases, it can be with different from shown in sequence execution herein The step of going out or describing, either they are fabricated to each integrated circuit modules or by them multiple modules or Step is fabricated to single integrated circuit module to realize.In this way, the present invention is not limited to any specific hardware and softwares to combine.

The foregoing is only a preferred embodiment of the present invention, is not intended to restrict the invention, for the skill of this field For art personnel, the invention may be variously modified and varied.All within the spirits and principles of the present invention, any made by repair Change, equivalent replacement, improvement etc., should all be included in the protection scope of the present invention.

Claims

1. a kind of cloud computing fault-tolerance approach, which is characterized in that including：

At main of cloud computing task when being run on the first processor, selection handles the multiple of the copy of the cloud computing task Second processing machine；

According to the copy, the treatment effeciency on the multiple second processing machine, determination are located when described main fault-tolerant respectively Manage the sequence of the second processing machine of the copy.

2. according to the method described in claim 1, it is characterized in that, selection handles multiple the of the copy of the cloud computing task Two processors include：

Obtain the load of all spare processors；

The multiple processors for selecting load minimum in all spare processing are as the second processing machine.

3. according to the method described in claim 1, it is characterized in that, according to the copy respectively in the multiple second processing machine On treatment effeciency, determine when described main fault-tolerant, handle the sequence of the second processing machine of the copy, including：

The copy is pre-processed on the multiple second processing machine；

Calculate separately the copy pretreated deadline on the multiple second processing machine；

Multiple deadlines are ranked up from low to high, is determined according to ranking results and handles institute when described main fault-tolerant State the sequence of the second processing machine of copy.

4. according to the method described in claim 1, it is characterized in that, described main includes one or more main of sons, the pair This includes one or more sub- copies, wherein main of the son and the sub- copy one-to-one correspondence, main of the different son exist It is executed on different processors, the different sub- copies executes on different processors.

5. according to the method described in claim 1, it is characterized in that, according to the copy respectively in the multiple second processing Treatment effeciency on machine is determined when described main is fault-tolerant, after the sequence of second processing machine for handling the copy, the side Method further includes：

At described main when breaking down on first processor, held on the second processing machine according to determining sequence The row copy；

In the second processing machine failure of currently processed copy, select to execute the copy successively according to the sequence of the determination Candidate second processing machine.

6. a kind of cloud computing fault tolerance facility, which is characterized in that including：

Selecting module, for, when being run on the first processor, selection to handle the cloud computing and appoints at main of cloud computing task Multiple second processing machines of the copy of business；

Determining module, for according to the copy treatment effeciency on the multiple second processing machine respectively, determining described Main it is fault-tolerant when, handle the sequence of the second processing machine of the copy.

7. device according to claim 6, which is characterized in that the selecting module includes：

Acquiring unit, the load for obtaining all spare processors；

Determination unit, multiple processors for selecting load minimum in all spare processing are as the second processing Machine.

8. device according to claim 6, which is characterized in that the determining module includes：

Pretreatment unit, for pre-processing the copy on the multiple second processing machine；

Computing unit, for calculating separately the copy pretreated deadline on the multiple second processing machine；

Sequencing unit is determined according to ranking results in the master for being ranked up from low to high multiple deadlines The sequence of the second processing machine of the copy is handled when this is fault-tolerant.

9. device according to claim 6, which is characterized in that described main includes one or more main of sons, the pair This includes one or more sub- copies, wherein main of the son and the sub- copy one-to-one correspondence, main of the different son exist It is executed on different processors, the different sub- copies executes on different processors.

10. device according to claim 6, which is characterized in that described device further includes：

First execution module, for the determining module according to the copy place on the multiple second processing machine respectively Efficiency is managed, is determined when described main fault-tolerant, after the sequence of second processing machine for handling the copy, at described main in institute It states when breaking down on the first processor, the copy is executed on the second processing machine according to determining sequence；

Second execution module, in the second processing machine failure of currently processed copy, according to the determination sequence successively Selection executes the candidate second processing machine of the copy.