CN1744554A

CN1744554A - Expandable dynamic fault-tolerant method for cooperative system

Info

Publication number: CN1744554A
Application number: CN 200510019586
Authority: CN
Inventors: 金海�; 王玎; 李胜利; 袁平鹏; 李昌清; 孙盛; 黎时才; 邝坪; 战治国; 王辉
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2005-10-13
Filing date: 2005-10-13
Publication date: 2006-03-08
Anticipated expiration: 2025-10-13
Also published as: CN100341298C

Abstract

The method is improved from mode of backup for main use. Receiving request from user, the main service node carries out process. Based on magnitude of redundancy of task, backup node is allocated dynamically. Backup process in backup service node communicates with basic service process in main service node periodically, and synchronization between them is kept. When basic service process is in failure, one process is selected from backup service processes as basic service process. Meanwhile, any service node can be as main service node and backup node in order to reach purpose of using system resources furthest. The invention reduces backup process is often in 'idle' state. Based on node capability and state of load, the invention changes redundancy of service so as to raise efficiency of service and balance of load.

Description

Extendible dynamic fault-tolerant method in the cooperative system

Technical field

(Computer Supported CooperativeWork, CSCW) field are a kind of novel extendible dynamic fault-tolerant method to the invention belongs to computer supported cooperative work.

Background technology

Computer supported cooperative work supports the multi-user by computer and network technology, accomplishes a task jointly with coordination and cooperation way.Along with rapid development of network technology, CSCW provides the cooperative working environment of a kind of " face-to-face " and " you see to be that I see " for the people that disperse on space-time, it can support a plurality of times to go up to separate, distribute and the complementary cooperation of work member's collaborative work on the space.In the past several years, international and domestic representational CSCW system comprises: multimedia collaborative work system (the The Collaborative Environment forConcurrent Engineering Design of Stanford Research Institute (SRI) development, CECED), the multimedia design system of the SHASTRA collaborative work of Purdue University's development, BERKOM multimedia collaborative server and other CSCW system of common exploitation such as IBM European network center, DEC, as capital intervisibility frequency conference system, the intelligent negotiating system of 3C (CAD/CAPP/CAM) etc.Cooperative system has been widely used in supporting colony's user collaborative work at present.Because the collaborative work back-up system relates to colony's user collaboration, any system failure causes group collaboration to carry out or the result loses, thereby, very big to efficient and result's influence of group collaboration.Therefore, in the cooperative system design, how to guarantee that the high reliability of cooperative system service node and performance are to be related to the key factor of user to system's confidence level.

The implementation of high reliability has the design of high reliability except requiring hardware device, and good parts fault tolerant mechanism also will be arranged.Facts have proved that fault-tolerant design is highly effective to the raising of computer application system reliability.The mode of service redundant is generally adopted in traditional fault-tolerant design: the service processes on the service node (being called the basic service process) duplicates many parts (being called the reserve service processes) and operates in respectively on the different nodes.According to the type of action of redundancy services, can be divided into: Active Replication and main with backup.In the Active Replication mode: basic service process and all reserve service processes receive client requests simultaneously and ask and handle, and then result are all returned to the client, by the customer selecting return results.Though it is transparent to the client that Active Replication mode service processes lost efficacy, communication overhead is bigger, and because system resource is limited, redundancy services is response request simultaneously, has reduced the performance of system on the whole.Main with backup mode in: basic service process reception client requests is handled, the reserve process periodically with the basic service process communication, synchronous with its maintenance.After the basic service process lost efficacy, from the reserve service processes, choose a process as the basic service process.This mode is compared with the Active Replication mode, has significantly reduced communication overhead, but backup process is being in " leaving unused " state under the basic normal situation of process, has wasted system resource and system load is unbalanced.When collaborative work task quantity increased, the cooperation with service node formed system bottleneck easily, and performance reduces.Therefore, these two kinds of fault-tolerant ways are applied to all exist in the cooperative surroundings certain problem.In addition, in the collaborative work support environment, collaborative work task quantity is dynamic change, is reflected as the dynamic change of service node performance, and it is dynamically extendible therefore requiring the fault-tolerant of service node.But Active Replication and the main characteristic that all can not adapt to the cooperative system dynamic change with two kinds of fault-tolerant ways of backup well as the above analysis.

Summary of the invention

Purpose of the present invention is exactly at the deficiencies in the prior art, and extendible dynamic fault-tolerant method in a kind of cooperative system is provided, and this method can make system have good fault-tolerant and load balance ability.

Extendible dynamic fault-tolerant method in a kind of cooperative system provided by the invention may further comprise the steps:

(1) when service request arrives, service managerZ-HU is its distribution services node as follows:

(1.1) judge whether ring is gone up maximum node RN is zero, if RN=0 will make up r unit service basic ring, and to set token number be 0, enters step (1.2); Otherwise directly enter step (1.2);

(1.2) cotasking t is distributed to the service node N (i) that holds token, its load increases 1, task t is added in the set of tasks of holding the token node;

(1.3) judge that token number is whether greater than maximum node number, if the node that will hold token adds the service ring, with seasonal RN=RN+1, enters step (1.4); Otherwise directly enter step (1.4);

(1.4) whether judge redundancy r greater than maximum node RN, if, in secondary node, new node is added the service ring, make its expansion constitute r unit service ring, enter step (1.5); Otherwise directly enter step (1.5);

(1.5) backup tasks t to service ring go up with r-1 nearest node of the service node of holding token on;

(1.6) judge on the service ring whether have the effective node of load less than threshold value, if there is no, then pass token to the next outer secondary node of ring that is about to add the service ring, and number add 1, enter step (2) then for its distribution node number equals maximum node; If exist, then transmit token and go up the effective node of next load less than threshold value to ring, enter step (2) then;

(2) service node N (i) carries out this cotasking, in the process of executing the task, checks in the service ring whether failure node N (i) is arranged, if there is not failure node, changes step (3) over to; Otherwise, service managerZ-HU reconstruct as follows service ring:

(2.1) whether the load of judging failure node N (i) equals zero, if then change step (2.4); Otherwise enter step (2.2);

(2.2) from the set of tasks of failure node N (i), take out task t, whether there be the effective node N (j) of load in the logic box of inspection task t less than load threshold, if there is no, then from secondary node, take out a node substitute node N (i), enter step (3); Otherwise enter step (2.3);

(2.3) with the substitute node of node N (j) as N (i), task t is added in the set of tasks of ingress N (j), the load of node N (i) simultaneously subtracts 1, changes step (2.1) then;

(2.4) deletion of node N (i), all node numbers greater than i subtract 1 simultaneously;

(2.5) judge whether the node load of holding token equals threshold value, if not, step (3) entered; Otherwise enter step (2.6);

(2.6) judge on the service ring whether have the effective node of load,, then transmit token, enter step (3) then to the node of next load less than threshold value if exist less than threshold value; Otherwise, token is passed to the outer secondary node of ring, and number add 1 for its distribution node number equals maximum node, enter step (3) then;

(3) judge whether task t finishes, if not, change step (2) over to, otherwise, carry out following steps:

(3.1) deletion task t from the set of tasks of node N (i), and the logic box T_C of the middle task t correspondence of deletion of node N (i) subtracts 1 with node N (i) load,

(3.2) judge token number whether greater than maximum node number, if, pass token to node N (i), finish, otherwise directly end.

R unit service basic ring construction method is in the step (1.1): according to the service redundant degree r of task t, take out r service node from secondary node, it is 0,1 that node number is set respectively ..., r-1 is designated as N (0), N (1) ..., N (r-1); Distribution node N (0) gives cotasking as main service node then, and other nodes are as the backup node of N (0); Last this r node N (0), N (1) ..., N (r-1) connects into ring, constitute by r node N (0), N (1) ..., N (r-1) } r unit service basic ring.

Service ring extended mode is in the step (1.4): at first increase r-RN new node, distribution node number is respectively N (r), N (r+1) ..., N (RN-1) is then according to service basic ring building mode reconstruct service ring.

The present invention is a kind of based on the main improved structure of backup mode of using, main service node receives client requests and handles, backup node is according to task redundancy size dynamic assignment, and all must be identical unlike the backup node of main each task of stipulating with the backup mode structure.Reserve process on the backup services node periodically with main service node on the basic service process communication, synchronous with its maintenance.After the basic service process lost efficacy, from the reserve service processes, choose one as the basic service process.Simultaneously, any one service node can reach and utilize system resource to greatest extent as the main service node and the backup node of system task.This mode is compared with backup mode with main, has significantly reduced backup process is in " leaving unused " under the basic normal situation of process state, has effectively utilized system resource.The present invention can be the load threshold that the task quantity of collaborative work is dynamically set service node according to joint behavior and cooperative system load state, change the service redundant degree, not only improved efficiency of service, and realized load balance with a kind of simple and effective way.Particularly, the present invention mainly contains following characteristics:

(1) dynamic: the present invention utilizes the load information of system dynamically to define the load threshold of service node, quantity and performance decision that the big I of load threshold can provide service node according to the size and the system of system load, can change the service redundant degree, both improve the efficient of system reliability and messenger service, exceeded the ground occupying system resources again.

(2) extensibility: the state replication strategy of service node is separated with communication protocol, does not relate to bottom communication mechanism, is with good expansibility.

(3) fine-grained load balancing: the scheduling mode that is based on load that the present invention adopts, can accurately locate the lightest node of load, reach splendid load balancing effect, and service node for the collaborative work task provides service, has avoided resource waste and single node that the system's " bottleneck " that serves and cause is provided in as backup node.

(4) to user transparent: fault tolerant mechanism is transparent fully to the user, and troubleshooting is timely, system restoration is fast, expense is little.

(5) good cost performance: compare with special-purpose high availability server, utilize service system of the present invention to have better fault-tolerant ability and stronger computing capability, and the realization of system is economic, easily payment.

Description of drawings

Fig. 1 is a schematic flow sheet of the present invention;

Fig. 2 is service node allocation flow figure;

Fig. 3 is the first service of a k of the present invention member ring systems structure chart;

Fig. 4 is a service ring reconstruct flow chart behind the node failure;

Fig. 5 is cotasking deletion flow chart.

Embodiment

The present invention is further detailed explanation below in conjunction with accompanying drawing and example.

As shown in Figure 1, the present invention includes following steps:

(1) when service request arrives, service managerZ-HU is its distribution services node, and service request becomes cotasking;

(2) carry out this cotasking, in the process of executing the task, in the regular check service ring whether failure node is arranged.If failure node is arranged, service managerZ-HU sends instruction reconstruct service ring.If there is not failure node, execution in step (3);

Whether (3) inspection task is finished.If task is finished, the service node deletion is finished the work, and finishes; Otherwise, get back to step (2).

System can for each node is selected suitable load threshold, and set the service redundant degree r of task according to the performance of node at different service requests.The load threshold of each service node, cotasking quantity with and redundancy r determined the interstitial content of service ring.Node and behavior cotasking provide service on the ring, have both improved the disposal ability of message, effectively utilize system resource again.Simultaneously, exist on the ring under the situation of some node failure, the service ring structure guarantees that effective service node can take over the work of inefficacy service node, automatic reconfiguration service ring at any time, for cotasking provides the continuous and reliable messenger service, guarantee " continuing to flow " of service.

Be better to set forth the operation principle of service ring structure, the node that we equal node number to token number is called holds the token node.Below describe the operation principle and the flow process of above-mentioned three steps in detail:

(1) service node distributes:

At the service redundant degree r of different service request setting tasks, be its distribution services node then.As shown in Figure 2, concrete steps are as follows:

(1.1) determine whether the service ring exists, judge promptly whether ring is gone up maximum node RN is zero.If maximum node number is zero on the ring, illustrate that the service ring does not exist, need to make up r unit service basic ring, and the setting token number is 0; If the service ring exists, then carried out for the 2nd step;

Wherein, r unit service ring building mode is as follows:

At first the service redundant degree r according to task t takes out r service node from secondary node, and it is 0,1 that node number is set respectively ..., r-1 is designated as N (0), N (1) ..., N (r-1); Distribution node N (0) gives cotasking as main service node then, and other nodes are as the backup node of N (0); Last this r node N (0), N (1) ..., N (r-1) connects into ring.By r node N (0), N (1) ..., N (r-1) } and the connected mode of the r unit service basic ring that constitutes has following feature:

(a) when the service node number be r (r is a positive integer, r 〉=2), during maximum service node number RN=r-1, N (i) is connected with N (1) with N (j), wherein: j=(i-1) modr; L=(i+1) modr; J, l 〉=0 and be integer.

(b) in the r unit service ring between any 2 N (i) and the N (j) apart from d (i, j)=(i-j) modr.R unit service loops composition as shown in Figure 3.

(1.2) cotasking t is distributed to the service node of holding token, its load increases 1, task t is added in the set of tasks of holding the token node;

(1.3) whether token number is greater than maximum node number.If, illustrate that the node that distributes is the outer node of service ring, then will hold the token node and add the service ring, maximum node number increases 1 simultaneously;

(1.4) whether redundancy r is greater than maximum node RN.If the service that illustrates encircles goes up the redundancy requirement that interstitial content does not satisfy task t, then needs in secondary node new node to be added service and encircles, make its expansion constitute the service of r unit and encircle; Wherein, service ring extended mode is as follows:

At first increase r-RN new node, distribution node number is respectively N (r), N (r+1) ..., N (RN-1) is then according to above-mentioned service ring building mode reconstruct service ring;

(1.5) backup tasks t to service ring go up with r-1 nearest node of the service node of holding token on:

If service ring is k unit service ring, when the task t of having redundancy and be a r is assigned to main service node N (i),, chooses the service ring and go up the backup node of r-1 nearest node of the main service node of distance as task t according to the principle of " from the close-by examples to those far off ".Backup node number is: (i-m) modk and (i+n) modk, wherein:

After backup is finished, token is delivered to the node of next load less than threshold value;

(1.6) whether there be the effective node of load on the service ring less than threshold value.If there is no, then pass token to the next outer secondary node of ring that is about to add the service ring, and number add 1 for its distribution node number equals maximum node; If exist, then transmit token and go up the effective node of next load less than threshold value to ring.

After service node assigned, the logic box T_C of task t was exactly the set that its main service node and backup services node are formed.System utilizes token passing scheme for the service of collaborative work Task Distribution encircles the service node of going up underloading, and the service that guaranteed encircles the load balance of going up node.Simultaneously, the service ring can adapt to the dynamic change of cotasking quantity and redundancy, is with good expansibility.

(2) service ring reconstruct:

After service node N (i) lost efficacy, at first be that the task on the N (i) is sought the service node that substitutes, reconstruct service ring; Inquire about the node of holding token then and whether still satisfy the condition of load,, then need to transmit token to effective node of next load less than threshold value if do not satisfy less than threshold value.

Service ring reconstruct NodeFailure (N (i)) after node N (i) lost efficacy, as shown in Figure 4, concrete implementation step is as follows:

(2.1) whether the load of failure node N (i) equals zero.If then change step (2.4); Otherwise continue next step;

(2.2) from the set of tasks of node N (i), take out task t, whether have the effective node N (j) of load in the logic box of inspection task t, if there is no, then from secondary node, take out a node substitute node N (i), finish less than load threshold; Otherwise continue next step;

(2.3) node N (j) adds task t in the set of tasks of ingress N (j) as the substitute node of N (i), and the load of node N (i) simultaneously subtracts 1, changes step (2.1) then;

(2.5) whether the node load of holding token equals threshold value.If not, then finish; Otherwise continue next step;

(2.6) whether there be the effective node of load on the service ring less than threshold value.If exist, then transmit token to the node of next load less than threshold value; Otherwise, illustrate that ring goes up node load and all reach maximum load, token is passed to the outer secondary node of ring, and number add 1 for its distribution node number equals maximum node.

Suppose that serving ring goes up node N (i) inefficacy, then be respectively N (i) and go up cotasking searching substitute node, promptly seek the alternative service node that N (i) goes up each task in the cotasking set.Suppose the task t in the set of tasks, its logic box is T_C, after the main service node of cotasking t lost efficacy, in T_C, seek with failure node N (i) recently and load less than effective node of load threshold as its substitute node.If this node exists, then delete failure node and reconstruct service ring, otherwise from secondary node, take out the substitute node reconstruct service ring of a node as N (i) node.

(3) cotasking deletion:

The service node load, be presented as the quantity of cotasking on the node, along with the establishment and the deletion of cotasking is dynamic change, task has reflected the current state of cotasking on service node on the node, therefore be that the task t end of r is withdrawed from or when deleted, load will change thereupon on the node when node N (i) goes up redundancy.

It is the deletion TaskDeleting (t) of the cotasking t of r that node N (i) goes up redundancy, and flow chart is seen shown in Figure 5, and concrete implementation step is as follows:

(3.1) the task t of deletion of node N (i) whether, if, deletion task t from the set of tasks of node N (i); Otherwise finish;

(3.2) the logic box T_C of task t correspondence among the deletion of node N (i);

(3.3) node N (i) load subtracts 1;

(3.4) whether token number is greater than maximum node number.If illustrate that the token node for the outer node of service ring, then passes token to node N (i).

Example

Utilize the said fault-tolerance approach of the present invention, 10 physical servers are provided in the laboratory, and these nodes can both provide the service node distribution, cotasking deletion, services such as service ring reconstruct.The hardware configuration of 10 physical servers and the load threshold that is provided with according to machine performance are as follows:

Machine name	CPU	Internal memory	Hard disk	Load threshold
Machine name	CPU	Internal memory	Hard disk	Load threshold	Server 1-2	PIII 550M	256M	10.2G	2
Server 3-5	PIIII 1.4G	256M	40G	5	Server 1-2	PIII 550M	256M	10.2G	2
Server 3-5	PIIII 1.4G	256M	40G	5	Server 6-10	PIIII 1.8G	512M	60G	8

Create first redundancy and be 4 cotasking t ₁The time, system constructing service basic ring: we take out 4 service nodes from 10 guest machines, be designated as N (0), N (1), N (2), N (3).According to connecting into the service ring in the following manner: node N (0) and node N (3), N (1) connects; Node N (1) and node N (0), N (2) connects; Node N (2) and node N (1), N (3) connects; Node N (3) and node N (2), N (0) connects.Wherein N (0) is task t ₁Main service node, N (1), N (2) and N (3) they are t ₁The backup services node, t ₁Become node N (0) and go up task, have logic box T_C={N (0), N (1), N (2), N (3) }.The basic ring creation-time is 5 milliseconds.

Create second redundancy and be 7 cotasking t ₂The time, be 4 because the service ring is gone up the node number, the discontented football association of service ring is with task t ₂The redundancy requirement, from secondary node, take out 3 nodes again and add the service ring, be designated as N (4) respectively, N (5), N (6).The service ring expands to 7 yuan of service rings by 4 yuan of service rings.Distribution node N (1) is as task t then ₂Main service node, backup node is N (0), N (2), N (3), N (4), N (5), N (6).t ₂Become node N (1) and go up task, have logic box T_C={N (1), N (0), N (2), N (3), N (4), N (5), N (6) }.

Create the 3rd redundancy and be 3 cotasking t ₃The time, distribution node N (2) is as task t ₃Main service node, backup node is N (1), N (3).t ₃Become node N (2) and go up task, have logic box T_C={N (1), N (2), N (3) }.

With above-mentioned mode, system creates cotasking t successively ₄, t ₅... the logic box that the node that distributes ring to go up underloading provides service and sets the tasks as its main service node.Node load on ring all reaches its load threshold, then new node is added the service ring, guarantees the performance of service.

Through repeatedly test, adopt the said cooperative system of the present invention can expand the service ring of fault-tolerance approach, behind node failure, service ring still can operate as normal, because the task on the failure node can be redistributed into effective node, and reconstruct service ring, guaranteed that the service request of carrying out can not be affected.

Claims

1, extendible dynamic fault-tolerant method in a kind of cooperative system may further comprise the steps:

2, method according to claim 1, it is characterized in that: r unit service basic ring construction method is in the step (1.1): according to the service redundant degree r of task t, take out r service node from secondary node, it is 0,1 that node number is set respectively, r-1 is designated as N (0), N (1),, N (r-1); Distribution node N (0) gives cotasking as main service node then, and other nodes are as the backup node of N (0); Last this r node N (0), N (1) ..., N (r-1) connects into ring, constitute by r node N (0), N (1) ..., N (r-1) } r unit service basic ring.

3, method according to claim 2 is characterized in that: service ring extended mode is in the step (1.4): at first increase r-RN new node, distribution node number is respectively N (r), N (r+1),, N (RN-1) is then according to service basic ring building mode reconstruct service ring.