WO1997022054A2 - Processor redundancy in a distributed system - Google Patents
Processor redundancy in a distributed system Download PDFInfo
- Publication number
- WO1997022054A2 WO1997022054A2 PCT/SE1996/001609 SE9601609W WO9722054A2 WO 1997022054 A2 WO1997022054 A2 WO 1997022054A2 SE 9601609 W SE9601609 W SE 9601609W WO 9722054 A2 WO9722054 A2 WO 9722054A2
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- processor
- processors
- software
- catastrophe
- software objects
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/16—Error detection or correction of the data by redundancy in hardware
- G06F11/20—Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
- G06F11/202—Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant
- G06F11/2023—Failover techniques
- G06F11/203—Failover techniques using migration
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/16—Error detection or correction of the data by redundancy in hardware
- G06F11/20—Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
- G06F11/202—Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant
- G06F11/2035—Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant without idle spare hardware
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04M—TELEPHONIC COMMUNICATION
- H04M3/00—Automatic or semi-automatic exchanges
- H04M3/22—Arrangements for supervision, monitoring or testing
- H04M3/24—Arrangements for supervision, monitoring or testing with provision for checking the normal operation
- H04M3/241—Arrangements for supervision, monitoring or testing with provision for checking the normal operation for stored program controlled exchanges
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04Q—SELECTING
- H04Q3/00—Selecting arrangements
- H04Q3/42—Circuit arrangements for indirect selecting controlled by common circuits, e.g. register controller, marker
- H04Q3/54—Circuit arrangements for indirect selecting controlled by common circuits, e.g. register controller, marker in which the logic circuitry controlling the exchange is centralised
- H04Q3/545—Circuit arrangements for indirect selecting controlled by common circuits, e.g. register controller, marker in which the logic circuitry controlling the exchange is centralised using a stored programme
- H04Q3/54575—Software application
- H04Q3/54591—Supervision, e.g. fault localisation, traffic measurements, avoiding errors, failure recovery, monitoring, statistical analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/16—Error detection or correction of the data by redundancy in hardware
- G06F11/20—Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
- G06F11/202—Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant
- G06F11/2048—Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant where the redundant components share neither address space nor persistent storage
Definitions
- the present invention relates to a distributed, fault-tolerant reconfigurable processor system in a telecommunication network.
- processors In public telecommunication systems there are several processors performing many different kinds of tasks, such as monitoring activity on subscriber equipment lines, set up and release of connections, traffic control, system management and taxation. Groups of processors are interconnected by way of a network which is separate from the telecommunication network or forms a part thereof. In modern telecommunication networks there are network elements, such as an exchange, a data base, a processor, that are distributed on several physical elements of the physical network making up the telecommunication network. To an application, such as POTS (Plain Old Telephony Service), GSM (Global System for Mobile communication) , VLL (virtual leased lines) , BISDN (Broadband),
- POTS Phase Old Telephony Service
- GSM Global System for Mobile communication
- VLL virtual leased lines
- BISDN Broadband
- Integrated-Services Digital Network a distributed processor or a distributed data base looks like a single unit.
- the distributed units are said to bee transparent from the point of view of a distribution.
- a mam requirement for processor based control systems of public telecommunication systems concerns system availability. With this is meant that the system should be available to serve its users. In for example the AXE-10 telephone system only 2 hours of unavailability of the system during 40 years was allowed. Converted into minutes per year this corresponds to abut 3 mm/year. Modern telecommunication systems have much higher availability demands. Nevertheless it is also required of modern telecommunication systems to allow for planned maintenance work, which may have long duration, at long term intervals, for example at intervals in the order of about 1 months.
- U.S. Patent Serial No. 4 710 926 relates to a fault recovery method for a distributed processing system using spare processors that take over the functions of failed processors.
- a spare processor acts as a stand-in for one or more active processors. When a spare processor is put into service it will no longer serve as a spare processor for any other processor in the system. During fault- recovery all functions executing on the faulty processor are transferred to the spare processor. The old spare processor's function of being a spare processor is transferred to a second spare processor in the system.
- the system requires two or more spare processors.
- a spare processor When a spare processor is inactive it does not perform any ;job tasks. When it becomes active it starts processing ⁇ ob tasks - provided it is operative, i.e. is not impaired by any faults.
- ⁇ ob tasks - When not described m said patent it will be necessary to run test programs to verify that the spare processors are operative.
- the processing system uses spare sub-system elements that do not participate in the overall processing tasks.
- the method used for reconfiguration when a faulty element is detected and replaced by a spare element is one that uses distinct socket addresses for each element in the system.
- a socket address is assigned a virtual address which replaces the socket address when a faulty condition is detected.
- Moleskey has found, as a s de effect of the chache coherency protocol, that data can be recovered by providing a database log which is used to roll back the transactions associated with tne failed processor's chache memory only. The transactions performed by the rest of the processors may continue and will not corrupt the data of the database.
- the problem addressed by Molesky is m no way related to reconfiguration of processor systems. A database log is not like a catastrophe plan.
- Deplance thus starts the reallocation process at the time a processor goes down and finishes it before the deadline expires.
- Deplance indicates that there are methods for computing task allocations off-line, but such methods are complex, require much processor work and provide allocation tables that are very large. This is so because the number of conceivable combinations of tasks and processors is very large even for moderately sized processor systems. Deplance is thus warning for the use of such off-line algorithms.
- the inventor of the present invention has realized this problem and his contribution to the art is to provide reallocation tables, not for all possible configurations of processors and tasks, but for one configuration only.
- One object of the invention is to provide a method for auto ⁇ matically recover from multiple permanent failures of processors in a distributed processor system which is used m an application environment of a telecommunication system having high demands on availability while simultaneously allowing system maintenance, planned or unplanned.
- Another object of the invention is to utilize available processing resources while allowing for a heterogeneous processing environment due to evolving technology in a system that grows over time and due to particular needs of different parts of an application that runs on the processor system.
- Another object of the invention is to provide a method for quickly recovering from multiple permanent failures of processors in a distributed processor system used m a telecom system's environment by providing an initial configuration of all processors and by providing, for each processor in the system, a catastrophe plan to be used in case the corresponding processor goes down.
- a catastrophe plan is the means by which software objects installed on a faulty processor is distributed to generally several pro ⁇ cessors in the system thus providing for load sharing among the processors .
- a further object of the invention is to have all of the catastrophe calculated and installed m memories associated with the processors so that they are available to the system instantaneously at the time a processor goes down.
- Still another object of the invention is to provide new catastrophe plans for the system of operating processors, some of which nave installed thereon software objects from a failed processor, so as to prepare the system for a quick recovery should a further processor m the system go down.
- Still another object of the invention is to provide a method of the indicated kind which takes back the system to its initial configuration of processors and software objects when the system's faulty processor or processors after repair or replacement are inserted back into the system.
- Another object of the invention is to provide n a catastrophe plan associated with an individual processor an initial redistribution of software objects executing on said individual processor to other non-faulty processors prior to the finishing redistribution of software objects of a faulty processor so as to free up memory for storage of a large software object, which is running on the faulty processor and which m accordance with its catastrophe plan is to be transferred to said predefined processor; the memory which is freed up being the memory associated with said predefined processor.
- Another object of the invention is to include in the catastrophe plan redistribution of objects executing on non-faulty processors to other non-faulty processors, so as to free up processor resources, such as memory and CPU capacity, for large software objects which are running on the faulty processor and which in accordance with the faulty processor' s catastrophe plan shall be transferred to the processors on which resources have been freed up.
- processor resources such as memory and CPU capacity
- An object of the invention is also to provide a software model that allows software objects to be transferred from a faulty processor to an operating processor by restarting the object on the operating processor.
- a software model will also allow for killing a software object installed on a processor and for restarting it on a repaired, previously faulty, processor which has been reinserted into the system. This latter objective is predominantly used when the system returns to its initial configuration and there are objects installed on operating processors, which objects should be given back to the repaired processors.
- a model of the telecommunication system comprising a hardware model of the control processors and the controlled hardware equipment as well as a software model that supports and fits into the hardware model of the telecommunication system.
- a first algorithm is used to calculate the catastrophe plans for each of the operating processors given either the initial configuration or any one of the actual configurations that w ll appear after a further processor has gone down.
- a second algorithm is used that given an actual configuration computes a delta configuration that applied to the actual configuration will give back the initial configuration of the system.
- Figure 1 is a block diagram showing a distributed processor system in an initial configuration
- FIG. 2 is a block diagram showing the processor system of Figure
- Figure 3 is a block diagram of the processor system of Figure 1 in a second actual configuration after failure of two processors
- FIG. 4 is a flow diagram of the method in accordance with the invention.
- Figure 5 is a block diagram of a distributed processor system some of the processors of which are controlling hardware equipment
- Figure 6A is a schematic view of a modularized software object
- FIGS 6B-D are block diagrams of three different types of software objects
- Figure 7 is a block diagram illustrating how the hardware and software models m accordance with the invention fit together in one single model of the telecommunication system m accordance with the invention
- FIG. 8 is a block diagram showing the hardware model in accordance with the invention.
- Figure 9 is a block diagram of the distributed processor system showing a preparatory redistribution of software objects.
- FIG. 1 there s shown a number of distributed processors Pl, P2, P3 and P4 which communicate over a network NI .
- the processors form part of a non-shown telecommunication network.
- the network NI may form part of said non-shown telecommunication network.
- Each processor comprises a processor unit PU and memory M.
- Software objects 1, 2, 3... 18 are installed on the processors; objects 1, 2, 3 on processor Pl, objects 4-7 on P2, objects b-12 on P3 and objects 13-18 on P .
- the software of an application that runs in the telecommunication network comprises software objects (Figs. 6B-D) , which are contained in software modules (F ⁇ g.6A) .
- the modularized software objects are allocation independent objects that can be transferred freely between the processors.
- a modularized software object is independent of other modularized software objects.
- a software object typically comprises a process and persistent data. Persistent data is data that survives a restart of the software object.
- Software objects can communicate w th each other.
- a task which is required by an application typically involves several software objects on different processors, and is executed by some or all of the processes of these objects. The actual distribution of software objects on different processors is unknown to the application.
- Modularized persistent data can be stored m a data base.
- the data base is also distributed over several the memories M of several processors, preferably the memories of all of the processors Pl, P2, P3 and P _ .
- These data base partitions are labeled DB1, DB2, DB3 and DB4 and comprises a random access memory (RAM) .
- RAM random access memory
- a novel and preferred alternative is, however, to store a mirror copy of each modularized software object m a data base partition of another processor than the one on which said object is installed.
- each modularized software object is stored in the database partition on theprocessor given by the catastrophe plan for the processor on which the modularized software object, the original, is executing.
- copies of the modularized persistent data will be safely stored on another processor if the processor on which the original is installed crashes.
- the initial configuration must not disappear if any processor goes down. For this reason the initial configuration and the mirror copy thereof is stored as described above. Instead of implementing the initial configuration in the form of a table it can be implemented in so called tuples. As an example tuples (1,1), (1,2), (1,3) would correspond to the information given by the first row of Table 1.
- a catastrophe plan contains directions regarding the processors to which the software objects of the faulty processor should be trans ⁇ ferred.
- One catastrophe plan shown m Table 2 indicates the processors to which the objects installed on processor Pl should be transferred in case processor Pl goes down.
- Another catastrophe plan contains information on where the objects installed on processor P2 should be transferred m case processor P2 goes down.
- m Table 4 indicates the processors to which the objects installed on processor Pl should be transferred in case processor Pl goes down.
- Another catastrophe plan contains information on where the objects installed on processor P2 should be transferred m case processor P2 goes down.
- new catastrophe plans must be established so that the system can quickly recover if another processor goes down.
- new catastrophe plans giving directions regarding the processors to which the software objects installed on a faulty processor should be transferred. Since one cannot foresee which one of the three operating processors P2-P4 that will go down, it will be necessary to create catastrophe plans for each one of the operating processors.
- Table 8 is the new catastrophe plan (CP-P2 1 ) for processor P2, Table 9 the new one (CP-P3') for processor P3 and Table 10 the new one (CP-P4') for the processor P .
- the new catastrophe planes and its mirror copies are stored m the above described manner.
- the system shall revert to the initial configuration. This can be done either by killing all software objects in the actual configuration and by creating and starting all software objects on the processors of the system. In the preferred embodiment of the invention only the objects transferred from the first processor, and which now execute on other processors, are killed at first and are then created and started on processor Pl .
- a delta configuration table is created by subtracting the initial configuration from the actual configuration, excluding processor Pl. By subtracting Table 1 from Table 6 the delta configuration shown m Table 7 is achieved. The row pertaining to the faulty processor Pl is not included n the subtraction.
- the delta configuration indicates that objects 1 and 3 at processor P2 and object 2 at processor P4 should be killed at the respective processors.
- tney shall be created and started on the repaired processor Pl . After said creation the system is now running like it did in the initial configuration and its recovery time was short.
- Processor Pl goes down and then processor P2 goes down.
- processor Pl goes down and then processor P2 goes down.
- processor Pl goes down and then processor P2 goes down.
- the system is running with the same configuration as shown m Figure 1, that catastrophe plans have been created for each one of the processors P1-P4, that processor Pl crashes, that the software objects installed on processor Pl are transferred to operating processors following the catastrophe plan of Table 2, that the system recovers and is up an running, that new catastrophe plans are created for processors P2, P3 and P , and that processor Pl is removed and brought to repair.
- processor P2 goes down.
- the new catastrophe plan associated with processor P2 i.e. the new catastrophe plan of Table 8 should be followed.
- plan objects 1, 3 and 4 should be transferred to processors P3 and objects 5-7 should be transferred to processor P4.
- the software objects on processor P2 are removed and are transferred to the processors P3 and P4.
- the system will now be up an running and will have a configuration of the Kind shown in Figure 3 and Table 11.
- it will now be necessary to work out catastrophe plans for each one of the processors P3 and P .
- processor system comprising four processors has been described.
- the inventive method is equally well applicable on processor systems that comprise two, three five or more processors.
- a processor system tolerating two faulty processors was described.
- the inventive method is equally well applicable on processor systems that tolerate three or more faulty processors.
- the last example illustrates that a four processor system can operate with 50o of its processors faulty. The application will still run, but it will have a degraded perfor ⁇ mance. If the processor system is a switch in a local office, telephone traffic will still be running and congestion will start at a low traffic volume. This is a novel and unique feature that is not present m any of the above referenced US patents, and, as far as applicant knows, no one else has achieved before.
- a first algorithm is used for creating catastrophe plans from the initial configuration in case a first processor goes down or from the actual configuration m case a further processor goes down is the same.
- the first algorithm comprises parameters that pertain to the capacity of a processor, parameters that pertain to the size of the memory of a processor, parameters relating to how much processor capacity (machine cycles per process to execute) and memory the individual objects to be transferred do require, and parameters relating to the quality of service.
- a second algorithm is used for returning the system to its initial configuration. This second algorithm has already been described above and has been referred to as a delta configuration.
- Various methods can be used to detect a faulty processor, for example the "heart beat" method m accordance with the US Patent Specification 4 710 926 referred to above.
- a preferred method in a typical telecom network is, however, to monitor the links by which processors are interconnected through the network NI .
- sottware objects installed on a faulty processor are transferred to two processors.
- the faulty processor's objects can also be distributed among three or more processors m the system.
- all software objects are transferred to a single processor in case the system comprises two processors that are in working order and one of these crashes.
- FIG. 4 the method steps performed in accordance with the invention are shown in a flow diagram.
- the initial configuration is created by a system vendor or system operator and is stored m the system. This is indicated in box 20.
- Next catastrophe planes should be created m accordance with the first algorithm. There should be as many catastrophe planes as there are processors in the system. Further mirror copies of persistent data base objects should be created.
- each processor creates its own catastrophe plan, i.e. the catastrophe plan to be used by the system m case it goes down. This will ensure that the work for creating the catastrophe plans will be totally distributed.
- a processor goes down, box 22.
- the software objects of the faulty processor should be transferred to operating processors using the catastrophe plan for the faulty processor.
- By transferring objects is contemplated that new copies of the software objects of the crashed processor are created and started on the processors to which they should be transferred in accordance with the catastrophe plan.
- Box 23 accordingly represents the recovery of the system from the faulty processor.
- the system is now up and running and a new configuration, referred to as actual configuration, arises.
- the actual configuration is also stored n a memory of a distributed processor.
- new catastrophe planes for the operating processors are created, box 24. Now the system has recovered its ability to withstand a new processor failure. Also mirror copies of the new catastrophe plans are stored in the data base.
- the process returns to operation 22 as indicated by arrow 25.
- the faulty processor or faulty processors are repaired, box 26, and are inserted into the system, box 26. If two or more processors have crashed, it is assumed they are repaired and that they are inserted back into the system simultaneously. Theoretically t is of course possible to repair faulty processors one by one and insert them into the system one by one but from practical point of view this procedure is roundabout.
- the last step in the process, box 27, is to take the system back to its initial configuration using the second algorithm.
- Example of hardware equipment controlled by software modules are I/O devices, subscriber line interface devices, subscriber line processors, tone decoders, voice prompting devices, conference equipment. Hardware dependencies of this kind pose restrictions on the software modules.
- a software module involved in controlling hardware equipment that is connected to one or more processors can not be transferred to an arbitrary processor m the system but must be transferred to a processor that has access to the very same hardware equipment. The catastrophe planes must be created with this in mind.
- a telecom system can usually continue to operate despite the loss of some devices, although the services it provides might be somewhat impaired.
- Figure 6B illustrates a software object which contains a function part (execution part) and a persistent data part (persistent part) .
- the software object m Figure 6C contains the function part of the software object shown in Figure 6B and the software object shown in Figure 6D contains the persistent data part of the same software object shown m Figure 6B.
- the software objects n Figures 6C and 6D together form a pair and have the same key.
- the key s the logical address to the software object shown in Figure 6B in the data base and the key will therefore also be the logical address to the two software objects of Figure 6C and 6D m the data base.
- blocking is provided by setting the persistent data part of the software object shown in Figure 6D m blocked state.
- a blocked state is marked by a setting a flag m the software object. Note that it is not the device controlling software object 6C that is blocked but its persistent companion software object in Figure 6D.
- the software object is blocked m the address tables existing m the operating system of the respective processors.
- the operative system of a processor has address tables to it own software objects. The address tables are used when messages are sent to its object.
- the processors of a processor system can be of two kinds, fault tolerant processors (FTP processors) and non-FTP processors.
- FTP processors fault tolerant processors
- An FTP processor which usually comprises double processors, goes on with executing its tasks even if there arises a simple hardware fault in the hardware equipment controlled by the FTP processor. When the fault occurs an alarm will be trigged but the program does not crash.
- FTP processor it is possible to take a FTP processor out of the service for repair in a controlled manner, using the catastrophe plans, so that the services the application deliver will not be interrupted. For example, m a telecom system no traffic disturbances will occur; ongoing calls will not be interrupted.
- FIG 5 there is shown a processor system similar to Figure 1 where there is a network NI to which processors P1-P4 have access and can communicated with each other. Although not shown in Figure 5 it is supposed that the software objects 1-18 are distributed on processors P1-P4 m the same way as shown m Figure 1. Further there is a device Dl connected to processor Pl . There is also a second network N2 to which processors P2 and P4 have access. Device processors D5 is a device that is used to connect devices D2 and D3 to the network N2. Device processor D5 thus controls devices D2 and D3. Another device processor D6 connects devices D_ and D5 to the network N2 and will thus control these.
- Typical examples of a device of the Dl kind is processor Pl itself.
- Another example is some hardware device, like a UART-device. If the processor Pl goes down, or if the hardware device controlled by processor Pl goes down, then the software object which represents processor Pl or the software object that represents the faulty hardware cannot be transferred to any of processors P2-P4 since none of these can gain control over the faulty hardware or of the processor Pl . Nevertheless, the processor system must tolerate that Pl goes down if the system is to be redundant . When the software object representing processor Pl goes down, the software object that represents processor Pl must be blocked so it cannot be accessed by any other software objects.
- object 1 represents processor Pl
- object 4 represents processor P2
- object 8 represents processor P3 and object 13 processor P4. All such hardware dependencies are exactly described by the model of the hardware with installed software as shown m Figure 7. Accordingly the model shows that none of these objects 1, 4, 8 and 13 can be transferred to any other processor in the system.
- the first algorithm operates on the model of the hardware with installed software and will thus take into account all hardware dependencies when it creates new catastrophe plans.
- the catastrophe table for processor 1 shown in Table 2 will therefore remain the same with the exception that object 1 disappears.
- object 4 will disappear and in the catastrophe table for processor 3 object 8 will disappear and object 13 will disappear from the catastrophe table associated with processor P4. Accordingly less objects than described previously will remain on the respective processors when a processor goes down.
- processor Pl If for example processor Pl goes down it will be necessary to block object 1 from access from other software objects. This means that no other sortware objects are allowed to communicate with software object 1.
- the model in accordance with the invention will always pretend that the system will operate even if hardware equipment is lost and cannot be controlled by its software objects. If, however, at a higher level of the system it turns out that the system does not operate, not even with impaired services delivered, then the reason why the system does not operate is lack of redundancy and the model cannot change this fact.
- FIG 6A the software model of the modularized software object is shown.
- the software model comprises a class named configOb which has the operations construct 0 and destruct() .
- Construct () and destructO are used to create and kill respectively a particular modular software object.
- the modular software object has been described above with reference to Figures >B-6D.
- the hardware model 30 describes the physical devices and their sites in the system.
- the hardware model 30 comprises a class processor 31 which generates objects that represent processors P1-P4, a class CPPool 32 which generate objects that represent network NI, a class DevX 33 which generates objects that represents network N2 and a class ConfigObj 34 which generate objects which represent the software objects, shown in Figure 6B, of the devices referred to above; device processors inclusive.
- Class ConfigObj 35 generate software which represent the software objects of the modularized software object shown in Figure 6A but do not form part of the hardware model.
- the model shows that processors are connected to NI and to N2.
- the model also shows that software which has no devices can be installed on all processors that can connect to NI, while software which controls devices must be installed on processors th t can connect to N2.
- the model will define the constraints for each hardware dependent software object. By connecting the ConfigObj to OrgX not only is the hardware devices included m the hardware model, but at the same time are the software objects that control the devices installed in the model.
- processor 1 goes down and that the software objects executing thereon shall be redistributed m accordance with its catastrophe plan shown in Table 2.
- the catastrophe plan is stored distributed on processors P2 and P3 in fragments.
- catastrophe plan fragment 40 is stored in the memory M of processor P2
- catastrophe plan fragment 41 is stored in memory M of processor P2.
- P2 and P3 software objects and data are stored, as exemplified by the various hatched layers. All memories are not completely filled as exemplified by the non-hatched memory areas.
- memory M of processor P2 has a free, non-occupied memory area 42
- a d memory M of processor P3 has a free, non-occupied memory area 43.
- the CPU capacity of the different processors are used to different extents (not necessary in proportion to the memory usage of the respective processor) .
- the catastrophe plan of processor Pl software object 1 shall be redistributed to processor P2.
- the free memory area 42 of processor P2, however, is not large enough for housing object 01. Therefore the catastrophe plan of processor Pl contains an initial redistribution phase in order to make room for software object 01.
- the memory of processor P2 is removed and is transferred to the free memory area 43 in processor P3 leaving an enlarged free memory area in processor P2, large enough to house software object 01.
- the objects executing on processor P2 are killed. Following this the objects which in accordance with the catastrophe plan are to execute on processor P2 are created on processor P2. In this manner there will be processor resources (memory as well as CPU capacity) available for executing the new objects.
- the object 04 killed on processor P3 must not disappear and is created on another non-faulty processor, for example processor P3.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Quality & Reliability (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Hardware Redundancy (AREA)
- Exchange Systems With Centralized Control (AREA)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
AU10488/97A AU1048897A (en) | 1995-12-08 | 1996-12-06 | Processor redundancy in a distributed system |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
SE9504396A SE515348C2 (sv) | 1995-12-08 | 1995-12-08 | Processorredundans i ett distribuerat system |
SE9504396-4 | 1995-12-08 |
Publications (2)
Publication Number | Publication Date |
---|---|
WO1997022054A2 true WO1997022054A2 (en) | 1997-06-19 |
WO1997022054A3 WO1997022054A3 (en) | 1997-09-04 |
Family
ID=20400521
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/SE1996/001609 WO1997022054A2 (en) | 1995-12-08 | 1996-12-06 | Processor redundancy in a distributed system |
Country Status (3)
Country | Link |
---|---|
AU (1) | AU1048897A (sv) |
SE (2) | SE515348C2 (sv) |
WO (1) | WO1997022054A2 (sv) |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP0959587A2 (en) * | 1998-04-02 | 1999-11-24 | Lucent Technologies Inc. | Method for creating and modifying similar and dissimilar databases for use in network configuration for use in telecommunication systems |
WO2000064199A2 (en) * | 1999-04-14 | 2000-10-26 | Telefonaktiebolaget Lm Ericsson (Publ) | Recovery in mobile communication systems |
WO2001013232A2 (en) * | 1999-08-17 | 2001-02-22 | Tricord Systems, Inc. | Self-healing computer system storage |
GB2359384A (en) * | 2000-02-16 | 2001-08-22 | Data Connection Ltd | Automatic reconnection of linked software processes in fault-tolerant computer systems |
US6438707B1 (en) | 1998-08-11 | 2002-08-20 | Telefonaktiebolaget Lm Ericsson (Publ) | Fault tolerant computer system |
US6449731B1 (en) | 1999-03-03 | 2002-09-10 | Tricord Systems, Inc. | Self-healing computer system storage |
US6725392B1 (en) | 1999-03-03 | 2004-04-20 | Adaptec, Inc. | Controller fault recovery system for a distributed file system |
WO2004062303A1 (en) * | 2002-12-30 | 2004-07-22 | At & T Corporation | System and method of disaster restoration |
WO2005009058A1 (de) * | 2003-06-26 | 2005-01-27 | Deutsche Telekom Ag | Verfahren und system zur erhöhung der vermittlungskapazität in telekommunikationsnetzwerken durch übertragung oder aktivierung von software |
US6922688B1 (en) | 1998-01-23 | 2005-07-26 | Adaptec, Inc. | Computer system storage |
US7287179B2 (en) | 2003-05-15 | 2007-10-23 | International Business Machines Corporation | Autonomic failover of grid-based services |
US7715837B2 (en) * | 2000-02-18 | 2010-05-11 | Telefonaktiebolaget Lm Ericsson (Publ) | Method and apparatus for releasing connections in an access network |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US4371754A (en) * | 1980-11-19 | 1983-02-01 | Rockwell International Corporation | Automatic fault recovery system for a multiple processor telecommunications switching control |
US4710926A (en) * | 1985-12-27 | 1987-12-01 | American Telephone And Telegraph Company, At&T Bell Laboratories | Fault recovery in a distributed processing system |
-
1995
- 1995-12-08 SE SE9504396A patent/SE515348C2/sv not_active IP Right Cessation
-
1996
- 1996-12-06 AU AU10488/97A patent/AU1048897A/en not_active Abandoned
- 1996-12-06 WO PCT/SE1996/001609 patent/WO1997022054A2/en active Application Filing
-
1997
- 1997-08-29 SE SE9703132A patent/SE9703132A0/sv not_active Application Discontinuation
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US4371754A (en) * | 1980-11-19 | 1983-02-01 | Rockwell International Corporation | Automatic fault recovery system for a multiple processor telecommunications switching control |
US4710926A (en) * | 1985-12-27 | 1987-12-01 | American Telephone And Telegraph Company, At&T Bell Laboratories | Fault recovery in a distributed processing system |
Non-Patent Citations (5)
Title |
---|
DISTRIBUTED PROCESSING - PROCEEDINGS OF THE IFIP WW6 10:3...., October 1987, A-M. DEPLANCHE et al., "Task Redistribution with Allocation Constraints in a Fault-Tolerant Real-Time Multiprocessor System", pages 136-143. * |
IEEE TRANS. ON PARALLEL AND DISTRIBUTED SYSTEMS, Volume 4, No. 8, August 1993, N-F. TZENG, "Reconfiguration and Analysis of a Fault-Tolerant Circular Butterfly Parallel System", pages 855-863. * |
IEEE TRANS. ON RELIABILITY, Volume 38, No. 1, April 1989, C-M. CHEN et al., "Reliability Issues with Multiprocessor Distributed Database Systems: A Case Study", pages 153-155. * |
PATENT ABSTRACTS OF JAPAN, Vol. 96, No. 01; & JP,A,07 234 849 (HITACHI LTD), 5 Sept. 1995. * |
SPECIAL INTEREST GROUP ON MANAGEMENT OF DATA, No. 2, 1995, L.D. MOLESKY et al., "Recovery Protocols for Shared Memory Database Systems", pages 11-22. * |
Cited By (24)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6922688B1 (en) | 1998-01-23 | 2005-07-26 | Adaptec, Inc. | Computer system storage |
EP0959587A3 (en) * | 1998-04-02 | 2000-05-10 | Lucent Technologies Inc. | Method for creating and modifying similar and dissimilar databases for use in network configuration for use in telecommunication systems |
EP0959587A2 (en) * | 1998-04-02 | 1999-11-24 | Lucent Technologies Inc. | Method for creating and modifying similar and dissimilar databases for use in network configuration for use in telecommunication systems |
US6438707B1 (en) | 1998-08-11 | 2002-08-20 | Telefonaktiebolaget Lm Ericsson (Publ) | Fault tolerant computer system |
US6725392B1 (en) | 1999-03-03 | 2004-04-20 | Adaptec, Inc. | Controller fault recovery system for a distributed file system |
US6449731B1 (en) | 1999-03-03 | 2002-09-10 | Tricord Systems, Inc. | Self-healing computer system storage |
AU770164B2 (en) * | 1999-04-14 | 2004-02-12 | Telefonaktiebolaget Lm Ericsson (Publ) | Recovery in mobile communication systems |
WO2000064199A2 (en) * | 1999-04-14 | 2000-10-26 | Telefonaktiebolaget Lm Ericsson (Publ) | Recovery in mobile communication systems |
WO2000064199A3 (en) * | 1999-04-14 | 2001-02-01 | Ericsson Telefon Ab L M | Recovery in mobile communication systems |
US6775542B1 (en) | 1999-04-14 | 2004-08-10 | Telefonaktiebolaget Lm Ericsson | Recovery in mobile communication systems |
WO2001013233A3 (en) * | 1999-08-17 | 2001-07-05 | Tricord Systems Inc | Self-healing computer system storage |
WO2001013233A2 (en) * | 1999-08-17 | 2001-02-22 | Tricord Systems, Inc. | Self-healing computer system storage |
US6530036B1 (en) | 1999-08-17 | 2003-03-04 | Tricord Systems, Inc. | Self-healing computer system storage |
WO2001013232A2 (en) * | 1999-08-17 | 2001-02-22 | Tricord Systems, Inc. | Self-healing computer system storage |
WO2001013232A3 (en) * | 1999-08-17 | 2001-07-12 | Tricord Systems Inc | Self-healing computer system storage |
GB2359384A (en) * | 2000-02-16 | 2001-08-22 | Data Connection Ltd | Automatic reconnection of linked software processes in fault-tolerant computer systems |
GB2359384B (en) * | 2000-02-16 | 2004-06-16 | Data Connection Ltd | Automatic reconnection of partner software processes in a fault-tolerant computer system |
US7715837B2 (en) * | 2000-02-18 | 2010-05-11 | Telefonaktiebolaget Lm Ericsson (Publ) | Method and apparatus for releasing connections in an access network |
WO2004062303A1 (en) * | 2002-12-30 | 2004-07-22 | At & T Corporation | System and method of disaster restoration |
US7058847B1 (en) | 2002-12-30 | 2006-06-06 | At&T Corporation | Concept of zero network element mirroring and disaster restoration process |
US7373544B2 (en) | 2002-12-30 | 2008-05-13 | At&T Corporation | Concept of zero network element mirroring and disaster restoration process |
US7287179B2 (en) | 2003-05-15 | 2007-10-23 | International Business Machines Corporation | Autonomic failover of grid-based services |
WO2005009058A1 (de) * | 2003-06-26 | 2005-01-27 | Deutsche Telekom Ag | Verfahren und system zur erhöhung der vermittlungskapazität in telekommunikationsnetzwerken durch übertragung oder aktivierung von software |
US8345708B2 (en) | 2003-06-26 | 2013-01-01 | Deutsche Telekom Ag | Method and system for increasing the switching capacity in telecommunications networks by transmission or activation of software |
Also Published As
Publication number | Publication date |
---|---|
AU1048897A (en) | 1997-07-03 |
SE9703132D0 (sv) | 1997-08-29 |
SE9703132L (sv) | |
SE515348C2 (sv) | 2001-07-16 |
SE9504396D0 (sv) | 1995-12-08 |
WO1997022054A3 (en) | 1997-09-04 |
SE9504396L (sv) | 1997-06-09 |
SE9703132A0 (sv) | 1997-08-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
KR100326982B1 (ko) | 높은 크기 조정 가능성을 갖는 고 가용성 클러스터 시스템 및 그 관리 방법 | |
EP0717355B1 (en) | Parallel processing system and method | |
US6854069B2 (en) | Method and system for achieving high availability in a networked computer system | |
EP1617331B1 (en) | Efficient changing of replica sets in distributed fault-tolerant computing system | |
US7870235B2 (en) | Highly scalable and highly available cluster system management scheme | |
EP2643771B1 (en) | Real time database system | |
US7302609B2 (en) | Method and apparatus for executing applications on a distributed computer system | |
WO1997022054A2 (en) | Processor redundancy in a distributed system | |
JP2000112911A (ja) | コンピュ―タネットワ―クにおけるデ―タベ―ス管理システムにおいて自動的にタスクを再分配するシステム及び方法 | |
JP2008210412A (ja) | マルチノード分散データ処理システムにおいてリモート・アクセス可能なリソースを管理する方法 | |
WO1998032074A1 (en) | Data partitioning and duplication in a distributed data processing system | |
Babaoğlu et al. | System support for partition-aware network applications | |
CN108984320A (zh) | 一种消息队列集群防脑裂方法及装置 | |
US11544162B2 (en) | Computer cluster using expiring recovery rules | |
CN114338670B (zh) | 一种边缘云平台和具有其的网联交通三级云控平台 | |
WO1997049034A1 (fr) | Systeme de prise en charge de taches | |
Corsava et al. | Intelligent architecture for automatic resource allocation in computer clusters | |
CN115291891A (zh) | 一种集群管理的方法、装置及电子设备 | |
JP2004094681A (ja) | 分散データベース制御装置および制御方法並びに制御プログラム | |
Pimentel et al. | A fault management protocol for TTP/C | |
CN109995560A (zh) | 云资源池管理系统及方法 | |
CN208299812U (zh) | 一种基于ZooKeeper集群的主备切换系统 | |
JP3183216B2 (ja) | 2重化mo管理方式 | |
CN118626098A (zh) | 集群部署方法及其系统 | |
CN117714386A (zh) | 分布式系统部署方法、配置方法、系统、设备及介质 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AK | Designated states |
Kind code of ref document: A2 Designated state(s): AL AM AT AU AZ BA BB BG BR BY CA CH CN CU CZ DE DK EE ES FI GB GE HU IL IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MD MG MK MN MW MX NO NZ PL PT RO RU SD SE SG SI SK TJ TM TR TT UA UG US UZ VN AM AZ BY KG KZ MD RU TJ TM |
|
AL | Designated countries for regional patents |
Kind code of ref document: A2 Designated state(s): KE LS MW SD SZ UG AT BE CH DE DK ES FI FR GB GR IE IT LU MC NL PT SE BF BJ CF CG |
|
DFPE | Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101) | ||
121 | Ep: the epo has been informed by wipo that ep was designated in this application | ||
NENP | Non-entry into the national phase |
Ref country code: JP Ref document number: 97521980 Format of ref document f/p: F |
|
REG | Reference to national code |
Ref country code: DE Ref legal event code: 8642 |
|
122 | Ep: pct application non-entry in european phase |