WO1997022054A2 - Processor redundancy in a distributed system - Google Patents

Processor redundancy in a distributed system Download PDF

Info

Publication number
WO1997022054A2
WO1997022054A2 PCT/SE1996/001609 SE9601609W WO9722054A2 WO 1997022054 A2 WO1997022054 A2 WO 1997022054A2 SE 9601609 W SE9601609 W SE 9601609W WO 9722054 A2 WO9722054 A2 WO 9722054A2
Authority
WO
WIPO (PCT)
Prior art keywords
processor
processors
software
catastrophe
software objects
Prior art date
Application number
PCT/SE1996/001609
Other languages
English (en)
French (fr)
Other versions
WO1997022054A3 (en
Inventor
Lars Ulrik Jensen
Original Assignee
Telefonaktiebolaget Lm Ericsson (Publ)
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Telefonaktiebolaget Lm Ericsson (Publ) filed Critical Telefonaktiebolaget Lm Ericsson (Publ)
Priority to AU10488/97A priority Critical patent/AU1048897A/en
Publication of WO1997022054A2 publication Critical patent/WO1997022054A2/en
Publication of WO1997022054A3 publication Critical patent/WO1997022054A3/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/202Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant
    • G06F11/2023Failover techniques
    • G06F11/203Failover techniques using migration
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/202Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant
    • G06F11/2035Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant without idle spare hardware
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M3/00Automatic or semi-automatic exchanges
    • H04M3/22Arrangements for supervision, monitoring or testing
    • H04M3/24Arrangements for supervision, monitoring or testing with provision for checking the normal operation
    • H04M3/241Arrangements for supervision, monitoring or testing with provision for checking the normal operation for stored program controlled exchanges
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04QSELECTING
    • H04Q3/00Selecting arrangements
    • H04Q3/42Circuit arrangements for indirect selecting controlled by common circuits, e.g. register controller, marker
    • H04Q3/54Circuit arrangements for indirect selecting controlled by common circuits, e.g. register controller, marker in which the logic circuitry controlling the exchange is centralised
    • H04Q3/545Circuit arrangements for indirect selecting controlled by common circuits, e.g. register controller, marker in which the logic circuitry controlling the exchange is centralised using a stored programme
    • H04Q3/54575Software application
    • H04Q3/54591Supervision, e.g. fault localisation, traffic measurements, avoiding errors, failure recovery, monitoring, statistical analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/202Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant
    • G06F11/2048Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant where the redundant components share neither address space nor persistent storage

Definitions

  • the present invention relates to a distributed, fault-tolerant reconfigurable processor system in a telecommunication network.
  • processors In public telecommunication systems there are several processors performing many different kinds of tasks, such as monitoring activity on subscriber equipment lines, set up and release of connections, traffic control, system management and taxation. Groups of processors are interconnected by way of a network which is separate from the telecommunication network or forms a part thereof. In modern telecommunication networks there are network elements, such as an exchange, a data base, a processor, that are distributed on several physical elements of the physical network making up the telecommunication network. To an application, such as POTS (Plain Old Telephony Service), GSM (Global System for Mobile communication) , VLL (virtual leased lines) , BISDN (Broadband),
  • POTS Phase Old Telephony Service
  • GSM Global System for Mobile communication
  • VLL virtual leased lines
  • BISDN Broadband
  • Integrated-Services Digital Network a distributed processor or a distributed data base looks like a single unit.
  • the distributed units are said to bee transparent from the point of view of a distribution.
  • a mam requirement for processor based control systems of public telecommunication systems concerns system availability. With this is meant that the system should be available to serve its users. In for example the AXE-10 telephone system only 2 hours of unavailability of the system during 40 years was allowed. Converted into minutes per year this corresponds to abut 3 mm/year. Modern telecommunication systems have much higher availability demands. Nevertheless it is also required of modern telecommunication systems to allow for planned maintenance work, which may have long duration, at long term intervals, for example at intervals in the order of about 1 months.
  • U.S. Patent Serial No. 4 710 926 relates to a fault recovery method for a distributed processing system using spare processors that take over the functions of failed processors.
  • a spare processor acts as a stand-in for one or more active processors. When a spare processor is put into service it will no longer serve as a spare processor for any other processor in the system. During fault- recovery all functions executing on the faulty processor are transferred to the spare processor. The old spare processor's function of being a spare processor is transferred to a second spare processor in the system.
  • the system requires two or more spare processors.
  • a spare processor When a spare processor is inactive it does not perform any ;job tasks. When it becomes active it starts processing ⁇ ob tasks - provided it is operative, i.e. is not impaired by any faults.
  • ⁇ ob tasks - When not described m said patent it will be necessary to run test programs to verify that the spare processors are operative.
  • the processing system uses spare sub-system elements that do not participate in the overall processing tasks.
  • the method used for reconfiguration when a faulty element is detected and replaced by a spare element is one that uses distinct socket addresses for each element in the system.
  • a socket address is assigned a virtual address which replaces the socket address when a faulty condition is detected.
  • Moleskey has found, as a s de effect of the chache coherency protocol, that data can be recovered by providing a database log which is used to roll back the transactions associated with tne failed processor's chache memory only. The transactions performed by the rest of the processors may continue and will not corrupt the data of the database.
  • the problem addressed by Molesky is m no way related to reconfiguration of processor systems. A database log is not like a catastrophe plan.
  • Deplance thus starts the reallocation process at the time a processor goes down and finishes it before the deadline expires.
  • Deplance indicates that there are methods for computing task allocations off-line, but such methods are complex, require much processor work and provide allocation tables that are very large. This is so because the number of conceivable combinations of tasks and processors is very large even for moderately sized processor systems. Deplance is thus warning for the use of such off-line algorithms.
  • the inventor of the present invention has realized this problem and his contribution to the art is to provide reallocation tables, not for all possible configurations of processors and tasks, but for one configuration only.
  • One object of the invention is to provide a method for auto ⁇ matically recover from multiple permanent failures of processors in a distributed processor system which is used m an application environment of a telecommunication system having high demands on availability while simultaneously allowing system maintenance, planned or unplanned.
  • Another object of the invention is to utilize available processing resources while allowing for a heterogeneous processing environment due to evolving technology in a system that grows over time and due to particular needs of different parts of an application that runs on the processor system.
  • Another object of the invention is to provide a method for quickly recovering from multiple permanent failures of processors in a distributed processor system used m a telecom system's environment by providing an initial configuration of all processors and by providing, for each processor in the system, a catastrophe plan to be used in case the corresponding processor goes down.
  • a catastrophe plan is the means by which software objects installed on a faulty processor is distributed to generally several pro ⁇ cessors in the system thus providing for load sharing among the processors .
  • a further object of the invention is to have all of the catastrophe calculated and installed m memories associated with the processors so that they are available to the system instantaneously at the time a processor goes down.
  • Still another object of the invention is to provide new catastrophe plans for the system of operating processors, some of which nave installed thereon software objects from a failed processor, so as to prepare the system for a quick recovery should a further processor m the system go down.
  • Still another object of the invention is to provide a method of the indicated kind which takes back the system to its initial configuration of processors and software objects when the system's faulty processor or processors after repair or replacement are inserted back into the system.
  • Another object of the invention is to provide n a catastrophe plan associated with an individual processor an initial redistribution of software objects executing on said individual processor to other non-faulty processors prior to the finishing redistribution of software objects of a faulty processor so as to free up memory for storage of a large software object, which is running on the faulty processor and which m accordance with its catastrophe plan is to be transferred to said predefined processor; the memory which is freed up being the memory associated with said predefined processor.
  • Another object of the invention is to include in the catastrophe plan redistribution of objects executing on non-faulty processors to other non-faulty processors, so as to free up processor resources, such as memory and CPU capacity, for large software objects which are running on the faulty processor and which in accordance with the faulty processor' s catastrophe plan shall be transferred to the processors on which resources have been freed up.
  • processor resources such as memory and CPU capacity
  • An object of the invention is also to provide a software model that allows software objects to be transferred from a faulty processor to an operating processor by restarting the object on the operating processor.
  • a software model will also allow for killing a software object installed on a processor and for restarting it on a repaired, previously faulty, processor which has been reinserted into the system. This latter objective is predominantly used when the system returns to its initial configuration and there are objects installed on operating processors, which objects should be given back to the repaired processors.
  • a model of the telecommunication system comprising a hardware model of the control processors and the controlled hardware equipment as well as a software model that supports and fits into the hardware model of the telecommunication system.
  • a first algorithm is used to calculate the catastrophe plans for each of the operating processors given either the initial configuration or any one of the actual configurations that w ll appear after a further processor has gone down.
  • a second algorithm is used that given an actual configuration computes a delta configuration that applied to the actual configuration will give back the initial configuration of the system.
  • Figure 1 is a block diagram showing a distributed processor system in an initial configuration
  • FIG. 2 is a block diagram showing the processor system of Figure
  • Figure 3 is a block diagram of the processor system of Figure 1 in a second actual configuration after failure of two processors
  • FIG. 4 is a flow diagram of the method in accordance with the invention.
  • Figure 5 is a block diagram of a distributed processor system some of the processors of which are controlling hardware equipment
  • Figure 6A is a schematic view of a modularized software object
  • FIGS 6B-D are block diagrams of three different types of software objects
  • Figure 7 is a block diagram illustrating how the hardware and software models m accordance with the invention fit together in one single model of the telecommunication system m accordance with the invention
  • FIG. 8 is a block diagram showing the hardware model in accordance with the invention.
  • Figure 9 is a block diagram of the distributed processor system showing a preparatory redistribution of software objects.
  • FIG. 1 there s shown a number of distributed processors Pl, P2, P3 and P4 which communicate over a network NI .
  • the processors form part of a non-shown telecommunication network.
  • the network NI may form part of said non-shown telecommunication network.
  • Each processor comprises a processor unit PU and memory M.
  • Software objects 1, 2, 3... 18 are installed on the processors; objects 1, 2, 3 on processor Pl, objects 4-7 on P2, objects b-12 on P3 and objects 13-18 on P .
  • the software of an application that runs in the telecommunication network comprises software objects (Figs. 6B-D) , which are contained in software modules (F ⁇ g.6A) .
  • the modularized software objects are allocation independent objects that can be transferred freely between the processors.
  • a modularized software object is independent of other modularized software objects.
  • a software object typically comprises a process and persistent data. Persistent data is data that survives a restart of the software object.
  • Software objects can communicate w th each other.
  • a task which is required by an application typically involves several software objects on different processors, and is executed by some or all of the processes of these objects. The actual distribution of software objects on different processors is unknown to the application.
  • Modularized persistent data can be stored m a data base.
  • the data base is also distributed over several the memories M of several processors, preferably the memories of all of the processors Pl, P2, P3 and P _ .
  • These data base partitions are labeled DB1, DB2, DB3 and DB4 and comprises a random access memory (RAM) .
  • RAM random access memory
  • a novel and preferred alternative is, however, to store a mirror copy of each modularized software object m a data base partition of another processor than the one on which said object is installed.
  • each modularized software object is stored in the database partition on theprocessor given by the catastrophe plan for the processor on which the modularized software object, the original, is executing.
  • copies of the modularized persistent data will be safely stored on another processor if the processor on which the original is installed crashes.
  • the initial configuration must not disappear if any processor goes down. For this reason the initial configuration and the mirror copy thereof is stored as described above. Instead of implementing the initial configuration in the form of a table it can be implemented in so called tuples. As an example tuples (1,1), (1,2), (1,3) would correspond to the information given by the first row of Table 1.
  • a catastrophe plan contains directions regarding the processors to which the software objects of the faulty processor should be trans ⁇ ferred.
  • One catastrophe plan shown m Table 2 indicates the processors to which the objects installed on processor Pl should be transferred in case processor Pl goes down.
  • Another catastrophe plan contains information on where the objects installed on processor P2 should be transferred m case processor P2 goes down.
  • m Table 4 indicates the processors to which the objects installed on processor Pl should be transferred in case processor Pl goes down.
  • Another catastrophe plan contains information on where the objects installed on processor P2 should be transferred m case processor P2 goes down.
  • new catastrophe plans must be established so that the system can quickly recover if another processor goes down.
  • new catastrophe plans giving directions regarding the processors to which the software objects installed on a faulty processor should be transferred. Since one cannot foresee which one of the three operating processors P2-P4 that will go down, it will be necessary to create catastrophe plans for each one of the operating processors.
  • Table 8 is the new catastrophe plan (CP-P2 1 ) for processor P2, Table 9 the new one (CP-P3') for processor P3 and Table 10 the new one (CP-P4') for the processor P .
  • the new catastrophe planes and its mirror copies are stored m the above described manner.
  • the system shall revert to the initial configuration. This can be done either by killing all software objects in the actual configuration and by creating and starting all software objects on the processors of the system. In the preferred embodiment of the invention only the objects transferred from the first processor, and which now execute on other processors, are killed at first and are then created and started on processor Pl .
  • a delta configuration table is created by subtracting the initial configuration from the actual configuration, excluding processor Pl. By subtracting Table 1 from Table 6 the delta configuration shown m Table 7 is achieved. The row pertaining to the faulty processor Pl is not included n the subtraction.
  • the delta configuration indicates that objects 1 and 3 at processor P2 and object 2 at processor P4 should be killed at the respective processors.
  • tney shall be created and started on the repaired processor Pl . After said creation the system is now running like it did in the initial configuration and its recovery time was short.
  • Processor Pl goes down and then processor P2 goes down.
  • processor Pl goes down and then processor P2 goes down.
  • processor Pl goes down and then processor P2 goes down.
  • the system is running with the same configuration as shown m Figure 1, that catastrophe plans have been created for each one of the processors P1-P4, that processor Pl crashes, that the software objects installed on processor Pl are transferred to operating processors following the catastrophe plan of Table 2, that the system recovers and is up an running, that new catastrophe plans are created for processors P2, P3 and P , and that processor Pl is removed and brought to repair.
  • processor P2 goes down.
  • the new catastrophe plan associated with processor P2 i.e. the new catastrophe plan of Table 8 should be followed.
  • plan objects 1, 3 and 4 should be transferred to processors P3 and objects 5-7 should be transferred to processor P4.
  • the software objects on processor P2 are removed and are transferred to the processors P3 and P4.
  • the system will now be up an running and will have a configuration of the Kind shown in Figure 3 and Table 11.
  • it will now be necessary to work out catastrophe plans for each one of the processors P3 and P .
  • processor system comprising four processors has been described.
  • the inventive method is equally well applicable on processor systems that comprise two, three five or more processors.
  • a processor system tolerating two faulty processors was described.
  • the inventive method is equally well applicable on processor systems that tolerate three or more faulty processors.
  • the last example illustrates that a four processor system can operate with 50o of its processors faulty. The application will still run, but it will have a degraded perfor ⁇ mance. If the processor system is a switch in a local office, telephone traffic will still be running and congestion will start at a low traffic volume. This is a novel and unique feature that is not present m any of the above referenced US patents, and, as far as applicant knows, no one else has achieved before.
  • a first algorithm is used for creating catastrophe plans from the initial configuration in case a first processor goes down or from the actual configuration m case a further processor goes down is the same.
  • the first algorithm comprises parameters that pertain to the capacity of a processor, parameters that pertain to the size of the memory of a processor, parameters relating to how much processor capacity (machine cycles per process to execute) and memory the individual objects to be transferred do require, and parameters relating to the quality of service.
  • a second algorithm is used for returning the system to its initial configuration. This second algorithm has already been described above and has been referred to as a delta configuration.
  • Various methods can be used to detect a faulty processor, for example the "heart beat" method m accordance with the US Patent Specification 4 710 926 referred to above.
  • a preferred method in a typical telecom network is, however, to monitor the links by which processors are interconnected through the network NI .
  • sottware objects installed on a faulty processor are transferred to two processors.
  • the faulty processor's objects can also be distributed among three or more processors m the system.
  • all software objects are transferred to a single processor in case the system comprises two processors that are in working order and one of these crashes.
  • FIG. 4 the method steps performed in accordance with the invention are shown in a flow diagram.
  • the initial configuration is created by a system vendor or system operator and is stored m the system. This is indicated in box 20.
  • Next catastrophe planes should be created m accordance with the first algorithm. There should be as many catastrophe planes as there are processors in the system. Further mirror copies of persistent data base objects should be created.
  • each processor creates its own catastrophe plan, i.e. the catastrophe plan to be used by the system m case it goes down. This will ensure that the work for creating the catastrophe plans will be totally distributed.
  • a processor goes down, box 22.
  • the software objects of the faulty processor should be transferred to operating processors using the catastrophe plan for the faulty processor.
  • By transferring objects is contemplated that new copies of the software objects of the crashed processor are created and started on the processors to which they should be transferred in accordance with the catastrophe plan.
  • Box 23 accordingly represents the recovery of the system from the faulty processor.
  • the system is now up and running and a new configuration, referred to as actual configuration, arises.
  • the actual configuration is also stored n a memory of a distributed processor.
  • new catastrophe planes for the operating processors are created, box 24. Now the system has recovered its ability to withstand a new processor failure. Also mirror copies of the new catastrophe plans are stored in the data base.
  • the process returns to operation 22 as indicated by arrow 25.
  • the faulty processor or faulty processors are repaired, box 26, and are inserted into the system, box 26. If two or more processors have crashed, it is assumed they are repaired and that they are inserted back into the system simultaneously. Theoretically t is of course possible to repair faulty processors one by one and insert them into the system one by one but from practical point of view this procedure is roundabout.
  • the last step in the process, box 27, is to take the system back to its initial configuration using the second algorithm.
  • Example of hardware equipment controlled by software modules are I/O devices, subscriber line interface devices, subscriber line processors, tone decoders, voice prompting devices, conference equipment. Hardware dependencies of this kind pose restrictions on the software modules.
  • a software module involved in controlling hardware equipment that is connected to one or more processors can not be transferred to an arbitrary processor m the system but must be transferred to a processor that has access to the very same hardware equipment. The catastrophe planes must be created with this in mind.
  • a telecom system can usually continue to operate despite the loss of some devices, although the services it provides might be somewhat impaired.
  • Figure 6B illustrates a software object which contains a function part (execution part) and a persistent data part (persistent part) .
  • the software object m Figure 6C contains the function part of the software object shown in Figure 6B and the software object shown in Figure 6D contains the persistent data part of the same software object shown m Figure 6B.
  • the software objects n Figures 6C and 6D together form a pair and have the same key.
  • the key s the logical address to the software object shown in Figure 6B in the data base and the key will therefore also be the logical address to the two software objects of Figure 6C and 6D m the data base.
  • blocking is provided by setting the persistent data part of the software object shown in Figure 6D m blocked state.
  • a blocked state is marked by a setting a flag m the software object. Note that it is not the device controlling software object 6C that is blocked but its persistent companion software object in Figure 6D.
  • the software object is blocked m the address tables existing m the operating system of the respective processors.
  • the operative system of a processor has address tables to it own software objects. The address tables are used when messages are sent to its object.
  • the processors of a processor system can be of two kinds, fault tolerant processors (FTP processors) and non-FTP processors.
  • FTP processors fault tolerant processors
  • An FTP processor which usually comprises double processors, goes on with executing its tasks even if there arises a simple hardware fault in the hardware equipment controlled by the FTP processor. When the fault occurs an alarm will be trigged but the program does not crash.
  • FTP processor it is possible to take a FTP processor out of the service for repair in a controlled manner, using the catastrophe plans, so that the services the application deliver will not be interrupted. For example, m a telecom system no traffic disturbances will occur; ongoing calls will not be interrupted.
  • FIG 5 there is shown a processor system similar to Figure 1 where there is a network NI to which processors P1-P4 have access and can communicated with each other. Although not shown in Figure 5 it is supposed that the software objects 1-18 are distributed on processors P1-P4 m the same way as shown m Figure 1. Further there is a device Dl connected to processor Pl . There is also a second network N2 to which processors P2 and P4 have access. Device processors D5 is a device that is used to connect devices D2 and D3 to the network N2. Device processor D5 thus controls devices D2 and D3. Another device processor D6 connects devices D_ and D5 to the network N2 and will thus control these.
  • Typical examples of a device of the Dl kind is processor Pl itself.
  • Another example is some hardware device, like a UART-device. If the processor Pl goes down, or if the hardware device controlled by processor Pl goes down, then the software object which represents processor Pl or the software object that represents the faulty hardware cannot be transferred to any of processors P2-P4 since none of these can gain control over the faulty hardware or of the processor Pl . Nevertheless, the processor system must tolerate that Pl goes down if the system is to be redundant . When the software object representing processor Pl goes down, the software object that represents processor Pl must be blocked so it cannot be accessed by any other software objects.
  • object 1 represents processor Pl
  • object 4 represents processor P2
  • object 8 represents processor P3 and object 13 processor P4. All such hardware dependencies are exactly described by the model of the hardware with installed software as shown m Figure 7. Accordingly the model shows that none of these objects 1, 4, 8 and 13 can be transferred to any other processor in the system.
  • the first algorithm operates on the model of the hardware with installed software and will thus take into account all hardware dependencies when it creates new catastrophe plans.
  • the catastrophe table for processor 1 shown in Table 2 will therefore remain the same with the exception that object 1 disappears.
  • object 4 will disappear and in the catastrophe table for processor 3 object 8 will disappear and object 13 will disappear from the catastrophe table associated with processor P4. Accordingly less objects than described previously will remain on the respective processors when a processor goes down.
  • processor Pl If for example processor Pl goes down it will be necessary to block object 1 from access from other software objects. This means that no other sortware objects are allowed to communicate with software object 1.
  • the model in accordance with the invention will always pretend that the system will operate even if hardware equipment is lost and cannot be controlled by its software objects. If, however, at a higher level of the system it turns out that the system does not operate, not even with impaired services delivered, then the reason why the system does not operate is lack of redundancy and the model cannot change this fact.
  • FIG 6A the software model of the modularized software object is shown.
  • the software model comprises a class named configOb which has the operations construct 0 and destruct() .
  • Construct () and destructO are used to create and kill respectively a particular modular software object.
  • the modular software object has been described above with reference to Figures >B-6D.
  • the hardware model 30 describes the physical devices and their sites in the system.
  • the hardware model 30 comprises a class processor 31 which generates objects that represent processors P1-P4, a class CPPool 32 which generate objects that represent network NI, a class DevX 33 which generates objects that represents network N2 and a class ConfigObj 34 which generate objects which represent the software objects, shown in Figure 6B, of the devices referred to above; device processors inclusive.
  • Class ConfigObj 35 generate software which represent the software objects of the modularized software object shown in Figure 6A but do not form part of the hardware model.
  • the model shows that processors are connected to NI and to N2.
  • the model also shows that software which has no devices can be installed on all processors that can connect to NI, while software which controls devices must be installed on processors th t can connect to N2.
  • the model will define the constraints for each hardware dependent software object. By connecting the ConfigObj to OrgX not only is the hardware devices included m the hardware model, but at the same time are the software objects that control the devices installed in the model.
  • processor 1 goes down and that the software objects executing thereon shall be redistributed m accordance with its catastrophe plan shown in Table 2.
  • the catastrophe plan is stored distributed on processors P2 and P3 in fragments.
  • catastrophe plan fragment 40 is stored in the memory M of processor P2
  • catastrophe plan fragment 41 is stored in memory M of processor P2.
  • P2 and P3 software objects and data are stored, as exemplified by the various hatched layers. All memories are not completely filled as exemplified by the non-hatched memory areas.
  • memory M of processor P2 has a free, non-occupied memory area 42
  • a d memory M of processor P3 has a free, non-occupied memory area 43.
  • the CPU capacity of the different processors are used to different extents (not necessary in proportion to the memory usage of the respective processor) .
  • the catastrophe plan of processor Pl software object 1 shall be redistributed to processor P2.
  • the free memory area 42 of processor P2, however, is not large enough for housing object 01. Therefore the catastrophe plan of processor Pl contains an initial redistribution phase in order to make room for software object 01.
  • the memory of processor P2 is removed and is transferred to the free memory area 43 in processor P3 leaving an enlarged free memory area in processor P2, large enough to house software object 01.
  • the objects executing on processor P2 are killed. Following this the objects which in accordance with the catastrophe plan are to execute on processor P2 are created on processor P2. In this manner there will be processor resources (memory as well as CPU capacity) available for executing the new objects.
  • the object 04 killed on processor P3 must not disappear and is created on another non-faulty processor, for example processor P3.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Hardware Redundancy (AREA)
  • Exchange Systems With Centralized Control (AREA)
PCT/SE1996/001609 1995-12-08 1996-12-06 Processor redundancy in a distributed system WO1997022054A2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU10488/97A AU1048897A (en) 1995-12-08 1996-12-06 Processor redundancy in a distributed system

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
SE9504396A SE515348C2 (sv) 1995-12-08 1995-12-08 Processorredundans i ett distribuerat system
SE9504396-4 1995-12-08

Publications (2)

Publication Number Publication Date
WO1997022054A2 true WO1997022054A2 (en) 1997-06-19
WO1997022054A3 WO1997022054A3 (en) 1997-09-04

Family

ID=20400521

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/SE1996/001609 WO1997022054A2 (en) 1995-12-08 1996-12-06 Processor redundancy in a distributed system

Country Status (3)

Country Link
AU (1) AU1048897A (sv)
SE (2) SE515348C2 (sv)
WO (1) WO1997022054A2 (sv)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0959587A2 (en) * 1998-04-02 1999-11-24 Lucent Technologies Inc. Method for creating and modifying similar and dissimilar databases for use in network configuration for use in telecommunication systems
WO2000064199A2 (en) * 1999-04-14 2000-10-26 Telefonaktiebolaget Lm Ericsson (Publ) Recovery in mobile communication systems
WO2001013232A2 (en) * 1999-08-17 2001-02-22 Tricord Systems, Inc. Self-healing computer system storage
GB2359384A (en) * 2000-02-16 2001-08-22 Data Connection Ltd Automatic reconnection of linked software processes in fault-tolerant computer systems
US6438707B1 (en) 1998-08-11 2002-08-20 Telefonaktiebolaget Lm Ericsson (Publ) Fault tolerant computer system
US6449731B1 (en) 1999-03-03 2002-09-10 Tricord Systems, Inc. Self-healing computer system storage
US6725392B1 (en) 1999-03-03 2004-04-20 Adaptec, Inc. Controller fault recovery system for a distributed file system
WO2004062303A1 (en) * 2002-12-30 2004-07-22 At & T Corporation System and method of disaster restoration
WO2005009058A1 (de) * 2003-06-26 2005-01-27 Deutsche Telekom Ag Verfahren und system zur erhöhung der vermittlungskapazität in telekommunikationsnetzwerken durch übertragung oder aktivierung von software
US6922688B1 (en) 1998-01-23 2005-07-26 Adaptec, Inc. Computer system storage
US7287179B2 (en) 2003-05-15 2007-10-23 International Business Machines Corporation Autonomic failover of grid-based services
US7715837B2 (en) * 2000-02-18 2010-05-11 Telefonaktiebolaget Lm Ericsson (Publ) Method and apparatus for releasing connections in an access network

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4371754A (en) * 1980-11-19 1983-02-01 Rockwell International Corporation Automatic fault recovery system for a multiple processor telecommunications switching control
US4710926A (en) * 1985-12-27 1987-12-01 American Telephone And Telegraph Company, At&T Bell Laboratories Fault recovery in a distributed processing system

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4371754A (en) * 1980-11-19 1983-02-01 Rockwell International Corporation Automatic fault recovery system for a multiple processor telecommunications switching control
US4710926A (en) * 1985-12-27 1987-12-01 American Telephone And Telegraph Company, At&T Bell Laboratories Fault recovery in a distributed processing system

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
DISTRIBUTED PROCESSING - PROCEEDINGS OF THE IFIP WW6 10:3...., October 1987, A-M. DEPLANCHE et al., "Task Redistribution with Allocation Constraints in a Fault-Tolerant Real-Time Multiprocessor System", pages 136-143. *
IEEE TRANS. ON PARALLEL AND DISTRIBUTED SYSTEMS, Volume 4, No. 8, August 1993, N-F. TZENG, "Reconfiguration and Analysis of a Fault-Tolerant Circular Butterfly Parallel System", pages 855-863. *
IEEE TRANS. ON RELIABILITY, Volume 38, No. 1, April 1989, C-M. CHEN et al., "Reliability Issues with Multiprocessor Distributed Database Systems: A Case Study", pages 153-155. *
PATENT ABSTRACTS OF JAPAN, Vol. 96, No. 01; & JP,A,07 234 849 (HITACHI LTD), 5 Sept. 1995. *
SPECIAL INTEREST GROUP ON MANAGEMENT OF DATA, No. 2, 1995, L.D. MOLESKY et al., "Recovery Protocols for Shared Memory Database Systems", pages 11-22. *

Cited By (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6922688B1 (en) 1998-01-23 2005-07-26 Adaptec, Inc. Computer system storage
EP0959587A3 (en) * 1998-04-02 2000-05-10 Lucent Technologies Inc. Method for creating and modifying similar and dissimilar databases for use in network configuration for use in telecommunication systems
EP0959587A2 (en) * 1998-04-02 1999-11-24 Lucent Technologies Inc. Method for creating and modifying similar and dissimilar databases for use in network configuration for use in telecommunication systems
US6438707B1 (en) 1998-08-11 2002-08-20 Telefonaktiebolaget Lm Ericsson (Publ) Fault tolerant computer system
US6725392B1 (en) 1999-03-03 2004-04-20 Adaptec, Inc. Controller fault recovery system for a distributed file system
US6449731B1 (en) 1999-03-03 2002-09-10 Tricord Systems, Inc. Self-healing computer system storage
AU770164B2 (en) * 1999-04-14 2004-02-12 Telefonaktiebolaget Lm Ericsson (Publ) Recovery in mobile communication systems
WO2000064199A2 (en) * 1999-04-14 2000-10-26 Telefonaktiebolaget Lm Ericsson (Publ) Recovery in mobile communication systems
WO2000064199A3 (en) * 1999-04-14 2001-02-01 Ericsson Telefon Ab L M Recovery in mobile communication systems
US6775542B1 (en) 1999-04-14 2004-08-10 Telefonaktiebolaget Lm Ericsson Recovery in mobile communication systems
WO2001013233A3 (en) * 1999-08-17 2001-07-05 Tricord Systems Inc Self-healing computer system storage
WO2001013233A2 (en) * 1999-08-17 2001-02-22 Tricord Systems, Inc. Self-healing computer system storage
US6530036B1 (en) 1999-08-17 2003-03-04 Tricord Systems, Inc. Self-healing computer system storage
WO2001013232A2 (en) * 1999-08-17 2001-02-22 Tricord Systems, Inc. Self-healing computer system storage
WO2001013232A3 (en) * 1999-08-17 2001-07-12 Tricord Systems Inc Self-healing computer system storage
GB2359384A (en) * 2000-02-16 2001-08-22 Data Connection Ltd Automatic reconnection of linked software processes in fault-tolerant computer systems
GB2359384B (en) * 2000-02-16 2004-06-16 Data Connection Ltd Automatic reconnection of partner software processes in a fault-tolerant computer system
US7715837B2 (en) * 2000-02-18 2010-05-11 Telefonaktiebolaget Lm Ericsson (Publ) Method and apparatus for releasing connections in an access network
WO2004062303A1 (en) * 2002-12-30 2004-07-22 At & T Corporation System and method of disaster restoration
US7058847B1 (en) 2002-12-30 2006-06-06 At&T Corporation Concept of zero network element mirroring and disaster restoration process
US7373544B2 (en) 2002-12-30 2008-05-13 At&T Corporation Concept of zero network element mirroring and disaster restoration process
US7287179B2 (en) 2003-05-15 2007-10-23 International Business Machines Corporation Autonomic failover of grid-based services
WO2005009058A1 (de) * 2003-06-26 2005-01-27 Deutsche Telekom Ag Verfahren und system zur erhöhung der vermittlungskapazität in telekommunikationsnetzwerken durch übertragung oder aktivierung von software
US8345708B2 (en) 2003-06-26 2013-01-01 Deutsche Telekom Ag Method and system for increasing the switching capacity in telecommunications networks by transmission or activation of software

Also Published As

Publication number Publication date
AU1048897A (en) 1997-07-03
SE9703132D0 (sv) 1997-08-29
SE9703132L (sv)
SE515348C2 (sv) 2001-07-16
SE9504396D0 (sv) 1995-12-08
WO1997022054A3 (en) 1997-09-04
SE9504396L (sv) 1997-06-09
SE9703132A0 (sv) 1997-08-29

Similar Documents

Publication Publication Date Title
KR100326982B1 (ko) 높은 크기 조정 가능성을 갖는 고 가용성 클러스터 시스템 및 그 관리 방법
EP0717355B1 (en) Parallel processing system and method
US6854069B2 (en) Method and system for achieving high availability in a networked computer system
EP1617331B1 (en) Efficient changing of replica sets in distributed fault-tolerant computing system
US7870235B2 (en) Highly scalable and highly available cluster system management scheme
EP2643771B1 (en) Real time database system
US7302609B2 (en) Method and apparatus for executing applications on a distributed computer system
WO1997022054A2 (en) Processor redundancy in a distributed system
JP2000112911A (ja) コンピュ―タネットワ―クにおけるデ―タベ―ス管理システムにおいて自動的にタスクを再分配するシステム及び方法
JP2008210412A (ja) マルチノード分散データ処理システムにおいてリモート・アクセス可能なリソースを管理する方法
WO1998032074A1 (en) Data partitioning and duplication in a distributed data processing system
Babaoğlu et al. System support for partition-aware network applications
CN108984320A (zh) 一种消息队列集群防脑裂方法及装置
US11544162B2 (en) Computer cluster using expiring recovery rules
CN114338670B (zh) 一种边缘云平台和具有其的网联交通三级云控平台
WO1997049034A1 (fr) Systeme de prise en charge de taches
Corsava et al. Intelligent architecture for automatic resource allocation in computer clusters
CN115291891A (zh) 一种集群管理的方法、装置及电子设备
JP2004094681A (ja) 分散データベース制御装置および制御方法並びに制御プログラム
Pimentel et al. A fault management protocol for TTP/C
CN109995560A (zh) 云资源池管理系统及方法
CN208299812U (zh) 一种基于ZooKeeper集群的主备切换系统
JP3183216B2 (ja) 2重化mo管理方式
CN118626098A (zh) 集群部署方法及其系统
CN117714386A (zh) 分布式系统部署方法、配置方法、系统、设备及介质

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A2

Designated state(s): AL AM AT AU AZ BA BB BG BR BY CA CH CN CU CZ DE DK EE ES FI GB GE HU IL IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MD MG MK MN MW MX NO NZ PL PT RO RU SD SE SG SI SK TJ TM TR TT UA UG US UZ VN AM AZ BY KG KZ MD RU TJ TM

AL Designated countries for regional patents

Kind code of ref document: A2

Designated state(s): KE LS MW SD SZ UG AT BE CH DE DK ES FI FR GB GR IE IT LU MC NL PT SE BF BJ CF CG

DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
121 Ep: the epo has been informed by wipo that ep was designated in this application
NENP Non-entry into the national phase

Ref country code: JP

Ref document number: 97521980

Format of ref document f/p: F

REG Reference to national code

Ref country code: DE

Ref legal event code: 8642

122 Ep: pct application non-entry in european phase