EP0536375A1 - Fehlertolerantes netwerkdateisystem - Google Patents

Fehlertolerantes netwerkdateisystem

Info

Publication number
EP0536375A1
EP0536375A1 EP19920909636 EP92909636A EP0536375A1 EP 0536375 A1 EP0536375 A1 EP 0536375A1 EP 19920909636 EP19920909636 EP 19920909636 EP 92909636 A EP92909636 A EP 92909636A EP 0536375 A1 EP0536375 A1 EP 0536375A1
Authority
EP
European Patent Office
Prior art keywords
primary
fileserver
backup
register
memory
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
EP19920909636
Other languages
English (en)
French (fr)
Inventor
Gordon Vinther
James W. Mcgrath
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Eastman Kodak Co
Original Assignee
Eastman Kodak Co
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Eastman Kodak Co filed Critical Eastman Kodak Co
Publication of EP0536375A1 publication Critical patent/EP0536375A1/de
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/2053Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where persistent mass storage functionality or persistent mass storage control functionality is redundant
    • G06F11/2056Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where persistent mass storage functionality or persistent mass storage control functionality is redundant by mirroring
    • G06F11/2071Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where persistent mass storage functionality or persistent mass storage control functionality is redundant by mirroring using a plurality of controllers
    • G06F11/2074Asynchronous techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/2053Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where persistent mass storage functionality or persistent mass storage control functionality is redundant
    • G06F11/2056Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where persistent mass storage functionality or persistent mass storage control functionality is redundant by mirroring
    • G06F11/2071Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where persistent mass storage functionality or persistent mass storage control functionality is redundant by mirroring using a plurality of controllers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/2053Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where persistent mass storage functionality or persistent mass storage control functionality is redundant
    • G06F11/2089Redundant storage control functionality
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/2053Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where persistent mass storage functionality or persistent mass storage control functionality is redundant
    • G06F11/2094Redundant storage or storage space
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/2002Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where interconnections or communication control functionality are redundant
    • G06F11/2007Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where interconnections or communication control functionality are redundant using redundant communication media
    • G06F11/201Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where interconnections or communication control functionality are redundant using redundant communication media between storage system components

Definitions

  • This invention relates to a fault tolerant network file system having a primary fileserver and a backup fileserver which mirrors the primary.
  • the backup assumes the role of the primary on the network in a manner transparent to users whose files are stored on the primary.
  • a network generally includes a group of nodes which communicate with each other over a high speed communications link. Examples of nodes include single user personal computers and workstations, multiple user computers, and peripheral devices such as image scanners, printers and display devices.
  • Many networks include a fileserver node which operates as a central storage facility for data and software used by many nodes on the network.
  • the fileserver includes a disk storage device, for storing the data and software, and a central processing unit (CPU) for controlling the fileserver.
  • Files arriving at the fileserver over the communication link are typically first stored in a cache memory within the central processing unit and later copied to the disk storage device for permanent storage.
  • the files may be irretrievably lost. Accordingly, a client node sending certain critical files to the fileserver may instruct the fileserver to immediately write specified files to disk, thereby reducing the likelihood that the specified files will be lost in the event of such a failure.
  • a backup fileserver may be used to maintain a copy of the primary's files. Upon failure of the primary, an exact copy of the primary's files can be automatically accessed through the backup fileserver.
  • One object of the invention is to provide a backup fileserver which promptly mirrors each change of the primary's files.
  • the invention relates to an improved fault tolerant network fileserver system.
  • the network fileserver system includes a network communication link connected to a plurality of nodes.
  • a primary fileserver and a backup fileserver are also connected to the network communication link for storing files from the nodes.
  • the primary fileserver includes a primary computer processor, a primary storage disk, a first network interface connected to the network communication link, and a first independent interface connected to the backup fileserver.
  • the first independent interface is responsive to commands from the primary processor for communicating information to the backup fileserver.
  • the backup fileserver includes a backup computer processor, a backup storage disk, a backup network interface connected to the network communication link, and a second independent interface connected to the first independent interface.
  • the second independent interface includes a dual ported memory. It receives information from the primary fileserver and stores the information in the dual ported memory. In response to commands from the back- up processor, the second independent interface provides information from the dual ported memory.
  • the second independent interface includes a means for interrupting the backup computer processor to notify it that the dual ported memory contains information received from the primary fileserver.
  • a portion of the dual ported memory is arranged as a group of data memory blocks, each for storing data to be copied to the backup storage disk. Other portions store an entry pointer identifying a next available data block, and a removal pointer identifying a next data block to be emptied.
  • Other portions of the dual ported memory may be arranged as a first control message block, for storing control messages from the primary fileserver to the backup fileserver, and a second control message block, for storing control messages from the backup fileserver to the primary fileserver.
  • the second independent interface includes at least one common register accessible to both the primary and backup processors for indicating the entry and removal of data blocks from the data block portion.
  • a preferred embodiment includes a count register.
  • the second interface includes a means responsive to signals from the first independent interface for modifying the contents of the count register to indicate the entry of data into a memory block.
  • the second independent interface is further responsive to signals from the backup computer processor for modifying the count register to indicate the removal of data from a memory block.
  • the dual ported memory includes a semaphore register for arbitrating access to the count register.
  • the semaphore register has a first and a second state of operation. When read by a processor while in the first state, the semaphore register provides a register available code and automatically enters the second state.
  • the primary fileserver When read by a processor while " in the second state, the semaphore register provides a register unavailable code and remains in the second state. When written to by a processor while in the second state, the semaphore register returns to ;he first state.
  • the primary fileserver further includes an improved Unix operating system for controlling the copying of files received by the first network interface to the primary storage disk and to the dual ported memory.
  • Conventional Unix operating systems receive both secure disk write instructions and unsecure disk write instructions directing the operating system to write specified files to the primary storage disk.
  • an unsecure write procedure writes to the primary storage disk using an efficiency algorithm which temporarily defers copying the files to disk.
  • a secure procedure In response to a secure write instruction, a secure procedure writes files to disk promptly without the benefit of the efficiency algorithms.
  • the improved operating system responds to both secure and unsecure instructions by promptly writing the specified files to the dual ported memory, and subsequently writing the specified files to the primary disk storage device using the unsecure write procedure.
  • Figure 1 is a block diagram of a computer network having a primary fileserver and a backup fileserver.
  • Figure 2 is a block diagram of a pair of interface boards for connecting the primary fileserver to the backup fileserver.
  • Figure 3 is a diagram of several registers on the interface boards shown in Figure 2.
  • Figure 4 is a diagram showing the organization of a dual ported memory according to a preferred embodiment of the invention.
  • Figures 5(a) and 5(b) are a flow chart of a method for copying new files of the primary fileserver to the dual ported memory.
  • Figures 6(a) and 6(b) are a flow chart of a method for copying data from the dual ported memory to a backup fileserver.
  • Figures 7(a) through 7(d) are a flow chart of a procedure by which a backup fileserver temporarily replaces the primary fileserver.
  • Figures 8(a) and 8(b) are a flow chart of a procedure by which a failed primary fileserver resumes its role on the network after recovering from a failure.
  • a network 10 includes a plurality of nodes 12, each for performing specific tasks in the design and production of publications such as newspapers and magazines.
  • node 12(a) could be a computer workstation which allows a journalist to draft an article to be included in the publication
  • node 12 (b) could be a scanner for digitizing an image, such as a photograph, to be printed with the article prepared at node 12 (a)
  • node 12(c) could be a computer workstation which allows a user to lay out the entire publication by selecting and arranging articles, images and advertisements prepared by other nodes on the network.
  • the network may include other nodes for producing hard copies of the publication.
  • printer nodes prepare hard copies of images called ⁇ proofs".
  • Other nodes prepare printing plates used in high volume printing of the publication.
  • Each node 12 communicates with other nodes via a high speed communication link 14.
  • link 14 is a high speed.serial communication channel over which a node can broadcast messages to one or more other nodes.
  • Many nodes include disks for storage of information used locally on the node.
  • workstations 12(a) and 12(c) typically include disks for storing software used in performing their respective tasks, e.g., text editing and document layout.
  • nodes are diskless and therefore require access to a central fileserver 15 for storing information and accessing software. Further, even nodes having disks often store data on the central fileserver. Requiring nodes to store data files in a central fileserver provides several advantages including the centralized control of a common file system which is accessible to many nodes on the network.
  • the primary fileserver includes a first network interface 20 for receiving messages (e.g., files) from communications link 14 and storing them in a buffer memory 21.
  • a central processing unit CPU 24 responds to instructions from operating system software 25 to transfer the buffer contents over a parallel bus 22, through a first interface 28, across a cable 19 to a second interface 30 within backup fileserver 16.
  • Second interface 30 includes the dual ported memory 31 which temporarily stores the buffer data. After storing data in memory?31, CPU 24 instructs interface 30 to interrupt a CPU 32 within backup fileserver 16, thereby notifying CPU 32 that memory 31 includes data. In response to the interrupt, CPU 32 moves the buffer data contents of dual ported memory 31 to a cache memory 33. An operating system 35 later instructs CPU 32 to copy the data from cache 33 to the backup disk 34, thereby providing disk 34 with a copy of each file arriving from the network.
  • the operating system 25 instructs CPU 24 to move the data to a data cache memory 27.
  • Operating system 25 next instructs CPU 24 to copy the contents of cache memory 27 to a primary disk 26.
  • the operating system provides three types of write procedures for copying the files from cache 27 to disk 26.
  • operating system 25 is a Unix operating system, modified to include code for interacting with backup fileserver 16 to copy files from buffer memory 21 to dual ported memory 31.
  • the filesystem code includes several types of write procedures for copying files from cache 27 to disk.
  • the filesystem code typically employs a delayed write procedure.
  • the delayed write procedure implements efficiency algorithms which manage the writing of data to disk to avoid unnecessary delays.
  • the delayed write procedure searches the cache for files which are assigned to the same disk ⁇ cylinder". It then instructs the CPU to initialize a direct memory controller (DMA) to successively transfer the selected files to disk. After this initialization is cort lete, the CPU returns to other tasks while the DMA controller attends to copying the selected data files to disk. Since the files are stored in the same disk cylinder, a time consuming "cylinder seek" operation is avoided by writing these files successively.
  • DMA direct memory controller
  • synchronous write procedure is typically used.
  • the synchronous write procedure promptly writes the file to disk without employing the efficiency algorithms described above.
  • the synchronous write procedure reduces the likelihood that the file contents will be lost in the event of a failure of the fileserver.
  • this procedure degrades disk performance by preventing the efficiency algorithms from reducing the number of time consuming cylinder seek operations.
  • the client server may also operate in a synchronized fashion, waiting for the fileserver to confirm that the file has been written to disk before proceeding to execute the application program which prompted the write to the filasystem.
  • the relatively unsecure delayed disk write procedure is the least taxing on the system performance. However, this procedure risks the loss of crucial data. In the event of a failure of the fileserver, contents of the volatile cache memory are destroyed. Accordingly, delayed writes are typically used for relatively unimportant data while synchronous and asynchronous writes are used for relatively critical data.
  • Backup fileserver 16 provides a high degree of security against a failure without the need for the costly synchronous disk writes. More specifically, each file is initially copied to the dual ported memory 31 which is powered and controlled by the backup fileserver. Since the backup fileserver is thus independent of the primary, there is no need to immediately copy the volatile cache memory 27 to the primary's nonvolatile disk 26. Accordingly, operating system 25 is a modified Unix operating system wherein all synchronous disk writes are converted to delayed writes. By eliminating synchronous writes, the performance of the fileserver is dramatically improved since the efficiency algorithms coordinate all disk writes to optimize disk performance.
  • client nodes running application programs which call for many remote synchronous writes receive an acknowledgment from the fileserver as soon ⁇ the file has been written to the relatively fast dual ported memory.
  • client nodes running application programs which call for many remote synchronous writes receive an acknowledgment from the fileserver as soon ⁇ the file has been written to the relatively fast dual ported memory.
  • dual ported memory 31 includes a nineteen bit address bus ADDR which is accessible to both CPU 24 and CPU 32.
  • CPU 24 To access a given location in memory 31, CPU 24 first loads a pair of address registers 40, 42 with the address of the location to be accessed. As shown in Fig. 3, register 42 contains the low order sixteen bits of the address (i.e., bits A0 - A15) and register 40 contains the upper three bits of the address (bits A16 - A18) .
  • the contents of registers 40, 42 are automatically applied to the address bus ADDR, thereby pointing to a specific location within memory 31.
  • CPU 24 performs a write cycle to a predetermined address assigned to the register.
  • An address decoder 44 within interface 28, decodes the address from parallel bus 22. Upon recognizing the predetermined register address, decoder 44 asserts a pair of encoded control signals "REG" which indicate that CPU ⁇ 24 is requesting access to register 40.
  • a second address decoder 46 within interface 30, decodes the control signals and asserts a register enable signal RI which selects register 40.
  • the read/write control signal from bus 22, which indicates that a write is being performed, is forwarded by address decoder 44 to interface 30. Decoder 46 receives the forwarded R/W signal and provides a corresponding buffered signal "R/Wl" to register 40.
  • register enable signal RI causes register 40 to load the data from data bus DB into the register cells.
  • the data on bus DB is provided by CPU 24 through transceivers 50, 52. More specifically, address decoder 44 monitors the read/write control signal from bus 22. Upon recognizing a write cycle to a location on interface 30, decoder 44 enables transceiver 50 to drive data from bus 22 across cable 19. Similarly, when decoder 46 recognizes an access to interface 30, it enables transceiver 52 to forward the data from cable 19 to data bus DB.
  • CPU 24 loads register 42 by performing a write to an address assigned to register 42.
  • decoder 46 asserts a second register enable signal R2 causing the data from bus DB to be loaded into register 42.
  • CPU 24 writes data to the addressed location in memory 31 by performing a write cycle to a predetermined address assigned to memory 31.
  • decoder 46 asserts memory register signal R3 which instructs registers 40, 42 to assert the stored address on the memory address bus ADDR.
  • the memory register signal R3 is further provided to a multiplexer 54 which in response, applies a memory access signal "MEM-Select" to memory 31.
  • the buffered read/write control signal R/Wl is also applied to multiplexer 54 which in response asserts a memory read/write control signal "MEM-R/W" .
  • memory select signal "MEM-Select" instructs memory 31 to load the data D0-D16 (see Fig. 3) from data bus DB to the location provided on the address bus ADDR.
  • the predetermined address assigned to memory 31 behaves as a sixteen bit register 43 whose contents are determined by the contents of the memory location pointed to by address registers 40, 42.
  • register 40 is a sixteen bit register. However, only five bits in the register are used. As explained above, the low order three bits store the high order address bits (A16 - A18) . Bit nine is a write increment bit "W l". If this bit is set, the address in registers 40, 42 is incremented each time CPU 24 writes to a location in memory 31. Thus, CPU 24 can write to a block of locations in memory 31 by loading registers 40, 42 with the base address of the block and setting the write increment bit. CPU 24 then simply repeatedly writes data to the address dedicated to memory 31. Registers 40, 42 increment the memory address with each write, thereby loading successive locations in memory 31. CPU 24 can similarly read a block of memory by setting read increment bit RI1 in register 40 and successively reading from memory 31.
  • CPU 32 reads and writes data from memory 31 in an analogous manner. For example, to read a block of data from memory 31, CPU 32 loads a pair of address registers 56, 58 with the address of the first location to be read, and sets read increment bit RI2 in register 56. It next reads data from the predetermined address dedicated to memory 31.
  • An address decoder 60 decodes the address from bus 62 and upon recognizing the address of memory 31, asserts a memory access signal R7 instructing address registers 56, 58 to assert the stored address on address bus ADDR. Access signal R7 is also applied to multiplexer 54 which responds by asserting memory access signal "MEM-Select".
  • memory 31 In response to access signal MEM-Select, memory 31 provides data D0-D15 from the memory location identified by the address on bus ADDR to data bus DB.
  • the predetermined address assigned to memory 31 behaves as a sixteen bit register 43 whose contents are determined by the contents of the memory location pointed to by address registers .
  • decoder 60 Upon recognizing a read cycle from CPU 32 directed to a location on interface 30, decoder 60 asserts an "enable" signal causing a data transceiver 64 to assert the contents of data bus DB onto bus 62, thereby providing CPU 32 with the desired data.
  • registers 56, 58 automatically increment the stored address.
  • CPU 32 reads the next location in memory 31 by again reading from the address assigned to memory 31.
  • Fig. 4 illustrates the memory map for dual ported memory 31 of the preferred embodiment.
  • Memory 31 includes a semaphore block 70 at the beginning of memory (i.e. the first 512 locations) which contain semaphores used in controlling access to certain registers to be described below.
  • the semaphore block is followed by a control register block 72 containing various registers used by CPU 24 and CPU 32 in passing data and control messages as described below.
  • Control register block 72 is followed by a pair of control message blocks 74, 76.
  • the first control message block 74 is used by CPU 24 to send control messages to CPU 32. Each message is defined by three words 75. To send a message, CPU 24 writes the corresponding three word code to a first container 78 within the message block 74. Subsequent messages are written to adjacent containers. CPU 32 removes the messages on a first-in- first-out basis. By the time CPU 24 loads the last container 80 in the block, CPU 32 has already emptied the first. Thus, after loading the last container, CPU 24 returns to the first container.
  • the control message block 74 thus operates as a circular buffer.
  • Control message block 76 is used in the same manner to transfer control messages from CPU 32 to CPU 24.
  • CPU 32 loads messages into the circular buffer and CPU 24 removes them on a first-in-first-out basis.
  • Control message blocks 74, 76 are followed by a group of data memory blocks 82.
  • the data memory blocks are used to transmit data between CPU 24 and CPU 32 in the same manner that control message blocks 74, 76 are used to transmit control messages. More specifically, CPU 24 loads a first block of data into a first memory block 84. As each new block of data arrives, CPU 24 loads the data into the next memory block.
  • CPU 32 removes them on a first-in-first-out-basis. By the time CPU 24 loads the last memory block 86, CPU 32 has already emptied the first memory block 84. Thus, once CPU 24 loads the last memory block 86, it returns to the first memory block 84. Memory blocks 72 therefore operate as a circular buffer.
  • Control register block 72 includes a next entry pointer 88 which points to the next available data block in memory 31. It also includes a removal pointer 90, which points to the next data block to be emptied, and a count register 92 which specifies the number of data blocks currently stored in memory 31.
  • CPU 24 desires to load a block of data to memory 31, it first reads count register 92 to determine whether the circular buffer is full. (Step 210) .
  • CPU 32 typically removes blocks quickly enough that the circular data buffer should never become full. However, if the buffer becomes full, CPU 24 repeatedly reads the count register 92 until CPU 32 frees a block and decrements the count. (Step 212) .
  • CPU 24 reads the entry pointer 88 to determine the location of the next available data block. (Step 214) . It then initializes registers 40, 42 with the address of the first location in the selected block and sets the write increment bit WI in register 40. (Step 216). CPU ⁇ 24 then loads data into the selected memory block by performing successive writes to memory 31. (Step 218). When the block of data is loaded, CPUy24 updates the entry pointer 88 to point to the next available block. (Step 220) .
  • CPU 24 must increment the count to indicate that a new block has been added.
  • CPU 32 may also attempt to modify the count at the same time, i.e., to decrement the count to reflect the removal of a data block. For example, assume both CPU 24 and CPU 32 read the current value of the count which is five. CPU 24, having just loaded a new data block, seeks to increment the count to six. CPU 32, having just removed a block, seeks to decrement the count to four. If no mechanism is provided to synchronize
  • CPU 24 may write a six to the count register 92 and CPUy32 will overwrite this value with a four. Yet the count should remain at five since five blocks remain in memory 31. To avoid this type of error, a count semaphore word 96 is stored in semaphore blocky70. (Fig. 4). Before CPUy24 reads the count register 92, it first reads count semaphore 96. (Step 222). If the semaphore is zero, (Step 224) CPUy24 assumes it has control over the countyregister.
  • CPU 24 When CPU 24 reads a zero from semaphore 96, memoryy31 automatically sets the semaphore to one, thereby indicating to CPU-32 that countyregister 92 is under the exclusive control of CPUy24. After CPUy24 reads the count and increments it (Steps 226, 228), CPU 24 writes a zero to semaphore 96, thereby freeing the countyregister 92 for use by CPUy24. (Step 230). If the count read in step 226 was zero, thereby indicating that the circular buffer was empty before CPU 24 added the last block, CPU 24 interrupts CPU 32 to notify it that the buffer now contains data. (Steps 232, 234). (The mechanism by which CPU 24 interrupts CPU 32 will be described in detail below. ) If the count is greater than zero, an interrupt should already be pending due to the previously loaded data block which has not yet been removed. Accordingly, CPU 24 returns to other operations. (Step 236) .
  • CPU responds to an interrupt by first clearing the interrupt as described below.
  • Step 310) It proceeds to remove a block of data from memory 31 by first reading the removal pointer 90 to determine the location of the next block in the buffer.
  • Step 311) CPU 32 next initializes registers 56, 58 with the address of the first location in the block to be emptied and sets the read increment bit RI2 in register 56.
  • Step 312 It then empties the block by successively reading from memory 31 and writing the data to cache 33.
  • Step 314 When the entire block is emptied, CPU 32 updates the removal pointer 90 to point to the next block in the buffer.
  • Step 316 When the entire block is emptied, CPU 32 updates the removal pointer 90 to point to the next block in the buffer.
  • CPU 32 updates the count register 92 to reflect the removal of a block from the buffer. Toward this end, it first reads the count semaphore 96. (Step 318) If the semaphore is set to one, indicating that CPU 24 has control of the count, CPU 32 repeatedly reads the semaphore until it returns to a zero. (Step 320) Once the semaphore clears, CPU 32 reads the count from register 92 and decrements register 92 to reflect the removal of a block. (Steps 322, 324) . After decrementing the count, CPU 32 releases the count register 92 by clearing the semaphore 96. (Step 326). CPU 32 then examines the updated count to determine if the buffer is empty.
  • Step 328 If the buffer contains another data block, CPU 32 continues to read blocks until the buffer is emptied. (Step 330) . Once the buffer is empty, CPU 32 returns to other tasks until a new interrupt appears of bus 62 indicating that new data has been loaded into the data buffer. (Step 332) .
  • control message blocks 74, 76 operate in essentially the same manner to transfer control messages. More specifically, control register block 72 includes an entry pointer 98 indicating the location of the next available message container, and a removal pointer 100 indicating the location of the next message container to be emptied. (Fig. 4) . It further includes a count register 102 indicating the number of control messages stored in the message block 74. Semaphore block 70 includes an associated semaphore 104 for use in arbitrating competing requests for access the count register 102.
  • control register block 72 includes a removal pointer 105, an entry pointer 106, and a count register 107, for the use in controlling the passing of messages through control message block 76.
  • Semaphore block 70 includes a semaphore 108 for arbitrating between competing requests for access to count register 107.
  • CPU 24 loads the first data or message block to memory 31, (that is memory 31 was previously empty) , it notifies CPU 32 that the block is available by interrupting CPU 32.
  • interface 30 includes a control status register 66 used by CPU 24 to generate an interrupt to CPU 32. More specifically,
  • CPU 24 writes to register 66 to set certain bits in the register which request interface 30 to interrupt CPU 32. To write to register 66, CPU 24 asserts an address dedicated to the register 66. Decoders 44 and 46 decode the address causing decoder 46 to assert control register access signal R4. Upon receipt of register signal R4, register 66 loads data from data bus DB into its register cells.
  • CPU 24 requests interface 30 to interrupt CPU 32 by setting interrupt bit IGl of control status register 66 and loading bits zero through five (MID0- MID5) with the identification code ID1 of CPU 32.
  • the identification code ID1 is applied to a comparator 110 which compares ID1 with an identification code "CODE'" assigned to CPU 32.
  • the comparator output and the interrupt generation bit IGl are applied to a three input AND gate 112, thereby requesting an interrupt.
  • Interface 30 includes a second control status register 68, virtually identical to register 66, which is used by CPU 32 to enable and disable interrupt requests from CPU 24 by setting or clearing an interrupt enable bit IE2.
  • the interrupt enable bit IE2 is applied to the third input of AND gate 112. If IE2 is set, the activation of the other two inputs causes the output of AND gate 112 to become asserted, triggering an interrupt latch 114 to set. Once set, interrupt latch 114 asserts an interrupt signal "Interrupt2" on bus 62.
  • CPU 32 clears the interrupt by setting the interrupt pending bit IP2 in its status register 68, thereby causing latch 114 to clear.
  • CPU 32 interrupts CPU 24 in the same manner.
  • interrupt bit IG2 in register 68 and loads bits MIDO - MID5 with the identification code ID2.
  • the bit IGl is applied to an input of an AND gate 113.
  • the Identification code bits ID2 are applied to an input of a comparator 111.
  • Comparator 111 compares ID2 to 'code2". If ID2 and Code2 are identical, the comparator supplies a "match" signal to AND gate 113. Finally, interrupt enable bit IE1 from status register 66 is applied to the third input of AND gate 113. If all three inputs to AND gate 113 are asserted, the output of AND gate 113 sets an interrupt latch 115. Once set, Interrupt latch 115 asserts an interrupt signal "Interruptl" across cable 19. Interface 28 forwards the interrupt to bus 22 thereby interrupting CPUy24. CPU 24 clears the interrupt by setting interrupt pending bit IP1 in its status register 66.
  • the interrupt mechanism also provides the means by which the backup fileserver determines when the primary has failed. More specifically, in normal operation, CPU 24 will regularly interrupt CPU 32 with new data to be loaded to the backup disk 34. If CPU 32 does not receive an interrupt within a specified period of time, it assumes the primary has failed and proceeds to assume responsibility for the primary.
  • the primary also monitors the length of time since it last interrupted CPU 32. If the specified period of time is about to expire, CPU 24 sends an "Alive" message to control block 74 and interrupts CPU 32 to notify it that the buffer contains a message. CPU 32 will respond to the interrupt, read the Alive message and return to its normal operation. In this manner,' CPU 24 notifies CPU 32 that it is operational even during moments when no data needs to be copied to backup disk 34. Similarly, CPU 32 regularly sends alive messages to CPU 24 to notify it that CPU 32 remains operational.
  • Figs. 7(a)-7(d) if CPU 24 fails to interrupt CPU 32 for the specified length of time, the backup fileserver assumes responsibility for the primary.
  • the backup fileserver 16 sends a "shut down" message to the block 76 to indicate to the primary that it is taking over. (Step 410) .
  • Backup fileserver 16 then mounts the backup filesystem as a local filesystem (Step 412) and activates its network interface 116. (Fig. 1) (Step 416) . It then broadcasts an "Address Resolution Protocol" packet (herein "ARP" packet) over link 14 via network interface 116 indicating that it will now handle all traffic formerly directed to the primary.
  • ARP Address Resolution Protocol
  • each client node of the primary maintains a Node ID table containing the network interface address used by each node recognized by the client's operating system.
  • the "ARP" packet instructs each client node to modify its table by replacing the interface address of network interface 20 in the primary with the address of network interface 116 in the backup. (Step 418) .
  • the backup fileserver After sending the ARP packet, the backup fileserver begins maintaining a record of all disk data blocks which are amended so that the primary's outdated data blocks can be replaced at a future time. More specifically, the backup fileserver allocates a block CPU memory for storage of a journal bit map. (Step 420) . Each bit in the map corresponds to a single disk data block. If a given block is written to or otherwise modified, the corresponding bit in the journal bit map is set.
  • the backup fileserver To update the newly created journal bit map, the backup fileserver first reads a journal file which describes all changes to the filesystem within the last fifteen seconds. (Step 422) . The backup fileserver examines the journal and, for each data block modified in the last fifteen seconds, sets a corresponding bit in the journal bit map. (Step 424) . The backup fileserver continues to monitor the journal file, updating the bit map with each modification of the filesystem. The primary fileserver will eventually be restarted after the failure condition is remedied. Upon being restarted, the primary sends an "Alive" message to control message block 74. The backup fileserver reads the "Alive" message from the control message block and begins returning control to the primary. (Step 426) .
  • Step 428) it first deactivates its network interface 116 from receiving any more packets.
  • Step 430 the backup fileserver may have already received packets from the network which have not yet been incorporated into the filesystem. Further, it may have already begun preparing packets for transmission over the network. Accordingly, during the fifteen second waiting period, the backup fileserver updates the filesystem to reflect any required changes and transmits all pending packets.
  • the backup fileserver sends a disk data block allocation bit map to the primary through the dual ported memory 31. (Step 432) . Each bit of the block allocation bit map corresponds to a disk data block.
  • a bit set to one indicates that the corresponding disk block is used, a zero indicates that the disk block is free.
  • the backup fileserver then examines the journal bit map to determine if it reflects the most recent changes to the filesystem. (Step 434) . If not, the backup fileserver updates the journal bit map to reflect the most recent changes. (Steps 436, 438).
  • the fileserver scans the journal bit map tc identify each disk data block which has been modified since fifteen seconds prior to the failure of the primary. (Step 440) . It then sends each modified data block to the primary through dual ported memory 31 using the data block transfer procedure described above. (Step 442). After completing this transfer, it deallocates the CPU memory block which contains the journal bit map (Step 444) and sends a "Journal Done" message to the primary through control message block 76. (Step 446).
  • the primary first sends an "alive" message to the backup fileserver through the control message block 74 of the dual ported memory 31. (Step 510) .
  • the backup fileserver responds by sending the disk data block allocation bit map, the primary replaces its old data block allocation bit map with the newly arrived allocation bit map. (Step 512) . It then proceeds to read each arriving modified data block from the dual ported interface and to write the block to its disk 26. (Step 514).
  • Step 516 When the backup fileserver sends a Journal Done message, indicating that all blocks have been sent (Step 516), the primary activates its network interface (Step 518) and mounts the filesystem (Step 520) . It then broadcasts a ARP packet instructing all client nodes to amend their node Id table to indicate that the primary has returned to service (Step 522) . It finally sends an "on-line" message to the backup through control message block 76, instructing the backup to return to its role as a backup (Step 524) .
  • the semaphore block 70 includes a power up semaphore 109 to provide a mechanism for notifying CPU 24 and CPU 32 that the control entries are not valid.
  • Both CPU 32 and CPU 24 read the semaphore bus when their respective fileserver's are first powered on.
  • Semaphore 109 is initially set to zero when the backup fileserver is powered up to indicate that control register blocky72 is not initialized. The first of CPU 24 and CPU 32 to read a zero in semaphore 109 assumes responsibility for initializing these locations. Upon being read by either
  • backup fileserver 16 may operate both as a backup fileserver and as a second primary fileserver.
  • fileserver 16 includes two network interfaces 116, 118.
  • interface 116 is used to access communication link 14 when fileserver 16 has assumed responsibility for the failed primary.
  • Interface 118 is used by fileserver 16 to communicate over link 14 in its capacity as a second primary fileserver.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Financial Or Insurance-Related Operations Such As Payment And Settlement (AREA)
EP19920909636 1991-04-23 1992-04-14 Fehlertolerantes netwerkdateisystem Withdrawn EP0536375A1 (de)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US69006691A 1991-04-23 1991-04-23
US690066 1991-04-23

Publications (1)

Publication Number Publication Date
EP0536375A1 true EP0536375A1 (de) 1993-04-14

Family

ID=24770937

Family Applications (1)

Application Number Title Priority Date Filing Date
EP19920909636 Withdrawn EP0536375A1 (de) 1991-04-23 1992-04-14 Fehlertolerantes netwerkdateisystem

Country Status (3)

Country Link
EP (1) EP0536375A1 (de)
JP (1) JPH05508506A (de)
WO (1) WO1992018931A1 (de)

Families Citing this family (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5574863A (en) * 1994-10-25 1996-11-12 Hewlett-Packard Company System for using mirrored memory as a robust communication path between dual disk storage controllers
JPH08147205A (ja) * 1994-11-18 1996-06-07 Nec Corp ディスク共有システム
US5822512A (en) * 1995-05-19 1998-10-13 Compaq Computer Corporartion Switching control in a fault tolerant system
JP3732869B2 (ja) * 1995-06-07 2006-01-11 株式会社日立製作所 外部記憶装置
DE19624302A1 (de) * 1996-06-18 1998-01-02 Siemens Ag Aufdatverfahren
AU7149498A (en) * 1997-04-25 1998-11-24 Symbios, Inc. Redundant server failover in networked environment
US6112249A (en) * 1997-05-30 2000-08-29 International Business Machines Corporation Non-disruptively rerouting network communications from a secondary network path to a primary path
JPH1185655A (ja) * 1997-09-16 1999-03-30 Hitachi Ltd 計算機システム
DE19949710B4 (de) * 1999-10-15 2016-03-10 Abb Ab Verfahren und Einrichtung zur fehlersicheren Kommunikation zwischen Zentraleinheiten eines Steuerungssystems
US6760859B1 (en) 2000-05-23 2004-07-06 International Business Machines Corporation Fault tolerant local area network connectivity
US7386610B1 (en) * 2000-09-18 2008-06-10 Hewlett-Packard Development Company, L.P. Internet protocol data mirroring
JP2002287999A (ja) * 2001-03-26 2002-10-04 Duaxes Corp サーバの二重化方法、二重化サーバシステム、および二重化データベースサーバ
US7581048B1 (en) * 2001-06-29 2009-08-25 Emc Corporation Method and apparatus for providing continuous communication between computers
US20030037133A1 (en) * 2001-08-15 2003-02-20 Thomas Owens Method and system for implementing redundant servers
GB2382176A (en) * 2001-11-20 2003-05-21 Hewlett Packard Co A method and apparatus for providing a reminder service
US7120654B2 (en) 2002-08-20 2006-10-10 Veritas Operating Corporation System and method for network-free file replication in a storage area network
US7376859B2 (en) 2003-10-20 2008-05-20 International Business Machines Corporation Method, system, and article of manufacture for data replication
US8495030B2 (en) 2011-01-06 2013-07-23 International Business Machines Corporation Records declaration filesystem monitoring
WO2020036917A1 (en) * 2018-08-14 2020-02-20 Optimum Semiconductor Technologies Inc. Vector instruction with precise interrupts and/or overwrites

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0377684A1 (de) * 1988-03-25 1990-07-18 Ncr International Inc. System für verkaufsstellen
KR900005300A (ko) * 1988-09-13 1990-04-13 윌슨 디.파아고 핫 백업 파일 서버 시스템
US4958270A (en) * 1989-01-23 1990-09-18 Honeywell Inc. Method for control data base updating of a redundant processor in a process control system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See references of WO9218931A1 *

Also Published As

Publication number Publication date
WO1992018931A1 (en) 1992-10-29
JPH05508506A (ja) 1993-11-25

Similar Documents

Publication Publication Date Title
US5787304A (en) Multipath I/O storage systems with multipath I/O request mechanisms
US9811463B2 (en) Apparatus including an I/O interface and a network interface and related method of use
EP0536375A1 (de) Fehlertolerantes netwerkdateisystem
US5649092A (en) Fault tolerant apparatus and method for maintaining one or more queues that are shared by multiple processors
US5434993A (en) Methods and apparatus for creating a pending write-back controller for a cache controller on a packet switched memory bus employing dual directories
US7971011B2 (en) Remote copy method and storage system
US5941969A (en) Bridge for direct data storage device access
US7516287B2 (en) Methods and apparatus for optimal journaling for continuous data replication
CN100378679C (zh) 用于存储器访问请求的重定向的方法和系统
US6505241B2 (en) Network intermediate node cache serving as proxy to client node to request missing data from server
US7627687B2 (en) Methods and apparatus for managing data flow in a continuous data replication system having journaling
US8751716B2 (en) Adaptive data throttling for storage controllers
JP2565658B2 (ja) リソースの制御方法及び装置
US20020194429A1 (en) Method and apparatus for cache synchronization in a clustered environment
US6711559B1 (en) Distributed processing system, apparatus for operating shared file system and computer readable medium
JPH04233653A (ja) 速度差が大きい協同プロセッサ間のメッセージ・キュー処理
CA2193174A1 (en) Computer system data i/o by reference among cpus and i/o devices
US6944684B1 (en) System for selectively using different communication paths to transfer data between controllers in a disk array in accordance with data transfer size
US7149922B2 (en) Storage system
US6549988B1 (en) Data storage system comprising a network of PCs and method using same
US20060095682A1 (en) High-performance lock management for flash copy in n-way shared storage systems
US7043603B2 (en) Storage device control unit and method of controlling the same
US20040177221A1 (en) Method and apparatus for managing data access and storage of data located on multiple storage devices
US6434635B1 (en) Methods, apparatus, and computer program product for data transfer using a scatter-gather list
JP2008047029A (ja) ストレージシステム及びキャッシュの冗長化方法とコンピュータプログラム

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): DE FR GB

17P Request for examination filed

Effective date: 19930327

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION HAS BEEN WITHDRAWN

18W Application withdrawn

Withdrawal date: 19950620