WO1993014455A1 - Computer memory array control - Google Patents

Computer memory array control Download PDF

Info

Publication number
WO1993014455A1
WO1993014455A1 PCT/GB1992/002291 GB9202291W WO9314455A1 WO 1993014455 A1 WO1993014455 A1 WO 1993014455A1 GB 9202291 W GB9202291 W GB 9202291W WO 9314455 A1 WO9314455 A1 WO 9314455A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
memory
buffer
memory units
host computer
Prior art date
Application number
PCT/GB1992/002291
Other languages
French (fr)
Inventor
Andrew James William Hill
Original Assignee
Array Data Limited
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Array Data Limited filed Critical Array Data Limited
Priority to JP5511984A priority Critical patent/JPH08501643A/en
Priority to EP92924811A priority patent/EP0620934A1/en
Priority to AU30915/92A priority patent/AU662376B2/en
Publication of WO1993014455A1 publication Critical patent/WO1993014455A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/08Error detection or correction by redundancy in data representation, e.g. by using checking codes
    • G06F11/10Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's
    • G06F11/1076Parity data used in redundant arrays of independent storages, e.g. in RAID systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/08Error detection or correction by redundancy in data representation, e.g. by using checking codes
    • G06F11/10Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's
    • G06F11/1008Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's in individual solid state devices
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0866Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches for peripheral storage systems, e.g. disk cache
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11BINFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
    • G11B20/00Signal processing not specific to the method of recording or reproducing; Circuits therefor
    • G11B20/10Digital recording or reproducing
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11BINFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
    • G11B20/00Signal processing not specific to the method of recording or reproducing; Circuits therefor
    • G11B20/10Digital recording or reproducing
    • G11B20/18Error detection or correction; Testing, e.g. of drop-outs
    • G11B20/1833Error detection or correction; Testing, e.g. of drop-outs by adding special lists or symbols to the coded information
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/0671In-line storage system
    • G06F3/0673Single storage device
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11BINFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
    • G11B20/00Signal processing not specific to the method of recording or reproducing; Circuits therefor
    • G11B20/10Digital recording or reproducing
    • G11B2020/10916Seeking data on the record carrier for preparing an access to a specific address

Definitions

  • This invention relates to computer memories, and in particular to a controller for controlling and a method of controlling an array of memory units in a computer.
  • an idealistic computer memory would be a memory having no requirement to "seek" the data. Such a memory would have instantaneous access to all data areas. Such a memory could be provided by a RAM disk. This would provide for access to data regardless of whether it was sequential or random in its distribution in the memory.
  • RAM is disadvantageous compared to the use of conventional magnetic disk drive storage media in view of the high cost of RAM and especially due to the additional high cost of providing "redundancy" to compensate for failure of memory units.
  • non-volatile computer memories are magnetic disk drives.
  • these disk drives suffer from the disadvantage that they require a period of time to position the head or heads with the correct part of the disk corresponding to the location of the data. This is termed the seek and rotation delay. This delay becomes a significant portion of the data access time when only a small amount of data is to be read or written to or from the disk.
  • RAID-3 This document describes two types of arrangements.
  • the first of these arrangements is particularly adapted for large scale data transfers and is termed "RAID-3".
  • RAID-3 At least three disk drives are provided in which sequential bytes of information are stored in the same logical block positions on the drives, one drive having a check byte created by a controller written thereto, which enables any one of the other bytes on the disk drives to be determined from the check byte and the other bytes.
  • RAID-3 as used hereinafter is as defined by the foregoing passage.
  • the RAID-3 arrangement there is preferably at least five disk drives, with four bytes being written to the first four drives and the check byte being written to the fifth drive, in the same logical block position as the data bytes on the other drives.
  • each byte stored on it can be reconstructed by reading the other drives.
  • the computer be arranged to continue to operate despite failure of a disk drive, but also the failed disk drive can be replaced and rebuilt without the need to restore its contents from probably out-of-date backup copies.
  • a disk drive storage system having the RAID-3 arrangement is described in EP-A-0320107, the content of which is incorporated herein by reference.
  • RAID-5 11 The second type of storage system which is particularly adapted for multi-user applications, is termed "RAID-5 11 .
  • RAID-5 arrangement there are preferably at least five disk drives in which four sectors of each disk drive are arranged to store data and one sector stores check information.
  • the check information is derived not from the data in the four sectors on the disk, but from designated sectors on each of the other four disks. Consequently each disk can be rebuilt from the data and check information on the remaining disks.
  • RAID-5 is seen to be advantageous, at least in theory, because it allows multi-user access, albeit with equivalent transfer performance of a single disk drive.
  • a write of one sector of information involves writing to two disks, that is to say writing the information to one sector on one disk drive and writing check information to a check sector on a second disk drive.
  • writing the check sector is a read modify write operation, that is, a read of the existing data and check sectors first, because the old contents of those sectors must be known before the correct check information, based on the new data to be written, can be generated and written to disk.
  • RAID-5 does allow simultaneous reads by multiple users from all disks in the system which RAID-3 cannot support.
  • RAID-5 cannot match the rate of data transfer achievable with RAID-3, because with RAID-3, both read and write operations involve a transfer to each of the five disks (in five disk systems) of only a quarter of the total amount of information transferred. Since each referral can be accomplished simultaneously the process is much faster than reading or writing to a single disk particularly where large scale transfers are involved. This is because most of the time taken to effect a read or write in respect of a given disk drive, is the time taken for the read/write heads to be positioned with respect to the disk, and for the disk to rotate to the correct angular positoin. Clearly, this is as long for one disk, as it is for all four. But once in the correct position, transfers of large amounts of sequential information can be effected relatively quickly.
  • RAID-5 only offers multiple user access in theory, rather than in practice, because requests for sequential information by the same user may involve reading several disks in turn, thereby occupying those disks so that they are not available to other users.
  • RAID-5 which makes it theoretically more advantageous than RAID-3; but, in fact, it is the data transfer rate and continued performance in the event of drive failure in RAID-3 format which gives the latter much greater potential. So it is an object of the present invention to provide a system which exhibits the same multi-user capability of a RAID-5 disk array, or indeed better capability in that respect.
  • the inventor has previously developed a system which has been termed RAID-35 and which is disclosed in the specification of PCT/GB90/01557. This system offers the same if not better performance as RAID-3 and RAID-5.
  • the RAID-35 system is thus highly efficient for applications where users are likely to request sequential data. On the other hand if the data requests are random, the advantages of the RAID-35 system cannot be realised.
  • the present invention provides a computer memory controller for interfacing to a host computer comprising a buffer means for interfacing to a plurality of memory units and for holding data read thereto and therefrom; and control means operative to control the transfer of data to and from said host computer and said memory units; said buffer means being controlled to form a plurality of buffer segments for addressably storing data read from or written to said memory units; said control means being operative to allocate a buffer segment for a read or write request from the host computer, of a size sufficient for the data; said control means being further operative in response to data requests from said host computer to control said memory units to seek data stored in different memory units simultaneously.
  • the present invention also provides a method of controlling a plurality of memory units for use with a host computer comprising the steps of repeatedly receiving from said host computer a read request for data stored in said memory units and allocating a buffer segment of sufficient size for the data to be read; and seeking data in said plurality of memory units simultaneously.
  • the present invention further provides a computer memory controller for a host computer comprising buffer means for interfacing to at least three memory channels arranged in parallel, each memory channel comprising a plurality of memory units connected by a bus such that each memory unit of said memory channel is independently accessible; respective memory units of said memory channels forming a memory bank; a logic circuit connected to said buffer means to split data input from said host computer into a plurality of portions such that said portions are temporarily stored in a buffer segment before being applied to ones of a group of said memory channels for storage in a memory bank; said logic circuit being further operative to recombine portions of data successively read from successive ones of a group of said memory units of a memory bank and into said buffer means; said logic circuit including parity means operative to generate a check byte or group of bits from said data for temporary storage in said buffer means before being stored in at least one said memory unit of said memory bank, and operative to use said check byte to regenerate said data read from said group of memory units of a memory bank if one of said group of memory units
  • the present invention still further provides a computer storage system comprising a plurality of memory units arranged into a two dimensional array having at least three memory channels arranged in parallel, each said memory channel comprising a plurality of memory units connected by a bus such that each memory unit is independently accessible; respective memory units of said memory channels forming a memory bank; and a controller comprising buffer means interfaced to said memory units and for holding information read from said memory channels; said buffer means being controlled to form a plurality of buffer segments for addressably storing data read from or written to said memory units; a logic circuit connected to said buffer means to recombine bytes or groups of bits read from ones of a group of said memory units in a memory bank, parity means operative to use a check byte or group of bits read from one of said memory units in said memory bank to regenerate information read from said group of memory units if one of said group of memory units fails; and control means for controlling the transfer of data to and from said host computer and said memory units, including allocating a buffer segment for a read or write request
  • the system of the present invention can be termed RAID-53 since it utilises a combination of RAID-3 and RAID-5 to provide for fast random access.
  • RAID-53 like RAID-5 allows for simultaneous reads by multiple users from all the disk banks in the system whilst also reducing the read time since the data is split between a number of disks which are read simultaneously.
  • the disk banks can be addressably segmented such that respective segments on sequential banks have a sequential address. This allows sequential data to be written to segements on sequential banks and thus distribute or "stripe" the data across the memory banks. This technique is termed hereinafter “overlay bank stripping”.
  • This organisation of data on the disk array is controlled by the controller and not the host computer.
  • the controller assigns addresses to segments of the disk banks in such a way that when data is written to the disk array it is striped across the banks.
  • This stripping of the data is also applicable to RAID-35 and will allow data to be read or stored on different banks simultaneously.
  • the memory units are disk drives and there are five per memory bank, i.e. five memory channels, one disk containing the check information, four disks containing the data.
  • SCSI-1 Small Computer Systems Interface
  • the currently standard disk drive interface SCSI-1 Small Computer Systems Interface
  • SCSI-2 Serial Computer Systems Interface
  • 15 banks can be used.
  • the present invention is not however limited to the use of such an interface and any number of memory banks could be used. In fact the more memory banks that are present, the more that can be simultaneously undertaking a seek operation, thus reducing data access time for the host computer.
  • the disk drives of a memory bank have their spindles synchronised.
  • RAID-3 and RAID-5 provide a simultaneous random access facility with a performance in excess of the theoretical maximum performance of RAID-5 systems with five slave -tous drives. In addition the performance penalties of Read-Modify-Write characteristics of RAID-5 systems are avoided. What is provided is a fast and simple RAID-3 type Read/Write facility.
  • the RAID-53 system also sustains maximum transfer rate under a "single" disk drive failure condition per "bank” of disk drives.
  • control means can queue host data requests for memory banks and carry out the data seek and transfer when the memory bank containing the requested data is not busy.
  • the order in which these seeks take place is optimimsed to provide optimised seek ordering.
  • the controller when a write request is received by the controller, it can effect the immediate writing of the data to a memory bank to the detriment of any pending read or write requests. This prevents any important data being lost due to power failure for instance when the data normally would be held in a buffer segment.
  • a number of buffer means, logic circuits and parity means are provided together with a number of associated two dimensional arrays of memory units.
  • the control means is operative to control the transfer of data to and from the host computer and the three dimensional array of memory units formed of layers of the two dimensional arrays.
  • the hardware utilised for the RAID-35 system of PCT/GB90/01557 can be the same as that used for the RAID-53.
  • RAID-35 and RAID-53 as options for the same hardware or they can be provided together and will share the hardware.
  • a first portion of the buffer means is allocated for RAID-53.
  • the remaining buffer memory is allocated for RAID-35 use.
  • the memory banks can be shared or a number of them can be allocated for RAID-35 and the rest for RAID-53.
  • the RAID-35 operation is as follows.
  • the transfer of sequential data to the host computer in response to requests therefrom is controlled by first addressing the buffer segments in the allocated part of the buffer means to establish whether the requested data is contained therein and if so supplying said data to said host computer. If the requested sequential data is not contained in the buffer segments of the allocated portion of the buffer means, data is read from the memory units and supplied to the host computer. Further data is read from the memory units which is logically sequential to the data requested by the host computer and the further data is stored in a buffer segment in the allocated portion of the buffer means.
  • the control means also controls the size and number of buffer segments in the portion of the buffer means allocated for RAID-35 usage.
  • the array of disk drives provided by the RAID-35 and RAID-53 systems provide redundancy in the event of disk drive failure.
  • the present invention also provides a plurality of buffer means each for interfacing a plurality of memory units arranged into a two dimensional array having at least three memory channels, each memory channel comprising a plurality of memory units connected by a bus such that each memory unit is independently accessible; respective memory units of said memory channels forming a memory bank; a plurality of logic circuits connected to respective said buffer means to recombine bytes or groups of bits read from ones of a group of said memory units of a memory bank and stored in said buffer segments to generate the requested data; said logic circuits each including parity means operative to use a check byte or group of bits read from one of said memory units of said memory bank to regenerate data read from said group of memory units if one of said group of memory units fails; said buffer means being divided into a number of channels corresponding to the number of memory channels, each channel being divided into associated portion of buffer segments; and control means operative to control the transfer of data from a three dimensional array of memory units formed from a plurality of said two dimensional arrays to said
  • the present invention is not limited to the use of such disk drives.
  • the present invention is equally applicable to the use of any memory device which has a long seek time for data compared to the data transfer rate once the data is located.
  • Such media could, for instance, be an optical compact disk.
  • Such an array provides large scale storage of information together with the faster data transfer rates and better performance with regard to multi-user applications, and security in the event of any one drive failure (per bank) .
  • the mean time between failures (MTBF) of such an array (when meaning the mean time between two simultaneous drive failures (per bank) , and which is required in order to result in information being lost beyond recall) is measured in many thousands of years with presently available disk drives each having individual MTBFs of many thousands of hours.
  • Figure 1 is a block diagram of the controller architecture of a disk array system according to one embodiment of the present invention.
  • FIG. 2 illustrates the operation of the data splitting hardware.
  • Figure 3 illustrates the read/write data cell matrix
  • Figure 4 illustrates a write data cell
  • Figure 5 illustrates a read data cell.
  • Figure 6 is a flow diagram illustrating the software steps in write operations for RAID-35 operation.
  • Figure 7 is a flow diagram illustrating the software steps in read operations for RAID-35 operation.
  • Figures 8 and 9 are flow diagrams illustrating the software steps for read ahead and write behind for
  • Figure 10 is a flow diagram illustrating the software steps involved to restart suspended transfers for RAID-35 operation.
  • Figure 11 is a flow diagram illustrating the software steps involved in cleaning up segments for
  • Figures 12 and 13 are flow diagrams illustrating the steps involved for input/output control for RAID-35 operation.
  • Figure 14 and 15 are flow diagrams illustrating the software steps performed by the 80376 central controller of Figure 1 during RAID-53 operation.
  • Figures 16 to 19 are flow diagrams illustrating the software steps performed by the slave bus controllers of Figure 1 during RAID-53 operation.
  • Figure 20 is a block diagram of an embodiment of the present invention illustrating the access points for RAID-53 operation.
  • Figure 21 illustrates a block diagram of a three dimensional memory array according to one embodiment of the present invention.
  • Figure 22 illustrates the use of a redundant controller according to one embodiment of the present invention.
  • Figure 23 illustrates the distribution of data in segments within the array using the technique of overlay bank stripping.
  • Figure 1 illustrates the architecture of the RAID-35 and RAID-53 disk array controller, and initially both systems will be considered together.
  • the internal interface of the computer memory controller 10 is termed the ESP data bus interface and the interface to the host computer is termed the SCSI interface. These are provided in interface 12.
  • the SCSI bus interface communicates with the host computer (not shown) and the ESP interface communicates with a high performance direct memory access (DMA) unit 14 in a host interface section 11 of the computer memory controller 10.
  • DMA direct memory access
  • the ESP interface is 16 bits (one word) wide.
  • the host interface section communicates with a central buffer management (CBM) section 20 which comprises a central controller 22, in the form of a suitable microprocessor such as the Intel 80376 Microprocessor, and data splitting and parity control (DSPC) logic circuit 24.
  • CBM central buffer management
  • DSPC data splitting and parity control
  • the DSPC 24 also combines the information on the first four channels and, after checking against the parity channel, transmits the combined information to the host computer. Furthermore, the DSPC 24 is able to reconstruct the information from any one channel, should that be necessary, on the basis of the information from the other four channels.
  • the DSPC 24 is connected to a central buffer 26 which is divided into five channels A to E, each of which is divisible into buffer segments 28.
  • Each central buffer channel 26,A through 26,E have the capacity to store up to half a megabyte of data for example, depending on the application required.
  • each segment may be as small as 128 kilobytes for example so that up to 16 segments can be formed in the buffer.
  • each segment will be as small as the minimum data request from the host computer.
  • the central buffer 26 communicates with five slave bus controllers 32 in a slave bus interface (SBI) section 30 of the memory controller 10.
  • SBI slave bus interface
  • Each slave bus controller 32,A through 32,E communicates with up to seven disk drives 42,0 to 42,6 along SCSI-1 buses 44,A through 44,E so that the drives 42,0,A through 42,0,E form a bank 0, of five disk drives and so also do drives 42,1,A through 42,1,E etc. to 42,6, through 42,6,E.
  • the seven banks of five drives effectively each constitute a single disk drive, each individually and independently accessible. This is made possible by the use of SCSI-1 buses, which allow for eight device addresses. One address is taken up by the slave bus controller 32 whilst the seven remaining addresses are available for seven disk drives.
  • each channel can therefore be increased sevenfold and the slave bus controller 32 is able to access any one of the disk drives 42 in the channel independently.
  • the use of more than one bank of disk drives is essential for the realisation of the advantage of RAID-53 operation.
  • This arrangement of banks of disk drives is not only applicable to the arrangement shown in Figure 1, but is also applicable to the RAID-3 arrangement.
  • Information stored in the disk drives of one bank can be accessed virtually simultaneously with information being accessed from the disk drives of another bank. This arrangement therefore gives an enhancement in access speed to data stored in an array of disk drives.
  • one of the functions of the central controller 22 is to store data on the various disk drives efficiently. Moreover each sector in so far as the host is concerned, is split between four disk drives in the known RAID-3 format. Under RAID-35 operation, the central controller 22 arranges to store sectors of information passed to it by the host computer, in an ordered fashion so that a sector on any given disk drive is likely to contain information which logically follows from a previous adjacent sector.
  • the read request is received by the central controller 22 which passes the request to the slave bus interface (SBI) controller 32.
  • the slave bus control 32 reads the disk banks 40 and selects the appropriate data from the appropriate banks of disks.
  • the DSPC circuit 24 receives the requested data and checks it is accurate against the check data in channel E.
  • the controller may automatically try to re-read the data, if a parity error is still detected the controller may return an error to the host computer. If there is a faulty drive this can be isolated and the system arranged to continue working employing the four good channels, in the same way and with no loss of performance, until the faulty drive is replaced and rebuilt with the appropriate information.
  • the central controller 22 first responds to the data read request by transferring the information to the SCSI-1 interface 12. However, it also instructs further information logically sequential to the requested information to be read. This is termed "read ahead information”. Read ahead information up to the capacity presently allocated by the central controller 22 to any one of the data buffer segments 28 is then stored in one buffer segment 28.
  • the central controller 22 When the host computer makes a further request for information, it is likely that the information requested will follow on from the information previously requested. Consequently, when the central controller 22 receives a read request, it first interrogates those buffer segments 28 to determine if the required information is already in the buffer. If the information is there, then the central controller 22 can respond to the user request immediately, without having to read the disk drives. This is obviously a much faster procedure and avoids the seek delay.
  • the central controller 22 will have allocated at least as many buffer segments 28 as there are application programs, up to the maximum number of segments available. Each buffer segment will be kept full by the central controller 22 ordering the disk drive seek commands in the most efficient manner, only over-riding that ordering when a buffer segment has been, say 50% emptied by host requests or when a host request cannot be satisfied from existing buffer segments 28. Thus all buffer segments are kept as full as possible with read ahead data.
  • a hardware switch can be provided to ensure that all write instructions are effected immediately, with write information only being stored in the buffer segments transiently before being written to disk. This removes the fear that a power loss might result in data being lost which was thought to have been written to disk although not actually effected by the memory system. There is still however, the unlikely exception that information may be lost when a power loss occurs very shortly after a user has sent a write command, but in that event, the user is likely to be conscious of the problem. If this alternative is utilised, it does of course affect the performance of the computer. Operation under RAID-53
  • the read request is received by the central controller 22 which passes the request to the slave bus controller 32.
  • the slave bus controller 32 reads the disk banks 40 and selects the appropriate data from the appropriate banks of disks.
  • the DSPC circuit 24 receives the requested data and checks it is accurate against the check data in channel E.
  • the controller may automatically retry to read the data. If a parity error is still detected the controller may return an error to the host computer. If there is a faulty drive this can be isolated and the system arranged to continue working employing the four good channels, in the same way and with no loss of performance, until the faulty drive is replaced and rebuilt with the appropriate information.
  • the central controller 22 responds to the data read request by transferring the data to the SCSI-1 interface 12, and then de-allocating the buffer segment.
  • the disk bank is then free to accept another read request and can commence a seek operation under the command of the central controller 22.
  • the size of the buffer segments is determined by the size of the data requested by the host computer. No data is read ahead from the disk drives.
  • the central controller 22 is thus able to receive the read requests and determine in which disk bank that data lies. If the disk bank is idle then the disk bank can be instructed to seek the data. Simultaneously the other disk banks may be seeking data requested by the host computer at an earlier date, and once this has been located the central controller 22 can read the disk bank and pass the data to the buffer segments for reconstruction, from where it is passed to the SCSI-1 interface 12.
  • Figure 14 illustrates the seven access points to the seven disk banks.
  • Each disk drive of each bank has a unique bus (SCSI) address and can thus be accessed independently by the computer memory controller 100.
  • SCSI bus
  • FIG. 14 illustrates the seven access points to the seven disk banks.
  • Each disk drive of each bank has a unique bus (SCSI) address and can thus be accessed independently by the computer memory controller 100.
  • SCSI bus
  • up to seven disk banks can be operating simultaneously to seek data requested by the host computer. While a disk bank is seeking it is disconnected from the SCSI-1 interface. When the data is located this is indicated to the central controller 22 which can then read the data.
  • the central controller 22 can queue these requests.
  • the queued read requests may not necessarily be performed in the order in which the host computer issued the commands. Such queuing of read requests could also be performed on the slave bus controllers 32.
  • the central controller 22 is provided with the capability of "forcing" the incoming data to be "immediately” written to the required bank of disk drives, rather than being queued with pending Read/Write commands. This ensures that data thought by the host computer to be written to disk is so written, in case of for instance power failure where any data to be written to the disks that is stored in the buffer memory 26 would be lost.
  • the controllers internal interface to the host system hardware interface is 16 bits (one word) wide. This is the ESP data bus. For every four words of sequential host data, one 64 bit wide slice of internal buffer data is formed. At the same time, an additional word or 16 bits of parity data is formed by the controller; one parity bit for four host data bits. Thus the internal width of the controller's central data bus is 80 bits. This is made up of 64 bits of host data and 16 bits of parity data.
  • the data splitting and parity logic 24 is split up into 16 identical read/write data cells within the customised ASICS (application specific integrated circuits) design of the controller.
  • the matrix of these data cells are shown in Figure 3.
  • Each of these data cells handles the same data bit from the ESP bus for the complete sequence of four ESP 16 bit data words. That is, with reference to Figure 2, each data cell handles the same bit from each ESP bus word 0,1,2 and 3. At the same time, each data cell generates/reads the associated parity bit for these four 16 bit ESP bus data words.
  • Data bits DB1 through DB15 will be identical in operation and description.
  • each of these four bits is temporarily stored/latched in devices G38 through G41. As• each bit appears on the ESP bus, it is steered through the multiplexor under the control of the two select lines to the relevant D-type latches G33 through G36, commencing with G33. At the end of this initial operation, the four host 16 bit words (64 data bits) will have been stored in the relevant gates G38 through G41 within all 16 data cells.
  • the four DBO data bits are now called DBO-A through DBO-D.
  • the RMW (buffer read modify write) control signa ⁇ is set to select input A from all devices G38 through G42. Under these situations, the rebuild line is not used (don't care).
  • the corresponding parity data bit is generated via G31, G32, and G37.
  • the resultant parity bit will have been generated and stored on device G42. This is accomplished as follows. As the first bit-0 (DBO-A) appears on the signal DBO, the INIT line is driven high/true and the output from the gate G31 is driven low/off. Whatever value is present on DBO will appear on the output of gate G32, and at the correct time will be clocked into the D-type G37. The value of DBO will now appear on ' the Q output of G37.
  • the INIT signal will now be driven low/off, and will now aid the flow of data through G31 for the next incoming three data bits on DBO.
  • Whatever value was stored as DBO-A on the output of gate G37 will now appear on the output of gate G31, and as the second DBO bit (DBO-B) appears on the signal DBO, an Exclusive OR value of these two bits will appear on the output of gate G32.
  • this new value will be clocked into the device G37.
  • the resultant Q output of G37 will now be the Exclusive OR function of DBO-A and DBO-B. This value will now be stored on device G42.
  • the accumulative Exclusive OR (XOR) value of DBO-A through DBO-D is generated in this manner so as to preserve buffer timing and synchronisation procedures.
  • the five outputs DBO-A through DBO-E are present for all data bits 0 through 15 of the four host data words.
  • the total of 80 bits are now stored in the central buffer memory (DRAM) .
  • the whole procedure is repeated for each sequence of four host data words (8 host data bytes) .
  • each "sector" of slave disk drive data is assembled in the central buffer, it is written to the slave disk drives (to channel A through channel E) within the same bank of disk drives.
  • the parity data bit is regenerated by the Exclusive OR gate G4 and compared at gate G2 with the parity data read from the slave disk drives at device G14. If a difference is detected, a NMI "non-maskable interrupt" is generated to the master processor device via gate G3. All read operations will terminate immediately or the controller may automatically perform read re-try procedures.
  • Gate G5 suppresses the effect of the parity bit DBO-E from the generation of the new parity bit.
  • Gate Gl will suppress NMI operations if any slave disk drive has failed and the resultant mask bit has been set high/true. Also, gate Gl, in conjunction with gate G5, will allow the read parity bit DBO-E to be utilised in the regeneration process at gate G4, should any channel have failed.
  • the single failed disk drive/channel will have its mask bit set high/true under the direction of the controller software.
  • the relevant gates within G6 through G9 and G10 through G14 for the failed channel/drives will have their outputs determined by their "B" inputs, not their "A” inputs.
  • Gl will suppress all NMI generation, and together with gate G5, will allow parity bit DBO-E to be utilised at gate G4.
  • the four valid bits from gates GIO through G14 will "regenerate” the "missing” data at gate G4, and the output with gate G4 will be fed to the correct ESP bus data bit DBO via a "B" input at the relevant gate G6 through G9.
  • gate G12 will be driven low and will not contribute to the output of gate G4.
  • the output of gate Gl will be driven low/false and will both suppress NMIs, and will allow signal DBO-E to be fed by gate G5 to gate G4.
  • Gate G4 will have all correct inputs from which to regenerate the missing data and feed the data to the output of device G8 via its "B" input. At the correct time, this bit will be fed through the multiplexor to DBO.
  • the memory controller must first read the data from the functioning four disk drives, regenerate the missing drive's data, and finally write the data to the failed disk drive after it has been replaced with a new disk drive.
  • All channels of the central buffer memory 26 will have their data set to the regenerated data, but only the single replaced channel data will be written to the new disk drive under software control.
  • the master 80376 processor detects an 80186 channel (array controller electronics) failure due to an "interprocessor" command protocol failure.
  • An 80186 processor detects a disk drive problem i.e. a SCSI bus protocol violation.
  • An 80186 processor detects a SCSI bus hardware error. This is a complete channel failure situation, not just a single disk drive on that SCSI bus.
  • the channel/drive "masking" function is performed by the master 80376 microprocessor.
  • the masked out channel/drive is not written to or read from by the associated 80186 channel processor.
  • Figure 6 through to 13 are diagrams illustrating the operation of the software run by the central controller 22.
  • Figure 6 illustrates the steps undertaken during the writing of data to the banks of disk drives. Initially the software is operating in "background" mode and is awaiting instructions. Once an instruction from the host is received indicating that data is to be sent, it is determined whether this is sequential within an existing segment. If data is sequential then this data is stored in the segment to form sequential data. If no sequential data exists in a buffer segment then either a new segment is opened (the write behind procedure illustrated in Figure 8) and data is accepted from the host, or the data is accepted into a transit buffer and queued ready to write into a segment. If there is no room for a new segment then the segment is found which has been idle for the most time. If there are no such segments then the host write request is entered into a suspended request list.
  • a segment is available it is determined whether this is a read or write segment. If it is a write segment then if it is empty it is de-allocated. If it is not empty then the segment is removed from consideration for de-allocation. If the segment is a read segment then the segment is de-allocated and opened ready to accept the host data.
  • Figure 7 illustrates the steps undertaken during read operations.
  • the controller is in a "background" mode.
  • a request for data is received from the host computer, if the start of the data requested is already in a read segment then data can be transferred from the central buffer 26 to the host computer. If the data is not already in the central buffer 26, then it is ascertained whether it is acceptable to read ahead information. If it is not acceptable then a read request is queued. If data is to be read ahead then it is determined whether there is room for a new segment. If there is then a new segment is opened and data is read from the drives to the buffer segment and is then transferred to the host computer. If there is no room for a new segment then the segment is found for which the largest time has elapsed since it was last accessed, and this segment is de-allocated and opened to accept the data read from the disk drives.
  • the read ahead procedure illustrated in Figure 9 is formed. It is determined whether there are any read segments open which require a data refresh. If there is such a segment then a read request for the I/O handler for the segment is queued.
  • Figure 10 illustrates the software steps undertaken to restart suspended transfers. It is first determined whether there are suspended host write requests in the list. If there is it is determined whether there is room for allocation of a segment for suspended host write requests. A new segment for the host transfer is opened and the host request which has been suspended longest is determined and data is accepted from the host computer into the buffer segment.
  • Figure 11 illustrates a form of "housekeeping" undertaken by the software in order to clean up the segments in the central buffer 26. It is determined at a point that it is time to clean up the buffer segments. All the read segments which have times since the last access time larger than a predetermined limit termed the "geriatric limit" are found and reallocated. Also it is determined whether there are any such write segments and if so write operations are tidied up.
  • Figure 12 illustrates the operation of the input/output handler
  • Figure 13 illustrates the operation of the input/output sub system
  • Figures 14 through to 19 are diagrams illustrating the operation of the software run by the central controller 22 and the slave controllers 32 during RAID-53 operation.
  • Figure 14 illustrates the steps undertaken by the central controller 22 when selected as the SCSI target.
  • a command from the initiator or host computer
  • syntax checked If a fault is detected the command is terminated by a check command status and the controller returns to background processing. If the syntax check indicates no errors then it is determined whether a queue tag message has been received to assign a queue position. If not and a command is already running a busy status is generated and the controller returns to background processing. If a command is not already running or if a queue tag message has been received it is determined whether data is required with the command. If data is required then a buffer segment is allocated for the data and if the command is to write data then data is received from the initiator into the allocated buffer segment.
  • a queue full status is generated and the controller returns to background processing. If the command is to read data or the command is to write data and data is received from the initiator into the allocated buffer then a command control block is allocated. If there is no space for this a queue full status is generated and the controller returns to background processing. If a command control block can be successfully allocated the appropriate command is issued to the slave bus controller 32 (an 80186 processor) and the command control tag pointer is passed as a tag. A disconnect message is then sent to the initiator and the controller returns to background processing.
  • this diagram illustrates the operation of the software in the central controller when the slave bus controller responds to commands.
  • Data can be read from the slave bus controller when the response available interrupt is generated.
  • the response information is read from the dual port RAMs (DPRAM) and the tag from this response is used to look up the correct command control block.
  • the receipt of a response from the particular slave bus controller is recorded in the command control block completion flags. It is then determined whether all of the slave bus controllers in the channels have responded and if not whether the command overall time ⁇ out has elapsed. If the command overall time-out has not elapsed then the central controller returns to background processing to read the channels which have not responded when they are available. If the command overall time-out has elapsed then a channel fault is recorded.
  • the completion of the command does require a data transfer then it is determined whether there is a faulty disk in the bank of disks being accessed. If so, then the appropriate channel is masked to cause a reconstruction of the missing data.
  • the initiator that gave the command is reselected and passed LUN identity and queue tag message.
  • the central processor then goes into background processing until an interrupt is received whereupon a data in bus phase is asserted and data is transferred. The central processor then returns to background processing awaiting interrupt whereupon a good status is returned.
  • Figures 16a and 16b illustrate the operation of the software by the slave bus controllers upon receipt of commands from the central controller.
  • the slave bus controller receives a command from the central controller, the command is read from the DPRAM. The command is decoded and syntax checked and if faulty is rejected. Otherwise, it is determined whether the command is a data read or write request. If it is not then the command is analysed to determine if a memory buffer is required and if so it is allocated. If there is no buffer space then the process is suspended to allow the reading of data to continue. The process is resumed when space is available. Then an input/output queue element is constructed and set up according to command requirements. The queue element is then put into the input/output queue and linked onto the destination targets list.
  • the command is a data read or write request then it is determined which targets are to be used.
  • the array block address is then converted to the target block address. It is then determined if the data received is to be diverted (or dumped) or a read modify write is required.
  • the command is a read data request then it is determined whether the transfer crosses bank boundaries. If not, then the input/output queue element is constructed and set up for the single read. If the transfer crosses bank boundaries then an input/output link block is allocated and it is recorded that two reads are to be performed for this command. If it is determined that there is no space then the process is suspended to allow the background to continue and-resume when space is available. Otherwise the input/output queue element is constructed and set up to read the target and queue request. The input/output queue is also constructed and set up to read the target plus one and the request is queued. The slave bus controller then returns to background processing.
  • the command is a data write request then as shown in Figure 16b it is determined whether the transfer crosses bank boundaries. If not, it is determined whether any read modify writes are required. If so, an I/O link block is allocated or the operation suspended until space is available. I/O queue elements for each of the reads of one or two read modifying write sequences are constructed as required. An I/O queue element for the aligned part of the write is .then constructed if required and the request is queued. The slave bus controller then enters background processing.
  • the transfer of data does cross bank boundaries then it is determined whether the writes to the lower target requires a front read modify write. If so, the I/O queue element for the read part of the read modify write is constructed (lower target) and a request is queued. The I/O queue element for the aligned write part of the transfer is then constructed (lower target) and the request is queued. It is then determined whether the write to the higher target requires a back read modify write and if so an I/O queue element for the read part of the read modify write is constructed (higher target) and the request is queued. The I/O queue element for the aligned part of the write is then constructed (higher target) and the request queued. The slave bus controller then enters background processing.
  • Figure 17 illustrates the operation of the input/output handling by the slave bus controllers.
  • the SCSI bus phases are handled to perform a required I/O for the specified target. If a target was disconnected it is determined whether the command complete message has been received. If not, a warning is generated and a target fault is logged.
  • the SCSI I/O queue element of command just completed is examined to determine if command completion function can be executed at this current interrupt level. If so, then the last SCSI I/O command completion function is executed as specified in I/O queue element. Also the I/O queue element is unlinked from the SCSI I/O queue and is marked as being free for other uses.
  • the last SCSI I/O command completion function and pointer to I/O queue element is entered onto the background process queue. Also the I/O queue element from the SCSI I/O queue is unlinked and the element is not marked as free. It remains in use until it is freed by the command completion function which will be executed from the background queue.
  • the next I/O request from the SCSI I/O queue is extracted using the I/O request from the target with the lowest average throughput. If several have a low figure, the lowest target is used. A select target command is then issued to the SCSI and an I/O is queued before the processor returns to background processing. If the I/O queue is empty a flag is set to show that the SCSI I/O has stopped.
  • Figure 18 illustrates a simple input/output completion function by a slave bus controller. This is executed by the SCSI I/O handler from the SCSI interrupt level. The SCSI I/O queue element is examined and the queue tag is extracted. The queue tag is given by the central controller when the-*command was issued to the slave bus controller. If the SCSI I/O was unsuccessfully executed then the queue tag and a fault response is sent to the central controller. If the SCSI I/O is executed successfully then the queue tag and an "acknowledge" response is sent to the central controller to inform command completion.
  • Figure 19 illustrates the operation of a complex I/O completion function by a slave bus controller. This is executed in the background from the background queue.
  • the I/O queue element is accessed with the pointer queued along with the completion function.
  • the I/O link block associated with this I/O is then accessed and in the I/O link block it is recorded that the I/O has completed. If the I/O was unsuccessfully completed then the fault details from the SCSI I/O queue element is stored in the I/O link block error information area.
  • the queue tag, fault response and the fault information is sent to the central controller.
  • the I/O link block and all attached buffers are freed and as well as the SCSI I/O queue element.
  • the "tidy-up" referred to hereinabove forms the final operation of the slave bus controllers when all associated SCSI I/O has completed successfully.
  • disk drives above the 500 megabyte level can typically only be formatted to a minimum of 256 bytes per sector. Further, new disk drives above the 1 gigabyte capacity, can typically only support a minimum of 512 byte sectors. This would mean that the controller would only be able to support host sector sizes of two kilobytes.
  • each slave disk sector contains four host sectors in what is termed “virtual" slave sectors of 128 bytes.
  • the controller has to extract an individual sector of 128 bytes from within the larger actual 512 bytes slave disk drive sector.
  • the controller has first to read the required overall sector, then modify the data for the actual part of the sector that is necessary, and then finally write the overall slave disk sector back to the disk drive.
  • This is a form of read modify write operation and can slow down the transfer of data to the disk drives but this is not normally a problem. Also, for large transfers of data to or from the disk drives, the affect of this problem is minimal and is not noticed by the host computer.
  • the hardware shown in Figure 1 can be expanded so that the host computer has access to a three dimensional array of disk drives. This is applicable to both RAID-35 and RAID-53 systems.
  • Figure 21 illustrates an arrangement of the disk drives in three dimensions with respect to the computer memory controller 100.
  • Each plane of disk drive corresponds to the two dimensional array illustrated in Figure 1 (42,0. , ...42,6,E) .
  • the number of buffer memories 26 and data splitting and parity logic 24 is increased in number to five, one for each two dimensional array (or planes) of disk drives.
  • the central controller 22 then controls each buffer memory 26 and its associated slave controllers 32 independently.
  • Each data splitting and• parity logic 24 is connected to its associated buffer memory 26 and to the SCSI-1 interface 12
  • Figure 16 illustrates the use of a second computer memory controller 100B.
  • the second computer memory controller 100B is provided in case of failure of the main computer memory controller 100A.
  • the second computer memory controller 100B is connected to each of the SCSI-1 buses at a different address to the main computer memory controller 100A. This reduces the number of banks of disk drives which can be provided to six since two of the SCSI-1 addresses are taken up by the controllers 100A and 100B.
  • This arrangement provides for controller redundancy where it is not acceptable to have to shut down to repair a fault.
  • the hardware shown in Figures 1, 21 and 22 can operate both RAID-35 and RAID-53.
  • the hardware can operate both systems by sharing the hardware. For instance at start-up a portion of the buffer memory 26 could be allocated to RAID-53, the remainder being allocated for RAID-35.
  • a buffer segment is opened in the portion of the buffer memory allocated for RAID-53 and data read thereto. If sequential data is detected by the central controller 22 then a buffer segment in the appropriate buffer portion is allocated and data read from the disk banks, together with read ahead information in the normal RAID-35 operation.
  • the disk banks can either be shared or a number of disk banks could be allocated for use by RAID-53 and the remainder for use by RAID-35.
  • This apportionment of the hardware can take place selectably by a user or it could take place automatically dependent on the sequential and non sequential data ratios.
  • the system could initially be set up on RAID-53 mode upon start-up and the size of the portion of the buffer mem ⁇ ry 26 and the number of disk banks allocated for RAID-35 will depend on the number of sequential data requests.
  • Overlay bank stripping is the term used hereinafter for the distribution of data amongst the memory banks and is applicable to both RAID-35 and RAID-53.
  • the data is stored in the banks sequentially. That is the logical blocks of user data are arranged sequentially across the disk surface and then sequentially across each additional bank. This does not fully utilise the ability of the system to read and write to banks simultaneously. If the data is better distributed over the banks of disk drives it is possible to simultaneously read and write to banks even using the RAID-35 arrangement.
  • the overlay bank stripping technique operates by writing data received from the host computer onto a predefined segment of the first bank. Once this segment is full the data is then written onto a segment having the same logical position in the next bank. This is repeated until the same logical segment in each bank is full whereupon data is written to the next logical segment in the first bank. This is repeated until the array is full.
  • This process has the advantage of evenly distributing the data over the banks of the array, therefore increasing the likelihood that data required by the host computer is located on different banks which can be read simultaneously to increase the speed of data retrieval. Further, since the controller allocates addresses for each segment data can be written to different banks simultaneously to increase the speed of data storage.
  • Figure 23 illustrates the distribution of data in segments within the array.
  • a segment can be defined as a data area that contains at least one block of disk data, e.g. 512 bytes, but more likely many multiples of disk data blocks, e.g. 64K bytes as shown in Figure 23.
  • a host data block is 512 bytes this is segmented using the RAID-35 or RAID-53 technique to apply 128 bytes to each channel.
  • a 64K byte segment on each disk drive of each bank can contain 512 of these host data block segments.
  • the size of the segment described hereinabove is 64K bytes, the segment size can be user selectable to allow tailoring to suit the performance optimisation required for different applications.
  • the ability to read sequential data may be penalised using overlay bank stripping.
  • overlay bank stripping enhances the performance since it allows the data on different banks to be simultaneously read.
  • This technique can increase the rate of data transfer to and from the* array and can overcome a limitation caused by the limited access speed provided by each individual disk. If the data is distributed in a segment on each bank in the arrangement shown in Figure 23 then the transfer rate is increased by a factor of seven.
  • the segment size may need to be of sufficient size, e.g. 64K bytes.
  • overlay bank stripping can be used with either the RAID-35 or RAID-53 techniques and where the computer memory controller is arranged to operate both by appropriately assigning the banks for the two techniques, overlay bank stripping can be used by both techniques if the disk banks are shared or only one of RAID-35 or RAID-53 if the disk banks are appropriately allocated.
  • controller of the present invention provides for large scale sequential data transfers from memory units for multi-users of a host computer and/or random requests for small amounts of data from a multitude of users.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Human Computer Interaction (AREA)
  • Techniques For Improving Reliability Of Storages (AREA)
  • Memory System Of A Hierarchy Structure (AREA)
  • Debugging And Monitoring (AREA)

Abstract

A computer memory controller for interfacing to a host computer comprises a buffer memory (26) for interfacing to a plurality of memory units (42) and for holding data read thereto and therefrom. A central controller (22) operative to control the transfer of data to and from the host computer and the memory units (42). The buffer memory (26) is controlled to form a plurality of buffer segments for addressably storing data read from or written to the memory units (42). The central controller (22) is operative to allocate a buffer segment for a read or write request from the host computer, of a size sufficient for the data. The central controller (22) is also operative in response to data requests from the host computer to control the memory units (42) to seek data stored in different memory units (42) simultaneously.

Description

COMPUTER MEMORY ARRAY CONTROL
This invention relates to computer memories, and in particular to a controller for controlling and a method of controlling an array of memory units in a computer.
For high performance Operating Systems and Fileservers, an idealistic computer memory would be a memory having no requirement to "seek" the data. Such a memory would have instantaneous access to all data areas. Such a memory could be provided by a RAM disk. This would provide for access to data regardless of whether it was sequential or random in its distribution in the memory. However, the use of RAM is disadvantageous compared to the use of conventional magnetic disk drive storage media in view of the high cost of RAM and especially due to the additional high cost of providing "redundancy" to compensate for failure of memory units.
Thus the most commonly used non-volatile computer memories are magnetic disk drives. However, these disk drives suffer from the disadvantage that they require a period of time to position the head or heads with the correct part of the disk corresponding to the location of the data. This is termed the seek and rotation delay. This delay becomes a significant portion of the data access time when only a small amount of data is to be read or written to or from the disk.
For disk drives, the seek and rotational latency times- can considerably limit the operating speed of a computer. The input/output (I/O) speed of disk drives has not kept pace with the development of microprocessors and therefore memory access time can severely restrain the performance of modern computers. In order to reduce the data access time for a large memory, a number of industry standard relativity- inexpensive disk drives have been used. Since a large array of these is used, some redundancy must be incorporated in the array to compensate for disk drive failure.
It is known to provide disk drives in an array of drives in such a way that the contents of any one drive can, should that drive fail, be reconstructed in a replacement drive from the information stored in the other drives.
Various classifications of arrangements that can perform this are described in more detail in a paper by D.A. Patterson, G. Gibson and R.H. Katz under the title "A Case for Redundant Arrays of Inexpensive Disks (RAID)", Report No. UCB/CSD 87/391 12/1987, Computer Science Division, University of California, U.S.A.
This document describes two types of arrangements. The first of these arrangements is particularly adapted for large scale data transfers and is termed "RAID-3". In this arrangement at least three disk drives are provided in which sequential bytes of information are stored in the same logical block positions on the drives, one drive having a check byte created by a controller written thereto, which enables any one of the other bytes on the disk drives to be determined from the check byte and the other bytes. The term "RAID-3" as used hereinafter is as defined by the foregoing passage.
In the RAID-3 arrangement there is preferably at least five disk drives, with four bytes being written to the first four drives and the check byte being written to the fifth drive, in the same logical block position as the data bytes on the other drives. Thus, if any drive fails, each byte stored on it can be reconstructed by reading the other drives. Not only can the computer be arranged to continue to operate despite failure of a disk drive, but also the failed disk drive can be replaced and rebuilt without the need to restore its contents from probably out-of-date backup copies. Moreover, even if one drive should fail, there is no loss of performance of the computer while the failed disk drive remains inactive and while it is replaced. A disk drive storage system having the RAID-3 arrangement is described in EP-A-0320107, the content of which is incorporated herein by reference.
The second type of storage system which is particularly adapted for multi-user applications, is termed "RAID-511. In the RAID-5 arrangement there are preferably at least five disk drives in which four sectors of each disk drive are arranged to store data and one sector stores check information. The check information is derived not from the data in the four sectors on the disk, but from designated sectors on each of the other four disks. Consequently each disk can be rebuilt from the data and check information on the remaining disks.
RAID-5 is seen to be advantageous, at least in theory, because it allows multi-user access, albeit with equivalent transfer performance of a single disk drive.
However, a write of one sector of information involves writing to two disks, that is to say writing the information to one sector on one disk drive and writing check information to a check sector on a second disk drive. However, writing the check sector is a read modify write operation, that is, a read of the existing data and check sectors first, because the old contents of those sectors must be known before the correct check information, based on the new data to be written, can be generated and written to disk. Nevertheless, RAID-5 does allow simultaneous reads by multiple users from all disks in the system which RAID-3 cannot support.
On the other hand, RAID-5 cannot match the rate of data transfer achievable with RAID-3, because with RAID-3, both read and write operations involve a transfer to each of the five disks (in five disk systems) of only a quarter of the total amount of information transferred. Since each referral can be accomplished simultaneously the process is much faster than reading or writing to a single disk particularly where large scale transfers are involved. This is because most of the time taken to effect a read or write in respect of a given disk drive, is the time taken for the read/write heads to be positioned with respect to the disk, and for the disk to rotate to the correct angular positoin. Clearly, this is as long for one disk, as it is for all four. But once in the correct position, transfers of large amounts of sequential information can be effected relatively quickly.
Moreover, with the current trend for sequential information to be requested by the user, RAID-5 only offers multiple user access in theory, rather than in practice, because requests for sequential information by the same user may involve reading several disks in turn, thereby occupying those disks so that they are not available to other users.
Furthermore, when a drive fails in RAID-5 format, the performance of the computer is severely retarded. When reading, if the required information is on a sector in the failed drive, it must be derived by reading all four of the other disks. similarly, when writing either check or information data to a working drive, the four working disks must first be read before the appropriate information sector is written and before the appropriate check information is determined and written. A further problem with RAID-3 is that disk drives are presently made to read or write minimum amounts of information on each given occasion. This is the formatted sector size of the disk drive and there is usually a minimum of 256 Bytes. In RAID-3 format this means that the minimum block length on any read or write is 1,024 Bytes. With growing disk drive capacities the tendency is towards even larger minimum block sizes such as 512 Bytes, so that RAID-3 effectively quadruples that minimum to 2,048 Bytes. However, many applications for computers, for example those employing UNIX version 5.3 require a minimum block size of only 512 Bytes and in this event, the known RAID-3 technique is not easily available to such systems. RAID-5 on the other hand does not increase the minimum data block size.
Nevertheless, it is the multi-user capability of RAID-5 which makes it theoretically more advantageous than RAID-3; but, in fact, it is the data transfer rate and continued performance in the event of drive failure in RAID-3 format which gives the latter much greater potential. So it is an object of the present invention to provide a system which exhibits the same multi-user capability of a RAID-5 disk array, or indeed better capability in that respect. The inventor has previously developed a system which has been termed RAID-35 and which is disclosed in the specification of PCT/GB90/01557. This system offers the same if not better performance as RAID-3 and RAID-5. This system recognises that with modern operating systems, data files tend to be sequential in the nature of their storage on the disk drive surface and read and write operations tend to be sequential or at least partially sequential in nature. Thus even with multi-user access to a disk storage medium, each user may require some sequential data in sequential requests. The RAID-35 system vastly reduces the delay in a host computer receiving data requested from the disk array since sequential data is read-ahead and stored in buffer segments. Thus if the requested data is sequential to a previous request then there is no seek delay, since the data is present in the buffer segment.
The RAID-35 system"*is thus highly efficient for applications where users are likely to request sequential data. On the other hand if the data requests are random, the advantages of the RAID-35 system cannot be realised.
It is an object of the present invention to provide a computer memory controller capable of providing a host computer with random data in a fast and efficient manner.
It is also an object of the present invention to provide a computer memory controller capable of operating the RAID-35 arrangement and capable of being interfaced to a three dimensional array memory units.
It is also an object of the present invention to provide a computer memory controller capable of operating the RAID-35 arrangement as well as providing a host computer with random data in a fast and efficient manner.
The present invention provides a computer memory controller for interfacing to a host computer comprising a buffer means for interfacing to a plurality of memory units and for holding data read thereto and therefrom; and control means operative to control the transfer of data to and from said host computer and said memory units; said buffer means being controlled to form a plurality of buffer segments for addressably storing data read from or written to said memory units; said control means being operative to allocate a buffer segment for a read or write request from the host computer, of a size sufficient for the data; said control means being further operative in response to data requests from said host computer to control said memory units to seek data stored in different memory units simultaneously.
The present invention also provides a method of controlling a plurality of memory units for use with a host computer comprising the steps of repeatedly receiving from said host computer a read request for data stored in said memory units and allocating a buffer segment of sufficient size for the data to be read; and seeking data in said plurality of memory units simultaneously.
The present invention further provides a computer memory controller for a host computer comprising buffer means for interfacing to at least three memory channels arranged in parallel, each memory channel comprising a plurality of memory units connected by a bus such that each memory unit of said memory channel is independently accessible; respective memory units of said memory channels forming a memory bank; a logic circuit connected to said buffer means to split data input from said host computer into a plurality of portions such that said portions are temporarily stored in a buffer segment before being applied to ones of a group of said memory channels for storage in a memory bank; said logic circuit being further operative to recombine portions of data successively read from successive ones of a group of said memory units of a memory bank and into said buffer means; said logic circuit including parity means operative to generate a check byte or group of bits from said data for temporary storage in said buffer means before being stored in at least one said memory unit of said memory bank, and operative to use said check byte to regenerate said data read from said group of memory units of a memory bank if one of said group of memory units fails; said buffer means being divided into a number of channels corresponding to the number of memory channels, each said channel being divided into associated portions of buffer segments; and a control means operative to control the transfer of data and check bytes or groups of bits to and from said memory banks, including allocating a buffer segment for a read or write request from the host computer of a sufficient size for the data, and controlling said memory banks to seek data stored in different memory banks simultaneously.
The present invention still further provides a computer storage system comprising a plurality of memory units arranged into a two dimensional array having at least three memory channels arranged in parallel, each said memory channel comprising a plurality of memory units connected by a bus such that each memory unit is independently accessible; respective memory units of said memory channels forming a memory bank; and a controller comprising buffer means interfaced to said memory units and for holding information read from said memory channels; said buffer means being controlled to form a plurality of buffer segments for addressably storing data read from or written to said memory units; a logic circuit connected to said buffer means to recombine bytes or groups of bits read from ones of a group of said memory units in a memory bank, parity means operative to use a check byte or group of bits read from one of said memory units in said memory bank to regenerate information read from said group of memory units if one of said group of memory units fails; and control means for controlling the transfer of data to and from said host computer and said memory units, including allocating a buffer segment for a read or write request from the host computer of a sufficient size for the data, and controlling said memory banks to seek data stored in different memory banks simultaneously.
Conveniently the system of the present invention can be termed RAID-53 since it utilises a combination of RAID-3 and RAID-5 to provide for fast random access. RAID-53 like RAID-5 allows for simultaneous reads by multiple users from all the disk banks in the system whilst also reducing the read time since the data is split between a number of disks which are read simultaneously.
In order to increase the speed of access to data stored in the disk array using RAID-53 the disk banks can be addressably segmented such that respective segments on sequential banks have a sequential address. This allows sequential data to be written to segements on sequential banks and thus distribute or "stripe" the data across the memory banks. This technique is termed hereinafter "overlay bank stripping".
This organisation of data on the disk array is controlled by the controller and not the host computer. The controller assigns addresses to segments of the disk banks in such a way that when data is written to the disk array it is striped across the banks.
This stripping of the data is also applicable to RAID-35 and will allow data to be read or stored on different banks simultaneously.
Preferably the memory units are disk drives and there are five per memory bank, i.e. five memory channels, one disk containing the check information, four disks containing the data. If the currently standard disk drive interface SCSI-1 (Small Computer Systems Interface) is used then since this has an eight address limit, one of which will be used by a controller, seven memory banks can be used. Alternatively if SCSI-2 is used then 15 banks can be used. The present invention is not however limited to the use of such an interface and any number of memory banks could be used. In fact the more memory banks that are present, the more that can be simultaneously undertaking a seek operation, thus reducing data access time for the host computer. Preferably for optimum performance, the disk drives of a memory bank have their spindles synchronised.
This combination of RAID-3 and RAID-5 provides a simultaneous random access facility with a performance in excess of the theoretical maximum performance of RAID-5 systems with five slave -tous drives. In addition the performance penalties of Read-Modify-Write characteristics of RAID-5 systems are avoided. What is provided is a fast and simple RAID-3 type Read/Write facility.
The RAID-53 system also sustains maximum transfer rate under a "single" disk drive failure condition per "bank" of disk drives.
During busy I/O requests the control means can queue host data requests for memory banks and carry out the data seek and transfer when the memory bank containing the requested data is not busy. Preferably the order in which these seeks take place is optimimsed to provide optimised seek ordering.
Preferably, when a write request is received by the controller, it can effect the immediate writing of the data to a memory bank to the detriment of any pending read or write requests. This prevents any important data being lost due to power failure for instance when the data normally would be held in a buffer segment.
In a preferred embodiment which increases the number of memory banks considerably, a number of buffer means, logic circuits and parity means are provided together with a number of associated two dimensional arrays of memory units. In this arrangement the control means is operative to control the transfer of data to and from the host computer and the three dimensional array of memory units formed of layers of the two dimensional arrays.
The hardware utilised for the RAID-35 system of PCT/GB90/01557 can be the same as that used for the RAID-53. Thus it is possible to provide RAID-35 and RAID-53 as options for the same hardware or they can be provided together and will share the hardware. In one shared system, a first portion of the buffer means is allocated for RAID-53. The remaining buffer memory is allocated for RAID-35 use. The memory banks can be shared or a number of them can be allocated for RAID-35 and the rest for RAID-53.
The RAID-35 operation is as follows. The transfer of sequential data to the host computer in response to requests therefrom is controlled by first addressing the buffer segments in the allocated part of the buffer means to establish whether the requested data is contained therein and if so supplying said data to said host computer. If the requested sequential data is not contained in the buffer segments of the allocated portion of the buffer means, data is read from the memory units and supplied to the host computer. Further data is read from the memory units which is logically sequential to the data requested by the host computer and the further data is stored in a buffer segment in the allocated portion of the buffer means. The control means also controls the size and number of buffer segments in the portion of the buffer means allocated for RAID-35 usage.
The array of disk drives provided by the RAID-35 and RAID-53 systems provide redundancy in the event of disk drive failure. In one embodiment of the invention there can also be provided redundancy in controllers. If a second controller is provided at a different address on the buses of the array then in the event of a failure of the main controller, the auxiliary controller can be activated with little or no down time of the system. The controller can then be repaired or replaced whilst the system is still running. The present invention also provides a plurality of buffer means each for interfacing a plurality of memory units arranged into a two dimensional array having at least three memory channels, each memory channel comprising a plurality of memory units connected by a bus such that each memory unit is independently accessible; respective memory units of said memory channels forming a memory bank; a plurality of logic circuits connected to respective said buffer means to recombine bytes or groups of bits read from ones of a group of said memory units of a memory bank and stored in said buffer segments to generate the requested data; said logic circuits each including parity means operative to use a check byte or group of bits read from one of said memory units of said memory bank to regenerate data read from said group of memory units if one of said group of memory units fails; said buffer means being divided into a number of channels corresponding to the number of memory channels, each channel being divided into associated portion of buffer segments; and control means operative to control the transfer of data from a three dimensional array of memory units formed from a plurality of said two dimensional arrays to said host computer in response to requests therefrom by first addressing said buffer segments to establish whether the requested data is contained therein and if so supplying said data to said host computer, and if the requested data is not contained in the buffer segments, reading said data from the memory units, supplying said data to said host computer, reading from said memory units further data which is logically sequential to the data requested by said host computer and storing said further data in a buffer segment; said control means further controlling said buffer means to control the number and size of said buffer segments. In this RAID-35 arrangement a three dimensional array of disk drives is provided to increase storage capacity.
Although at present the most commonly form of redundant array of inexpensive disks used utilises magnetic disk drives, the present invention is not limited to the use of such disk drives. The present invention is equally applicable to the use of any memory device which has a long seek time for data compared to the data transfer rate once the data is located. Such media could, for instance, be an optical compact disk.
Thus such an array, according to the present invention, provides large scale storage of information together with the faster data transfer rates and better performance with regard to multi-user applications, and security in the event of any one drive failure (per bank) . Indeed, the mean time between failures (MTBF) of such an array (when meaning the mean time between two simultaneous drive failures (per bank) , and which is required in order to result in information being lost beyond recall) is measured in many thousands of years with presently available disk drives each having individual MTBFs of many thousands of hours.
Examples of the present invention will now be described with reference to the accompanying drawings in which:
Figure 1 is a block diagram of the controller architecture of a disk array system according to one embodiment of the present invention.
Figure 2 illustrates the operation of the data splitting hardware.
Figure 3 illustrates the read/write data cell matrix.
Figure 4 illustrates a write data cell.
Figure 5 illustrates a read data cell. Figure 6 is a flow diagram illustrating the software steps in write operations for RAID-35 operation.
Figure 7 is a flow diagram illustrating the software steps in read operations for RAID-35 operation.
Figures 8 and 9 are flow diagrams illustrating the software steps for read ahead and write behind for
RAID-35 operation.
Figure 10 is a flow diagram illustrating the software steps involved to restart suspended transfers for RAID-35 operation.
Figure 11 is a flow diagram illustrating the software steps involved in cleaning up segments for
RAID-35 operation.
Figures 12 and 13 are flow diagrams illustrating the steps involved for input/output control for RAID-35 operation.
Figure 14 and 15 are flow diagrams illustrating the software steps performed by the 80376 central controller of Figure 1 during RAID-53 operation.
Figures 16 to 19 are flow diagrams illustrating the software steps performed by the slave bus controllers of Figure 1 during RAID-53 operation.
Figure 20 is a block diagram of an embodiment of the present invention illustrating the access points for RAID-53 operation.
Figure 21 illustrates a block diagram of a three dimensional memory array according to one embodiment of the present invention. Figure 22 illustrates the use of a redundant controller according to one embodiment of the present invention. Figure 23 illustrates the distribution of data in segments within the array using the technique of overlay bank stripping. Figure 1 illustrates the architecture of the RAID-35 and RAID-53 disk array controller, and initially both systems will be considered together.
In Figure 1 of the drawings the internal interface of the computer memory controller 10 is termed the ESP data bus interface and the interface to the host computer is termed the SCSI interface. These are provided in interface 12. The SCSI bus interface communicates with the host computer (not shown) and the ESP interface communicates with a high performance direct memory access (DMA) unit 14 in a host interface section 11 of the computer memory controller 10. The ESP interface is 16 bits (one word) wide.
The host interface section communicates with a central buffer management (CBM) section 20 which comprises a central controller 22, in the form of a suitable microprocessor such as the Intel 80376 Microprocessor, and data splitting and parity control (DSPC) logic circuit 24. These perform the function of splitting information received from the host computer into four channels, and generating parity information for the fifth channel. The DSPC 24 also combines the information on the first four channels and, after checking against the parity channel, transmits the combined information to the host computer. Furthermore, the DSPC 24 is able to reconstruct the information from any one channel, should that be necessary, on the basis of the information from the other four channels. The DSPC 24 is connected to a central buffer 26 which is divided into five channels A to E, each of which is divisible into buffer segments 28. Each central buffer channel 26,A through 26,E have the capacity to store up to half a megabyte of data for example, depending on the application required. For-RAID-35, each segment may be as small as 128 kilobytes for example so that up to 16 segments can be formed in the buffer. For RAID-53 each segment will be as small as the minimum data request from the host computer.
The central buffer 26 communicates with five slave bus controllers 32 in a slave bus interface (SBI) section 30 of the memory controller 10.
Each slave bus controller 32,A through 32,E communicates with up to seven disk drives 42,0 to 42,6 along SCSI-1 buses 44,A through 44,E so that the drives 42,0,A through 42,0,E form a bank 0, of five disk drives and so also do drives 42,1,A through 42,1,E etc. to 42,6, through 42,6,E. The seven banks of five drives effectively each constitute a single disk drive, each individually and independently accessible. This is made possible by the use of SCSI-1 buses, which allow for eight device addresses. One address is taken up by the slave bus controller 32 whilst the seven remaining addresses are available for seven disk drives. Thus for the RAID-35 system the storage capacity of each channel can therefore be increased sevenfold and the slave bus controller 32 is able to access any one of the disk drives 42 in the channel independently. The use of more than one bank of disk drives is essential for the realisation of the advantage of RAID-53 operation.
This arrangement of banks of disk drives is not only applicable to the arrangement shown in Figure 1, but is also applicable to the RAID-3 arrangement. Information stored in the disk drives of one bank can be accessed virtually simultaneously with information being accessed from the disk drives of another bank. This arrangement therefore gives an enhancement in access speed to data stored in an array of disk drives.
In so far as the host computer is concerned, its memory consists of a number of sectors each identified by a unique address number. Where or how these sectors are stored on the various disk drives of the memory 40 is a matter of no concern to the host computer, it must merely remember the address of the data sectors it requires. Of course, addresses themselves may form part of the data stored in the memory.
On the other hand, one of the functions of the central controller 22 is to store data on the various disk drives efficiently. Moreover each sector in so far as the host is concerned, is split between four disk drives in the known RAID-3 format. Under RAID-35 operation, the central controller 22 arranges to store sectors of information passed to it by the host computer, in an ordered fashion so that a sector on any given disk drive is likely to contain information which logically follows from a previous adjacent sector.
To optimise performance, the disk drives of a bank should have their spindles synchronised. Operation under RAID-35
When the host computer requires data, the read request is received by the central controller 22 which passes the request to the slave bus interface (SBI) controller 32. The slave bus control 32 reads the disk banks 40 and selects the appropriate data from the appropriate banks of disks. The DSPC circuit 24 receives the requested data and checks it is accurate against the check data in channel E.
If there is any error detected by the parity check the controller may automatically try to re-read the data, if a parity error is still detected the controller may return an error to the host computer. If there is a faulty drive this can be isolated and the system arranged to continue working employing the four good channels, in the same way and with no loss of performance, until the faulty drive is replaced and rebuilt with the appropriate information.
Assuming however that the data is good, the central controller 22 first responds to the data read request by transferring the information to the SCSI-1 interface 12. However, it also instructs further information logically sequential to the requested information to be read. This is termed "read ahead information". Read ahead information up to the capacity presently allocated by the central controller 22 to any one of the data buffer segments 28 is then stored in one buffer segment 28.
When the host computer makes a further request for information, it is likely that the information requested will follow on from the information previously requested. Consequently, when the central controller 22 receives a read request, it first interrogates those buffer segments 28 to determine if the required information is already in the buffer. If the information is there, then the central controller 22 can respond to the user request immediately, without having to read the disk drives. This is obviously a much faster procedure and avoids the seek delay.
On those occasions when the required information is not already in the buffer, then a new read of the disk drives is required. Again, the requested information is passed on and sequential read ahead information is fed to another buffer segment. This process continues until all the buffer segments are filled and the system is maintained with its segments permanently filled. Of course, there comes a point wnen all the segments are filled, but still the disk drives must be read. It is only at this point that a buffer segment is finally deallocated by the central controller 22, by keeping note of which buffer segments buffers 28 are or have been used most frequently, and dumping the most infrequently used one.
During the normal busy operation of the host computer, the central controller 22 will have allocated at least as many buffer segments 28 as there are application programs, up to the maximum number of segments available. Each buffer segment will be kept full by the central controller 22 ordering the disk drive seek commands in the most efficient manner, only over-riding that ordering when a buffer segment has been, say 50% emptied by host requests or when a host request cannot be satisfied from existing buffer segments 28. Thus all buffer segments are kept as full as possible with read ahead data.
To write information to the disk drives, a similar procedure is followed. When a write instruction is received by the central controller 22 information is split by DSPC circuits 24 and appropriate check information created. The five resulting components are placed in allocated write buffer segments. The number of write buffer segments may be preselected, or may be dynamically allocated as and when required. In any event, write buffer segments are protected against de-allocating until its information has been written to disk. Actual writing to disk is only effected under instruction from the host computer, if and when a segment becomes full and the system cannot wait any longer, or, more likely, when the system is idle and not performing any read operations.
In any event, simultaneous writes appear to be happening in so far as the host computer is concerned, because the central controller 22 is capable of handling commands very rapidly and storing writes in buffers while waiting for an opportunity* for the more time consuming actual writing to disk drives.
This does not mean however, that in the event of power failure, some writes, which the user will think have been recorded on disk, may in fact have been lost by virtue of its temporary location in the random access buffer at the time of power failure. In that event a restored disk drive system from back-up copies is required.
Alternatively, a hardware switch can be provided to ensure that all write instructions are effected immediately, with write information only being stored in the buffer segments transiently before being written to disk. This removes the fear that a power loss might result in data being lost which was thought to have been written to disk although not actually effected by the memory system. There is still however, the unlikely exception that information may be lost when a power loss occurs very shortly after a user has sent a write command, but in that event, the user is likely to be conscious of the problem. If this alternative is utilised, it does of course affect the performance of the computer. Operation under RAID-53
When the host computer requires data a request is received and a buffer segment allocated for that data. The read request is received by the central controller 22 which passes the request to the slave bus controller 32. The slave bus controller 32 reads the disk banks 40 and selects the appropriate data from the appropriate banks of disks. The DSPC circuit 24 receives the requested data and checks it is accurate against the check data in channel E.
If there is any error detected by the parity check the controller may automatically retry to read the data. If a parity error is still detected the controller may return an error to the host computer. If there is a faulty drive this can be isolated and the system arranged to continue working employing the four good channels, in the same way and with no loss of performance, until the faulty drive is replaced and rebuilt with the appropriate information.
Assuming that the data is good the central controller 22 responds to the data read request by transferring the data to the SCSI-1 interface 12, and then de-allocating the buffer segment. The disk bank is then free to accept another read request and can commence a seek operation under the command of the central controller 22.
The size of the buffer segments is determined by the size of the data requested by the host computer. No data is read ahead from the disk drives.
The central controller 22 is thus able to receive the read requests and determine in which disk bank that data lies. If the disk bank is idle then the disk bank can be instructed to seek the data. Simultaneously the other disk banks may be seeking data requested by the host computer at an earlier date, and once this has been located the central controller 22 can read the disk bank and pass the data to the buffer segments for reconstruction, from where it is passed to the SCSI-1 interface 12.
Figure 14 illustrates the seven access points to the seven disk banks. Each disk drive of each bank has a unique bus (SCSI) address and can thus be accessed independently by the computer memory controller 100. Thus up to seven disk banks can be operating simultaneously to seek data requested by the host computer. While a disk bank is seeking it is disconnected from the SCSI-1 interface. When the data is located this is indicated to the central controller 22 which can then read the data.
If a disk bank is busy when a new read request is received then the central controller 22 can queue these requests. To provide an ^optimised seek ordering, the queued read requests may not necessarily be performed in the order in which the host computer issued the commands. Such queuing of read requests could also be performed on the slave bus controllers 32.
For write operations very much the same thing happens. However the central controller 22 is provided with the capability of "forcing" the incoming data to be "immediately" written to the required bank of disk drives, rather than being queued with pending Read/Write commands. This ensures that data thought by the host computer to be written to disk is so written, in case of for instance power failure where any data to be written to the disks that is stored in the buffer memory 26 would be lost. Detailed Operation of Hardware for both RAID-35 and RAID-53
The detailed operation of the hardware data splitting, parity generation and checking logic, and buffer interface logic will now be described with reference to Figures 2 to 5 for both RAID-35 and RAID-53.
Referring to Figure 2, the controllers internal interface to the host system hardware interface is 16 bits (one word) wide. This is the ESP data bus. For every four words of sequential host data, one 64 bit wide slice of internal buffer data is formed. At the same time, an additional word or 16 bits of parity data is formed by the controller; one parity bit for four host data bits. Thus the internal width of the controller's central data bus is 80 bits. This is made up of 64 bits of host data and 16 bits of parity data.
The data splitting and parity logic 24 is split up into 16 identical read/write data cells within the customised ASICS (application specific integrated circuits) design of the controller. The matrix of these data cells are shown in Figure 3. Each of these data cells handles the same data bit from the ESP bus for the complete sequence of four ESP 16 bit data words. That is, with reference to Figure 2, each data cell handles the same bit from each ESP bus word 0,1,2 and 3. At the same time, each data cell generates/reads the associated parity bit for these four 16 bit ESP bus data words.
For explanation purposes, only the first data bit 0 (DBO) will be described. Data bits DB1 through DB15 will be identical in operation and description.
Four basic operations are performed, namely
1. Writing host data
2. Reading of data to the host
3. Regeneration of "single failed channel" data during host read operations.
4. Rebuilding of data on a failed disk drive unit.
Writing of host data to the disk drive array
Referring now to Figure 4, as the corresponding data bit from each host 16 bit word is received on the ESP data bus, each of these four bits is temporarily stored/latched in devices G38 through G41. As• each bit appears on the ESP bus, it is steered through the multiplexor under the control of the two select lines to the relevant D-type latches G33 through G36, commencing with G33. At the end of this initial operation, the four host 16 bit words (64 data bits) will have been stored in the relevant gates G38 through G41 within all 16 data cells. The four DBO data bits are now called DBO-A through DBO-D.
During the write operations, the RMW (buffer read modify write) control signaϊ is set to select input A from all devices G38 through G42. Under these situations, the rebuild line is not used (don't care).
As each bit is clocked into the data cell, the corresponding parity data bit is generated via G31, G32, and G37. At the end of the sequence of the four bit O's from each of the four incoming ESP bus host data words, the resultant parity bit will have been generated and stored on device G42. This is accomplished as follows. As the first bit-0 (DBO-A) appears on the signal DBO, the INIT line is driven high/true and the output from the gate G31 is driven low/off. Whatever value is present on DBO will appear on the output of gate G32, and at the correct time will be clocked into the D-type G37. The value of DBO will now appear on' the Q output of G37. The INIT signal will now be driven low/off, and will now aid the flow of data through G31 for the next incoming three data bits on DBO. Whatever value was stored as DBO-A on the output of gate G37 will now appear on the output of gate G31, and as the second DBO bit (DBO-B) appears on the signal DBO, an Exclusive OR value of these two bits will appear on the output of gate G32. At the appropriate time, this new value will be clocked into the device G37. At the end of the clock cycle, the resultant Q output of G37 will now be the Exclusive OR function of DBO-A and DBO-B. This value will now be stored on device G42. The above operation will continue as the remaining two DBO bits (DBO-C and DBO-D) appear on the signal DBO. At the end of this operation, the accumulative Exclusive OR function of all bits DBO-A through DBO-D will be stored on device G42, and at the same time, bits DBO-A through DBO-D will be stored on devices G38 through G41 respectively.
The accumulative Exclusive OR (XOR) value of DBO-A through DBO-D is generated in this manner so as to preserve buffer timing and synchronisation procedures.
The five outputs DBO-A through DBO-E are present for all data bits 0 through 15 of the four host data words. The total of 80 bits are now stored in the central buffer memory (DRAM) . The whole procedure is repeated for each sequence of four host data words (8 host data bytes) .
As each "sector" of slave disk drive data is assembled in the central buffer, it is written to the slave disk drives (to channel A through channel E) within the same bank of disk drives.
If a failed slave channel, or disk drive exists, then the controller will mask out that drive's data and no data will be written to that channel/disk drive. However, the data will be assembled in the central buffer in the normal manner. Reading of array disk drive data to the host system
Referring now to Figure 5, in response to a host request, data is read from the disk array and placed in the central buffer memory 26. Also, in the reverse procedure to that for write operations, the 80 bits of central buffer data are loaded into devices G10 through G14 for each bit (4 data bits and 1 parity bit) . Again we will only consider DBO. The resulting five bits are DBO-A through DBO-E. All read operations are checked for correct parity by regenerating a new parity bit and comparing this bit with the bit read from the slave disk drives.
Initially, the case of a fully functioning array will be considered with no faulty slave disk drives. In this case all mask bits (mask-A through mask-E) will be low/false, and all bits from the central buffer 26 will appear on the outputs of devices G10 through G14 via "A" inputs. Also, all data bits will appear on the outputs of devices G6 through G9 via their "A" inputs. After the central buffer read operation, the four data bits will simultaneously appear on the* outputs of devices G6 through G9. In the reverse procedure to that for write operations, all data bits DBO-A through DBO-D will be reassembled on the ESP data bus through the mutilplexor under the control of the two select lines. As the data bits are read from the central buffer 26, the parity data bit is regenerated by the Exclusive OR gate G4 and compared at gate G2 with the parity data read from the slave disk drives at device G14. If a difference is detected, a NMI "non-maskable interrupt" is generated to the master processor device via gate G3. All read operations will terminate immediately or the controller may automatically perform read re-try procedures.
Gate G5 suppresses the effect of the parity bit DBO-E from the generation of the new parity bit. Gate Gl will suppress NMI operations if any slave disk drive has failed and the resultant mask bit has been set high/true. Also, gate Gl, in conjunction with gate G5, will allow the read parity bit DBO-E to be utilised in the regeneration process at gate G4, should any channel have failed.
Regeneration of "single failed channel" data during host read operations
Referring to Figure 5, the single failed disk drive/channel will have its mask bit set high/true under the direction of the controller software. The relevant gates within G6 through G9 and G10 through G14 for the failed channel/drives will have their outputs determined by their "B" inputs, not their "A" inputs. Also, Gl will suppress all NMI generation, and together with gate G5, will allow parity bit DBO-E to be utilised at gate G4. In this situation, the four valid bits from gates GIO through G14 will "regenerate" the "missing" data at gate G4, and the output with gate G4 will be fed to the correct ESP bus data bit DBO via a "B" input at the relevant gate G6 through G9.
For example consider the channel 2 disk drive to be faulty, and mask bit mask-C will be driven high/true. The output of gate G12 will be driven low and will not contribute to the output of gate G4. Also, the output of gate Gl will be driven low/false and will both suppress NMIs, and will allow signal DBO-E to be fed by gate G5 to gate G4. Gate G4 will have all correct inputs from which to regenerate the missing data and feed the data to the output of device G8 via its "B" input. At the correct time, this bit will be fed through the multiplexor to DBO.
Rebuilding of data on a failed disk drive unit
Referring now to Figures 4 and 5, to rebuild data, the memory controller must first read the data from the functioning four disk drives, regenerate the missing drive's data, and finally write the data to the failed disk drive after it has been replaced with a new disk drive.
With reference to Figure 5 and the example given above for "regeneration of single failed channel data during host read operations", under rebuild conditions the outputs from gates G6 through G9 will not be fed to the ESP data bus. However, the regenerated data at the output of gate G4 will be fed to the "B" inputs of gates G38 through G42 of the write data cell in Figure 4. Under rebuild conditions, the RMW signal will be set high/true and the outputs of devices G38 through G42 will be determined by the value of the rebuild data on signal rebuild.
All channels of the central buffer memory 26 will have their data set to the regenerated data, but only the single replaced channel data will be written to the new disk drive under software control.
Detection of faulty channel/disk drive
The detection of a faulty channel/slave disk drive is as per the following three main criteria:-
1. The master 80376 processor detects an 80186 channel (array controller electronics) failure due to an "interprocessor" command protocol failure.
2. An 80186 processor detects a disk drive problem i.e. a SCSI bus protocol violation.
3. An 80186 processor detects a SCSI bus hardware error. This is a complete channel failure situation, not just a single disk drive on that SCSI bus.
After detection of the fault condition, the channel/drive "masking" function is performed by the master 80376 microprocessor.
Under fault conditions, the masked out channel/drive is not written to or read from by the associated 80186 channel processor.
Operation of Software for RAID-35 Operation
Figure 6 through to 13 are diagrams illustrating the operation of the software run by the central controller 22.
Figure 6 illustrates the steps undertaken during the writing of data to the banks of disk drives. Initially the software is operating in "background" mode and is awaiting instructions. Once an instruction from the host is received indicating that data is to be sent, it is determined whether this is sequential within an existing segment. If data is sequential then this data is stored in the segment to form sequential data. If no sequential data exists in a buffer segment then either a new segment is opened (the write behind procedure illustrated in Figure 8) and data is accepted from the host, or the data is accepted into a transit buffer and queued ready to write into a segment. If there is no room for a new segment then the segment is found which has been idle for the most time. If there are no such segments then the host write request is entered into a suspended request list. If a segment is available it is determined whether this is a read or write segment. If it is a write segment then if it is empty it is de-allocated. If it is not empty then the segment is removed from consideration for de-allocation. If the segment is a read segment then the segment is de-allocated and opened ready to accept the host data.
The write behind procedure is illustrated in Figure 8 and if there are any write segments open which need to be emptied, then a write request is queued for the I/O handler for each open segment with data in it.
Figure 7 illustrates the steps undertaken during read operations. Initially, the controller is in a "background" mode. When a request for data is received from the host computer, if the start of the data requested is already in a read segment then data can be transferred from the central buffer 26 to the host computer. If the data is not already in the central buffer 26, then it is ascertained whether it is acceptable to read ahead information. If it is not acceptable then a read request is queued. If data is to be read ahead then it is determined whether there is room for a new segment. If there is then a new segment is opened and data is read from the drives to the buffer segment and is then transferred to the host computer. If there is no room for a new segment then the segment is found for which the largest time has elapsed since it was last accessed, and this segment is de-allocated and opened to accept the data read from the disk drives.
In order to keep the buffer segments 28 full, the read ahead procedure illustrated in Figure 9 is formed. It is determined whether there are any read segments open which require a data refresh. If there is such a segment then a read request for the I/O handler for the segment is queued.
Figure 10 illustrates the software steps undertaken to restart suspended transfers. It is first determined whether there are suspended host write requests in the list. If there is it is determined whether there is room for allocation of a segment for suspended host write requests. A new segment for the host transfer is opened and the host request which has been suspended longest is determined and data is accepted from the host computer into the buffer segment.
Figure 11 illustrates a form of "housekeeping" undertaken by the software in order to clean up the segments in the central buffer 26. It is determined at a point that it is time to clean up the buffer segments. All the read segments which have times since the last access time larger than a predetermined limit termed the "geriatric limit" are found and reallocated. Also it is determined whether there are any such write segments and if so write operations are tidied up.
Figure 12 illustrates the operation of the input/output handler, whilst Figure 13 illustrates the operation of the input/output sub system.
All these procedures are performed by software which may be run on the central (80376) controller 22 in order to control and efficiently manage the transfer of data in the buffer segments 28, in order that the buffer 26 is kept as full as possible with data sequential to data requested by the host computer.
Operation of Software for RAID-53
Figures 14 through to 19 are diagrams illustrating the operation of the software run by the central controller 22 and the slave controllers 32 during RAID-53 operation.
Figure 14 illustrates the steps undertaken by the central controller 22 when selected as the SCSI target. Once selected a command from the initiator (or host computer) is decoded and syntax checked. If a fault is detected the command is terminated by a check command status and the controller returns to background processing. If the syntax check indicates no errors then it is determined whether a queue tag message has been received to assign a queue position. If not and a command is already running a busy status is generated and the controller returns to background processing. If a command is not already running or if a queue tag message has been received it is determined whether data is required with the command. If data is required then a buffer segment is allocated for the data and if the command is to write data then data is received from the initiator into the allocated buffer segment. If there is no space available then a queue full status is generated and the controller returns to background processing. If the command is to read data or the command is to write data and data is received from the initiator into the allocated buffer then a command control block is allocated. If there is no space for this a queue full status is generated and the controller returns to background processing. If a command control block can be successfully allocated the appropriate command is issued to the slave bus controller 32 (an 80186 processor) and the command control tag pointer is passed as a tag. A disconnect message is then sent to the initiator and the controller returns to background processing.
Referring now to Figure 15, this diagram illustrates the operation of the software in the central controller when the slave bus controller responds to commands. Data can be read from the slave bus controller when the response available interrupt is generated. The response information is read from the dual port RAMs (DPRAM) and the tag from this response is used to look up the correct command control block. The receipt of a response from the particular slave bus controller is recorded in the command control block completion flags. It is then determined whether all of the slave bus controllers in the channels have responded and if not whether the command overall time¬ out has elapsed. If the command overall time-out has not elapsed then the central controller returns to background processing to read the channels which have not responded when they are available. If the command overall time-out has elapsed then a channel fault is recorded. It is then determined whether the command can be completed. If the command cannot be completed then a fatal error is reported and the processor returns to background processing. If the command can be completed or if all the channels have responded then it is determined whether the completion of the command requires a data transfer. If not, then the initiator that gave the command is reselected and passed the logical unit number (LUN) identity and queue tag message. The central controller then returns to background processing awaiting an interrupt whereupon it returns a good status and then returns to background processing.
If the completion of the command does require a data transfer then it is determined whether there is a faulty disk in the bank of disks being accessed. If so, then the appropriate channel is masked to cause a reconstruction of the missing data. The initiator that gave the command is reselected and passed LUN identity and queue tag message. The central processor then goes into background processing until an interrupt is received whereupon a data in bus phase is asserted and data is transferred. The central processor then returns to background processing awaiting interrupt whereupon a good status is returned.
Figures 16a and 16b illustrate the operation of the software by the slave bus controllers upon receipt of commands from the central controller. When the slave bus controller receives a command from the central controller, the command is read from the DPRAM. The command is decoded and syntax checked and if faulty is rejected. Otherwise, it is determined whether the command is a data read or write request. If it is not then the command is analysed to determine if a memory buffer is required and if so it is allocated. If there is no buffer space then the process is suspended to allow the reading of data to continue. The process is resumed when space is available. Then an input/output queue element is constructed and set up according to command requirements. The queue element is then put into the input/output queue and linked onto the destination targets list.
If the command is a data read or write request then it is determined which targets are to be used. The array block address is then converted to the target block address. It is then determined if the data received is to be diverted (or dumped) or a read modify write is required. If the command is a read data request then it is determined whether the transfer crosses bank boundaries. If not, then the input/output queue element is constructed and set up for the single read. If the transfer crosses bank boundaries then an input/output link block is allocated and it is recorded that two reads are to be performed for this command. If it is determined that there is no space then the process is suspended to allow the background to continue and-resume when space is available. Otherwise the input/output queue element is constructed and set up to read the target and queue request. The input/output queue is also constructed and set up to read the target plus one and the request is queued. The slave bus controller then returns to background processing.
If the command is a data write request then as shown in Figure 16b it is determined whether the transfer crosses bank boundaries. If not, it is determined whether any read modify writes are required. If so, an I/O link block is allocated or the operation suspended until space is available. I/O queue elements for each of the reads of one or two read modifying write sequences are constructed as required. An I/O queue element for the aligned part of the write is .then constructed if required and the request is queued. The slave bus controller then enters background processing.
If the transfer of data does cross bank boundaries then it is determined whether the writes to the lower target requires a front read modify write. If so, the I/O queue element for the read part of the read modify write is constructed (lower target) and a request is queued. The I/O queue element for the aligned write part of the transfer is then constructed (lower target) and the request is queued. It is then determined whether the write to the higher target requires a back read modify write and if so an I/O queue element for the read part of the read modify write is constructed (higher target) and the request is queued. The I/O queue element for the aligned part of the write is then constructed (higher target) and the request queued. The slave bus controller then enters background processing.
Figure 17 illustrates the operation of the input/output handling by the slave bus controllers. The SCSI bus phases are handled to perform a required I/O for the specified target. If a target was disconnected it is determined whether the command complete message has been received. If not, a warning is generated and a target fault is logged. The SCSI I/O queue element of command just completed is examined to determine if command completion function can be executed at this current interrupt level. If so, then the last SCSI I/O command completion function is executed as specified in I/O queue element. Also the I/O queue element is unlinked from the SCSI I/O queue and is marked as being free for other uses.
If it is determined that the command completion function cannot be executed at this current interrupt level then the last SCSI I/O command completion function and pointer to I/O queue element is entered onto the background process queue. Also the I/O queue element from the SCSI I/O queue is unlinked and the element is not marked as free. It remains in use until it is freed by the command completion function which will be executed from the background queue.
The next I/O request from the SCSI I/O queue is extracted using the I/O request from the target with the lowest average throughput. If several have a low figure, the lowest target is used. A select target command is then issued to the SCSI and an I/O is queued before the processor returns to background processing. If the I/O queue is empty a flag is set to show that the SCSI I/O has stopped. Figure 18 illustrates a simple input/output completion function by a slave bus controller. This is executed by the SCSI I/O handler from the SCSI interrupt level. The SCSI I/O queue element is examined and the queue tag is extracted. The queue tag is given by the central controller when the-*command was issued to the slave bus controller. If the SCSI I/O was unsuccessfully executed then the queue tag and a fault response is sent to the central controller. If the SCSI I/O is executed successfully then the queue tag and an "acknowledge" response is sent to the central controller to inform command completion.
Figure 19 illustrates the operation of a complex I/O completion function by a slave bus controller. This is executed in the background from the background queue.
The I/O queue element is accessed with the pointer queued along with the completion function. The I/O link block associated with this I/O is then accessed and in the I/O link block it is recorded that the I/O has completed. If the I/O was unsuccessfully completed then the fault details from the SCSI I/O queue element is stored in the I/O link block error information area.
It is then determined whether the I/O link through the current I/O link block has been completed. If so, it is determined whether there are any faults recorded in the I/O link block error information area. If not, a "tidy-up" routine is executed which is particular to the original command from the central controller. A queue tag and acknowledged response is then sent to the central controller.
If there are faults recorded in the I/O link block error information area then the queue tag, fault response and the fault information is sent to the central controller. The I/O link block and all attached buffers are freed and as well as the SCSI I/O queue element.
The "tidy-up" referred to hereinabove forms the final operation of the slave bus controllers when all associated SCSI I/O has completed successfully.
Sector Translation
A problem has been experienced with the disk drives available to form the slave disk drive banks 40. As mentioned above host data arriving in "sectors" is split into four. This arrangement relies upon the slave disk drives of the array being able to be formatted with sector sizes exactly one quarter of that used by the host. A current standard sector size is 512 bytes, with a resultant slave disk sector size requirement of 128 bytes.
Until recently this has not been a problem, but due to the speed and complexity of electronics, disk drives above the 500 megabyte level can typically only be formatted to a minimum of 256 bytes per sector. Further, new disk drives above the 1 gigabyte capacity, can typically only support a minimum of 512 byte sectors. This would mean that the controller would only be able to support host sector sizes of two kilobytes.
This problem has been overcome by applying a technique termed "sector translation". In this technique each slave disk sector contains four host sectors in what is termed "virtual" slave sectors of 128 bytes. In this technique if the host requires a single sector of 512 bytes, then the controller has to extract an individual sector of 128 bytes from within the larger actual 512 bytes slave disk drive sector. When writing data, for individual writes of a single sector, or less than four correctly grouped sectors, the controller has first to read the required overall sector, then modify the data for the actual part of the sector that is necessary, and then finally write the overall slave disk sector back to the disk drive. This is a form of read modify write operation and can slow down the transfer of data to the disk drives but this is not normally a problem. Also, for large transfers of data to or from the disk drives, the affect of this problem is minimal and is not noticed by the host computer.
Three Dimensional Memory Array
The hardware shown in Figure 1 can be expanded so that the host computer has access to a three dimensional array of disk drives. This is applicable to both RAID-35 and RAID-53 systems.
Figure 21 illustrates an arrangement of the disk drives in three dimensions with respect to the computer memory controller 100. Each plane of disk drive corresponds to the two dimensional array illustrated in Figure 1 (42,0. , ...42,6,E) . In this arrangement the number of buffer memories 26 and data splitting and parity logic 24 is increased in number to five, one for each two dimensional array (or planes) of disk drives. The central controller 22 then controls each buffer memory 26 and its associated slave controllers 32 independently. Each data splitting and• parity logic 24 is connected to its associated buffer memory 26 and to the SCSI-1 interface 12
For RAID-35 operation this vastly increases the memory capacity and increases the number of read ahead segments by five, whilst for RAID-53 operation a vast increase in access speed for data is encountered since five times the number of seek operations can be carried out simultaneously compared to the two dimensional arrangement of Figure 1. What is described hereinabove is a schematic arrangement. In a practical arrangement five separate array controllers may be used, one per plane of disk drives.
Controller Redundancy
Figure 16 illustrates the use of a second computer memory controller 100B. The second computer memory controller 100B is provided in case of failure of the main computer memory controller 100A. The second computer memory controller 100B is connected to each of the SCSI-1 buses at a different address to the main computer memory controller 100A. This reduces the number of banks of disk drives which can be provided to six since two of the SCSI-1 addresses are taken up by the controllers 100A and 100B.
This arrangement provides for controller redundancy where it is not acceptable to have to shut down to repair a fault.
Combined RAID-35 and RAID-53
The hardware shown in Figures 1, 21 and 22 can operate both RAID-35 and RAID-53. In addition the hardware can operate both systems by sharing the hardware. For instance at start-up a portion of the buffer memory 26 could be allocated to RAID-53, the remainder being allocated for RAID-35. When the system detects non sequential data requests then a buffer segment is opened in the portion of the buffer memory allocated for RAID-53 and data read thereto. If sequential data is detected by the central controller 22 then a buffer segment in the appropriate buffer portion is allocated and data read from the disk banks, together with read ahead information in the normal RAID-35 operation.
The disk banks can either be shared or a number of disk banks could be allocated for use by RAID-53 and the remainder for use by RAID-35.
This apportionment of the hardware can take place selectably by a user or it could take place automatically dependent on the sequential and non sequential data ratios. Thus for instance the system could initially be set up on RAID-53 mode upon start-up and the size of the portion of the buffer memøry 26 and the number of disk banks allocated for RAID-35 will depend on the number of sequential data requests.
Overlay Bank Stripping
Overlay bank stripping is the term used hereinafter for the distribution of data amongst the memory banks and is applicable to both RAID-35 and RAID-53.
In the embodiments described hereinabove the data is stored in the banks sequentially. That is the logical blocks of user data are arranged sequentially across the disk surface and then sequentially across each additional bank. This does not fully utilise the ability of the system to read and write to banks simultaneously. If the data is better distributed over the banks of disk drives it is possible to simultaneously read and write to banks even using the RAID-35 arrangement.
The overlay bank stripping technique operates by writing data received from the host computer onto a predefined segment of the first bank. Once this segment is full the data is then written onto a segment having the same logical position in the next bank. This is repeated until the same logical segment in each bank is full whereupon data is written to the next logical segment in the first bank. This is repeated until the array is full. This process has the advantage of evenly distributing the data over the banks of the array, therefore increasing the likelihood that data required by the host computer is located on different banks which can be read simultaneously to increase the speed of data retrieval. Further, since the controller allocates addresses for each segment data can be written to different banks simultaneously to increase the speed of data storage.
Figure 23 illustrates the distribution of data in segments within the array. A segment can be defined as a data area that contains at least one block of disk data, e.g. 512 bytes, but more likely many multiples of disk data blocks, e.g. 64K bytes as shown in Figure 23.
If a host data block is 512 bytes this is segmented using the RAID-35 or RAID-53 technique to apply 128 bytes to each channel. Thus a 64K byte segment on each disk drive of each bank can contain 512 of these host data block segments.
Although the size of the segment described hereinabove is 64K bytes, the segment size can be user selectable to allow tailoring to suit the performance optimisation required for different applications.
When overlay bank stripping is used with the RAID-53 arrangement and host data requests are truly random there is no advantage in using overlay banks stripping. However, where host data requests (read or write) appear simultaneously for data which would have previously been on the same bank (but not within the same segment) , then a considerable performance improvement will be achieved since the requests are distributed across a number of bank thus allowing simultaneous read/write operations. In the arrangement shown in Figure 23 the performance improvement is 7.
For the RAID-35 arrangement on the face of it the ability to read sequential data may be penalised using overlay bank stripping. However, the use of overlay bank stripping enhances the performance since it allows the data on different banks to be simultaneously read. Thus for sequential data greater than a segment whereas without overlay bank stripping the full length of the data is read or written to a bank, with overlay bank stripping the data can be simultaneously read from or written to one or more banks. This technique can increase the rate of data transfer to and from the* array and can overcome a limitation caused by the limited access speed provided by each individual disk. If the data is distributed in a segment on each bank in the arrangement shown in Figure 23 then the transfer rate is increased by a factor of seven. However, in order to optimise the data transfer rate provided by the SCSI interface the segment size may need to be of sufficient size, e.g. 64K bytes.
The technique of overlay bank stripping can be used with either the RAID-35 or RAID-53 techniques and where the computer memory controller is arranged to operate both by appropriately assigning the banks for the two techniques, overlay bank stripping can be used by both techniques if the disk banks are shared or only one of RAID-35 or RAID-53 if the disk banks are appropriately allocated.
From the embodiments hereinabove described it can be seen that the controller of the present invention provides for large scale sequential data transfers from memory units for multi-users of a host computer and/or random requests for small amounts of data from a multitude of users.
While the invention has been described with reference to specific elements and combinations of elements, it is envisaged that each element may be combined with other or any combination of other elements. It is not intended to limit the invention to the particular combinations of elements suggested. Furthermore, the foregoing description is not intended to suggest that any element mentioned is indispensable to the invention, or that alternatives may not be employed. What is defined as invention should not be construed as limiting the extent of the disclosure of this specification.

Claims

1. A computer memory controller for interfacing to a host computer comprising a buffer means for interfacing to a plurality of memory units and for holding data read thereto and therefrom; - aHd control means operative to control the transfer of data to and from said host computer and said memory units; said buffer means being controlled to form a plurality of buffer segments for addressably storing data read from or written to said memory units; said control means being operative to allocate a buffer segment for a read or write request from the host computer, of a size sufficient for the data; said control means being further operative in response to data requests from said host computer to control said memory units to seek data stored in different memory units simultaneously.
2. A computer memory controller as claimed in Claim 1 for interfacing to a plurality of memory units arranged into a two dimensional array having at least three memory channels, each memory channel comprising a plurality of memory units connected by a bus such that each memory unit of said memory channel is independently accessible; respective memory units of said memory channels forming a memory bank; wherein said control means is operative in response to data requests from said host computer to store in said buffer segments bytes or groups of bits read from a memory bank; said controller comprising a logic circuit connected to said buffer means to recombine bytes or groups of bits read from a group of said memory units of a memory bank and stored in said buffer segments to generate the requested data; said logic circuit including parity means operative to use a check byte or group of bits read from one of said memory units of said memory bank to regenerate data read from said group of memory units if one of said group of memory units fails; said buffer means being divided into a number of channels corresponding to the number of memory channels, each channel being divided into associated portions of buffer segments; said control means being further operative to control said memory units to seek data stored in different memory banks simultaneously.
3. A computer memory controller as claimed in Claim 2, wherein said control means is adapted to queue host data requests which it is unable to carry out at the time of the request, until the memory bank containing the requested data is not busy.
4. A computer memory controller as claimed in Claim 3, wherein if a memory bank has more than one data request said control means is adapted to control the order in which the data is sought in order to optimise the time taken to accomplish the read operation.
5. A computer memory controller as claimed in any of Claims 2 to 4, wherein said logic circuit is operative to split data input from said host computer into a plurality of portions such that said portions are temporarily stored in a buffer segment before being applied to ones of a group of said memory channels for storage in a memory bank; and said parity means is operative to generate a check byte or group of bits from said data for temporary storage in a buffer segment before being stored in at least one memory unit of a memory bank.
6. A computer memory controller as claimed in Claim 5, wherein said control means is operative to effect the writing of data to a memory bank substantially immediately. to the detriment of any pending read or write requests.
7. A computer memory controller as claimed in any of Claims 2 to 6, wherein said controller is adapted for interfacing to an array of memory units having five memory channels, one said memory channel holding said check byte or groups of bits.
8. A computer memory controller as claimed in any of Claims 2 to 7, comprising a plurality of said buffer means, and said logic circuits, each said buffer means being adapted for interface to one said two dimensional array of memory units; said control means being operative to control the transfer of data to and from said host computer and a three dimensional array of memory units formed of a plurality of said two dimensional arrays of memory units.
9. A computer memory controller as claimed in any preceding claim, wherein said controller is adapted for interfacing to magnetic disk drives.
10. A computer memory controller as claimed in any preceding claim , wherein said buffer means is adapted to hold data requested by said host computer and further data logically sequential thereto; said control means being further operative to control the transfer of data to said host computer in response to requests therefrom by first addressing said buffer segments to establish whether the requested data is contained therein and if so supplying said data to said host computer, and if the requested data is not contained in the buffer segments reading said data from said memory units, supplying said data to said host computer, reading from said memory units further data which is logically sequential to the data requested by said host computer and storing said further data in a buffer segment; said control means being further operative to control the buffer means to control the number and size of said buffer segments.
11. A computer memory controller as claimed in Claim 10, wherein said control means is operative to reduce the size of existing buffer segments on each occasion that a request for data from said host computer cannot be complied with from the further data stored in existing ones of said buffer segments, to dynamically allocate a new segment of said buffer means for further data to the data requested, and to continue this process until the size of each buffer segment is some predetermined minimum, whereupon, at the next request for data not available in a buffer segment, the buffer segment least frequently utilised is employed.
12. A computer memory controller as claimed in any preceding claim, wherein said control means is further operative to addressably segment a plurality of said memory units such that respective segments on sequential memory units have a sequential address, and to write sequential data to sequentially addressed segments on sequential segmented memory units.
13. A computer memory controller as claimed in any of Claims 2 to 11, wherein said control means is further operative to addressably segment a plurality of said memory banks into sequential bank segments on sequential memory banks such that respective segments on sequential banks have sequential address, and to write sequential data to sequentially addressed bank segments on sequential segmented memory banks.
14. A method of controlling a plurality of memory units for use with a host computer comprising the steps of repeatedly receiving from said host computer a read request for data stored in said memory units and allocating a- buffer segment of sufficient size for the data to be read; and seeking data in sa i plurality of memory units simultaneously.
15. A method as claimed in Claim 14, wherein said memory units are arranged into a two dimensional array having at least three memory channels, each memory channel comprising a plurality of respective memory units connected by a bus such that each memory unit of said memory channel is independently accessible; respective memory units of said memory channels forming a memory bank; said method including the steps of storing bytes or groups of bits read from a memory bank in said buffer segments in response to data requests from said host computer; reco bining bytes or groups of bits read from a group of said memory units of a memory bank and stored in said buffer segments to generate the requested data; reading a check byte or group of bits from one of said memory units of said memory bank; regenerating data read from said group of memory units using said check byte if one of said group of said memory units fails; and seeking data stored in different memory banks simultaneously.
16. A method as claimed in Claim 15, wherein host data requests which cannot be carried out at the time of request, are queued to be carried out at a time when the memory bank containing the requested data is not busy.
17. A method as claimed in Claim 16, wherein if a memory bank has more than one data request, the order in which the data is sought is controlled in order to optimise the time taken to accomplish the read operation.
18. A method as claimed in any one of Claims 15 to 17 including the steps of splitting data output from said host computer into a plurality of portions; storing said portions are in buffer segment; applying said split data to ones of a group of said memory units of a memory bank; generating a check byte or group of bits from said data; storing said check byte or group of bits in a buffer segment; applying said check byte or group of bits to at least one memory unit of a memory bank.
19. A method as claimed in Claim 18, wherein data is written to a memory bank substantially immediately to the detriment of any pending read or write request.
20. A method as claimed in any of Claims 15 to 19 including the step of controlling the transfer of data to and from said host computer and a three dimensional array of memory units formed of a plurality of said two dimensional arrays of memory units.
21 A method as claimed in any of Claims 14 to 20, including the steps of checking a plurality of buffer segments to establish whether the requested data is in said buffer segments, either complying with said request by transferring the data in said buffer segments to said host computer, or first reading said data from said memory units into one buffer segment and then complying with said request, reading from said memory units further data logically sequential to the data requested and storing said data in said buffer segment.
22. A method as claimed in Claim 21, further including the steps of reducing the size of existing buffer segments on each occasion that a request for data from said host computer cannot be complied with from the further data stored in existing ones of said buffer segments, dynamically allocating a. new segment of said buffer for further data to the data requested, and continuing this process until the size of each buffer segment is some predetermined minimum, whereupon at the next request for data not available in a buffer segment the buffer segment least frequently utilised is employed.
23. A method as claimed in any of Claims 14 to 22, including the steps of addressably segmenting a plurality of said memory units such that respective segements on sequential memory units have a sequential address, and writing sequential data to sequentially addressed segments of sequential segmented memory units.
24. A method as claimed in any of Claims 15 to 22, including the steps of addressably segmenting a plurality of said memory banks such that respective bank segments on sequential memory banks have a sequential address, and writing sequential data to sequentially addressed bank segments of sequential segmented memory banks.
25. A computer memory controller for a host computer comprising buffer means for interfacing to at least three memory channels arranged in parallel, each memory channel comprising a plurality of memory units connected by a bus such that each memory unit of said memory channel is independently accessible; respective memory units of said memory channels forming a memory bank; a logic circuit connected to said buffer means to split data input from said host computer into a plurality of portions such that said portions are temporarily stored in a buffer segment before being applied to ones of a group of said memory channels for storage in a memory bank; said logic circuit being further operative to recombine portions of data successively read from successive ones of a group of said memory units of a memory bank and into said buffer means; said logic circuit including parity means operative to generate a check byte or group of bits from said data for temporary storage in said buffer means before being stored in at least one said memory unit of said memory bank, and operative to use said check byte to regenerate said data read from said group of memory units of a memory bank if one of said group of memory units fails; said buffer means being divided into a number of channels corresponding to the number of memory channels, each said channel being divided into associated portion of buffer segments; and a control means operative to control the transfer of data and check bytes or groups of bits to and from said memory banks, including allocating a buffer segment for a read or write request from the host computer of a sufficient size for the data, and controlling said memory banks to seek requested data stored in different memory banks simultaneously.
26. A computer memory controller as claimed in Claim 25, wherein said control means is adapted to queue host data requests which it is unable to carry out at the time of the request, until the memory bank containing the requested data is not busy.
27. A computer memory controller as claimed in Claim 26, wherein if a memory bank has more than one data request, said control means is adapted to control the order in which the data is sought in order to optimise the time taken to accomplish the read operation.
28. A computer memory controller as claimed in any of Claims 25 to 27, wherein said control means is operative to effect the writing of data^to a memory bank substantially immediately, to the detriment of any pending read or write requests.
29. A computer memory controller as claimed in any of Claims 25 to 28, wherein said controller is adapted for interfacing to an array of memory units having five memory channels, one said memory channel holding said check byte or groups of bits.
30. A computer memory controller as claimed in any of Claims 25 to 29, comprising a plurality of said buffer means, and said logic circuits, each said buffer means being adapted for interface to one said two dimensional array of memory units; said control means being operative to control the transfer of data to and from said host computer and a three dimensional array of memory units formed of a plurality of said two dimensional arrays of memory units.
31. A computer memory controller as claimed in any of Claims 25 to 30, wherein said controller is adapted for interfacing to magnetic disk drives.
32. A computer memory controller as claimed in any of Claims 25 to 31, wherein said control means is operative to allocate a first portion of said buffer means for non sequential data, said control means being operative to allocate a buffer segment which is of sufficient size for the data, in said first portion for a read or write request from the host computer which is not sequential, to control said memory banks to seek a plurality of requested data stored in . different ones of said memory banks simultaneously, to allocate a second portion of said buffer means for sequential data, to control the transfer of sequential data to said host computer in response to requests therefrom by first addressing said buffer segments of the second portion to establish whether the requested data is contained therein and if so supplying said data to said host computer, and if the requested sequential data is not contained in the buffer segments of said second portion, reading said data from the memory units of said memory banks, supplying said data to said host computer, reading from said memory units further data which is logically sequential to the data requested by the host computer and storing said further data in a buffer segment in said second portion; said control means being further operative to control the second portion of said buffer means to control the number and size of said buffer segments.
33. A computer memory controller as claimed in any of Claims 25 to 31, wherein said control means is operative to allocate a first portion of said buffer means and a number of memory banks for non sequential data, said control means being operative to allocate a buffer segment which is of sufficient size for the data, in said first portion for a read or write request from the host computer which is not sequential, to control said memory banks to seek requested data stored in different ones of said number of memory banks simultaneously, to allocate a second portion of said buffer means and the remaining memory banks for sequential data, to control the transfer of sequential data to said host computer in response to requests therefrom by first addressing said buffer segments of the second portion to establish whether the requested data is contained therein and if so supplying said data to said host computer, and if the requested sequential data is not contained in the buffer segments of said -second portion, reading said data from the memory units of the remaining memory banks, supplying said data to said host computer, reading from the memory units further data which is logically sequential to the data requested by the host computer and storing said further data in a buffer segment in said second portion; said control means being further operative to control the second portion of said buffer means to control the number and size of said buffer segments.
34. A computer memory controller as claimed in Claim 33, wherein said control means is further operative to addressably segment a plurality of said memory banks such that respective bank segments on sequential memory banks have a sequential address, to write sequential data to sequentially addressed bank segments of sequential segmented memory banks, and to seek and read requested data as well as any data sequential thereto stored in sequential bank segments in sequential ones of said plurality of memory banks simultaneously.
35. A computer storage system comprising a plurality of memory units arranged into a two dimensional array having at least three memory channels arranged in parallel, each said memory channel comprising a plurality of memory units connected by a bus such that each memory unit is independently accessible; respective memory units of said memory channels forming a memory bank; and a controller comprising buffer means interfaced to said memory units and for holding information read from said memory channels; said buffer means being controlled to form a plurality of buffer segments for addressably storing data read from or written to said memory units; a logic circuit connected to said buffer means to recombine bytes or groups of bits read from ones of a group of said memory units in a memory bank, parity means operative to use a check byte or group of bits read from one of said memory units in said memory bank to regenerate information read from said group of memory units if one of said group of memory units fails; and control means for controlling the transfer of data to and from said host computer and said memory units, including allocating a buffer segment for a read or write request from the host computer of a sufficient size for the data, and controlling said memory banks to seek data stored in different memory banks simultaneously.
36. A computer storage system as claimed in Claim 35 comprising five memory channels arranged in parallel, one said memory channel holding said check byte or groups of bits.
37. A computer storage system as claimed in Claim 35 or Claim 36, wherein each memory channel comprises seven memory units thus forming seven memory banks.
38. A computer storage system as claimed in any of Claims 35 to 37, wherein said memory units are magnetic disk drives and the rotation of magnetic disk drives of a memory bank is synchronised.
39. A computer storage system as claimed in Claim 35, wherein said controller includes a plurality of said buffer means, said logic circuits and said parity means, each said buffer means being adapted for interface to one said two dimensional array of memory units; said control means being operative to control the transfer of data to and from said host computer and a three dimensional array of memory units formed of a plurality of said two dimensional arrays of memory units.
40. A computer storage system as claimed in any of Claims 35 to 39, further including a second controller interfaced to said memory channels in a like manner to the first controller and having a different address on said bus.
41. A computer memory controller for a host computer comprising a plurality of buffer means each for interfacing a plurality of memory units arranged into a two dimensional array having at least three memory channels, each memory channel comprising a plurality of memory units connected by a bus such that each memory unit is independently accessible; respective memory units of said memory channels forming a memory bank; a plurality of logic circuits connected to respective said buffer means to recombine bytes or groups of bits read from ones of a group of said memory units of a memory bank and stored in said buffer segments to generate the requested data; said logic circuits each including parity means operative to use a check byte or group of bits read from one of said memory units of said memory bank to regenerate data read from said group of memory units if one of said group of memory units fails; said buffer means being divided into a number of channels corresponding to the number of memory channels, each channel being divided into associated portion of buffer segments; and control means operative to control the transfer of data from a three dimensional array of memory units formed from a plurality of said two dimensional arrays to said host computer in response to requests therefrom by first addressing said buffer segments to establish whether the requested data is contained therein and if so supplying said data to said host computer, and if the requested data is not contained in the buffer segments, reading said data from the memory units, supplying said data to said host computer, reading from said memory units further data which is logically sequential to the data requested by said host computer and storing said further data in a buffer segment; said control means being further operative in response to data requests from said host computer to control said memory units to seek data stored in different memory banks simultaneously; said control means further controlling said buffer means to control the number and size of said buffer segments.
PCT/GB1992/002291 1992-01-06 1992-12-10 Computer memory array control WO1993014455A1 (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
JP5511984A JPH08501643A (en) 1992-01-06 1992-12-10 Computer memory array control
EP92924811A EP0620934A1 (en) 1992-01-06 1992-12-10 Computer memory array control
AU30915/92A AU662376B2 (en) 1992-01-06 1992-12-10 Computer memory array control

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
GB9200207.0 1992-01-06
GB929200207A GB9200207D0 (en) 1992-01-06 1992-01-06 Computer memory array control

Publications (1)

Publication Number Publication Date
WO1993014455A1 true WO1993014455A1 (en) 1993-07-22

Family

ID=10708187

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/GB1992/002291 WO1993014455A1 (en) 1992-01-06 1992-12-10 Computer memory array control

Country Status (6)

Country Link
EP (1) EP0620934A1 (en)
JP (1) JPH08501643A (en)
AU (1) AU662376B2 (en)
CA (1) CA2127380A1 (en)
GB (1) GB9200207D0 (en)
WO (1) WO1993014455A1 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5375084A (en) * 1993-11-08 1994-12-20 International Business Machines Corporation Selectable interface between memory controller and memory simms
EP0727750A2 (en) * 1995-02-17 1996-08-21 Kabushiki Kaisha Toshiba Continuous data server apparatus and data transfer scheme enabling multiple simultaneous data accesses
WO1998000776A1 (en) * 1996-06-28 1998-01-08 Lsi Logic Corporation Cache memory controller in a raid interface
US5881254A (en) * 1996-06-28 1999-03-09 Lsi Logic Corporation Inter-bus bridge circuit with integrated memory port

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1989009468A1 (en) * 1988-04-01 1989-10-05 Unisys Corporation High capacity multiple-disk storage method and apparatus
WO1989010594A1 (en) * 1988-04-22 1989-11-02 Amdahl Corporation A file system for a plurality of storage classes
EP0369707A2 (en) * 1988-11-14 1990-05-23 Emc Corporation Arrayed disk drive system and method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1989009468A1 (en) * 1988-04-01 1989-10-05 Unisys Corporation High capacity multiple-disk storage method and apparatus
WO1989010594A1 (en) * 1988-04-22 1989-11-02 Amdahl Corporation A file system for a plurality of storage classes
EP0369707A2 (en) * 1988-11-14 1990-05-23 Emc Corporation Arrayed disk drive system and method

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5375084A (en) * 1993-11-08 1994-12-20 International Business Machines Corporation Selectable interface between memory controller and memory simms
EP0727750A2 (en) * 1995-02-17 1996-08-21 Kabushiki Kaisha Toshiba Continuous data server apparatus and data transfer scheme enabling multiple simultaneous data accesses
EP0727750A3 (en) * 1995-02-17 1997-07-23 Toshiba Kk Continuous data server apparatus and data transfer scheme enabling multiple simultaneous data accesses
US5862403A (en) * 1995-02-17 1999-01-19 Kabushiki Kaisha Toshiba Continuous data server apparatus and data transfer scheme enabling multiple simultaneous data accesses
WO1998000776A1 (en) * 1996-06-28 1998-01-08 Lsi Logic Corporation Cache memory controller in a raid interface
US5881254A (en) * 1996-06-28 1999-03-09 Lsi Logic Corporation Inter-bus bridge circuit with integrated memory port
US5937174A (en) * 1996-06-28 1999-08-10 Lsi Logic Corporation Scalable hierarchial memory structure for high data bandwidth raid applications
US5983306A (en) * 1996-06-28 1999-11-09 Lsi Logic Corporation PCI bridge with upstream memory prefetch and buffered memory write disable address ranges

Also Published As

Publication number Publication date
AU3091592A (en) 1993-08-03
EP0620934A1 (en) 1994-10-26
AU662376B2 (en) 1995-08-31
GB9200207D0 (en) 1992-02-26
JPH08501643A (en) 1996-02-20
CA2127380A1 (en) 1993-07-22

Similar Documents

Publication Publication Date Title
US5526507A (en) Computer memory array control for accessing different memory banks simullaneously
US6058489A (en) On-line disk array reconfiguration
US6009481A (en) Mass storage system using internal system-level mirroring
EP0572564B1 (en) Parity calculation in an efficient array of mass storage devices
US7228381B2 (en) Storage system using fast storage device for storing redundant data
US5893919A (en) Apparatus and method for storing data with selectable data protection using mirroring and selectable parity inhibition
US5608891A (en) Recording system having a redundant array of storage devices and having read and write circuits with memory buffers
US5875456A (en) Storage device array and methods for striping and unstriping data and for adding and removing disks online to/from a raid storage array
US5101492A (en) Data redundancy and recovery protection
US5446855A (en) System and method for disk array data transfer
US6772310B2 (en) Method and apparatus for zeroing a transfer buffer memory as a background task
EP1376329A2 (en) Method of utilizing storage disks of differing capacity in a single storage volume in a hierarchic disk array
EP0850448A1 (en) Method and apparatus for improving performance in a redundant array of independent disks
JP2000099282A (en) File management system
US6571314B1 (en) Method for changing raid-level in disk array subsystem
US6425053B1 (en) System and method for zeroing data storage blocks in a raid storage implementation
EP0657801A1 (en) System and method for supporting reproduction of full motion video on a plurality of playback platforms
US6611897B2 (en) Method and apparatus for implementing redundancy on data stored in a disk array subsystem based on use frequency or importance of the data
WO1992004674A1 (en) Computer memory array control
AU662376B2 (en) Computer memory array control
US6934803B2 (en) Methods and structure for multi-drive mirroring in a resource constrained raid controller
JP2854471B2 (en) Disk array device
GB2298306A (en) A disk array and tasking means
JPH0736633A (en) Magnetic disk array

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AU CA GB JP

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): AT BE CH DE DK ES FR GB GR IE IT LU MC NL PT SE

DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
WWE Wipo information: entry into national phase

Ref document number: 1992924811

Country of ref document: EP

WWE Wipo information: entry into national phase

Ref document number: 2127380

Country of ref document: CA

WWP Wipo information: published in national office

Ref document number: 1992924811

Country of ref document: EP

WWR Wipo information: refused in national office

Ref document number: 1992924811

Country of ref document: EP

WWW Wipo information: withdrawn in national office

Ref document number: 1992924811

Country of ref document: EP