CA2382929A1 - Shared memory disk - Google Patents

Shared memory disk Download PDF

Info

Publication number
CA2382929A1
CA2382929A1 CA002382929A CA2382929A CA2382929A1 CA 2382929 A1 CA2382929 A1 CA 2382929A1 CA 002382929 A CA002382929 A CA 002382929A CA 2382929 A CA2382929 A CA 2382929A CA 2382929 A1 CA2382929 A1 CA 2382929A1
Authority
CA
Canada
Prior art keywords
memory
shared
memdisk
multiplicity
processor
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
CA002382929A
Other languages
French (fr)
Inventor
Chris Miller
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Times N Systems Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Times N Systems Inc filed Critical Times N Systems Inc
Publication of CA2382929A1 publication Critical patent/CA2382929A1/en
Abandoned legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/52Program synchronisation; Mutual exclusion, e.g. by means of semaphores
    • G06F9/526Mutual exclusion algorithms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/0223User address space allocation, e.g. contiguous or non contiguous base addressing
    • G06F12/0284Multiple user address space allocation, e.g. using different base addresses
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0806Multiuser, multiprocessor or multiprocessing cache systems
    • G06F12/0815Cache consistency protocols
    • G06F12/0817Cache consistency protocols using directory methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/45Exploiting coarse grain parallelism in compilation, i.e. parallelism between groups of instructions
    • G06F8/457Communication
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5011Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
    • G06F9/5016Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals the resource being the memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/54Interprogram communication
    • G06F9/544Buffers; Shared memory; Pipes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0806Multiuser, multiprocessor or multiprocessing cache systems
    • G06F12/0815Cache consistency protocols
    • G06F12/0837Cache consistency protocols with software control, e.g. non-cacheable data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/52Indexing scheme relating to G06F9/52
    • G06F2209/523Mode

Abstract

Methods, systems and devices are described for a shared memory disk (MEMDISK).
A method, includes setting aside a particular range of a shared memory as a MEMDISK; and providing control for each of several operating systems that compose processing nodes coupled to said shared memory such that none of the processing nodes will attempt to utilize pages within said particular region for non-MEMDISK purposes. The methods, systems and devices provide advantages because the speed and scalability of parallel processor systems is enhanced.

Description

SHARED MEMORY DISK
BACKGROUND OF THE INVENTION
Field of the Invention The invention relates generally to the field of computing systems in which multiple processors share memory but in which each is provided with separate access to input-output (I/O) devices such as disks. More particularly, the invention relates to computer science techniques that utilize a shared memory disk (MEMDISK).
2. Discussion of the Related Art The clustering of workstations is a well-known art. In the most common cases, the clustering involves workstations that operate almost totally independently, utilizing the network only to share such services as a printer, license-limited applications, or shared files.
In more-closely-coupled environments, some software packages (such as NQS) allow a cluster of workstations to share work. In such cases the work arrives, typically as batch jobs, at an entry point to the cluster where it is queued and dispatched to the workstations on the basis of load.
In both of these cases, and all other known cases of clustering, the operating system and cluster subsystem are built around the concept of message-passing. The term message-passing means that a given workstation operates on some portion of a job until communications (to send or receive data, typically) with another workstation is necessary. Then, the first workstation prepares and communicates with the other workstation.
Another well-known art is that of clustering processors within a machine, usually called a Massively Parallel Processor or MPP, in which the techniques are essentially identical to those of clustered workstations.
Usually, the bandwidth and latency of the interconnect network of an MPP are more highly optimized, but the system operation is the same.
In the general case, the passing of a message is an extremely expensive operation; expensive in the sense that many CPU cycles in the sender and receiver are consumed by the process of sending, receiving, bracketing, verifying, and routing the message, CPU cycles that are therefore not available for other operations. A highly streamlined message-passing subsystem can typically require 10,000 to 20,000 CPU cycles or more.
There are specific cases wherein the passing of a message requires significantly less overhead. However, none of these specific cases is adaptable to a general-purpose computer system.
Message-passing parallel processor systems have been offered commercially for years but have failed to capture significant market share because of poor performance and difficulty of programming for typical parallel applications. Message-passing parallel processor systems do have some advantages. In particular, because they share no resources, message-passing parallel processor systems are easier to provide with high-availability features.
What is needed is a better approach to parallel processor systems.
There are alternatives to the passing of messages for closely-coupled cluster work. One such alternative is the use of shared memory for inter-processor communication.
Shared-memory systems, have been much more successful at capturing market share than message-passing systems because of the dramatically superior performance of shared-memory systems, up to about four-processor systems. In Search of Clusters, Gregory F. Pfister 2nd ed. (January 1998) Prentice Hall Computer Books, ISBN: 0138997098 describes a computing system with multiple processing nodes in which each processing node is provided with private, local memory and also has access to a range of memory which is shared with other processing nodes. The disclosure of this publication in its entirety is hereby expressly incorporated herein by reference for the purpose of indicating the background of the invention and illustrating the state of the art.
However, providing high availability for traditional shared-memory systems has proved to be an elusive goal. The nature of these systems, which share all code and all data, including that data which controls the shared operating systems, is incompatible with the separation normally required for high availability. What is needed is an approach to shared-memory systems that improves availability.
Although the use of shared memory for inter-processor communication is a well-known art, prior to the teachings of U.S. Ser. No. 09/273,430, filed March 19, 1999, entitled Shared Memory Apparatus and Method for Multiprocessing Systems, the processors shared a single copy of the operating system. The problem with such systems is that they cannot be efficiently scaled beyond four to eight way systems except in unusual circumstances. All known cases of said unusual circumstances are such that the systems are not good price-performance systems for general-purpose computing.
The entire contents of U.S. Patent Applications 09/273,430, filed March 19, 1999 and PCT/US00/01262, filed January 18, 2000 are hereby expressly incorporated by reference herein for all purposes. U.S. Ser. No. 09/273,430, improved upon the concept of shared memory by teaching the concept which will herein be referred to as a tight cluster. The concept of a tight cluster is that of individual computers, each with its own CPU(s), memory, I/O, and operating system, but for which collection of computers there is a portion of memory which is shared by all the computers and via which they can exchange information. U.S. Ser. No. 09/273,430 describes a system in which each processing node is provided with its own private copy of an operating system and in which the connection to shared memory is via a standard bus. The advantage of a tight cluster in comparison to an SMP is "scalability" which means that a much larger number of computers can be attached together via a tight cluster than an SMP with little loss of processing efficiency.
What is needed are improvements to the concept of the tight cluster.
What is also needed is an expansion of the concept of the tight cluster.
Another well-known art is the use of memory caches to improve performance. Caches provide such a significant performance boost that most modern computers use them. At the very top of the performance (and price) range all of memory is constructed using cache-memory technologies.
However, this is such an expensive approach that few manufacturers use it. All manufacturers of personal computers (PCs) and workstations use caches except for the very low end of the PC business where caches are omitted for price reasons and performance is, therefore, poor.
Caches, however, present a problem for shared-memory computing systems; the problem of coherence. As a particular processor reads or writes a word of shared memory, that word and usually a number of surrounding words are transferred to that particular processor's cache memory transparently by cache-memory hardware. That word and the surrounding words (if any) are transferred into a portion of the particular processor's cache memory that is called a cache line or cache block.
If the transferred cache line is modified by the particular processor, the representation in the cache memory will become different from the value in shared memory. That cache line within that particular processor's cache memory is, at that point, called a "dirty" line. The particular processor with the dirty line, when accessing that memory address will see the new (modified) value. Other processors, accessing that memory address will see the old (unmodified) value in shared memory. This lack of coherence between such accesses will lead to incorrect results.
Modern computers, workstations, and PCs which provide for multiple processors and shared memory, therefore, also provide high-speed, transparent cache coherence hardware to assure that if a line in one cache changes and another processor subsequently accesses a value which is in that address range, the new values will be transferred back to memory or at least to the requesting processor.
Caches can be maintained coherent by software provided that sufficient cache-management instructions are provided by the manufacturer. However, in many cases, an adequate arsenal of such instructions are not provided.
Moreover, even in cases where the instruction set is adequate, the software overhead is so great that no examples of are known of commercially successful machines which use software-managed coherence.
SUMMARY OF THE INVENTION
A goal of the invention is to simultaneously satisfy the above-discussed requirements of improving and expanding the tight cluster concept which, in the case of the prior art, are not satisfied.
One embodiment of the invention is based on a method, comprising: setting aside a particular range of a shared memory as a MEMDISK; and providing control for each of several operating systems that compose processing nodes coupled to said shared memory such that none of the processing nodes will attempt to utilize pages within said particular region for non-MEMDISK
purposes. Another embodiment of the invention is based on a system, comprising a multiplicity of processors, each with some private memory and the multiplicity with some shared memory, interconnected and arranged such that memory accesses to a first set of address ranges will be to local, private memory whereas memory accesses to a second set of address ranges will be to shared memory, and arranged such that at least some of said processors are provided with input-output subsystems and that said input-output (I/O) traffic started by one processor for an I/O device attached to another processor will be started by inter-processor signals but continued via use of a portion of shared memory accessed via I/O driver emulation means. Another embodiment of the invention is based on a computer system which provides operating system extensions to perform disk input-output (I/O) functions in a shared-memory environment, where said extensions perform the functions with direct Load and Store operations. Another embodiment of the invention is based on a computer system that provides system-wide registration of shared-memory disk partitions at all of a multiplicity of processing nodes within the system. Another embodiment of the invention is based on a computer system that provides system-wide registration of shared-memory disk access methodologies at all of a multiplicity of processing nodes within the system. Another embodiment of the invention is based on a computer system that provides system-wide status of shared-memory disk operations at all of a multiplicity of processing nodes within the system.
Another embodiment of the invention is based on a computer system that provides for multiple instantiations in a shared-memory environment of a disk to satisfy disk I/O operations for all system members, transparent to the Operating System. Another embodiment of the invention is based on a computer system that provides for caching of data system-wide in a shared-memory environment to satisfy disk I/O functions for all system members, transparent to the Operating System. Another embodiment of the invention is based on a computer system that provides application appropriate access methodologies based on system-wide partitioning of a data store in a shared-memory environment.
These, and other goals and embodiments of the invention will be better appreciated and understood when considered in conjunction with the following description and the accompanying drawings. It should be understood, however, that the following description, while indicating preferred embodiments of the invention and numerous specific details thereof, is given by way of illustration and not of limitation. Many changes and modifications may be made within the scope of the invention without departing from the spirit thereof, and the invention includes all such modifications.
BRIEF DESCRIPTION OF THE DRAWINGS
A clear conception of the advantages and features constituting the invention, and of the components and operation of model systems provided with the invention, will become more readily apparent by referring to the exemplary, and therefore nonlimiting, embodiments illustrated in the drawings accompanying and forming a part of this specification, wherein like reference characters (if they occur in more than one view) designate the same parts. It should be noted that the features illustrated in the drawings are not necessarily drawn to scale.
FIG. 1 illustrates a block schematic view of a system, representing an embodiment of the invention.
FIG. 2 illustrates a block schematic view of another system, representing an embodiment of the invention.
DESCRIPTION OF PREFERRED EMBODIMENTS
The invention and the various features and advantageous details thereof are explained more fully with reference to the nonlimiting embodiments that are illustrated in the accompanying drawings and detailed in the following description of preferred embodiments. Descriptions of well known components and processing techniques are omitted so as not to unnecessarily obscure the invention in detail.
The teachings of U.S. Ser. No. 09/273,430 include a system which is a single entity; one large supercomputer. The invention is also applicable to a cluster of workstations, or even a network.
The invention is applicable to systems of the type of Pfister or the type of U.S. Ser. No. 09/273,430 in which each processing node has its own copy of an operating system. The invention is also applicable to other types of multiple processing node systems.
The context of the invention can include a tight cluster as described in U.S. Ser. No. 09/273,430. A tight cluster is defined as a cluster of workstations or an arrangement within a single, multiple-processor machine in which the processors are connected by a high-speed, low-latency interconnection, and in which some but not all memory is shared among the processors. Within the scope of a given processor, accesses to a first set of ranges of memory addresses will be to local, private memory but accesses to a second set of memory address ranges will be to shared memory. The significant advantage to a tight cluster in comparison to a message-passing cluster is that, assuming the environment has been appropriately established, the exchange of information involves a single STORE instruction by the sending processor and a subsequent single LOAD
instruction by the receiving processor.
The establishment of the environment, taught by U.S. Ser. No.
09/273,430 and more fully by companion disclosures (U.S. Provisional Application Ser. No. 60/220,794, filed July 26, 2000; U.S. Provisional Application Ser. No. 60/220,748, filed July 26, 2000; WSGR 15245-711;
WSGR 15245-712; WSGR 15245-713; WSGR 15245-715; WSGR 15245-716;
WSGR 15245-717; WSGR 15245-719; and WSGR 15245-720, the entire contents of all which are hereby expressly incorporated herein by reference for all purposes) can be performed in such a way as to require relatively little system overhead, and to be done once for many, many information exchanges.
Therefore, a comparison of 10,000 instructions for message-passing to a pair of instructions for tight-clustering, is valid.
The invention can include providing highly-efficient operating system control for a tight cluster system. Among the means of controlling shared memory in such a tight cluster for improved performance is the provision of a shared memory disk (MEMDISK) which can be shared among the various processes and processors of the cluster.
In the context of a computing system in which multiple processing nodes share some memory, and where each node has access to separate input-output (I/O), the invention can include the utilization of shared memory to achieve OS-transparent high-speed access to disk. The invention can also provide the ability to substitute memory accesses for disk accesses under certain circumstances, while maintaining OS-transparency to the substitution.
The invention is applicable to the kind of systems taught by U.S. Ser.
No. 09/273,430 and is also applicable to other architectures such as NUMA, CC-NUMA, and other machines in which each processor or processor aggregation is provided with separate I/O. In the case of NUMA and CC-NUMA machines, all of memory is shared and there is one copy of the operating system. In U.S. Ser. No. 09/273,430 only a portion of the memory is shared and each node is provided with a separate copy of the operating system.
In a computing system which is provided with multiple nodes, and in which several of the multiple processing nodes are provided with separate I/O
paths, the paths from each of the processing nodes to non-local I/O is generally one of several types. These types include: (1) each of the separate processing nodes restricted from reaching I/O attached to other nodes; (2) each of the processing nodes acting as a surrogate for another node, providing I/O
capability to the requesting node in a proxy fashion; (3) in schemes such as NUMA and CC-NUMA, where a processing node may be provided with a path, via directory and cache-coherence means, whereby the single common operating system can reach I/O on the other nodes; or (4) a system external provision providing a common I/O resource, such as Fibre Channel I/O, a twin-tailed SCSI disk facility, or standard networking facility.
The present invention can be used in the context of the environment described in U.S. Ser. No. 09/273,430 where multiple computers are provided with means to selectively address a first set of memory address ranges which will be to private memory and a second set of memory ranges which will be to shared memory. The invention can include: setting aside a particular range of shared memory as a MEMDISK; providing control means for each of the several operating systems such that none will attempt to utilize pages within this region for normal shared-memory purposes.
A device driver interface can be utilized that is responsive to disk I/O
operations to a particular disk and which will translate said I/O operations to LOAD and STORE operations to said region of memory. I/O operations can be intercepted which may otherwise be directed to a physical disk and redirect those operations to said region of memory.
In a tight cluster, each processor is provided with its own memory and its own operating system or micro-kernel and may be provided with its own I/O
subsystem. The present invention is applicable in such a system where more than one such processor is provided with I/O and in which the file system is visible to all or a multiplicity of processors.
When a particular processor develops traffic for an I/O device not on its I/O subsystem, it requests that the processor owning the particular I/O
subsystem satisfy the request. In this way, the processors serve as I/O
processors to each other.
However, in this invention the requesting process sends to the service process a request sufficient to initiate and completely satisfy the particular I/O
block request. Rather than the two processors thereafter interrupting or polling each other during the block transfer at each disk READ or WRITE, the movement is accomplished via the shared MEMDISK.
The MEMDISK, as described here, teaches three major new concepts.
First, the purpose is not to minimize access time, although that minimization occurs, but rather is to minimize interference between a pair of processors so that a first process on a first processor may deliver data to the MEMDISK as it becomes available from the file system and a second process on a second processor may read the data from the shared memory region without interrupting the first processor after the initial START I/O process is begun.
MEMDISK WRITES are protected by semaphores to eliminate data corruption, and MEMDISK semaphores are applied on a region-by-region basis so that artificial interference is also eliminated.
The second major concept is that of disk-to-memory dynamic redirection. The initial START I/O to a particular file is delivered, by the device driver, to the processor owning the I/O device. Subsequent I/O to that file is dynamically redirected to the MEMDISK by the device driver and is managed transparent to all other processes including the remainder of the operating system. When the particular file access is completed, subsequent I/O
operations to other files will be satisfied by MEMDISK if present there, otherwise subsequent I/O operations to other files will be directed to the appropriate I/O
owner by the file subsystem.
A third major teaching of this invention is that of "aging" of data within the MEMDISK region. Once I/O data is placed in the region, it stays there and is available to any process on any processor until the MEMDISK region becomes full, at which time older (least-recently-used) information is replaced by newer information. The portion of the MEMDISK reflecting data on a particular I/O device is kept coherent by the owner of the I/O device.
This invention describes a means outside the operating system but within shared memory to provide any node in a shared-memory computing system access to memory, which is physically attached to another node. In a preferred embodiment, each system is provided with some local, private memory and a separate copy of the operating system. In this preferred embodiment, each of several nodes is provided with its own I/O channel to disks and other I/O units.
Referring to FIGS. 1-2, in a preferred embodiment, the operating system in each node is augmented with external extensions, not part of the operating system, which provide means by which said extensions have capabilities to reach shared memory and to communicate to other said nodes via Load and Store instructions to shared memory.
In this embodiment, the invention includes other extensions called shared-memory-disk (SMD) extensions, which make use of primitives and by which disk I/O functions which originate in applications and which are then passed to the operating system are processed. Said disk I/O functions, arriving at the Operating System, are processed by said SMD extensions and are translated to shared-memory Load and Store instructions. This effects disk I/O
transactions issued by the Operating System into Load and Store transfers via shared memory, and so satisfies the Operating System I/O request transparently.
An additional key teaching of this invention is the system-wide registration of all shared-memory disk partitions at all processing nodes within the system, with an access methodology dependent upon performance requirements. An additional key teaching of this invention is the use of a means herein called "software RAID" to assure high-availability of the disk I/O
portion of the computing system. An additional key feature of this system is the provision, in shared memory, for retention of data which passes through shared memory so that subsequent accesses to said retained images can be satisfied solely by memory transfers thus achieving dramatic performance improvements.
Each of the computing nodes that participates in constructing an SMD
instantiation contributes a node-local data store to the instantiation hereafter referred to as a SmdBlock. The data store is not limited as to type, and may be private memory, paged virtual memory, local disk I/O, etc. A preferred implementation provides this data store with commodity disk drives. These SmdBlocks are collected and logically bound into a SmdDisk in a repeatable order. This order is subsequently used to base an access methodology most appropriate to use (e.g., ordinally-arranged contiguity for best random access, interleaved contiguity for best sequential access, distributed contiguity for de-clustered failover or software Raid applications, etc.). The SmdDisk and SmdBlocks are identified within the immediate community of computing nodes by a community-unique signature.
Upon instantiation of a SmdDisk, a known address (or known key to address translation) is searched for a single data structure (hereafter referred to as the SmdAnchor) containing identifying patterns, validity and version information, and a shared-memory pointer to a linked list of SmdDisk control structures. If the SmdAnchor is not yet present, one is created. If the SmdAnchor is present, the SmdDisk list is searched for a matching signature, which informs the instantiation that the supporting shared-memory data structures have been created by another computing node, and it may now access same as appropriate. If the SmdDisk structure is not present, it is created and linked into the SmdAnchor list, for future SmdDisk and SmdBlock instantiations to find. The passive nature of the preferred implementation avoids inter-node messaging and communications overhead, while simplifying state 1 S transitions.
Upon instantiation of a SmdBlock, the SmdAnchor, and its corresponding SmdDisk list, is searched for a matching signature. If one is not found, the search is repeated ad infinitum, with an appropriate delay to allow for another computing node or computing process to create the SmdDisk structure.
If and when one is found, the SmdBlock inserts its computing node-unique information into the SmdDisk structure. When all SmdBlocks that contribute to a SmdDisk have inserted their information into the control structure, the SmdDisk is ready for use, and thusly describes a distributed data store.
When an I/O from the Operating System arrives, the SmdDisk structure can be examined and the operation broken into a list of transactions matching the SmdBlock node-locality and requirements of the SmdDisk data store access methodology. These transactions are then sent to each of the SmdBlock nodes for fulfillment using shared-memory Load and Store operations to transfer the data between nodes.
Referring to FIG. 2, on read transactions, the SmdDisk allocates shared-memory data areas for each of the transactions and sends read commands to the SmdBlock nodes referencing said data areas. The SmdBlock nodes perform the node-local data store reads, moving the data into said areas, then return status to the original node. When the originating node collects all SmdBlock transaction statuses (either passively or via the mechanisms described in [2]), the operation is considered complete and the data contained in the shared-memory data areas can be used to fulfill the original request.
Still referring to FIG. 2, on write transactions, the SmdDisk allocates shared-memory data areas for each of the transactions and moves the write data into said areas, according to the access methodology, then sends write commands to the appropriate SmdBlock nodes. The SmdBlock nodes perform the data store writes, moving the data from said areas into their data store, then return status to the originating node. When the originating node collects all SmdBlock transaction statuses, the operation is complete.
On both read and write transactions, the shared-memory areas used to transfer the data can be kept resident in shared memory, and subsequently used to satisfy read requests by any node connected to the shared-memory system.
This implements a shared-memory cache of the shared and distributed disk. The management of a shared-memory cache requires the use of a shared-memory mutual exclusion mechanisms to maintain coherency. This shared nature allows multiple nodes to access, and contribute to, the shared cache, resulting in being able to completely satisfy Disk I/O operations with Load and Store operations.
The control structures to manage said shared-memory cache can be kept within the SmdDisk control structure, allowing the above mentioned search and access methods to be used. This cache can result in a significant performance enhancements, as the latency and transfer time for a node to deliver data into shared memory, as well as the access of the physical data store, is eliminated.
An extension of the invention allows different access methodologies to be implemented for differing requirements, without affecting the Operating System perceived implementation. In a preferred implementation for optimal random access, the data stores contributed by all SmdBlock nodes can be ordinally arranged as contiguous data stores using commodity disk drives. This allows the head movement latency to be mitigated across the SmdBlock contributors, providing greatly reduced access latency and improved performance. In a preferred implementation for optimal sequential data access, the data stores can be arranged in a striping, or RaidO configuration, thus improving throughput by effecting concurrent media access. In a preferred implementation for high availability, the SmdBlocks can implement a "software RAID" by striping in a Chained Declustering methodology allowing head movement mitigation and concurrent access (at the expense of duplicate data store space). In another preferred implementation for high availability, the SmdBlocks can implement another "software RAID" by striping the SmdBlock nodes' contributions in a RAIDS methodology, balancing the computing cost of data parity generation and checking with the improvements provided by head movement mitigation and concurrent access. In the preferred embodiment, the shared-memory caching is used in conjunction with a "software RAID"
methodology, for a complete high performance fault tolerant implementation.
The Operating System perceived implementation of the data store remains, in all cases, transparent.
While not being limited to any particular performance indicator or diagnostic identifier, preferred embodiments of the invention can be identified one at a time by testing for the substantially highest performance. The test for the substantially highest performance can be carried out without undue experimentation by the use of a simple and conventional benchmark (speed) experiment.
The term substantially, as used herein, is defined as at least approaching a given state (e.g., preferably within 10% of, more preferably within 1 % of, and most preferably within 0.1 % of). The term coupled, as used herein, is defined as connected, although not necessarily directly, and not necessarily mechanically.
The term means, as used herein, is defined as hardware, firmware and/or software for achieving a result. The term program or phrase computer program, as used herein, is defined as a sequence of instructions designed for execution on a computer system. A program may include a subroutine, a function, a procedure, an object method, an object implementation, an executable application, an applet, a servlet, a source code, an object code, and/or other sequence of instructions designed for execution on a computer system.

Practical Applications of the Invention A practical application of the invention that has value within the technological arts is waveform transformation. Further, the invention is useful in conjunction with data input and transformation (such as are used for the purpose of speech recognition), or in conjunction with transforming the appearance of a display (such as are used for the purpose of video games), or the like. There are virtually innumerable uses for the invention, all of which need not be detailed here.
Advantages of the Invention A system, representing an embodiment of the invention, can be cost effective and advantageous for at least the following reasons. The invention improves the speed of parallel computing systems. The invention improves the scalability of parallel computing systems.
All the disclosed embodiments of the invention described herein can be realized and practiced without undue experimentation. Although the best mode of carrying out the invention contemplated by the inventors is disclosed above, practice of the invention is not limited thereto. Accordingly, it will be appreciated by those skilled in the art that the invention may be practiced otherwise than as specifically described herein.
For example, although the shared memory disk described herein can be a separate module, it will be manifest that the shared memory disk may be integrated into the system with which it is associated. Furthermore, all the disclosed elements and features of each disclosed embodiment can be combined with, or substituted for, the disclosed elements and features of every other disclosed embodiment except where such elements or features are mutually exclusive.
It will be manifest that various additions, modifications and rearrangements of the features of the invention may be made without deviating from the spirit and scope of the underlying inventive concept. It is intended that the scope of the invention as defined by the appended claims and their equivalents cover all such additions, modifications, and rearrangements.

The appended claims are not to be interpreted as including means-plus-function limitations, unless such a limitation is explicitly recited in a given claim using the phrase "means for." Expedient embodiments of the invention are differentiated by the appended subclaims.

Claims (30)

What is claimed is:
1. A method, comprising:
setting aside a particular range of a shared memory as a MEMDISK; and providing control for each of several operating systems that compose processing nodes coupled to said shared memory such that none of the processing nodes will attempt to utilize pages within said particular region for non-MEMDISK purposes.
2. The method of claim 1, wherein a first process on a first processor delivers data to the MEMDISK as data becomes available from a file system and a second process on a second processor reads data from the MEMDISK
without interrupting the first processor after an initial START I/O process is begun.
3. The method of claim 2, wherein MEMDISK WRITES are protected by semaphores to eliminate data corruption.
4. The method of claim 3, wherein MEMDISK semaphores are applied on a region-by-region basis so that artificial interference is eliminated.
5. The method of claim 1, wherein an initial START I/O to a particular file is delivered, by a device driver, to a processor owning the I/O device and subsequent I/O to said particular file is dynamically redirected to the MEMDISK by the device driver and is managed transparent to all other processes including a remainder of the operating system.
6. The method of claim 5, wherein, when an access to the particular file is completed, subsequent I/O operations to other files will be satisfied by MEMDISK if present there, otherwise subsequent I/O operations to other files will be directed to an appropriate I/O owner by a file subsystem.
7. The method of claim 1, wherein once I/O data is placed in the MEMDISK, data stays there and is available to a process on a processor until the MEMDISK becomes full, at which time older information is replaced by newer information.
8. The method of claim 7, wherein a portion of the MEMDISK reflecting data on a particular I/O device is kept coherent by an owner of the particular I/O
device.
9. An electronic media, comprising: a computer program adapted to set aside a particular range of a shared memory as a MEMDISK; and provide control for each of several operating systems that compose processing nodes coupled to said shared memory such that none of the processing nodes will attempt to utilize pages within said particular region for non-MEMDISK
purposes.
10. A computer program comprising computer program means adapted to perform the steps of setting aside a particular range of a shared memory as a MEMDISK; and providing control for each of several operating systems that compose processing nodes coupled to said shared memory such that none of the processing nodes will attempt to utilize pages within said particular region for non-MEMDISK purposes when said computer program is run on a computer.
11. A computer program as claimed in claim 10, embodied on a computer-readable medium.
12. A system, comprising a multiplicity of processors, each with some private memory and the multiplicity with some shared memory, interconnected and arranged such that memory accesses to a first set of address ranges will be to local, private memory whereas memory accesses to a second set of address ranges will be to shared memory, and arranged such that at least some of said processors are provided with input-output subsystems and that said input-output (I/O) traffic started by one processor for an I/O device attached to another processor will be started by inter-processor signals but continued via use of a portion of shared memory accessed via I/O driver emulation means.
13. The system of claim 12, wherein the transition from physical I/O to memory-converted I/O is performed automatically at the I/O driver level so that all application and all other operating system interfaces are maintained so that it is fully transparent.
14. The system of claim 12, wherein said flow of I/O via shared memory is such that the two processors involved do not need to interrupt each other for the satisfying of a given I/O request after it is started and until it is complete.
15. The system of claim 12, wherein said shared-memory region (MEMDISK) retains the information placed therein for use by other processes.
16. The system of claim 15, wherein said information is replaced on a least-recently-used basis as the MEMDISK becomes full.
17. A computer system which provides operating system extensions to perform disk input-output (I/O) functions in a shared-memory environment, where said extensions perform the functions with direct Load and Store operations.
18. The computer system of claim 17, wherein each of said multiplicity of processors includes a separate operating system and a separate input-output.
19. A computer system that provides system-wide registration of shared-memory disk partitions at all of a multiplicity of processing nodes within the system.
20. The computer system of claim 19, wherein each of said multiplicity of processors includes a separate operating system and a separate input-output.
21. A computer system that provides system-wide registration of shared-memory disk access methodologies at all of a multiplicity of processing nodes within the system.
22. The computer system of claim 21, wherein each of said multiplicity of processors includes a separate operating system and a separate input-output.
23. A computer system that provides system-wide status of shared-memory disk operations at all of a multiplicity of processing nodes within the system.
24. The computer system of claim 23, wherein each of said multiplicity of processors includes a separate operating system and a separate input-output.
25. A computer system that provides for multiple instantiations in a shared-memory environment of a disk to satisfy disk I/O operations for all system members, transparent to the Operating System.
26. The computer system of claim 25, wherein each of a multiplicity of processors includes a separate operating system and a separate input-output.
27. A computer system that provides for caching of data system-wide in a shared-memory environment to satisfy disk I/O functions for all system members, transparent to the Operating System.
28. The computer system of claim 27, wherein each of a multiplicity of processors includes a separate operating system and a separate input-output.
29. A computer system that provides application appropriate access methodologies based on system-wide partitioning of a data store in a shared-memory environment.
30. The computer system of claim 29, wherein each of a multiplicity of processors includes a separate operating system and a separate input-output.
CA002382929A 1999-08-31 2000-08-31 Shared memory disk Abandoned CA2382929A1 (en)

Applications Claiming Priority (7)

Application Number Priority Date Filing Date Title
US15215199P 1999-08-31 1999-08-31
US60/152,151 1999-08-31
US22074800P 2000-07-26 2000-07-26
US22097400P 2000-07-26 2000-07-26
US60/220,974 2000-07-26
US60/220,748 2000-07-26
PCT/US2000/024298 WO2001016743A2 (en) 1999-08-31 2000-08-31 Shared memory disk

Publications (1)

Publication Number Publication Date
CA2382929A1 true CA2382929A1 (en) 2001-03-08

Family

ID=27387201

Family Applications (3)

Application Number Title Priority Date Filing Date
CA002382929A Abandoned CA2382929A1 (en) 1999-08-31 2000-08-31 Shared memory disk
CA002382728A Abandoned CA2382728A1 (en) 1999-08-31 2000-08-31 Efficient event waiting
CA002382927A Abandoned CA2382927A1 (en) 1999-08-31 2000-08-31 Semaphore control of shared-memory

Family Applications After (2)

Application Number Title Priority Date Filing Date
CA002382728A Abandoned CA2382728A1 (en) 1999-08-31 2000-08-31 Efficient event waiting
CA002382927A Abandoned CA2382927A1 (en) 1999-08-31 2000-08-31 Semaphore control of shared-memory

Country Status (4)

Country Link
EP (3) EP1214653A2 (en)
AU (9) AU6949600A (en)
CA (3) CA2382929A1 (en)
WO (9) WO2001016760A1 (en)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1527968A (en) * 2001-07-13 2004-09-08 �ʼҷ����ֵ������޹�˾ Method of running a media application and a media system with job control
US6920485B2 (en) 2001-10-04 2005-07-19 Hewlett-Packard Development Company, L.P. Packet processing in shared memory multi-computer systems
US6999998B2 (en) 2001-10-04 2006-02-14 Hewlett-Packard Development Company, L.P. Shared memory coupling of network infrastructure devices
US7254745B2 (en) 2002-10-03 2007-08-07 International Business Machines Corporation Diagnostic probe management in data processing systems
JP2008046969A (en) * 2006-08-18 2008-02-28 Fujitsu Ltd Access monitoring method and device for shared memory
US7685381B2 (en) 2007-03-01 2010-03-23 International Business Machines Corporation Employing a data structure of readily accessible units of memory to facilitate memory access
US7899663B2 (en) 2007-03-30 2011-03-01 International Business Machines Corporation Providing memory consistency in an emulated processing environment
US9442780B2 (en) 2011-07-19 2016-09-13 Qualcomm Incorporated Synchronization of shader operation
US9064437B2 (en) 2012-12-07 2015-06-23 Intel Corporation Memory based semaphores
CN103608792B (en) 2013-05-28 2016-03-09 华为技术有限公司 The method and system of resource isolation under support multicore architecture

Family Cites Families (31)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US3668644A (en) * 1970-02-09 1972-06-06 Burroughs Corp Failsafe memory system
US4484262A (en) * 1979-01-09 1984-11-20 Sullivan Herbert W Shared memory computer method and apparatus
US4403283A (en) * 1980-07-28 1983-09-06 Ncr Corporation Extended memory system and method
US4414624A (en) * 1980-11-19 1983-11-08 The United States Of America As Represented By The Secretary Of The Navy Multiple-microcomputer processing
US4725946A (en) * 1985-06-27 1988-02-16 Honeywell Information Systems Inc. P and V instructions for semaphore architecture in a multiprogramming/multiprocessing environment
JPH063589B2 (en) * 1987-10-29 1994-01-12 インターナシヨナル・ビジネス・マシーンズ・コーポレーシヨン Address replacement device
US5175839A (en) * 1987-12-24 1992-12-29 Fujitsu Limited Storage control system in a computer system for double-writing
DE68925064T2 (en) * 1988-05-26 1996-08-08 Hitachi Ltd Task execution control method for a multiprocessor system with post / wait procedure
US4992935A (en) * 1988-07-12 1991-02-12 International Business Machines Corporation Bit map search by competitive processors
US4965717A (en) * 1988-12-09 1990-10-23 Tandem Computers Incorporated Multiple processor system having shared memory with private-write capability
DE69124285T2 (en) * 1990-05-18 1997-08-14 Fujitsu Ltd Data processing system with an input / output path separation mechanism and method for controlling the data processing system
US5206952A (en) * 1990-09-12 1993-04-27 Cray Research, Inc. Fault tolerant networking architecture
US5434970A (en) * 1991-02-14 1995-07-18 Cray Research, Inc. System for distributed multiprocessor communication
JPH04271453A (en) * 1991-02-27 1992-09-28 Toshiba Corp Composite electronic computer
DE69227956T2 (en) * 1991-07-18 1999-06-10 Tandem Computers Inc Multiprocessor system with mirrored memory
US5315707A (en) * 1992-01-10 1994-05-24 Digital Equipment Corporation Multiprocessor buffer system
US5398331A (en) * 1992-07-08 1995-03-14 International Business Machines Corporation Shared storage controller for dual copy shared data
US5434975A (en) * 1992-09-24 1995-07-18 At&T Corp. System for interconnecting a synchronous path having semaphores and an asynchronous path having message queuing for interprocess communications
DE4238593A1 (en) * 1992-11-16 1994-05-19 Ibm Multiprocessor computer system
JP2963298B2 (en) * 1993-03-26 1999-10-18 富士通株式会社 Recovery method of exclusive control instruction in duplicated shared memory and computer system
US5590308A (en) * 1993-09-01 1996-12-31 International Business Machines Corporation Method and apparatus for reducing false invalidations in distributed systems
US5664089A (en) * 1994-04-26 1997-09-02 Unisys Corporation Multiple power domain power loss detection and interface disable
US5636359A (en) * 1994-06-20 1997-06-03 International Business Machines Corporation Performance enhancement system and method for a hierarchical data cache using a RAID parity scheme
US6587889B1 (en) * 1995-10-17 2003-07-01 International Business Machines Corporation Junction manager program object interconnection and method
US5940870A (en) * 1996-05-21 1999-08-17 Industrial Technology Research Institute Address translation for shared-memory multiprocessor clustering
US5784699A (en) * 1996-05-24 1998-07-21 Oracle Corporation Dynamic memory allocation in a computer using a bit map index
JPH10142298A (en) * 1996-11-15 1998-05-29 Advantest Corp Testing device for ic device
US5829029A (en) * 1996-12-18 1998-10-27 Bull Hn Information Systems Inc. Private cache miss and access management in a multiprocessor system with shared memory
US5918248A (en) * 1996-12-30 1999-06-29 Northern Telecom Limited Shared memory control algorithm for mutual exclusion and rollback
US6360303B1 (en) * 1997-09-30 2002-03-19 Compaq Computer Corporation Partitioning memory shared by multiple processors of a distributed processing system
DE69715203T2 (en) * 1997-10-10 2003-07-31 Bull Sa A data processing system with cc-NUMA (cache coherent, non-uniform memory access) architecture and cache memory contained in local memory for remote access

Also Published As

Publication number Publication date
WO2001016743A8 (en) 2001-10-18
WO2001016738A3 (en) 2001-10-04
CA2382927A1 (en) 2001-03-08
EP1214653A2 (en) 2002-06-19
WO2001016760A1 (en) 2001-03-08
WO2001016741A2 (en) 2001-03-08
AU7110000A (en) 2001-03-26
WO2001016738A2 (en) 2001-03-08
WO2001016750A3 (en) 2002-01-17
WO2001016738A9 (en) 2002-09-12
WO2001016743A3 (en) 2001-08-09
WO2001016750A2 (en) 2001-03-08
EP1214652A2 (en) 2002-06-19
WO2001016738A8 (en) 2001-05-03
WO2001016740A2 (en) 2001-03-08
AU7108500A (en) 2001-03-26
WO2001016742A2 (en) 2001-03-08
WO2001016761A2 (en) 2001-03-08
WO2001016743A2 (en) 2001-03-08
AU6949600A (en) 2001-03-26
CA2382728A1 (en) 2001-03-08
AU7100700A (en) 2001-03-26
EP1214651A2 (en) 2002-06-19
AU7113600A (en) 2001-03-26
AU7112100A (en) 2001-03-26
WO2001016737A2 (en) 2001-03-08
WO2001016737A3 (en) 2001-11-08
WO2001016761A3 (en) 2001-12-27
AU7108300A (en) 2001-03-26
AU7474200A (en) 2001-03-26
WO2001016741A3 (en) 2001-09-20
WO2001016742A3 (en) 2001-09-20
WO2001016740A3 (en) 2001-12-27
AU6949700A (en) 2001-03-26

Similar Documents

Publication Publication Date Title
US9760386B2 (en) Accelerator functionality management in a coherent computing system
Goglin et al. KNEM: A generic and scalable kernel-assisted intra-node MPI communication framework
US10452580B2 (en) Method and system for providing remote direct memory access to virtual machines
US6295598B1 (en) Split directory-based cache coherency technique for a multi-processor computer system
Cheriton et al. Paradigm: A highly scalable shared-memory multicomputer architecture
US6345352B1 (en) Method and system for supporting multiprocessor TLB-purge instructions using directed write transactions
JP4667092B2 (en) Information processing apparatus and data control method in information processing apparatus
Ang et al. StarT-Voyager: A flexible platform for exploring scalable SMP issues
CA2382929A1 (en) Shared memory disk
Karlsson et al. Performance evaluation of a cluster-based multiprocessor built from ATM switches and bus-based multiprocessor servers
Stets et al. The effect of network total order, broadcast, and remote-write capability on network-based shared memory computing
Speight et al. Using multicast and multithreading to reduce communication in software DSM systems
Kontothanassis et al. Shared memory computing on clusters with symmetric multiprocessors and system area networks
Leijten et al. PROPHID: a data-driven multi-processor architecture for high-performance DSP
Brewer A highly scalable system utilizing up to 128 PA-RISC processors
Lucci et al. Reflective-memory multiprocessor
Lioupis et al. “CHESS” Multiprocessor A Processor-Memory Grid for Parallel Programming
Goglin et al. An efficient network api for in-kernel applications in clusters
Vlaovic CC-NUMA Page Table Management and Redundant Linked List Based Cache Coherence Protocol
Heinrich et al. Computer Systems Laboratory
Gupta Multiple Protocol Engines for a Directory based Cache Coherent DSM Multiprocessor
Bhatti Evaluation of All-Software Conventional Distributed Shared Memory on NOWs based on High Speed Networks
Mustafa An assessment of a method to enhance the performance of cluster systems
Goglin High Bandwidth Data Transfer with OPIOM & Myrinet: Application to Remote Video
Dwarkadas et al. CASHMERE-2L: Software Coherent Shared Memory on a Clustered Remote-Write Network

Legal Events

Date Code Title Description
FZDE Dead