WO1998058318A1

WO1998058318A1 - Computer system with transparent write cache memory policy

Info

Publication number: WO1998058318A1
Application number: PCT/US1998/012363
Authority: WO
Inventors: Richard A. Sergo
Original assignee: Paradigm Computer Systems, Inc.
Priority date: 1997-06-16
Filing date: 1998-06-12
Publication date: 1998-12-23
Also published as: AU8256498A

Abstract

Efficient use of a cache memory in a computer system is achieved, the system comprising a processor (12), a local bus comprising local address (110) and local data buses (111) coupled to the processor, a cache memory (16) coupled to the local bus, a bus interface (20) coupled to the local bus for coupling the processor to a main memory via an external bus (141, 142) and a transparent write cache policy (TWCP) controller (14) functionally coupled between the processor and bus interface. The TWCP controller looks for a data write operation initiated by the processor, and signals the processor that the data write is complete before actual completion, to free the processor to engage in one or more subsequent operations that do not require the external bus. The TWCP controller causes the bus interface to complete the data write to main memory in parallel with the one or more subsequent operations.

Description

COMPUTER SYSTEM WITH TRANSPARENT WRITE CACHE MEMORY POLICY

BACKGROUND OF THE INVENTION

Field of the Invention

The present invention relates to computer systems, and, in particular, to the use of cache memory systems in such computer systems.

Description of the Related Art

Computer systems employ processors (often referred to as microprocessors or central processing units (CPUs)) in various architectures, to execute instructions and operate on data stored in memory. In various architectures, computer processors thus routinely fetch instructions and data from various memory locations (system memory space addresses) in the computer system, generate new data based on the instructions and data fetched, and write the resultant modified or generated data to certain memory locations within the system's main memory. Accordingly, the information or computational bandwidth of a system is limited by the rate at which a processor can fetch and store information, not only by the instantaneous clock rate of the processor or other factors such as bus and processor width. Cache memory systems are often used to improve the computational and bus bandwidths of computer systems. Such a system employs a hierarchical memory structure in which the cache memory is a high speed subset of the main memory. The use of cache memory can increase the effective rate at which a processor fetches instructions and data, as well as writes data, by storing, or "caching," in the cache, data or instructions fetched from main memory. The processor simply fetches instructions and/or memory from the cache, instead of from main memory, if there is a "hit". The speed of memory write operations can also be improved through the use of caches, in particular write-back policy caches, as described below.

Because a cache memory is a smaller than the main memory and has a limited amount of space, the cache memory is shared on a dynamic (time-variant) basis among the programs being executed and stored in the main memory. Due to this dynamic sharing and limited space of the cache memory, the memory location specified by the processor may not be resident in the cache memory. This may be referred to as a "miss," as opposed to a hit. If a miss occurs during a READ operation, the processor "stalls" (enters an idle condition or state) until the required instructions and/or data can be fetched from the main memory. If a miss occurs during a write operation, the processor must write the variable directly to the main memory. In both cases, there is a performance penalty borne by the processor, since it is stalled during this time, which decreases the effective information rate. The performance penalty suffered is the information cycle that must occur, which requires more time than the time to access the cache memory would have taken.

With respect to READ operations, all cache memories incur the same basic type of penalty if the requested information is not resident in the cache. If a processor — or, more precisely, a cache controller — can fetch a block of information, and a processor can utilize the information content of a block while the block remains in the cache, the aforementioned penalty is minimized, and a performance improvement over a standard main memory read cycle is realized. Such information blocks are often referred to as a "cache line," and may be, for example, 32 bytes in size. The amount of time required to fetch a cache line from main memory is (M+l)-ΔT_b, where M is the number of information units constituting the cache line and ΔT_b is the amount of time to transfer one information unit from the main memory to the cache memory under so-called "page mode" of the main memory. (An information unit is determined by the bit-size of the bus of the computer system. For example, in a 32-bit system, in which 32-bits of data (4 bytes) can be fetched at a time, an information unit is 4 bytes, so that there are 8 information units (M=8) in a cache line. In a 64-bit system, an information unit is 8 bytes, so that M=4. In page mode, data or instructions are fetched from using one row address and a series of column addresses, in "burst mode" fashion.) Because a standard main memory cycle requires 2-ΔT_b units of time to fetch a single information unit, the amount of time to perform a cache line read is justified if a processor utilizes at least (M+l)/2 information units while the cache line is still resident in the cache memory. If more than (M+l)/2 information units are utilized from the cache, overall READ efficiency is increased.

With respect to WRITE operations, there are two conventional ways in which a processor writes data to the main memory and the cache memory. These techniques are known as cache policies, and include so-called "write-through" and "write-back" cache policies. Each has its own advantages and disadvantages. Under a write-through cache policy, data is written back to the main memory in real time and, if the location specified by the processor is resident in the cache, to the cache memory in parallel with the main memory operational. Given the real time update of the main memory, the duration of the write cycle for a write-through cache policy is always the duration of the main memory cycle or 2-ΔT_b units of time. Since the main memory cycle is always larger in duration than a cache memory cycle, the performance of write-through cache policy can never take advantage of the improved speed of cache memory for WRITE operations, and thus cannot achieve the superior rate of the cache memory. However, the write-through cache policy does not give rise to other problems, such as deadly embrace (deadlock) due to lack of main memory coherency in multi-processor systems, as described further below. Under a write-back cache policy, by contrast, data is written only to the cache memory in real time, and the update of the main memory is deferred until a later time. Thus, the real time portion of the write operation is 2-ΔT_c, where ΔT_C is the cache memory access time. ΔT_C (typically on the order of 8-10 ns, with current technology) is much smaller than ΔT_b (typically on the order of 30-40 ns). At some later point in time, the contents written to the cache memory must be written to the main memory. Due to the addressing structure of typical cache memories, the entire cache line usually must be written back to the main memory (a process sometimes referred to as "flushing" the cache line), independent of the number of data variables written to the cache line. Thus, when it is time to flush a cache line, if only one byte therein has changed, the entire cache line is nevertheless written back to main memory. This can cause inefficient use of the cache memory.

Given the page mode capability for main memories, the cache line is written to the main memory in burst mode, resulting in a total write time of 2-ΔT_c + (M+l)-ΔT_b/m, where m is a number of information units written to the cache prior to the cache line being written to the main memory (1 < m ≤ M). In order to gain a performance advantage using a write-back cache policy over a write-through cache policy, a write-back cache policy must write at least (M+l)/2 information units to the cache prior to the cache line being written to the main memory.

In order for a processor to write to m consecutive locations prior to the cache line being written back to main memory, the data must exhibit the properties of spatial and temporal locality. Spatial locality refers to the information units being adjacently located in the main memory, while temporal locality refers to the information units being accessed during the time that the information units are resident in the cache memory. With respect to WRITE operations, spatial and temporal locality are important only for write-back cache policies. If either property is not present, the write-back cache policy performs write operations in an identical manner as a write-through cache policy.

Thus, a write-back cache policy can result in more efficient cache usage than a write- through cache policy. However, various problems are encountered when employing a writeback cache policy, such as the deadly embrace problem. For example, multiple processor computer systems are often employed, in which N multiple processors share the same main memory. In such systems, main memory coherency is critical. If a processor writes a variable to either main memory, cache memory, or both, and the same memory address or location is also contained in the cache memory of another processor, the latter processor's cache line containing the variable is invalidated, preventing the latter processor from accessing any information contained in the cache line until the updated information can be fetched from the main memory. Therefore, at any given time in multiple processor configuration, there can be up to (N-l) invalid cache locations (and thus cache lines) per processor.

Under a write-through cache policy, main memory coherency is always maintained in real time because an output variable is always written to the main memory independent of the variable being resident or absent in the cache memory. Although (N-l) invalid cache locations per processor can still exist, a given processor can always fetch an updated variable from the main memory. Thus, a write-through cache policy does not require any processor to wait for another processor to update main memory. Consequently, the write-through cache policy prevents inter-processor stalling and eliminates the potential for a so-called "deadly embrace" among processors. (The deadly embrace condition occurs when each processor waits for another processor to update the main memory but none of the processors can update the main memory because each processor is waiting for another processor.)

Under a write-back cache policy, however, main memory coherency is not maintained in real time due to the non-real time memory update. Updating of the main memory depends on changed cache lines in the various cache memories to be flushed (written back to the main memory), and the manner and timing of such cache line flushes depends on the flushing algorithm employed. Typically, a cache line is flushed under one of two conditions: (1) when space is required in the cache to store a new cache line; or (2) when a context switch, i.e., change in the program flow, is encountered. In very large cache memories, the amount of time between main memory updates can be very large or possibly infinite, such that main memory coherency may not be achieved. For example, a 2 megabyte cache memory is sufficient to hold the entire UNIX operating system and all of its associated registers such that the value of the register would never be written back to the main memory. As result, any processor depending on accessing these values would stall for an infinite amount of time and the function being executed by the processor lost. Thus, due to the absence of real time main memory updates, write-back cache policies can result in up to N invalid main memory locations and (N-l) invalid cache locations per processor. Consequently, a write-back cache policy incurs a high probability of processor stalls and a potential for a "deadly embrace" condition.

Three conventional techniques have been employed to address the deadly embrace and main memory coherence problems that arises in multi-processor systems having processors that employ write-back cache policies. These techniques are: (1) Least Recently Used; (2) Random; and, (3) First-In/First-Out. These techniques flush changed cache lines at a time that is not a function of the current process, but rather at a time that is statistical in nature, to break any deadly embraces that may have occurred.

Additionally, in such multi-processor systems using either type of cache policies, each processor must monitor the activity of the other processors to determine if another processor is performing a WRITE operation to the main memory and its respective cache. This process is commonly termed "snooping" and the bus upon which the process is performed is termed the "snoopy" bus. With write-through cache polices, data is always written to the main memory so that the system bus is the only bus that must be monitored or snooped. Therefore, the snoopy bus is the system bus for a system with a write-through cache policy. By monitoring the system bus, each processor can determine if another processor is performing a WRITE operation.

For write-back cache policies, however, since local writes to caches are not always immediately written or flushed to the main memory (via the system bus), a special purpose snoopy bus must be employed and monitored, in addition to monitoring the system bus. The special purpose snoopy bus is required because no activity is present on the system bus when a processor updates only its cache memory due to the deferred main memory update. This special purpose snoopy bus provides the information that would normally be present on the system bus during a main memory write, at a level sufficient for a cache controller to detect a cache memory WRITE operation and the location (address) to which the data variable is being written. In addition, due to the deferred main memory update, each cache controller must also track the validity of the main memory locations as well as the validity of cache lines. Consequently, a write-back cache policy requires additional hardware, control logic, storage, and system real estate in comparison to a write-through cache policy.

Thus, in employing a write-through cache policy, there is inefficient use (underemployment) of the cache memory and stalling of the processor. In employing a writeback cache policy, there is either a deadly-embrace possibility or complex, expensive, and inefficient solutions to this problem.

SUMMARY

A computer system for efficiently utilizing a cache memory. In one embodiment, the system comprises a processor, a local bus comprising local address and local data buses coupled to the processor, a cache memory coupled to the local bus, a bus interface coupled to the local bus for coupling the processor to a main memory via an external bus, and a transparent write cache policy (TWCP) controller functionally coupled between the processor and the bus interface. The TWCP controller monitors for a data write operation initiated by the processor, and signals the processor that the data write is complete before the data write is complete, to free the processor to engage in one or more subsequent operations that do not require the external bus. The TWCP controller causes the bus interface to complete the data write to main memory in parallel with the one or more subsequent operations.

BRIEF DESCRIPTION OF THE DRAWINGS

These and other features, aspects, and advantages of the present invention will become more fully apparent from the following description, appended claims, and accompanying drawings in which:

Fig. 1 is a block diagram of a computer system having an improved cache memory system with a transparent write cache policy, in accordance with an embodiment of the present invention; Fig. 2 is a block diagram showing the computer system of Fig. 1 in further detail; and Fig. 3 is an illustrative timing diagram showing the comparative cache memory performance for various cache policies including the transparent write cache policy of the present invention.

DESCRIPTION OF THE PREFERRED EMBODIMENT

In conventional computer systems, READ and WRITE operations are executed sequentially, since the processor uses the same signal pins and signal lines for READ and WRITE operations, which requires the processor to complete an operation before moving on to the next operation. In the present invention, as described in further detail below, a computer system employs a transparent write cache policy (TWCP) instead of a write- through or write-back cache policy. The system comprises a special purpose controller, which may also be referred to as a TWCP controller, to implement the transparent write cache policy.

Using the transparent write cache policy of the present invention, WRITE operations of a processor are executed in parallel with subsequent cache READ or INTERNAL operations performed by the processor. WRITE operations are thus "transparent" to the processor, since the processor does not "see" the on-going write and instead acts as if the write has been completed even though it has not. Thus, the duration of a WRITE operation is governed by the write cycle characteristics of a processor rather than the speed of the cache memory or the main memory, giving rise to various improvements and advantages, as described in further detail below.

Referring now to Fig. 1, there is shown a block diagram of a computer system 100 having an improved cache memory system with a transparent write cache policy, in accordance with an embodiment of the present invention. System 100 includes processor 12, TWCP controller 14, bus controller 18, bus interface, and cache 16, interconnected as shown by various system buses and lines. These include local (or internal or processor) address and data busses 110, 111, local bus controller control (BC CNTL) signal group bus 113, local bus slave control (BS CNTL) signal group bus 114, local ready line 115, local bus master control (BM CNTL) bus 116, latch control bus 112, bus controller signal group bus 123, BS CNTL bus 124, CPU miss line 125, bus master control bus 126, bus controller ready line 127, output enable line 137, bus master control bus 135, BS CNTL bus 132, cache miss line 131, external (or system) address bus 141, external data bus 142, external bus master control bus 156, and external BS CNTL bus 152. Local buses and lines 110, 110, 113, 114, 115, and 116, which directly couple to processor 12 at all times, may be considered to be the local bus.

Referring now to Fig. 2, there is shown a block diagram showing computer system 100 of Fig. 1 in further detail. In particular, as shown in Fig. 2, Fig. 2 further shows the width and interconnections of the busses of system 100, as well as the components of bus interface 20 and cache 16. Cache 16 comprises cache controller 32 and cache memory 31. As illustrated, bus interface 20 comprises latches 21 and 22, as well as transmitter and receiver modules 23 and 24, respectively. Thus, bus interface 20 preferably comprises latching capability for data and address latching, namely, latches 21 and 22. Latches 21 and 22 may be, for example, transparent latches or edge triggered registers. Transmitter and receiver modules 23 and 24 may also be implemented from latches, or may be components of standard a transceiver device.

In a preferred embodiment, bus controller 18 is implemented as a single integrated circuit or combination of integrated circuits commonly termed the "chip set" that effect and/or control the bus operation. Cache 16 preferably comprises one or more cache controllers 32 and one or more levels of cache memory 31, and may be implemented in a single integrated circuit or multiple integrated circuits. Bus interface 20 may also be implemented as a single integrated circuit or combination of integrated circuits, which enable information to be shared between or among devices connected to a common communication media independent of the actual physical nature of the communication media. Processor 12 is preferably a single integrated circuit microprocessor, which may or may not include an integral cache controller and first level cache memory or processor realized through any combination of integrated and discrete components.

Referring once more to Fig. 1 , it can be seen that TWCP controller has been inserted in between bus controller 18 and processor 12. As explained in further detail below, this allows processor 12 to begin a WRITE operation, which is then taken over and completed under the control of TWCP controller 14, which de-couples local buses 110, and 111 from external buses 141, 142 so that processor 12 can use the local bus to perform a READ operation from cache 16 in parallel with the WRITE operation being completed by TWCP controller 14, bus controller 18, and bus interface 20 via the external bus.

As will be appreciated, processor 12, bus controller 18, and cache 16 generate the control signals for TWCP controller 14. During a READ operation, the existence of TWCP controller 14 is transparent to other components of system 100. Thus, in the absence of a memory write operation indicated by the states of the aforementioned control signals, TWCP controller 14 is totally transparent such that processor 12, bus controller 18, cache 16, and bus interface 20 operate as if TWCP controller 20 is a set of directly connected signal lines such that a normal transfer operation occurs, either from cache 16 (if there is a "hit") or from main memory (if there is a "miss"). Thus, for a READ standard cache advantages are achieved, i.e. some increases in READ efficiency are achieved if enough hits to the cache occur.

However, when processor 12 engages in a WRITE operation, TWCP controller performs the transparent write cache policy function, and operates as follows. When the states the aforementioned control signals indicate a memory write operation, TWCP controller 14 generates control signals to bus interface 20, latches the control signals generated by processor 12, and re-drives these control signals to bus controller 18. Bus interface 20 also latches with latches 21 and 22 the address and data placed by processor 12 on local buses 110, 111 for the WRITE operation. This latching by bus interface 20 and by TWCP controller requires approximately 3-5 ns, in one embodiment using currently-available technology. In addition, in the TWCP of the present invention, cache 16 is updated with new data there is a hit. Thus, if there is a hit, cache 16 at this time also latches and stores the data on local data bus 111 at the address or location specified by local address bus 110. This latching may require more time than the latching of bus interface 20 and TWCP controller 14, e.g. 8-10 ns.

TWCP controller 14 therefore monitors the control signal portion of the local bus of processor 12 to detect the presence of a memory write operation and the start of a transfer cycle. Upon detecting the presence of a memory write cycle, TWCP controller 14 generates a "latch control" signal or set of signals on latch control bus 112, enabling the address and the bus master control signals, BM CNTL (on buses 116, 126, and 136), to be latched by their respective circuits contained in bus interface 20. In particular, latch 21 of bus interface 20 is enabled at this point to latch the address on local address bus 110, and TWCP controller 14 latches the state of the bus master control signals BM CNTL internally. The internally latched bus master control signals, BM CNTL, are driven by TWCP controller 14 to bus controller 18 for the duration of the WRITE operation. The internal latch of TWCP controller 14 is broken by the end of cycle command contained in the bus controller signal group, BC CNTL, generated by the bus controller 18 when the write to external, main memory is completed.

TWCP controller 14 continues to monitor the local bus of processor 12 to detect the start of the data transfer phase. (In some cases, this signal may be driven onto the local bus of the processor 12 by bus controller 18.) Upon detecting the data transfer phase, TWCP controller 14 generates a latch control signal enabling the data on data line 111 to be latched or registered by bus interface 20. Concurrent with the generation of the data latch signal, TWCP controller 14 generates an end of cycle or "ready" signal, RDY, to processor 12, via line 115, if the cache miss line 131 indicates that there is no need to wait longer while the data is written in cache 16. If, however, cache miss line 131 indicates that there is a hit, TWCP controller 14 does not generate the RDY signal until an additional delay sufficient to allow cache 16 to store the data on local data bus 111. The extra time is necessary since the latching by bus interface 20 and by TWCP controller requires approximately 3-5 ns, in one embodiment, while approximately 8-10 ns (i.e., ΔT_C) are required for cache 16 to store data. In either case, whether at the end of the 3-5 ns latching time when there is a cache miss, or at the end of the cache store time of 8-10 ns if there is a cache hit, the WRITE to external main memory is still taking place since such an operation requires on the order of 30-40 ns (i.e., ΔT_b).

Thus, processor 12 must drive information only for a duration sufficient for TWCP controller 14 to latch or register the information, so that the WRITE cycle is not governed by either the speed of the cache memory or the main memory but rather the set-up and hold times for the latches or registers. Thus, although the WRITE to external memory is not yet complete, processor 12 is released from supervising and completing this operation, and thinks the WRITE is completed, and is thus free to read from cache 16 as the local bus is decoupled from the external system bus, or to perform other operations such as instruction execution that do not require the use of the external or system bus. The local bus is de-coupled from the external bus because, at this point, bus interface 20 has latched its necessary information and can isolate the external address and data buses 141, 142 from local address and data buses 110, 111; and TWCP controller has latched the necessary bus control signals and thus can also isolate buses 123, 124, and 125 from local buses 113, 114, and 116. Thus, upon receipt of the end of cycle, or "ready" (RDY) signal by processor 12, the local bus of processor 12 is de-coupled from the remainder of the system and processor 12 can fetch information from cache 16.

As will be appreciated, cache 16 is always available to processor 12 because the cache is configured as a write-through cache policy and therefore does not contain a location that is being written to by processor 12. If processor 12 executes an immediate WRITE operation following the first WRITE, such as would be incurred during an interrupt or context switch operation, the processor is placed in a "hold" state by TWCP controller 14, resulting in an identical action taken using a write-through or write-back cache policy. Thus, in the case of back-to-back writes, the performance of the present invention is no worse than with conventional cache policies.

Referring once more to Fig. 2, the BM CNTL signal group contains signals which define the nature and type of operation being performed. Typically, the nature of the operation is specified by a single signal. This signal defines either a WRITE (logical state 1) or READ (logical state 0) operation so that processor 12 is always writing unless otherwise specified. With respect to the type of operation, a single signal is required to define if the operation is a memory or I/O operation, so that a memory operation is always being performed unless otherwise specified by processor 12. Consequently, the nature of a memory write operation can be typically detected by TWCP controller 14 monitoring only two bus master control signals. For example, for an Intel® 80X86 processor, these signals are M/IO and W/R. This requires three signals for most processors.

Because these signals are always in a given state unless otherwise specified by the processor and the unspecified state of these signals is a valid operation, an additional signal is required that is time variant to determine the point at which the state of these signals is valid. Processor 12 drives the state of these signals concurrent with driving the address of the desired location onto the local address bus 110 of processor 12. This signal is typically, but not always, termed the Host Address Strobe (HADS#), and is a low true signal to prevent misinterpretation during the period in which the local bus of processor 12 is floating, as will be appreciated by those skilled in the art. Some processors start the transfer cycle concurrent with the HADS while other processors start the cycle a prescribed number of clock cycles later. In either case, processor 12 defines the nature and timing of the operation to be performed. The leading edge of the HADS specifies the start of a window in which the address is valid and should be latched by bus controller 18 or any device connected to the processor's local address bus 110. The address is normally latched on the trailing edge of the strobe. During the duration of the window, the address is decoded. The selected device or bus controller 18 normally drives the RDY signal line or lines of a processor such as processor 12 to a low logical low (0) indicating to the processor that the receiving device is not ready to receive the data. However, in the present invention, TWCP controller 14, rather than bus controller 18, drives the RDY signal to processor 12. Therefore, although processor 12 has driven the data onto its local data bus 111, processor 12 will continue to drive the data onto local data bus 111 until the RDY signal line or lines are released by the receiving device. When the RDY signal group is released (returned to a logical "one" condition), processor 12 interprets the transition from the low to high condition as the completion of the cycle and advances to the next operation.

TWCP controller 14 determines the nature of the operation and the phase of the transfer based on the state of M/IO, W/R, and HADS# in conjunction with the timing of the specific processor 12 and system bus architecture to generate the latch control signals and latch the control and address signals internally in TWCP controller 14. Immediately after generating the latching control signals, TWCP controller 14 generates the RDY signal to processor 12, indicating to processor 12 that the selected device has received the data, thereby completing the transfer cycle, from the point of view of processor 12. Processor 12 is thus "falsely" signaled that the transfer cycle is complete, even though it is not. Thus, processor 12 is free to use the local bus to read data or instructions from cache 16, even while the WRITE operation in reality is not completed.

Thus, TWCP controller 14 continues to drive the required control signals to bus controller 18, indicating to bu controller 18 that the transfer cycle has not been aborted by processor 12. Thus, even though processor 12 thinks the WRITE operation is done and may be engaged reading more data from cache 16, the processor appears to be continuing the WRITE operation's transfer cycle from the point of view of bus controller 18.

Because TWCP controller 14 continues to drive the processor 12 control signals to bus controller 18 in lieu of processor 12, bus controller 18 continues the transfer cycle. However, processor 12 is able to execute the next instruction or fetch data from cache 14, resulting in the performance gain normally associated with cache READ operations, as well as the real time main memory update normally associated with a write-through cache policy. Upon the true completion of the transfer cycle, as determined by bus controller 18 and the response signals from the main memory, bus controller 18 generates its RDY signal (via bus controller ready line 127) accordingly. This RDY signal is intercepted by TWCP controller 14 before reaching processor 12 (so that processor 12 does not have to wait until the true end of the WRITE operation before going on to other tasks). Thus, bus interface 20' s latches 21, 22 and transmitter and receiver 23, 24 are in a transparent state, from the point of view of processor 12, thereby enabling the next bus operation. The output enable (OE) signal controlling the output state of these components of bus interface 20 is the same signal that is normally generated by bus controller 18, so that the output control is unaltered.

During the initial phase of the transfer operation in which the local bus of processor 12 overlays onto the system bus, bus interface 20 latches 21, 22, and transmitter and receiver 23, 24 are transparent, resulting in an in-line signal flow with respect to processor 12. This circuit configuration eliminates any requirement for sequential operations or special purpose timing to implement the present invention. Concurrent with the start of the data transfer phase of the cycle, TWCP controller 14 signals processor 12 that the cycle is complete so that the next operation of processor 12 is performed in parallel with the previous WRITE operation.

Referring now to Fig. 3, there is depicted an illustrative timing diagram 300 showing the comparative cache memory performance for various cache policies including the transparent cache policy of the present invention, in a possible scenario. Diagram 300 illustrates one example of how the present invention reduces the main memory cycle time. Diagram 300 shows illustrative cache read (CR), cache write (CW), and cache idle periods for similar program flow for a write-through, transparent (the present invention), and write- back cache policies, and also shows the activity of the system bus for each such policy, via write-through time-line 301 and associated system bus time-line 302, transparent time-line 303 and associated system bus time-line 304, write-back time-line 305 and associated system bus and write-back (WB) valid memory time-lines 306 and 307.

Thus, as illustrated, a write-back cache policy (time-lines 305-307) can give rise to an "invalid main memory condition" 342, when there is a cache write (CW) 341 , in a multiprocessor system. This arises because any WRITE by any processor in the system to its respective cache creates such an invalid main memory condition 342 with respect to the location or address written by CW 341 , since other processors may not use data from that address or location, until a main memory update 344 (the flush of the required cache line), which allows other processors that contain the invalid variable to fetch the cache line from main memory. Consequently, for a write-back cache policy, the processor whose cache contains the invalid variable will stall for a minimum of 2-(M+l)-ΔT_b for each cache line containing an invalid variable. For a system configured with N processors, the worst case stall condition is (N- l)²-(M+l)ΔT_b. With typical cache lines having 8 information units, 4- way multiple processor configurations, and main memory cycles of 60 nanoseconds, the worst case processor stall is 4.86 μs. By virtue of the present invention, this potential stall condition is totally eliminated.

A write-through cache policy incurs cache idle 343 each time there is a cache write, due to the long main memory cycle necessary for the WRITE to main memory (e.g., see the bandwidth (BW) occupied by the system bus 141, 142 in time-line 302, following a CW. For a cache access cycle of ΔT_C and a main memory access cycle of ΔT_b, the cache idle condition 343 is (ΔT_b-ΔT_c) and is incurred for every memory WRITE operation performed by the processor.

As can be seen from time-lines 303, 304, however, there are no cache idles or invalid main memory conditions, where there are no back-to-back cache writes. Each time a CW occurs, system bus bandwidth is used to update main memory, but several CRs can occur simultaneously with such updates, as long as there is a hit in the cache. The present invention not only eliminates the penalty incurred in a write-through cache policy, but also frees the local bus of the processor, thereby enabling the processor to fetch instructions and data at a higher rate. As result, the computational bandwidth of the system is increased due to the reduced duration of a WRITE operation with the computational bandwidth limited only by the set-up and hold times for the bus interface 20 latches.

Accordingly, it can be seen that the present invention provides a number of advantages over conventional computer systems that employ either a write-through or writeback cache policy. In particular, the present invention employs TWCP controller 14 to implement a transparent write cache policy, to improve the overall performance of a computer system by increasing the effective information rate and, in particular, increasing the output information rate of a processor. The transparent write cache policy of the present invention also eliminates the potential for a deadly embrace condition that would be present in write-back cache policy multi-processor configurations, without, however, incurring the inefficiencies normally incurred through use of a write-through cache policy.

Thus, the present invention implements the advantages of both write-through and write-back cache policies, without their associated disadvantages. In particular, the complexities, costs, snoopy bus, memory incoherency due to non-real time main memory updates, and deadly embrace danger of a write-back policy are avoided or minimized, since the transparent write cache policy of the present invention does not defer main memory updates until a later time. For example, a special purpose snoopy bus, normally required for write-back cache policies, is not required for the transparent write policy of the present invention; instead, as in the write-through cache policy, the system or external bus can serve as the snoopy bus. Thus, in the present invention, the cache controller need monitor only the activity on the system or external bus during main memory write operations, thereby eliminating the requirement for a special purpose snoopy bus as well as the associated additional hardware costs of the snoopy bus. In addition, elimination of the snoopy bus frees up chip real estate that may be used to add more symmetric multiple processors to the computer system than could be implemented in a write-back cache policy system. Further, the transparent write cache policy of the present invention, unlike the write-back cache policy, does not depend on spatial and temporal locality in order to avoid main memory incoherency and deadly embrace conditions. In this respect, the present invention provides the advantages of the write-through cache policy, and avoids disadvantages accompanying a write-back cache policy.

However, by allowing the processor local bus to be de-coupled and engage in further internal operations simultaneously with the completion of WRITE operations which are handled by the TWCP controller, the present invention also avoids the long real time write cycle experienced with a write-through cache policy. The duration of a WRITE operation is thus minimized, such that the duration of the WRITE operation is always less than or equal to the duration of the same operation using a write-back cache policy. In fact, with the present invention, processor WRITE operations are performed at the speed of the instruction execution rate of a processor such that the effective information cycle can be less than the effective cycle time of the cache memory. In this respect, therefore, the present invention provides the efficiency advantages (or better) of the write-back cache policy, and avoids disadvantages accompanying a write-through cache policy. It will be understood that various changes in the details, materials, and arrangements of the parts which have been described and illustrated above in order to explain the nature of this invention may be made by those skilled in the art without departing from the principle and scope of the invention as recited in the following claims.

Claims

CLAIMSWhat is claimed is:

1. A computer system, comprising: (a) a processor;

(b) a local bus comprising local address and local data buses coupled to the processor;

(c) a cache memory coupled to the local bus;

(d) a bus interface coupled to the local bus for coupling the processor to a main memory via an external bus; and (e) a transparent write cache policy (TWCP) controller functionally coupled between the processor and the bus interface, wherein the TWCP controller is adapted to monitor for a data write operation initiated by the processor, to signal the processor that the data write is complete before the data write is complete to free the processor to engage in one or more subsequent operations that do not require the external bus, and to cause the bus interface to complete the data write to main memory in parallel with the one or more subsequent operations.

2. The computer system of claim 1, wherein, when the processor is signaled by the TWCP controller that the data write is complete, the local bus is de-coupled from the external bus.

3. The computer system of claim 1, further comprising:

(f) a bus controller functionally coupled between the TWCP controller and the bus interface for controlling the bus interface in response to control signals driven from the TWCP controller.

4. The computer system of claim 1, wherein the bus interface comprises a data latch for latching data placed on the local data bus by the processor to be placed on an external data bus of the external bus and an address latch for latching an address placed on the local address bus by the processor to be placed on an external address bus of the external bus.

5. The computer system of claim 1 , wherein the cache memory is coupled by a cache miss line to the TWCP controller for notifying the TWCP controller when the address of data to be written by the processor to the main memory is not stored in the cache memory.

6. The computer system of claim 1, wherein, when the TWCP controller detects the initiation of the data write operation by the processor, the TWCP controller: internally latches control signals from the processor related to the data write operation; instructs the bus interface via a latch control bus to latch the address and data signals placed by the processor on the local data bus; transmits a ready signal to the processor after the internal latching by the TWCP controller and the latching by the bus interface is complete and before the data write to main memory is complete.

7. The computer system of claim 6, wherein, if the address of data to be written by the processor to the main memory is also stored in the cache memory, the TWCP controller does not transmit the ready signal to the processor until after the cache memory stores the data write.

8. The computer system of claim 7, wherein the TWCP controller is coupled by a cache miss line to the TWCP controller for notifying the TWCP controller when the address of data to be written by the processor to the main memory is or is not stored in the cache memory.

9. The computer system of claim 1 , wherein the one or more subsequent operations comprise one or more of a subsequent cache read or an instruction execution.

10. A transparent write cache policy (TWCP) controller for a computer system comprising a processor, a local bus comprising local address and local data buses coupled to the processor, a cache memory coupled to the local bus, and a bus interface coupled to the local bus for coupling the processor to a main memory via an external bus, the TWCP controller comprising: (a) means for functionally coupling the TWCP controller between the processor and the bus interface;

(b) means for monitoring for a data write operation initiated by the processor;

(c) means for signaling the processor that the data write is complete before the data write is complete to free the processor to engage in one or more subsequent operations that do not require the external bus; and

(d) means for causing the bus interface to complete the data write to main memory in parallel with the one or more subsequent operations.

11. In a computer system comprising a processor, a local bus comprising local address and local data buses coupled to the processor, a cache memory coupled to the local bus, a bus interface coupled to the local bus for coupling the processor to a main memory via an external bus, and a transparent write cache policy (TWCP) controller functionally coupled between the processor and the bus interface, a method for efficiently utilizing the cache memory comprising the steps of:

(a) monitoring, with the TWCP controller, for a data write operation initiated by the processor;

(c) signaling, with the TWCP controller, the processor that the data write is complete before the data write is complete to free the processor to engage in one or more subsequent operations that do not require the external bus; and

(d) completing, with the bus interface, the data write to main memory in parallel with the one or more subsequent operations.