MXPA99006144A

MXPA99006144A - Data processing system for non-uniform memory access data introducing potential intermediate memories third-node transactions to decrease communicate waiting time

Info

Publication number: MXPA99006144A
Application number: MXPA/A/1999/006144A
Authority: MX
Inventors: Dale Carpenter Gary; Edward Dean Mark; Brian Glasco David; Nicholas Iachetta Richard Jr
Original assignee: International Business Machines Corporation
Priority date: 1998-06-30
Filing date: 1999-06-30
Publication date: 2000-04-24

Abstract

A non-uniform memory access computer system (NUMA) includes an interconnect to which multiple processing nodes (including first, second, and third processing nodes) are coupled. Each of the first, second, and third processing nodes includes at least one processor and a local system memory. The NUMA computer system further includes a transaction buffer, coupled to the interconnection, which stores communication transactions transmitted on the interconnection that are both initiated by and selected at a processing node different from the third processing node. In response to a determination that a particular communication transaction originally executing another processing node must be processed by the third processing node, the buffer memory logic coupled to the transaction buffer causes the particular communication transaction to be recovered from the transaction buffer and processed by the third processing node. In one embodiment, the interconnection includes a broadcast structure and the transaction buffer and the buffer memory logic form a portion of the third processing node.

Description

DATA PROCESSING SYSTEM FOR ACCESS TO NON-UNIFORM MEMORY THAT INTRODUCES POTENTIAL INTERMEDIATE MEMORIES THIRD-NODE TRANSACTIONS TO DECREASE THE TIME OF WAITING FOR COMMUNICATION DESCRIPTION Background and field of the invention The present invention relates in general to a method and a system for data processing and, in particular, to a non-uniform memory access data processing system (NUMA) and to a method of communication in a data processing system. NUMA Still more particularly, the present invention relates to a NUMA data processing system and to a communication method in which potential third node transactions are buffered to reduce the communication wait time. It is well known in the field of computing that greater operation of the computer system can be achieved by using the processing potential of multiple individual processors in tandem. Multiprocessor computer systems (MP) with different topologies can be designed, of which several can adapt more precisely for particular applications that depend on the operation and environment requirements of the computer program of each application. One of the most common MP computer topologies is a symmetric multiprocessing (SMP) configuration in which multiple processors share common resources, such as a memory system and an input / output subsystem (I / O) that are typically coupled to a shared system interconnection. It is said that such computer systems are symmetric because all the processors in an SMP computer system ideally have the same access wait time with respect to data stored in the shared system memory. Although SMP computer systems allow the use of relatively simple interprocessor communication and data sharing methodologies, SMP computer systems have limited scalability. In other words, while the operation of a typical SMP computer system can generally be expected to improve with multiplication (that is, with the sum of more processors), inherent bus, memory, and bandwidth limitations of input / output (I / O) prevent significant advantage from being obtained by scaling multiplication of SMP beyond a size dependent on the start-up in which the use of these shared resources is perfected. Thus, the SMP topology by itself suffers up to a certain limit of bandwidth limitations, especially in the system memory, as the system scale increases. SMP computer systems also do not multiply well from the point of view of manufacturing efficiency. For example, although some components may be refined for use in both computer systems of a processor and in small-scale SMP computer systems, such components are often inefficient for use in high-power SMPs. Reciprocally, components designed for use in high-power SMPs are impractical for use in smaller systems from an economic point of view. As a result, an MP computer system topology known as non-uniform memory access (NUMA) has emerged as an alternative design that faces many of the limitations of SMP computer systems at the expense of some additional complexity. A typical NUMA computer system includes several interconnected nodes each including one or more processors and a local "system" memory. It is said that such computer systems have non-uniform memory access because each processor has lower access wait time with respect to data stored in the computer. system memory in its local node that with respect to data stored in the system memory in a remote node. NUMA systems can be further classified as either non-coherent or coherent cache, depending on whether or not consistency of the data between caches in different nodes is maintained. The complexity of the NUMA coherent cache systems (CC-NUMA) is largely attributed to the additional communication required by the computer equipment to maintain data consistency not only between the various levels of cache memory and system memory within each node but also between cache and system memories in different nodes. NUMA computer systems, however, address the scalability limitations of conventional SMP computer systems since each node within a NUMA computer system can be implemented as a smaller SMP system. Thus, the shared components within each node can be optimized for use by only a few processors, while the overall system benefits from the availability of parallelism on a larger scale while maintaining relatively low waiting time. A main operating consequence with the CC-NUMA computer systems is the waiting time associated with communication transactions transmitted through interconnection by coupling the nodes. Because all data accesses can potentially activate a coherence or data request transaction in the nodal interconnection, the waiting time associated with the transmission of requests to remote nodes and transmission of responses from remote nodes can have an important influence on the operation global system As would be evident then, it would be desirable to provide a CC-NUMA computer system that has a low waiting time for communication between nodes.

It is therefore an object of the present invention to provide an improved method and system for data processing. It is another object of the present invention to provide a NUMA data processing system and communication method in an improved NUMA data processing system. It is still another object of the present invention to provide an improved NUMA data processing system and a communication method in which potential third node transactions are buffered to reduce the communication wait time. The above objects are achieved as described below. A computer system is provided Access to non-uniform memory (NUMA) that includes an interconnection to which first, second, and third processing nodes are attached. Each of the first, second, and third processing nodes includes at least one processor and one local system memory. The NUMA computer system further includes a transaction buffer, coupled to the interconnection, which stores communication transactions transmitted in the interconnection which are initiated by and executed in a processing node different from the third processing node. In response to a determination that a particular communication transaction originally executing another processing node must be processed by the third processing node, the buffer control logic coupled to the transaction buffer causes the particular communication transaction to be retrieved of the transaction buffer and processed by the third processing node. In one embodiment, the interconnection includes a broadcast structure and the transaction buffer and the buffer memory logic form a portion of the third processing node. The above objects as well as additional objects, features, and advantages of this invention will be apparent from the following detailed description.

Brief description of the drawings The novel features considered characteristic of the invention are set forth in the appended claims. The invention itself, however, as well as a preferred mode of use, additional objects and their advantages, will be more clearly understood by reference to the following detailed description of an illustrative embodiment when read in conjunction with the accompanying drawings. Figure 1 shows an illustrative embodiment of a NUMA computer system with which the present invention can be used advantageously. Figure 2A is a more detailed block diagram of an interconnect architecture used in the illustrative embodiment shown in Figure 1. Figure 2B describes an illustrative embodiment of an I instruction. Figure 2C is a more detailed block diagram of the controller of node shown in figure 1.

Figures 3A-3D illustrate a third node communication scenario according to the state of the art. Figure 4 is a high-level logical flow diagram of a third-node communication methodology according to the present invention.

Detailed description of illustrative modality Global appraisal of the system Referring now to the figures and in particular with reference to figure 1, an illustrative embodiment of a NUMA computer system according to the present invention is described. The modality described can be carried out, for example, as a workstation, server, or macrocomputer. As illustrated, the NUMA computer system 8 includes a number (N) of processing nodes lOa-lOd, which are interconnected by means of the interconnection of nodes 22. The processing nodes lOa-lOd each includes at least one, and up to M, processors 12. Processors 12a-12d are preferably identical and may comprise a processor within the line of PowerPC ™ processors obtainable from International Business Machines (IBM) Corporation of Armonk, New York. In addition to the registers, instruction flow logic and execution units used to execute program instructions, each of the processors 12a-12d also includes a level one on chip cache (Ll) (not illustrated), which together with a respective one level two caches (L2) 14a-14d is used to organize data for the associated processor 12 from the system memories 18. In other words, the Ll caches and the L2 14a-14d caches function as intermediate storage between the system memories 18 and the processors 12 buffering data temporarily that will likely be accessed by the associated processor 12. The L2 caches 14 typically have a much larger storage capacity than the Ll caches, but at a longer access wait time. For example, L2 caches 14a-14d can have a storage capacity of 1 to 16 megabytes (MB), while Ll caches on chip can have a storage capacity of 8 to 32 kilobytes. Although the L2 caches are illustrated 14a-14d in Figure 1 as being outside of the processors 12, it should be understood that the L2 caches 14a-14d could alternatively be incorporated into the associated processor 12 as an additional level of on-chip cache memory. In addition, it should be understood that one or more additional levels of cache memory (L3, L4, etc.) could used to provide additional data storage. In the following discussion, each processor 12 and its associated cache hierarchy (Ll, L2, etc.) is considered to be a simple harrow. As shown, the processing nodes 10a-lOd further include a respective node controller 20, which, together with the system memory 18 and the L2 caches 14a-14d, is coupled to local interconnection 16. Each node controller 20 it serves as a local agent for remote processing nodes 10 performing at least two functions. First, the node controllers 20 poke the associated local interconnections 16 and facilitate the transmission of local communication transactions on remote processing nodes 10. Second, the node controllers 20 poke communication transactions on node 22 interconnections and govern the transactions of the node. relevant communication in the associated local interconnection 16. The communication at each local interconnection 16 is controlled by an arbitrator 24. As discussed further below, the arbitrators 24 regulate access to local interconnections 16 based on bus request signals generated by processors 12 and compile coherence responses for communication transactions delved into local interconnections 16.

Of course, the NUMA computer system 8 may also include additional devices that are not necessary for an understanding of the present invention and are therefore omitted to avoid entanglement of the present invention. For example, each node 10 can also support I / O devices (e.g., a data display device, keyboard, or graphic indicator), non-volatile storage to store an operating system and computer programs or application software and ports serial and parallel for connections to networks or attached devices.

Memory organization All processors 12 in computer system NUMA 8 shares a single physical memory space, meaning that each physical address is associated with only a single location in one of the 18 system memories. Thus, the global contents of the system memory, which can generally be accessed through of any processor 12 in the NUMA computer system 8 can be seen as partitioned among the four system memories 18. For example, for the illustrative embodiment of the present invention shown in Figure 1, the processors 12 address an address space of 16. gigabytes (GB) including a general-purpose memory area and a reserved area. The general purpose memory area is divided into 500 MB segments, with each of the four processing nodes 10 each fourth segment being located. The reserved area, which may contain approximately 2 GB, includes peripheral memory and system control and peripheral memory and I / O areas that are each assigned to a respective one of processing nodes 10. For purposes of the present disclosure, the processing node 10 that stores a particular data in its system memory 18 is said to be the base node for that data; conversely, other processing nodes 10-lOd are said to be remote nodes with respect to the particular data.

Memory coherence Because the data stored within each system memory 18 can be requested, accessed, and modified by any processor 12 within the NUMA 8 computer system, the NUMA 8 computer system implements a consistency cache protocol to maintain consistency between caches in the same processing node as between caches in processing nodes different Thus, the NUMA 8 computer system is properly classified as a CC-NUMA computer system. The cache coherence protocol that is implemented is implementation dependent and may include, for example, the well-known Modified, Exclusive, Shared, Invalid (MESI) protocol or a variant thereof. From here on, it will be assumed that the L cache, the L2 caches 14b and the arbitrators 24 implement the conventional MESI protocol, of which the node controllers 20 recognize the M, S and I states and consider the E state to be merged into the M state for correction. That is, node controllers 20 assume that data maintained exclusively by a remote cache has been modified, regardless of whether the data has actually been modified or not.

Interconnection architecture The local interconnects 16 and the node interconnection 22 can each be implemented with any interconnection architecture either broadcast or point-to-point, for example, a bus or cross-bar switching. However, in a preferred embodiment, each of the local interconnections 16 and node interconnections 22 are implemented as a hybrid bus architecture governed by the 6xx communication protocol developed by IBM Corporation. Referring now to Figure 2A, a preferred mode of node interconnection 22 within the NUMA computer system 8 is illustrated from the perspective of one of the processing nodes 10. As shown, the illustrated mode of node interconnection 22 includes separate addresses (ie, not multiplexed) and portions of data, which are decoupled to allow transactions by sections. The node interconnect address portion 22 is implemented as a shared address bus 26, the access to which is controlled by a central arbiter 27. A node controller 20 requests access to the shared address bus 26 by asserting its signal ( ABR) 25 of the respective address bus request and an access concession is reported by the central arbitrator 27 through the assertion of its respective address bus concession signal 29 (ABG). Each node controller 20 coupled to the node interconnect 22 also scans all communication transactions on shared address bus 26 to support memory consistency, as discussed further below. A summary of the relevant signal names and definitions for shared address bus 26 is given in Table 1.

TABLE 1 The use of shared address bus 26 is preferably enhanced by implementing the shared address bus 26 as a bus executed in cascade, meaning that a subsequent transaction can be proceeded by a processing node 10 before the The master of a previous communication transaction receives coherence responses from each of the other processing nodes 10. While the node interconnection data portion could also be implemented as a shared bus, the node interconnection data portion 22 is preferably implemented as a distributed switch having N-1 (eg, 4-1 = 3) data input channels 34 and a single data output channel 32 for each processing node 10. The data output by a processing node 10 in data output channel 32 is transmitted to all processing nodes 10, and each processing node 10 receives data from each of the other processing nodes 10 through data entry channels 34. By starting the node interconnection data portion 22 in this manner instead of as a shared bus, blockages are avoided and the data bandwidth is significantly increased. you. The names of relevant signals and definitions for each channel within the preferred embodiment of the node interconnection data portion 22 are summarized below in Table II.

TABLE II As indicated in Table II, to allow recipients of data packets to determine the communication transaction to which each data packet belongs, each data packet is identified with a transaction tag. This allows the shared address bus synchronization 26 and the node interconnection data portion 22 to be completely decoupled, meaning that there is no fixed synchronization relationship between the address holdings and the data holdings and that the data holdings may ordered differently than the corresponding address holdings. Those of ordinary skill in the art will appreciate that the data flow control logic and associated flow control signals must be used to regulate the use of finite data communication resources. As illustrated in FIG. 2A, a preferred mode of node interconnection 22 also includes a high speed I instruction channel 31. This sideband channel, like the node interconnection data portion 22, is brought to preferably as a distributed switch including an output channel (instruction output channel 32) and N-1 input channels (instruction input channels 34) for each processing node 10. Channels 32 and 34 allow communication I instruction between processing nodes 10 without creating additional load on the address or node interconnect data portions 22. An exemplary mode of an I instruction is shown in FIG. 2B. As illustrated, instruction I 36 includes five (5) fields: a 4-bit instruction type field 33, a N-bit target node field (e.g., 4 bits) 35, a source node field of N bits 37, a transaction tag field 38 and a valid field (V) 39. The instruction type field 33 provides a code indication of the type of instruction I 36. Some of the possible instructions I that can be coded within field type 33 are listed below in table III. TABLE III For each type of instruction I, the recipient is specified in the target node field 35, the sending node is specified in the source node field 37 and the transaction with which the I statement is related is specified within the label field of transaction 38. The validity of the instruction I 36 is indicated by means of the valid field (V) 39. Preferably, the instructions I issued by processing nodes 10 by means of the instruction channel I 31 do not have any necessary synchronization relationship with the address or data holdings of the associated communication transactions. And, because the instruction channel I 31 uses small packets and is non-blocking (i.e., the use of the instruction channel I 31 by a processing node 10 does not inhibit or block use by other processing nodes) , instructions I can be transmitted at high speed between processing nodes 10. Like node interconnection 22, local interconnections 16 include three distinct components - a directional portion, a data portion, and a coherence portion. The address portion of each local interconnect 22 is preferably carried out as described above with respect to the shared node address 26 management bus 22. The data portion of each interconnection Local 16 uses the same data signals listed in Table II, but is preferably carried out as a shared data bus instead of a distributed switch (although either could be used). Instead of the instruction channel I discussed above, the coherence portion of each local interconnection 16 includes signal lines that couple each harbinger attached to the local arbiter 24. The signal lines within the local interconnects 16 that are used for coherence communication they are summarized below in table IV.

TABLE IV In contrast to the coherence responses transmitted between processing nodes 10 by means of the instruction channel I 31, the coherence responses transmitted by means of the AResp and AStat lines of local interconnections 16 preferably have a fixed but programmable synchronization relationship with the associated address bus transactions. For example, the AstatOut votes, which provide a preliminary indication of each picker's response to a communication transaction on the local address bus, may be required in the second cycle after receipt of a request on the local address bus . Referee 24 compiles the AStatOut votes and then sends The vote AStatln a fixed but programmable number of cycles later (eg, 1 cycle). The possible votes of AStat are summarized below in table V.

TABLE V After the AST period, the ARespOut votes may then be required for a fixed but programmable number of cycles (eg, 2 cycles) later. Arbitrator 24 also compiles the ARespOut votes of each bugger and sends an ARespIn vote, preferably during the next cycle. Possible AResp votes preferably include the coherence responses listed in Table III. In addition, possible AResp votes include "ReRun," which is sent (usually by means of a node controller 20) to indicate that the request scans have a long wait time and that the source of the request will be instructed to resend the request. transaction in a later moment. Thus, in contrast to a retry response, a ReRun response makes the recipient of a transaction that is voted ReRun (and not the transaction creator) responsible for causing the communication transaction to be forwarded at a later time.

Node controller Referring now to Figure 2C, a more detailed block diagram of a node controller 20 in computer system NUMA 8 of Figure 1 is illustrated. As shown in Figure 2C, each node controller 20, which is coupled between a local interconnection 16 and a node interconnection 22, includes transaction receiving unit 40, transaction sending unit 42, a data receiving unit (DRU) 44, and a data sending unit (DSU) 46. The transaction receiving unit 40, transaction sending unit 42, DRU 44 and DSU 46 can be implemented, for example, with programmable field gate arrays (FPGAs) or specific application integrated circuits (ASICs). As indicated, the data paths and addresses through the node controller 20 branch, with address signals that are processed by means of the transaction receiving unit 40 and the sending unit. transaction 42 and data signals that are processed by means of the DSU 44 and the DRU 46. The transaction receiving unit 40, which is designated in this manner to indicate transaction flow outside node 22 interconnection, is responsible for receiving instructions I from the instruction channel I 31, from accepting transactions and responses from the node interconnection 22, from issuing transactions received in local interconnection 16 and from sending responses to the transaction sending unit 42. The transaction receiving unit 40 is also responsible for maintaining transaction buffer 52. The transaction buffer 52 is an associative buffer in which the transaction receiving unit 40 stores communication operations poked on shared address bus 26 which proceed by and execute on processing nodes 10 different than the local processing node. Each entry in the transaction buffer 52 stores a communication transaction in association with a transaction tag (i.e., address bits <8:15>) so that communication transactions can be accessed quickly, as discussed. below with respect to Figure 4. The transaction sending unit 42, which as indicated by its nomenclature is a channel for transactions flowing over the node interconnection 22, reciprocally interact with the transaction receiving unit 40 to process memory request transactions and send instructions to the DRU 44 and the DSU 46 to control the data transfer between the local interconnect 16 and the data portion of the node interconnection 22. The transaction sending unit 42 also implements the selected coherence protocol (i.e., MSI) for the node interconnection 22 and maintains the coherence directory 50. The coherence directory 50 stores indications of the verified data system memory addresses (e.g., cache lines) for caches at remote nodes for which the local processing node is the base node. The address indication for each data is stored in association with an identifier of each processing node having a copy of the data and the consistency status of the data in each of the processing node. Possible consistency states for entries in the consistency directory 50 are summarized in Table VI.

TABLE VI As indicated in Table VI, knowledge of the coherence states of cache lines retained by remote processing nodes is imprecise. This inaccuracy is due to the fact that a remotely held cache line can transition from S to I, from E to I, or from E to M without notification to the node controller 20 of the base node.

State of the art communication scenario of "Third Node" Referring now to Figures 3A-3D, an exemplary "third node" communication scenario is described within a NUMA computer system according to the state of the art. As shown, the conventional NUMA computer system 58 includes first, second, third, and fourth nodes which are illustrated respectively with the reference numbers 60, 62, 64, and 66. Assuming that the second node 62 is the base node for data that is exclusively retained (ie, in E or M state) by the third node 64, the first node 60 requests the data broadcasting a read request on the node interconnection. As shown in Figure 3A, the request transaction is received by the second node 62, the third node 64, and the fourth node 66; however, because the requested data is owned by the second node 62, the third node 64 and the fourth node 66 filter out (i.e., ignore) the data request. In response to receipt of the request transaction, the second node 62 checks its node directory to determine if a copy of the requested data is retained within a remote node. Because the requested or requested data was recorded in the node directory of the second node 62 as retained exclusively by a remote node, the second node 62 is unable to immediately respond to the request transaction received from the first node 60. This is because the copy of the requested data to the second node 62 may be altered (i.e., a processor in the third node 64 may have modified the requested data). Accordingly, as shown in Figure 3B, the second node 62 sends to the third node 64, via the node interconnect, a request transaction specifying the address of the requested data. As indicated by arrow 70, in response to the request transaction the third node 64 directs the request transaction to the internal caches that can save the requested data. The internal cache retaining the requested data responds exclusively with a shared coherence response and indicates that the cache will update the state of consistency of the requested data for the state S. Then, as shown in Figure 3C, the third node 64 transmits a shared response to the second node 62 and informs the second node 62 that the state of consistency of the copy of the requested data retained by the third node 64 is being updated for the shared state. Finally, with reference to the 3D figure, in response to receiving the shared response from the third node 64, the second node 62 can process the request transaction as illustrated by the arrow 72. The requested data then proceeds to the first node 60 with a shared coherence state, as indicated by arrow 74. While this conventional third node communication scenario ensures data consistency between nodes in a NUMA computer system, it should be noted that the same communication transaction it is transmitted twice to the third node 64, as shown in Figures 3A and 3B. The present invention advantageously eliminates this redundant communication by means of node interconnection, reducing this way the communication waiting time and improving the scalability of a NUMA computer system.

Innovative third-node communication scenario Referring now to Figure 4, a high-level logical flow diagram of a third-node communication methodology according to the present invention is illustrated. The flow chart shown in Figure 4 assumes the same initial conditions as the scenario of the exemplary prior art discussed above, namely that one of the processors 12a-12d has sent a read request for a cache line that is retained exclusively by the processing node 10c and has the processing node 10b as a base node. As described, the process begins at block 80 and thereafter proceeds to block 82, which illustrates the node controller 20 of the processing node 10a transmitting, via shared address bus 26 of the node interconnect 22 , a communication transaction requesting data at a specified address. Because the shared address bus 26 is a broadcast medium in the preferred embodiment, the request transaction is received by each of the processing nodes 10b, 10c and lOd. After block 82, the process proceeds toward both blocks 84-88 and block 90. Blocks 84-88 illustrate the processing performed by processing node 10b (ie, the base node of the requested cache line) in response to receiving the request transaction in the shared address bus 26. First, as shown in block 84, the node controller 20 of the processing node 10b arbitrates the ownership of its local interconnection 16 and governs the request transaction in the local interconnection 16. The process then proceeds to block 86, which describes the node controller 20 of the processing node 10b by voting ReRun for its ARespOut consistency response for the request transaction. The ReRun vote indicates that the transaction sending unit 42 has determined by reference to the coherency directory 50, that the coherence state of the requested cache line can not be resolved without involving a third processing node, specifically, the node of 10c processing, which retains the data requested exclusively. As shown in block 88, in response to arbitrator 24 of local interconnection 16 by voting ReRun for ARespIn, transaction sending unit 42 within node controller 20 of processing node 10b sends processing node 10c, by means of the instruction channel I 31, a "run 3rd node" instruction I together with the transaction tag of the original request transaction sent by the processing node 10a. Because the instruction I is transmitted via the sideband instruction channel I 31 instead of the address or data portions of the node interconnect 22, the address bandwidth of the node interconnect 22 that would be consumed In some way, it can be used advantageously to communicate other transactions. In this way, the communication timeout in the block, the limited portions of bandwidth of the node interconnect 22 is reduced. After block 88, the process goes to block 100, which is described later. Block 90 illustrates the activation process in processing nodes 10c and lOd, none of which is the source or purpose of the request transaction, in response to receipt of the request transaction sent by processing node 10a. As indicated, the transaction receiving unit 40 within each of the processing nodes 10c and lOd stores the request transaction and the transaction tag in an entry within its respective transaction buffer 52. In a preferred embodiment , not all hurried transactions are stored in memories transaction intermediates 52 from third-party nodes (that is, processing nodes that are neither the source nor the purpose of a transaction). In contrast, to preserve the limited storage capacity of transaction buffers 52, only transactions identified by address signal lines < 0: 7 > and the TDescriptors as transactions that could possibly require third node involvement are stored in buffer. Of course, other optimizations are possible to improve the storage efficiency of transaction buffers of limited size 52, such as storing only those transactions that would involve the use of a greater amount of communication resources than a threshold amount for retransmission if they are not stored in intermediate form. The process continues from block 90 to block 100, which illustrates a determination of whether it has received a "run of 3rd node" instruction I or not the processing node 10c. Otherwise, the process illustrated in FIG. 4 iterates in block 100 until a "3rd node run" instruction I is received by the processing node 10c. Of course, during the interval between the re-recording of the request transaction by the processing node 10c in the transaction buffer 52 and the reception of an instruction I of "3rd node run" by the processing node 10c, the processing nodes lOa-lOd can initiate, receive and process other communication transactions. Then, in response to a determination in block 100 that the processing node 10c has received a "3rd node run" instruction I, the process passes to block 102. Block 102 illustrates a determination by the transaction receiving unit 40 within the node controller 20 of the processing node 10c of whether a transaction tag that matches the transaction tag received by means of the instruction channel I 31 was saved or not within the transaction buffer 52. Depending on the size of the transaction buffer 52 and the number of communication transactions received by the processing node 10c between steps 90 and 102, the transaction specified by the transaction tag can no longer be stored within the transaction buffer 52 due to its limited size. If a transaction having a match transaction tag is stored within the transaction buffer 52, the process passes from block 102 to block 108 which is described below. However, in response to a determination that the transaction tag received by means of the instruction channel I 31 does not match any of the labels of transaction in the transaction buffer 52, the process proceeds to block 104. Block 104 shows the processing node 10c transmitting an I instruction "Reissue" to the processing node 10b via the instruction channel I 31 together with the transaction tag received. As shown in block 106, in response to receiving the I "Reissue" instruction, the processing node 10b retransmits the communication transaction to the processing node 10c via the shared address bus 26 of the node interconnection 22 , as described above with respect to Figure 3B. Thus, in the statistically improbable event that the relevant communication transaction is not stored in the transaction buffer 52 of the processing node 10c, the NUMA computer system 8 handles a third node communication scenario such as a computer system. Conventional NUMA 58. The process either proceeds from block 102 or block 106 to block 108, which illustrates the transaction receiving unit 40 within node controller 20 of processing node 10c which governs the request transaction (what it was accessed from the transaction buffer 52 or received from the processing node 10b) in the local interconnection 16 of the processing node 10c. In response to the transaction of request, each of the rovers attached to the node 16 interconnection votes a coherence response during the ARespOut period. The beeper who retains the requested data exclusively votes shared during the ARespOut period and initiates an update to the state of coherence of the requested cache line to the S state; meanwhile, other scavengers vote null. As described in block 110, the bus arbitrator 24 of the processing node 10c compiles these coherence responses and sends a shared coherence response during the ARespIn period. In response to the receipt of the shared ARespIn coherence response, the transaction sending unit 42 within the node controller 20 transmits an I instruction containing a shared response and the transaction tag to the processing node 10b via the communication channel. instruction I 31. The process then proceeds to block 112, which describes the node controller 20 of the processing node 10b traversing the request transaction on the local interconnection 16 of the processing node 10b. In response to fetching the ReRun request transaction, the node controller 20 of the processing node 10b votes shared during the ARespOut period and indicates that the processing node 10c retains the requested data in the shared state. The bus arbitrator 24 of the processing node 10b thereafter compiles the coherence and shared vote responses during the ARespIn period. Finally, as shown in block 114, the node controller 20 of processing node 10b transmits an I instruction containing a shared response and the transaction tag to processing node 10a via instruction channel I 31 and proceeds the cache line requested from the processing node 10a via the data output channel 28. After this, the process ends in the block 116. As described, the present invention provides an improved NUMA computer system and a better methodology of third node communication in a NUMA computer system. According to the present invention, transactions that could possibly require the intervention of a third node are stored in buffer in a third node that is neither the source nor the object of a communication transaction. In case the intervention of the third node is required, the transaction can be accessed from the buffer memory instead of being retransmitted on the shared address bus within the node interconnection. In this way, traffic over the limited bandwidth portion of the node interconnect is advantageously reduced, whereby the waiting time of communication decreases and the overall functioning of the system is improved. While the invention has been particularly shown and described with reference to a preferred embodiment, it will be understood to those of ordinary skill in the art that various changes may be made in form and detail therein without departing from the essence and scope of the invention. For example, even though the present invention has been described with respect to a preferred embodiment in which the node interconnection is a bus-based structure (e.g., the shared bus), it should be understood that in alternate modes the node interconnection could be carried out with a point-to-point broadcast structure, such as a cross-bar switching. In this mode, the transaction buffer for each node and the associated control logic would be coupled to crossbar switching instead of being incorporated into each node.

Claims

1. - A non-uniform memory access computer system (NUMA), characterized in that it comprises: an interconnection; first, second, and third processing nodes coupled to the interconnection, each of the first, second, and third processing nodes include at least one processor and a local system memory; a transaction buffer, coupled to the interconnection, which stores communication transactions transmitted in the interconnection that are initiated by and executed in a processing node different from the third processing node; and buffer control logic, coupled to the transaction buffer, which in response to a determination that a particular communication transaction originally executed by another processing node must be processed by the third processing node, causes the transaction of particular communication is retrieved from the transaction buffer and processed by the third processing node.

2. The NUMA computer system according to claim 1, further characterized because the interconnection includes a broadcast interconnect and the transaction buffer and the buffer control logic form a portion of the third processing node.

The NUMA computer system according to claim 1, further characterized in that the transaction buffer stores only communication transactions capable of requiring processing by the third processing node. .

The NUMA computer system according to claim 1, further characterized in that each communication transaction in the transaction buffer is accessed by an associated transaction tag.

The NUMA computer system according to claim 4, further characterized in that the second processing node further comprises a node controller that, in response to the reception of the particular communication transaction, the particular communication transaction executing the second processing node, determines whether the particular communication transaction should be processed by the third processing node and, if so, transmits a transaction tag associated with the particular communication transaction to the buffer control logic.

6. The NUMA computer system according to claim 1, further characterized in that the interconnection includes a non-blocking interconnect that carries the transaction tag of the second processing node to the buffer control logic.

The NUMA computer system according to claim 1, further characterized in that the buffer memory logic transmits a forwarding instruction to another processing node in response to a determination that a communication transaction that was originally executed in another processing node and that must be processed by the third processing node is not stored within the transaction buffer.

8. A method for operating a non-uniform memory access computer system (NUMA) that includes first, second, and third processing nodes coupled to an interconnect, each of the first, second, and third processing nodes include at least less a processor and a local system memory, characterized in that the method comprises: transmitting a communication transaction in the interconnection from the first processing node that selects the second processing node; receiving the communication transaction both in the second processing node and in a transaction buffer coupled to the interconnection; store the communication transaction in the transaction buffer; and recovering the communication transaction from the transaction buffer and processing the communication transaction in the third processing node in response to a determination that the communication transaction must be processed by means of the third processing node.

9. The method according to claim 8, the interconnection includes a broadcast interconnect and the third processing node includes the transaction buffer and the buffer control logic, wherein the step of receiving the communication transaction in the transaction buffer comprises receiving the transaction from communication in the third processing node.

The method according to claim 8, further characterized in that storing the communication transaction in the transaction buffer comprises storing the communication transaction in the transaction buffer if the The communication transaction is capable of requiring processing by the third processing node.

The method according to claim 8, further characterized in that recovering the communication transaction from the transaction buffer comprises recovering the communication transaction from the transaction buffer using an associated transaction tag.

The method according to claim 8, further characterized in that it further comprises: in response to the reception of the particular communication transaction in the second processing node, the particular communication transaction executes the second processing node, determining in the second processing if the particular communication transaction must be processed by means of the third processing node; and in response to a determination that the particular communication transaction must be processed by means of the third processing node, transmitting an indication of the determination to the transaction buffer.

13. The method according to claim 12, further characterized by transmitting An indication of the determination comprises transmitting the indication by means of a non-blocking interconnection. The method according to claim 8, further characterized in that it further comprises: transmitting a forwarding instruction from the buffer memory logic to the second processing node in response to a determination that a communication transaction that originally executed the second processing node and that must be processed by means of the third processing node is not stored within the transaction buffer.