KR19990024309A

KR19990024309A - Multiprocessor device with distributed shared memory structure

Info

Publication number: KR19990024309A
Application number: KR1019970040083A
Authority: KR
Inventors: 장성태; 전주식; 김형호
Original assignee: 전주식; 장성태; 김형호
Priority date: 1997-08-22
Filing date: 1997-08-22
Publication date: 1999-04-06
Also published as: KR19990023119A; KR100281465B1

Abstract

본 발명은 스누핑(snooping) 방식의 링버스를 포함하는 분산된 공유 메모리 구조의 다중 프로세서 장치에 관한 것으로, 링형으로 배치된, 분산된 공유 메모리 구조를 갖는 다수개의 프로세서 노드(500A 내지 500H)들로써, 상기 소정의 프로세서 노드(500A)에서 데이터를 요청하는 요청신호를 발생하고, 상기 요청 신호를 수신한 다른 프로세서 노드(500B 내지 500H)에서는 내부를 스누핑하여 요청 신호에 대응하는 데이터를 출력하는 프로세서 노드군; 상기 다수개의 프로세서 노드(500A 내지 500H)들을 링형으로 연결하여, 상기 요청 신호 및 상기 요청 신호에 대응하는 데이터가 상기 각 프로세서 노드들을 순회하여 상기 요청 신호를 발생한 프로세서 노드(500A)에 전달되도록 하는 경로를 제공하는 링 버스(510)를 포함하여 구성됨을 특징으로 한다.The present invention relates to a multiprocessor device of a distributed shared memory structure including a snooping ring bus, and includes a plurality of processor nodes 500A to 500H having a distributed shared memory structure arranged in a ring shape. The processor node group which generates a request signal for requesting data from the predetermined processor node 500A, and outputs data corresponding to the request signal by snooping inside the other processor nodes 500B to 500H having received the request signal. ; A path connecting the plurality of processor nodes 500A to 500H in a ring shape so that the request signal and data corresponding to the request signal are circulated through the processor nodes to be delivered to the generated processor node 500A. It characterized in that it comprises a ring bus 510 to provide.

Description

Multiprocessor device with distributed shared memory structure

본 발명은 분산된 공유 메모리 구조의 다중 프로세서 장치 및 그의 제어 방법에 관한 것으로, 특히 스누핑(snooping) 방식의 링버스를 포함하는 분산된 공유 메모리 구조의 다중 프로세서 장치 및 그의 제어 방법에 관한 것이다.The present invention relates to a multiprocessor device of a distributed shared memory structure and a control method thereof, and more particularly, to a multiprocessor device of a distributed shared memory structure including a snooping ring bus and a control method thereof.

일반적으로, 단일 주소 공간(single address space)과 일관성(coherence)이 유지되는 캐쉬(cache)를 가지는 대규모 공유 메모리 다중 프로세서 시스템(shared memory multiprocessor system)은 유동적이고도 강력한 연산환경을 제공한다. 즉, 단일 주소 공간과 일관성이 유지되는 캐쉬의 두 요소는 데이터 분할(data partitioning) 및 동적 부하 균형(dynamic load balancing) 문제를 쉽게 하고, 병렬 컴파일러 및 표준 운영체제, 멀티프로그래밍을 위한 더 낳은 환경을 제공하여, 더 유동적이고 효과적인 기계의 사용을 가능하게 하는 것이다.In general, large shared memory multiprocessor systems with a single address space and cache that maintain coherence provide a flexible and powerful computing environment. In other words, two elements of the cache that are consistent with a single address space ease data partitioning and dynamic load balancing issues, and provide a better environment for parallel compilers, standard operating systems, and multiprogramming. Thus, it is possible to use a more fluid and effective machine.

이러한 공유 메모리 다중 프로세서 시스템은 공유 메모리를 접근하는 방법에 따라 도 1 과 같은 균등 메모리 접근(UMA : Uniform Memory Access) 다중 프로세서와, 도 2 와 같은 비균등(NUMA : Non-Uniform Memory Access) 또는 분산된(distributed) 공유 메모리 다중 프로세서로 분류할 수 있다. 도 1 을 보면, 다중 프로세서에서 지역 캐쉬(12, 12')는 공유 메모리(30, 30')보다 용량은 작으나 훨씬 빠른 접근 시간(access time)을 제공하며, 프로세서(11, 11')가 자주 사용하리라고 예측되는 공유 메모리(30, 30')의 데이터 블록(block)들을 저장함으로써, 공유 버스(32)와 공유 메모리(30, 30')에 대한 접근 요청의 횟수를 줄여 더 빠른 메모리 접근 시간을 제공할 수 있다. 하지만, 캐쉬의 사용은 예를들어, 하나의 프로세서 모듈(10)내의 프로세서(11)가 자체의 지역 캐쉬(12)에 저장된 어떤 데이터 블록에 대한 쓰기 작업을 수행하게 되면, 그 쓰기 작업의 결과가 다른 프로세서 모듈(10')내의 지역 캐쉬(12')에 저장되어 있는 해당 데이터 블록들에 반영되어야 하는 소위 캐쉬 일관성 유지(cache coherence) 문제가 발생하며, 버스를 기반으로 한 공유 메모리 다중 프로세서에서는 일반적으로 스누핑 방식(snooping scheme)의 캐쉬 일관성 유지 방법이 널리 사용되고 있다.Such a shared memory multiprocessor system may include a uniform memory access (UMA) multiprocessor as shown in FIG. 1 and a non-uniform memory access (NUMA) as shown in FIG. 2 according to a method of accessing shared memory. It can be classified as distributed shared memory multiprocessor. Referring to FIG. 1, in multiple processors, local caches 12, 12 ′ provide a much faster access time with less capacity than shared memory 30, 30 ′, and processors 11, 11 ′ frequently. By storing data blocks of shared memory 30, 30 'that are expected to be used, the number of access requests to shared bus 32 and shared memory 30, 30' is reduced, resulting in faster memory access times. Can provide. However, the use of the cache, for example, if the processor 11 in one processor module 10 performs a write operation on a block of data stored in its local cache 12, the result of the write operation is A so-called cache coherence problem arises that must be reflected in the corresponding data blocks stored in the local cache 12 'in another processor module 10', which is common in bus-based shared memory multiprocessors. The cache coherence method of the snooping scheme is widely used.

도 1 에 도시된 바와 같이, 균등 메모리 접근 다중 프로세서는 공유 메모리(30, 30')가 시스템의 모든 프로세서(11, 11')에 의해 동일하게 접근(access)되며, 이 공유 메모리(30, 30')가 연결된 시스템 버스(32)의 트래픽(traffic) 증가로 인한 버스 접근 지연 시간의 증가, 공유 메모리의 접근 지연 시간 증가 및 메모리 대역 폭(bandwidth)에 의한 시스템 성능 제한등이 시스템 확장성(scalability) 및 성능 향상의 장애 요소이다. 이러한 문제를 극복하기 위한 한 형태로서, 널리 사용되는 분산된 공유 메모리 다중 프로세서는 도 2 에 도시된 바와 같이 공유 메모리(30, 30')를 각 프로세서(11, 11')들 가까이에 분산시킴으로써 다른 프로세서(11', 11) 가까이에 있는 공유 메모리(30', 30)에 대한 접근에 비해 해당 프로세서 가까이에 있는 공유 메모리(30, 30')의 접근 시간이 짧아져서 접근하려는 명령어나 데이터가 저장된 메모리의 위치에 따라 메모리 접근 시간이 달라지게 된다. 따라서 분산된 공유 메모리 다중 프로세서는 가급적 지역 공유 메모리의 접근 횟수가 증가하도록 유도하여 시스템 버스(32)의 트래픽을 완화시키며, 시스템 전체의 메모리 대역폭을 확장시키고 메모리 접근 지연을 줄여 시스템 성능을 향상시킬 수 있다.As shown in FIG. 1, the even memory access multiprocessor has shared memory 30, 30 'equally accessed by all processors 11, 11' of the system, and this shared memory 30, 30 Scalability due to increased bus access delay time due to increased traffic of the system bus 32 connected to '), increased access latency of shared memory, and system performance limitation due to memory bandwidth. ) And barriers to performance improvement. As a form of overcoming this problem, a widely used distributed shared memory multiprocessor may be configured by distributing shared memory 30, 30 'close to each processor 11, 11' as shown in FIG. Compared to accessing the shared memory 30 'and 30 near the processors 11' and 11, the access time of the shared memory 30 and 30 'near the processor is shorter, so that the memory in which the instruction or data to be accessed is stored. Depending on the location of the memory access time will vary. Therefore, the distributed shared memory multiprocessor can increase the number of accesses of the local shared memory as much as possible to mitigate the traffic on the system bus 32, improve the system performance by expanding the memory bandwidth of the whole system and reducing the memory access delay. have.

다중 프로세서 시스템은 여러 개의 프로세서와 메모리 등의 시스템 자원을 상호 연결 망(interconnection network)으로 연결하여 구성하게 되는데, 다중 프로세서 시스템에서 프로세서 모듈을 연결하는 상호 연결 망의 복잡성과 비용의 문제로 인해 중, 소규모의 상용화된 시스템에서는 도 1 및 도 2 에서와 같이 버스를 선호하고 있다. 이러한 버스 구조의 시스템은 버스의 물리적 특성으로 인한 확장성의 문제와 버스 사용량의 증가로 인한 버스 대역폭의 문제점을 갖고 있다. 이러한 한계는 컴퓨터의 계산능력이 높아짐에 따른 대용량의 데이터 전송에 심각한 장애가 되고 있다. 이러한 한계를 극복하고자 버스의 폭을 늘리는 방법은 버스 중재나 주소지정에 필요한 고정된 오버헤드 때문에 실제 대역폭이 그만큼 향상되는 효과를 보지 못하고 있다. 오버헤드를 줄이기 위해 라인전송의 크기를 늘리는 방법은 캐쉬라인의 크기를 넘을 때는 그 효과가 거의 없게 된다. 버스의 길이를 짧게 함으로써 신호주기를 빠르게 할 수 있으나 신호잡음의 문제가 있고 복수개의 버스를 두는 방법은 제어가 복잡해지고 캐쉬 일관성을 유지하는데 어려움이 따르게 된다.The multiprocessor system is configured by connecting system resources such as multiple processors and memory through an interconnection network. Due to the complexity and cost of the interconnection network connecting processor modules in a multiprocessor system, In small commercial systems, buses are preferred, as shown in FIGS. Such a bus structure system has a problem of scalability due to the physical characteristics of the bus and a problem of bus bandwidth due to an increase in bus usage. This limitation is a serious obstacle to the transfer of large amounts of data as computers become more computationally powerful. In order to overcome this limitation, the method of increasing the bus width does not show the effect of real bandwidth improvement due to the fixed overhead required for bus arbitration or addressing. In order to reduce overhead, increasing the size of the line transfer has little effect when the size of the cache line is exceeded. By shortening the bus length, the signal cycle can be made faster, but there is a problem of signal noise, and the method of having a plurality of buses is complicated to control and difficult to maintain cache coherency.

버스 구조의 한계를 극복하는 방법으로 고속의 지점간(point-to-point) 링크(link)로 구성되는 상호 연결 망을 생각할 수 있는데, 상호 연결 망으로는 Mesh, Torus, Hypercubes, N-cube, MIN, Omega 망, 링(ring)등 여러 구조가 고려 가능하다. 링 구조는 다른 구조에 비해 설계 및 구현이 간단하며, 버스는 각 트랜잭션을 순차적으로 전송하지만 링의 경우 동시에 여러 개의 트랜잭션을 전송할 수 있도록 허용함으로써 대역폭의 증가를 꾀할 수 있다.As a way of overcoming the limitations of the bus structure, we can think of an interconnection network composed of high-speed point-to-point links.The interconnection networks include Mesh, Torus, Hypercubes, N-cube, Various structures such as MIN, omega network and ring can be considered. The ring structure is simpler to design and implement than other structures, and the bus can transmit each transaction sequentially, but the ring can increase bandwidth by allowing multiple transactions to be sent at the same time.

도 3 에는 디렉토리 방식으로 캐쉬 일관성을 유지하는 단 방향 링으로 구성된 분산된 공유 메모리 다중 프로세서 시스템이 도시된다. 도 3 에서, 제 1 프로세서 노드(40)에서 캐쉬 읽기 실패(read miss)가 발생할 경우 제 1 프로세서 노드(40)은 해당 데이터 블록에 대한 요청을 그 데이터 블록의 원래 메모리 영역에 해당하는 홈(home) 프로세서 노드(41)로 단일 전송하며, 만일 그 데이터 블록이 제 2 프로세서 노드(42)의 캐쉬에 갱신된 상태로 저장되어 있을 경우 홈 프로세서 노드(41)는 그 블록에 대한 요청을 다시 제 2 프로세서 노드(42)로 단일 전송하게 된다.3 illustrates a distributed shared memory multiprocessor system configured as a unidirectional ring that maintains cache coherency in a directory manner. In FIG. 3, when a cache read miss occurs in the first processor node 40, the first processor node 40 sends a request for the data block to a home corresponding to the original memory area of the data block. A single transmission to the processor node 41, and if the data block is stored in an updated state in the cache of the second processor node 42, then the home processor node 41 again sends a request for that block to the second. A single transmission to processor node 42 is made.

따라서, 제 2 프로세서 노드(42)는 요청된 데이터 블록을 홈 프로세서 노드(41)로 단일 전송하게 되며, 홈 프로세서 노드(41)은 해당 메모리를 갱신한 후 다시 요청된 블록을 제 1 프로세서 노드(40)로 전송한다.Therefore, the second processor node 42 transmits the requested data block to the home processor node 41 in a single transmission, and the home processor node 41 updates the memory and then sends the requested block again to the first processor node. 40).

이와 같이 디렉토리 방식으로 캐쉬 일관성을 유지하는 단 방향 링으로 구성된 분산된 공유 메모리 다중 프로세서 시스템은 캐쉬 일관성 유지를 위한 트랜잭션을 생성해야 하기 때문에, 링 이용률이 비교적 높고, 지연을 증가시키는 문제점이 있었다.As described above, the distributed shared memory multiprocessor system configured as a unidirectional ring that maintains cache coherency in a directory method has a problem in that ring utilization is relatively high and delay is increased because a transaction for maintaining cache coherency must be generated.

따라서, 본 발명은 스누핑 방식으로 캐쉬 일관성을 유지하도록 하여, 링의 이용률 및 지연을 감소시킬수 있는 분산된 공유 메모리 구조의 다중 프로세서 시스템을 제공함에 그 목적이 있다.Accordingly, an object of the present invention is to provide a multi-processor system with a distributed shared memory structure that can maintain cache coherency in a snooping manner, thereby reducing ring utilization and delay.

본 발명은 상기 목적을 달성하기 위하여 링형으로 배치된, 분산된 공유 메모리 구조를 갖는 다수개의 프로세서 노드들로써, 상기 소정의 프로세서 노드에서 데이터를 요청하는 요청신호를 발생하고, 상기 요청 신호를 수신한 다른 프로세서 노드에서는 내부를 스누핑하여 요청 신호에 대응하는 데이터를 출력하는 프로세서 노드군; 상기 다수개의 프로세서 노드들을 링형으로 연결하여, 상기 요청 신호 및 상기 요청 신호에 대응하는 데이터가 상기 각 프로세서 노드들을 순회하여 상기 요청 신호를 발생한 프로세서 노드에 전달되도록 하는 경로를 제공하는 링 버스를 포함하여 구성함을 특징으로 한다.The present invention is a plurality of processor nodes having a distributed shared memory structure, arranged in a ring to achieve the above object, and generates a request signal for requesting data at the predetermined processor node, the other receiving the request signal A processor node group that snoops inside and outputs data corresponding to a request signal; And a ring bus connecting the plurality of processor nodes in a ring shape to provide a path through which the request signal and data corresponding to the request signal are traversed through each of the processor nodes and delivered to the processor node that generated the request signal. It is characterized by the configuration.

도 1 은 일반적인 버스 구조의 균등 메모리 접근 공유 메모리 다중 프로세서에 대한 개략 구성도,1 is a schematic block diagram of an even memory access shared memory multiprocessor in a general bus structure;

도 2 는 일반적인 버스 구조의 비균등 메모리 접근 공유 메모리 다중 프로세서 시스템에 대한 개략 구성도,2 is a schematic structural diagram of a non-uniform memory access shared memory multiprocessor system of a general bus structure;

도 3 은 종래 기술의 디렉토리 방식을 사용한 링 구조의 분산된 공유 메모리 다중 프로세서 시스템에 대한 구성도,3 is a configuration diagram of a distributed shared memory multiprocessor system having a ring structure using a conventional directory method;

도 4 는 본 발명에 따른 스누핑 방식의 링버스를 포함하는 분산된 공유 메모리 다중 프로세서 장치에 대한 일 실시예를 나타내는 전체 구성도,4 is an overall configuration diagram illustrating an embodiment of a distributed shared memory multiprocessor device including a snooping ringbus according to the present invention;

도 5 는 도 4 에 의거하여 본 발명의 일실시예에 대한 동작을 설명하기 위한 도면,5 is a view for explaining an operation of an embodiment of the present invention based on FIG. 4;

도 6 은 링버스 구조에서 프로세서 노드가 다른 순서로 스누핑 요청을 관찰하는 동작을 예시하는 도면,6 is a diagram illustrating an operation of a processor node observing a snooping request in a different order in a ring bus structure;

도 7 은 본 발명의 스누핑 방식의 링버스를 포함하는 분산된 공유 메모리 다중 프로세서 장치에 대한 다른 실시예를 나타내는 구성도,7 is a block diagram illustrating another embodiment of a distributed shared memory multiprocessor device including the snooping ringbus of the present invention;

도 8 , 도 9, 도 10은 본 발명의 실시예에 따른 스누핑 방식의 링버스를 포함하는 분산된 공유 메모리 다중 프로세서의 변형된 구조를 예시하는 도면.8, 9, and 10 illustrate a modified structure of a distributed shared memory multiprocessor including a ringbus of snooping according to an embodiment of the present invention.

＜도면의 주요 부분에 대한 부호의 설명＞<Description of the code | symbol about the principal part of drawing>

100A, 100B : 프로세서 모듈 110A, 110B : 프로세서100A, 100B: Processor module 110A, 110B: Processor

120A, 120B : 지역 캐쉬 140 : 지역 시스템 버스120A, 120B: Local Cache 140: Local System Bus

150 : 노드 제어기 160 : 링 인터페이스150: node controller 160: ring interface

170 : 원격 캐쉬 172A, 172B : 원격 태그 캐쉬170: remote cache 172A, 172B: remote tag cache

176 : 원격 데이터 캐쉬 180 : I/O 브릿지176: remote data cache 180: I / O bridge

190 : 펜딩 버퍼 300: 지역 공유 메모리부190: pending buffer 300: local shared memory section

310 : 데이터 메모리 320A, 320B : 메모리 디렉토리310: data memory 320A, 320B: memory directory

320 : 시스템 버스 500A 내지 500H : 프로세서 노드320: system bus 500A to 500H: processor node

510 : 링 버스 600 : 상호 연결망510: ring bus 600: interconnection network

이하, 본 발명에 대해 첨부한 도면을 참조하여 상세히 설명한다.Hereinafter, with reference to the accompanying drawings, the present invention will be described in detail.

도 4 에는 본 발명의 일실시예에 따른 스누핑을 지원하는 단 방향 링(이하, 링 버스라고 함)으로 연결된 분산된 공유 메모리 구조의 다중 프로세서 시스템이 도시된다.4 illustrates a multiprocessor system of distributed shared memory architecture coupled to a unidirectional ring (hereinafter referred to as a ring bus) that supports snooping in accordance with one embodiment of the present invention.

도 4 에서, 각 프로세서 노드(500A 내지 500H)는 스누핑을 지원하는 단방향 지점간 링 버스(510)로 연결되어 있다. 또한 각 프로세서 노드(500A 내지 500H)는, 프로세서(110)와, 지역 캐쉬(120)로 이루어진 다수개의 프로세서 모듈(100A,100B)과, 지역 공유 메모리부(300), I/O 브릿지(180), 링 인터페이스(160), 노드 제어기(150) 및 원격 캐쉬(remote cache)(170)를 포함하여 구성되며, 프로세서 모듈(100A,100B)과, I/O 브릿지(180), 및 노드 제어기(150) 지역 시스템 버스(140)를 통해 연결된다.In FIG. 4, each processor node 500A-500H is connected to a unidirectional point-to-point ring bus 510 that supports snooping. In addition, each processor node 500A to 500H includes a processor 110, a plurality of processor modules 100A and 100B including a local cache 120, a local shared memory unit 300, and an I / O bridge 180. , Ring interface 160, node controller 150, and remote cache 170, including processor modules 100A, 100B, I / O bridge 180, and node controller 150. ) Is connected via a local system bus 140.

여기에서, 노드 제어기(150)는 프로세서 노드(500A)의 프로세서 모듈(100A, 100B)로 부터의 데이터 요청 신호에 대응하는 명령어나 데이터 블록이 원격 캐쉬(170)나, 프로세서 노드(500A)내의 지역공유메모리부(300)에 유효한 상태로 저장되어 있는지를 검색하여, 유효한 상태로 저장되어 있을 경우, 해당 데이터 블록을 프로세서 모듈(100A,100B)에 제공하지만, 원격 캐쉬(170)나 지역공유메모리부(300)에 유효한 상태로 저장되어 있지 않을 경우, 링인터페이스(160)를 통해 다른 프로세서 노드(500B 내지 500H)들로, 요청 신호를 전송하도록 하는 작용을 한다. 또한 노드 제어기(150)는 링인터페이스(160)를 통해 다른 프로세서 노드(500B 내지 500H)들로부터 요청 신호가 입력되면, 요청 신호에 대응하는 데이터 블록이 자신의 원격 캐쉬(170)나 지역공유메모리부(300)에 유효한 상태로 저장되어 있는지를 검색하여, 유효한 상태로 저장되어 있을 경우, 링인터페이스(160)를 통해 요청 신호를 발생한 다른 프로세서 노드(500B 내지 500H)에 유효 데이터가 전송되도록 한다.Herein, the node controller 150 includes an instruction or data block corresponding to a data request signal from the processor modules 100A and 100B of the processor node 500A in the remote cache 170 or the region within the processor node 500A. Search whether the data is stored in the shared memory unit 300 in a valid state, and if the data is stored in the valid state, the data block is provided to the processor modules 100A and 100B, but the remote cache 170 or the local shared memory unit is provided. If it is not stored in a valid state (300), and serves to transmit the request signal to the other processor nodes (500B to 500H) through the ring interface 160. In addition, when a request signal is input from other processor nodes 500B to 500H through the ring interface 160, the node controller 150 stores a data block corresponding to the request signal in its remote cache 170 or the local shared memory unit. If it is stored in the valid state, and if it is stored in the valid state, the valid data is transmitted to the other processor nodes 500B to 500H generating the request signal through the ring interface 160.

링 인터페이스(160)는 프로세서 노드(500A)를 링 버스(510)에 연결하는 데이터 패스(path)로 작용하여, 노드 제어기(150)로부터의 요청 신호나 데이터 블럭을 링 버스(510)을 통해 다른 프로세서 노드들(500B 내지 500H)로 전송하고 링 버스(510)를 통해 다른 프로세서 노드(500B 내지 500H)에서 전송되어 오는 요청 신호나 데이터 블록을 선별하여 노드 제어기(150)에게 전달할 뿐 아니라, 전송되어 오는 신호가 방송 패킷일 경우 다른 프로세서 노드(500B)에게 바이패스하는 역할과 패킷 전송에 필요한 모든 흐름제어를 책임진다.The ring interface 160 acts as a data path that connects the processor node 500A to the ring bus 510 so that a request signal or data block from the node controller 150 can be transferred via the ring bus 510. It transmits to the processor nodes 500B to 500H and selects and transmits a request signal or data block transmitted from another processor node 500B to 500H through the ring bus 510 to the node controller 150 as well. If the incoming signal is a broadcast packet, it is responsible for bypassing the other processor node 500B and all flow control necessary for packet transmission.

프로세서 노드(500A)내의 원격 캐쉬(170)는 다른 프로세서 노드들(500B 내지 500H)내의 공유메모리부, 즉 원격의 공유 메모리부에 저장되어 있는 데이터 블록만 캐슁하는 캐쉬로서, 지역 시스템 버스(140)에 연결된 프로세서 모듈(100A, 100B)로 부터의 원격의 공유 메모리 주소 영역 참조 트랜잭션에 대해 참조 실패의 경우, 해당 데이터 블록을 원격 캐쉬(170)에 할당하며, 지역 공유 메모리(300) 영역의 블록은 캐슁하지 않는다.The remote cache 170 in the processor node 500A is a cache that caches only data blocks stored in the shared memory portion of the other processor nodes 500B to 500H, that is, the remote shared memory portion. The local system bus 140 In the case of a reference failure for a remote shared memory address region reference transaction from the processor modules 100A and 100B connected to the corresponding module, the corresponding data block is allocated to the remote cache 170, and the block in the region shared memory 300 region is Do not cache.

프로세서 노드(500A)의 원격 캐쉬(170)는 다른 프로세서 노드들(500B 내지 500H)내의 원격의 공유 메모리부만을 캐슁함으로써 상대적으로 작고 빨리 동작하도록 구성하는 것이 가능하며 데이터 블록의 대체로 인한 지역 버스(140)로의 트래픽을 감소시키는 효과를 가진다.The remote cache 170 of the processor node 500A can be configured to operate relatively small and fast by caching only the remote shared memory portion within the other processor nodes 500B to 500H and the local bus 140 due to the replacement of the data block. This has the effect of reducing traffic to.

원격 캐쉬(170)는 프로세서 노드(500A)내의 지역 캐쉬(120)들과, 다른 프로세서 노드들(500B 내지 500H)내의 원격의 지역 공유 메모리부의 메모리 영역에 대해 MLI 성질(Multi-Level Inclusion Property)을 만족시키기 때문에 다른 프로세서 노드(500B 내지 500H)들로 부터의 원격의 메모리 참조 요청 신호에 대해 스누핑 여과(Snoop filtering)의 기능을 가진다. 여기서 MLI 성질은 하위 계층의 캐쉬에 저장된 데이터 블록은 상위 캐쉬에도 항상 저장되어 있어야 하는 성질을 의미하며, 이를 위해 상위 계층의 캐쉬 라인이 대체(replacement)될 경우 해당 라인이 어떠한 하위 캐쉬에도 존재하지 않아야 함을 보장해야 한다.The remote cache 170 applies the MLI property (Multi-Level Inclusion Property) to the local cache 120 in the processor node 500A and the memory area of the remote local shared memory portion in the other processor nodes 500B to 500H. As it satisfies, it has the function of snoop filtering on the remote memory reference request signal from other processor nodes 500B to 500H. In this case, the MLI property means that the data block stored in the lower layer cache must always be stored in the upper cache. For this purpose, if the upper level cache line is replaced, the line should not exist in any lower cache. Should be ensured.

따라서, 원격 캐쉬(170)는 프로세서 노드(500A)의 지역 캐쉬(120)들내에 저장된 원격의 데이터 블록들을 저장하게 되며, 다른 프로세서 노드(500B 내지 500H)들로 부터의 원격의 메모리 참조 요청 신호에 대해 원격 캐쉬(170)에 해당 데이터 블록이 유효한 상태로 저장되어 있지 않으면, 지역 시스템 버스(140)로 해당 데이터 블록에 대한 요청을 전송할 필요가 없는 스누핑 여과 기능을 담당하게 되는 것이다.Thus, the remote cache 170 stores the remote data blocks stored in the local caches 120 of the processor node 500A, and responds to remote memory reference request signals from other processor nodes 500B through 500H. If the data block is not stored in the remote cache 170 in a valid state, the local system bus 140 is responsible for the snooping filtering function that does not need to send a request for the data block.

이때, 바람직하기로는 원격 캐쉬(170)를 도 4에 도시된 바와 같이 데이터 블록의 내용을 저장하는 원격 데이터 캐쉬(176)와, 데이타 블록의 상태 및 주소의 일부분을 저장하는 원격 태그 캐쉬(172)로 구성하여, 원격 데이터 캐쉬(176)에 저장된 데이터 블록의 상태를 갱신하거나, 필요한 경우 해당 데이터블럭을 제공하기 용이하게 한다.In this case, the remote cache 170 preferably stores the contents of the data block as shown in FIG. 4, and the remote tag cache 172 stores a part of the state and address of the data block. In this case, the data block stored in the remote data cache 176 can be updated, or a corresponding data block can be provided if necessary.

더욱, 바람직하기로는 원격 데이터 블록에 대한 주소와 이러한 상태를 저장하는 원격 데이터 캐쉬(172)를 독립적인 두 개의 원격 태그 캐쉬(172A, 172B)로 구성하여, 지역 시스템 버스(140)를 통해 연결된 프로세서(110A, 110B)로부터의 원격 캐쉬(170) 접근 요구와 링 인터페이스(160)를 통해 이웃 프로세서 노드(500B 내지 500H)로 부터의 원격 캐쉬(170) 접근 요구를 병렬적으로 처리할수 있도록 하는 것이다.More preferably, a processor connected via a local system bus 140 is configured by configuring two independent remote tag caches 172A and 172B, which comprise an address for a remote data block and a remote data cache 172 that stores this state. The remote cache 170 access request from 110A and 110B and the remote cache 170 access request from neighboring processor nodes 500B to 500H through the ring interface 160 may be processed in parallel.

여기에서, 원격 캐쉬(170)의 데이타 블록은 다음 4 가지의 상태, 즉, '갱신', '갱신-공유', '공유', '무효(invalid)' 상태로 나타낼수 있다.Here, the data block of the remote cache 170 may be represented in the following four states, namely 'update', 'update-share', 'share', and 'invalid' state.

* 갱신 : 데이터 블록이 유효하고 원격 캐쉬내에서 갱신되었으며, 유일하게 유효한 복사본.* Update: The data block is valid, updated in the remote cache, and the only valid copy.

* 갱신-공유 : 데이터 블록이 유효하고 원격 캐쉬내에서 갱신되었으며, 다른 원격 캐쉬가 그 데이터 블록을 공유하고 있음.Update-Shared: The data block is valid and updated in the remote cache, and another remote cache is sharing the data block.

* 공유 : 데이터 블록이 유효하고 다른 원격 캐쉬가 데이터 블록을 공유하고 있을수 있음.* Share: The data block is valid and another remote cache may be sharing the data block.

* 무효 : 데이터 블록이 유효하지 않음.* Invalid: The data block is invalid.

또한, 공유 메모리부(300)는 데이터 블록의 내용을 저장하는 데이터 메모리(310)와, 데이터 메모리(310)에 저장된 데이터 블록의 상태를 저장하는 메모리 디렉토리(320)로 구성되며, 지역 시스템 버스(140)를 통해 연결된 프로세서(110A, 110B)들로부터의 지역 공유 메모리 참조 요청 신호와, 링 인터페이스(160)를 통해 전송된 다른 프로세서 노드(500B 내지 500H)로부터의 지역 공유 메모리 참조 요청을 병렬적으로 처리하기 위해 메모리 디렉토리(320)을 두개의 독립적인 메모리 디렉토리(320A, 320B)로 구성한다.In addition, the shared memory unit 300 includes a data memory 310 that stores the contents of the data block, and a memory directory 320 that stores the state of the data block stored in the data memory 310. Local shared memory reference request signals from processors 110A and 110B connected through 140 and local shared memory reference requests from other processor nodes 500B to 500H transmitted via ring interface 160 in parallel. The memory directory 320 is composed of two independent memory directories 320A and 320B for processing.

여기에서, 지역 시스템 버스(140)를 통해 전송된 지역 공유 메모리 참조 요청에 대한 캐쉬 일관성 트래픽을 최소화하고 링 버스로의 불필요한 트랜잭션을 줄이기 위한 메모리 디렉토리(320A)는, 지역 시스템 버스(140)로부터의 요구를 처리하기 위해 EXL, LS, RS, LRS, STALE, GONE의 여섯 가지 상태를 유지한다.Here, the memory directory 320A for minimizing cache coherency traffic for local shared memory reference requests sent over local system bus 140 and reducing unnecessary transactions to the ring bus is provided from local system bus 140. Maintain six states: EXL, LS, RS, LRS, STALE, and GONE to handle the request.

* EXL : 메모리 초기 상태* EXL: Memory initial state

* LS : 해당 블록 데이터의 프로세서 노드의 지역 캐쉬에 읽기 전용으로 저장되어 있을수 있음* LS: It may be stored as read only in the local cache of the processor node of the corresponding block data.

* RS : 해당 블록 데이터가 다른 프로세서 노드의 원격 캐쉬에 읽기 전용으로 저장되어 있을수 있음* RS: The block data may be stored as read only in the remote cache of another processor node.

* LRS : 해당 블록 데이터가 프로세서 노드의 지역 캐쉬와 다른 프로세서 노드의 원격 캐쉬에 읽기 전용으로 저장되어 있을수 있음* LRS: The block data may be stored as read only in the local cache of the processor node and the remote cache of another processor node.

* GONE : 해당 블록 데이터가 다른 프로세서 노드의 원격 캐쉬에 변경된 유효한 상태로 저장되어 있음* GONE: The block data is stored in a valid state changed in the remote cache of another processor node.

* STABLE : 해당 블록 데이터가 프로세서 노드의 지역 캐쉬에 변경된 유효한 상태로 저장되어 있음* STABLE: The block data is stored in the valid state changed in the local cache of the processor node.

한편, 메모리 디렉토리(320B)는 링 버스로부터의 스누핑 요구에 대해 스누핑 결과를 생성하기 위해 GN, EX, NE의 세 가지 상태를 유지하는데, 여기에서, GN은 GONE을, EX는 EXL, LS, STALE을, NE는 RS, LRS를 포괄하는 상태이다.Meanwhile, memory directory 320B maintains three states of GN, EX, and NE to generate snooping results for snooping requests from the ring bus, where GN is GONE and EX is EXL, LS, STALE. NE is a state encompassing RS and LRS.

다른 한편, 각 프로세서 노드(500A 내지 500H)를 연결해주는 단방향 지점간 링 버스(510)상의 모든 통신은 패킷을 통해 이루어지며 패킷들은 요청 신호에 상응하는 요청 패킷, 응답 신호에 상응하는 응답 패킷, 인식 신호에 상응하는 인식 패킷으로 분류될 수 있다. 요청 패킷은 링 버스로의 트랜잭션을 필요로 하는 프로세서 노드에 의해 발송되는 패킷으로 방송 패킷(broadcasting packet)과 단일전송 패킷(unicasting packet)으로 구분될 수 있으며 방송 패킷만이 다른 프로세서 노드들에 의해 스누핑 된다.On the other hand, all communication on the unidirectional point-to-point ring bus 510 connecting each processor node 500A to 500H is via a packet, the packets being a request packet corresponding to the request signal, a response packet corresponding to the response signal, and recognition. It can be classified into a recognition packet corresponding to the signal. A request packet is a packet sent by a processor node that requires a transaction to a ring bus. The request packet may be divided into a broadcasting packet and a unicasting packet, and only the broadcast packet is snooped by other processor nodes. do.

응답 패킷은 요청 패킷을 수신한 응답 프로세서 노드가 요청에 대한 응답으로 생성하는 패킷으로 언제나 단일 전송되며, 인식 패킷은 단일전송 패킷에 대한 인식으로 수신 프로세서 노드에 의해 생성돼 발송 프로세서 노드로 단일 전송된다. 단일전송 패킷을 전송한 프로세서 노드는 패킷 전송 후 이 패킷에 대한 인식 패킷이 도착할 때까지 전송 패킷에 대한 정보를 유지하며, 인식 패킷이 도착한 후 응답정보를 확인하고 필요에 따라 재시도를 한다.A response packet is always sent in a single packet generated by the responding processor node that receives the request packet in response to the request, and an acknowledgment packet is generated by the receiving processor node in recognition of the single transmission packet and sent to the sending processor node in a single transmission. . After transmitting a single packet, the processor node maintains the information on the transport packet until the acknowledgment packet arrives. After confirming the response, the processor node checks the response information and retries as necessary.

여기에서, 좀더 세부적으로 살펴보면, 요청 패킷중의 방송 패킷으로는 MRFR(Memory Read For Read), MRFW(Memory Read For Write), MINV(Memory Invalidate), MPRG(Memory Purge), MCASTOUT(Memory Castout), MFLSH(Momery Flush)가 있고, 단일 전송 패킷으로는 MWBE(Memory Writeback Exclusive), MWBS(Memory Writeback Shared), MRPLY(Memory Reply)가 있다. 방송 요청 패킷 중 MRFR, MRFW, MINV은 다른 프로세서 노드(500B 내지 500H)에 있는 원격의 공유 메모리부의 메모리 영역에 대한 요청 신호이며, MPRG, MCASTOUT, MFLSH는 자신의 프로세서 노드에 있는 지역 공유 메모리부의 메모리 영역에 대해 해당 프로세서 노드가 발송하는 요청 신호이다.Here, in more detail, the broadcast packets in the request packet include MRFR (Memory Read For Read), MRFW (Memory Read For Write), MINV (Memory Invalidate), MPRG (Memory Purge), MCASTOUT (Memory Castout), There is a MFLSH (Momery Flush), and a single transport packet includes a memory writeback exclusive (MWBE), a memory writeback shared (MWBS), and a memory reply (MRPLY). Among the broadcast request packets, MRFR, MRFW, and MINV are request signals for a memory area of a remote shared memory part in other processor nodes 500B to 500H, and MPRG, MCASTOUT, and MFLSH are memories of a local shared memory part in their processor node. This is a request signal sent by the processor node for the zone.

① MRFR① MRFR

프로세서 노드 내의 프로세서로부터의 읽기 요청에 대해 원격 캐쉬에서 읽기 참조 실패일 경우 발송하는 패킷으로 캐쉬 블록에 대한 읽기 요청A request sent to the cache block as a packet sent when a read reference failure in the remote cache for a read request from a processor within a processor node.

② MRFW② MRFW

프로세서 노드 내의 프로세서로부터의 쓰기 요청에 대해 원격 캐쉬에서 쓰기 참조 실패일 경우 발송하는 패킷으로 캐쉬 블록에 대한 읽기 및 무효화 요청A packet sent when the remote cache fails to write a reference to a write request from a processor within a processor node. A request to read and invalidate a cache block.

③ MINV③ MINV

프로세서 노드 내의 프로세서로부터의 쓰기 또는 무효화 요청에 대해 원격 캐쉬가 공유된 상태인 경우 발송하는 무효화 요청 패킷Invalidation request packet sent when the remote cache is shared for a write or invalidation request from a processor within a processor node

④ MPRG④ MPRG

프로세서 노드 내의 프로세서로부터 지역 공유 메모리 영역에 대한 쓰기 또는 무효화 요청에 대해 지역 공유 메모리의 상태가 유효하고 다른 프로세서 노드의 원격 캐쉬에 그 블록이 유효한 상태로 캐슁되어 있는 경우 발송하는 무효화 요청 패킷Invalidation request packet sent when the state of local shared memory is valid for a write or invalidation request from a processor within a processor node to a local shared memory area and the block is cached in the remote cache of another processor node.

⑤ MCASTOUT⑤ MCASTOUT

프로세서 노드 내의 프로세서로부터 지역 공유 메모리 영역에 대한 쓰기 또는 무효화 요청에 대해 다른 프로세서 노드가 그 블록을 원격 캐쉬에 수정된 상태(예를 들어 '갱신'이나 '갱신-공유' 상태)로 캐슁하고 있는 경우 발송하는 읽기 및 무효화 요청 패킷Another processor node is caching the block in a modified state (for example, 'update' or 'update-shared' state) for a write or invalidate request from the processor within the processor node to a local shared memory area. Outgoing Read and Invalidate Request Packets

⑥ MFLSH⑥ MFLSH

프로세서 노드 내의 프로세서로부터 지역 공유 메모리 영역에 대한 읽기 요청에 대해 다른 프로세서 노드가 그 블록을 원격 캐쉬에 수정된 상태(예를 들어 '갱신'이나 '갱신-공유' 상태)로 캐슁하고 있는 경우 발송하는 플러시 요청 패킷For a read request to a local shared memory region from a processor within a processor node, another processor node is dispatching the block to the remote cache if it is modified (for example, 'update' or 'update-shared' state). Flush request packet

⑦ MWBE , MWBS⑦ MWBE, MWBS

원격 캐쉬의 블록 대체로 인해 대체될 블록의 메모리 영역에 해당하는 프로세서 노드로의 되쓰기 패킷으로 원격 캐쉬의 상태가 '갱신' 상태일 경우는 MWBE, '갱신-공유' 상태일 경우 MWBS를 발송Rewrite packet to the processor node corresponding to the memory area of the block to be replaced due to block replacement of the remote cache. Sends MWBE when the status of the remote cache is 'update' and MWBS when the status is 'update-shared'.

⑧ MRPLY⑧ MRPLY

요청 패킷에 대한 데이터 제공 응답 트랜잭션Data provision response transaction for request packet

도 5 에는 상술한바와 같이 구성된 본 발명의 링 버스 구조의 분산된 공유 메모리 다중 프로세서에 대한 동작예가 도시된다.5 shows an example of operation for a distributed shared memory multiprocessor of the ring bus structure of the present invention configured as described above.

도 5 에서, 프로세서 노드(500A)의 한 프로세서(110)로부터의 메모리 참조 요청에 대해, 프로세서 노드(500A)가 해당 요청 패킷를 자체내에서 처리할 수 없으면, 즉 요청 패킷에 대응하는 명령어나 데이터가 포함된 블록이 원격 캐쉬(170)나 지역 공유 메모리(300)에 유효한 상태로 저장되어 있지 않으면, 프로세서(500A)는 링 버스(510)를 통해 다른 프로세서 노드(500B 내지 500H)들로 MRFR나 MFLSH 요청 패킷을 방송한다.In FIG. 5, for a memory reference request from one processor 110 of processor node 500A, if processor node 500A cannot process the request packet within itself, i.e., instructions or data corresponding to the request packet are generated. If the included block is not stored in the remote cache 170 or the local shared memory 300 in a valid state, the processor 500A may pass the MRFR or MFLSH to the other processor nodes 500B to 500H via the ring bus 510. Broadcast the request packet.

이에따라, 요청 패킷은 링버스(510)를 따라 프로세서 노드(500B)에서부터 프로세서 노드(500H)측으로 순차적으로 순회하게 되며, 요청 패킷이 링 버스(510)를 순회하는 동안, 각 프로세서 노드(500B 내지 500H)는 이 요청 패킷에 대해 내부의 원격 태그 캐쉬나 메모리 디렉토리를 조사하여 해당 데이터 블록이 어떠한 상태로 저장되어 있는지 등에 대한 스누핑을 수행하는 동시에 그 요청 패킷을 인접한 이웃 프로세서 노드로 바이패스(bypass)한다.Accordingly, the request packet is sequentially traversed from the processor node 500B to the processor node 500H along the ring bus 510, and while the request packet traverses the ring bus 510, each processor node 500B to 500H. ) Checks the internal remote tag cache or memory directory for this request packet, snoops about how the data block is stored, and bypasses the request packet to an adjacent neighboring processor node. .

예를들어, 프로세서 노드(500D)에서 내부를 스누핑한 결과 해당 데이터 블록이 원격 캐쉬에 수정된 상태(예를 들어, '갱신'이나 '갱신-공유'상태)로 저장되어 있거나(이 경우에 그 블록을 지역 공유 메모리에 유효한 상태로 저장하고 있는 프로세서 노드는 존재하지 않는다), 해당 데이터 블록이 지역 공유 메모리에 유효한 상태로 저장되어 있으면, 프로세서 노드(500D)는 자신이 요청 패킷에 대한 응답의 책임을 가진다고 판단하여, 요청한 데이터 블록을 포함하는 응답 패킷을 그 요청 패킷을 생성한 프로세서 노드(500A)로 단일전송한다.For example, as a result of snooping inside the processor node 500D, the data block is stored in the remote cache in a modified state (eg, 'update' or 'update-sharing' state) There is no processor node that stores the block in the local shared memory in a valid state). If the data block is stored in the local shared memory in a valid state, the processor node 500D is responsible for responding to the request packet. And a response packet including the requested data block is transmitted to the processor node 500A which generated the request packet.

이때, 앞선 방송 요청 패킷은 링 버스(510)를 순회한 후 프로세서 노드(500A)에 의해 제거된다. 한편, 프로세서 노드(500A)는 프로세서 노드(500D)로부터 응답 패킷을 받으면, 다시 프로세서 노드(500D)에게 인식 패킷을 단일 전송하는 동시에, 프로세서 노드(500A)의 지역 시스템 버스(140)를 통해 그 요청을 생성한 프로세서(110)로 해당 데이터 블록을 전송한다. 또한, 요청 신호에 대응하는 데이터 블록이 원격 공유 메모리부의 메모리 영역에 해당하면 원격 캐쉬(170)에 해당 데이타 블록을 유효한 상태로 저장하며, 자신의 지역 공유 메모리부(300)의 메모리 영역에 해당하면 지역 공유 메모리부(300)에 유효한 상태로 저장한다.At this time, the preceding broadcast request packet is removed by the processor node 500A after iterating through the ring bus 510. On the other hand, when the processor node 500A receives the response packet from the processor node 500D, the processor node 500A transmits a single acknowledgment packet back to the processor node 500D, and the request is made through the local system bus 140 of the processor node 500A. The data block is transmitted to the processor 110 that generates the data block. When the data block corresponding to the request signal corresponds to the memory area of the remote shared memory unit, the data block is stored in the remote cache 170 in a valid state. When the data block corresponding to the request signal corresponds to the memory area of the local shared memory unit 300, The data is stored in the local shared memory unit 300 in a valid state.

프로세서 노드(500A)의 한 프로세서(110)로부터의 메모리 쓰기 참조 요청에 대해 프로세서 노드(500A)가 해당 데이터 블록을 원격 캐쉬(170)나 지역 공유 메모리(300)에 유효한 상태로 저장하고 있지 않을 경우에, 프로세서 노드(500A)는 이웃 프로세서 노드(500B 내지 500H)들로 링 버스를 통해 MRFW나 MCASTOUT 요청 패킷을 방송한다. 요청 패킷이 링 버스를 순회하는 동안 각 프로세서 노드(500B 내지 500H)는 이 요청 패킷에 대해 원격 태그 캐쉬나 지역 공유 메모리부의 메모리 디렉토리를 조사하여 해당 데이터 블록이 어떠한 상태로 저장되어 있는지 등에 대한 스누핑을 수행하는 동시에 그 요청 패킷을 인접한 이웃 프로세서 노드로 바이패스한다.The processor node 500A is not storing the data block in the remote cache 170 or the local shared memory 300 in a valid state for a memory write reference request from one processor 110 of the processor node 500A. In turn, processor node 500A broadcasts an MRFW or MCASTOUT request packet to ring processor buses to neighboring processor nodes 500B to 500H. While the request packet traverses the ring bus, each processor node 500B-500H examines the remote tag cache or the memory directory of the local shared memory for this request packet and snoops on how the data block is stored, etc. At the same time, it bypasses the request packet to an adjacent neighboring processor node.

예를들어, 프로세서 노드(500D)에서 내부를 스누핑한 결과 해당 데이터 블록이 원격 캐쉬에 수정된 상태(예를 들어, '갱신'이나 '갱신-공유'상태)로 저장되어 있거나(이 경우에 그 블록을 지역 공유 메모리에 유효한 상태로 저장하고 있는 프로세서 노드는 존재하지 않는다), 그 데이터 블록이 지역 공유 메모리부에 유효한 상태로 저장되어 있으면, 프로세서 노드(500D)는 자신이 요청 패킷에 대한 응답의 책임을 가진다고 판단하여 요청한 데이터 블록을 포함하는 응답 패킷을 그 요청을 생성한 프로세서 노드(500A)로 단일전송하는 동시에 해당 데이터 블록을 저장하고 있는 원격 캐쉬의 상태를 무효화된 상태(예를 들어, '무효')로 하거나 해당 데이타 블록을 저장하고 있는 지역 공유 메모리부의 메모리 디렉토리의 상태를 무효화된 상태(예를 들어, GN 상태)로 갱신한다. 이때, 앞선 방송 요청 패킷은 링 버스를 순회한 후 프로세서 노드(500A)에 의해 제거되며, 스누핑 결과 해당 데이터 블록을 원격 캐쉬에 수정되지 않은 유효한 상태(예를 들어 '공유' 상태)로 저장하고 있는 프로세서 노드는 그 블록의 원격 캐쉬 상태를 무효화된 상태(예를 들어 '무효' 상태)로 변경한다.For example, as a result of snooping inside the processor node 500D, the data block is stored in the remote cache in a modified state (eg, 'update' or 'update-sharing' state) There is no processor node that stores the block in the local shared memory in a valid state). If the data block is stored in the local shared memory in a valid state, the processor node 500D is responsible for the response of the request packet. Determining responsibility, it transmits a single response packet containing the requested data block to the processor node 500A that generated the request while invalidating the state of the remote cache storing the data block (e.g., Invalid ') or invalidate the state of the memory directory of the local shared memory where the data block is being stored (for example, the GN state). Updates. In this case, the preceding broadcast request packet is removed by the processor node 500A after iterating through the ring bus, and as a result of snooping, the corresponding data block is stored in the remote cache in a valid unmodified state (for example, a 'shared' state). The processor node changes the remote cache state of the block to an invalidated state (eg, an 'invalid' state).

프로세서 노드(500A)는 프로세서 노드(500D)로부터 응답 패킷을 받으면 프로세서 노드(500D)에게 인식 패킷을 단일 전송하는 동시에 프로세서 노드(500A)의 지역 시스템 버스(140)를 통해 그 요청 신호를 생성한 프로세서(110)로 해당 데이터 블록을 전송한다. 또한, 요청한 데이터 블록이 원격 공유 메모리부의 메모리 영역에 해당하면 원격 캐쉬(170)에 그 블록을 수정된 유효한 상태(예를 들면 '갱신' 상태)로 저장하며, 지역 공유 메모리부의 메모리 영역에 해당하면 지역 공유 메모리부(300)에 유효한 상태로 저장한다.When the processor node 500A receives the response packet from the processor node 500D, the processor node transmits a single acknowledgment packet to the processor node 500D, and simultaneously generates the request signal through the local system bus 140 of the processor node 500A. The data block is transmitted to 110. In addition, if the requested data block corresponds to a memory area of the remote shared memory unit, the block is stored in the remote cache 170 in a modified valid state (for example, an 'update' state). The data is stored in the local shared memory unit 300 in a valid state.

프로세서 노드(500A)의 한 프로세서(110)로부터의 메모리 쓰기 참조 요청이나 무효화 요청에 대해 프로세서 노드(500A)가 해당 블록을 원격 캐쉬(170)나 지역 공유 메모리부(300)에 유효한 상태로 저장하고 있고 다른 프로세서 노드의 원격 캐쉬에 그 블록이 유효한 상태(예를 들어 '공유'나 '갱신-공유')로 저장되어 있을 경우에, 그 프로세서 노드(500A)는 이웃 프로세서 노드(500B 내지 500H)들로 링 버스(510)를 통해 MINV나 MPRG 요청 패킷을 방송한다. 요청 패킷이 링을 순회하는 동안 각 프로세서 노드(500B 내지 500H)는 이 요청 패킷에 대해 내부의 원격 태그 캐쉬나 메모리 디렉토리를 조사하여 해당 데이터 블록이 어떠한 상태로 저장되어 있는지 등에 대한 스누핑을 수행하는 동시에 그 요청 패킷을 인접한 이웃 프로세서 노드로 바이패스한다. 예를들어, 프로세서 노드(500D)에서 내부를 스누핑한 결과 해당 데이터 블록이 원격 캐쉬에 '갱신-공유' 상태로 저장되어 있거나(이 경우에 그 블록을 지역 공유 메모리에 유효한 상태로 저장하고 있는 프로세서 노드는 존재하지 않는다), 해당 데이터 블록이 지역 공유 메모리에 유효한 상태로 저장되어 있으면, 자신이 요청 패킷에 대한 응답의 책임을 가진다고 판단하며, 이에 따라 프로세서 노드(500D)가 무효화 요청에 대한 응답 패킷을 그 요청 패킷을 생성한 프로세서 노드(500A)로 단일전송하는 동시에 해당 데이터 블록을 저장하고 있는 원격 캐쉬의 상태를 무효화된 상태(예를 들어 '무효')로 하거나 그 데이터 블록을 저장하고 있는 지역 공유 메모리부의 메모리 디렉토리의 상태를 무효화된 상태(예를 들어 GN 상태)로 갱신한다.For a memory write reference request or invalidation request from one processor 110 of the processor node 500A, the processor node 500A stores the block in a valid state in the remote cache 170 or the local shared memory unit 300. And the block is stored in a valid state (eg, 'shared' or 'update-shared') in the remote cache of another processor node, the processor node 500A is the neighboring processor nodes 500B to 500H. The MINV or MPRG request packet is broadcast through the low ring bus 510. While the request packet traverses the ring, each processor node 500B-500H examines the internal remote tag cache or memory directory for this request packet, while snooping on how the data block is stored, and so on. Bypass the request packet to an adjacent neighbor processor node. For example, a processor that has been snooped internally by the processor node 500D and the data block is stored in the remote cache as 'update-shared' (in this case, the processor is valid in local shared memory). Node does not exist), if the corresponding data block is stored in the local shared memory in a valid state, it is determined that it is responsible for the response to the request packet, and thus the processor node 500D responds to the invalidation request. Sends the request packet to the processor node 500A that generated the request packet, and at the same time makes the state of the remote cache storing the data block invalidated (e.g., 'invalid') or the area storing the data block. The state of the memory directory of the shared memory unit is updated to an invalidated state (for example, the GN state).

이때, 앞선 방송 요청 패킷은 링 버스를 순회한 후 프로세서 노드(500A)에 의해 제거되며, 스누핑 결과 해당 데이터 블록을 원격 캐쉬에 수정되지 않은 유효한 상태(예를 들어 '공유' 상태)로 저장하고 있는 프로세서 노드는 그 데이터 블록의 원격 캐쉬 상태를 무효화된 상태(예를 들어 '무효' 상태)로 변경한다. 프로세서 노드(500A)는 프로세서 노드(500D)로부터 응답 패킷을 받으면 프로세서 노드(500D)에게 인식 패킷을 단일 전송한다.In this case, the preceding broadcast request packet is removed by the processor node 500A after iterating through the ring bus, and as a result of snooping, the corresponding data block is stored in the remote cache in a valid unmodified state (for example, a 'shared' state). The processor node changes the remote cache state of the data block to an invalidated state (eg, an 'invalid' state). When the processor node 500A receives the response packet from the processor node 500D, the processor node 500A transmits a single recognition packet to the processor node 500D.

또한, 프로세서 노드(500A)의 원격 캐쉬(170)에서의 데이터 블록 대체로 인해 축출될 데이터 블록의 상태가 수정된 상태(예를 들면, '갱신'이나 '갱신-공유' 상태)일 경우에, 프로세서 노드(500A)는 축출될 데이터 블록이 원래 저장되어 있어야할 공유 메모리 영역을 구비한, 프로세서 노드, 예를들어 프로세서 노드(500D)로 링 버스(510)를 통해 MWBE나 MWBS 요청 패킷을 단일 전송한다.In addition, if the state of the data block to be evicted due to data block replacement in the remote cache 170 of the processor node 500A is in a modified state (eg, an 'update' or 'update-shared' state), the processor Node 500A transmits a single MWBE or MWBS request packet over ring bus 510 to a processor node, for example processor node 500D, having a shared memory area where the data block to be evicted should originally be stored. .

그러면, 프로세서 노드(500D)는 해당 요청 패킷에 대해 자신의 데이터 메모리와 메모리 디렉토리를 갱신하며, 데이터 블록 축출 요청에 대한 응답 패킷을 그 요청 패킷을 생성한 프로세서 노드(500A)로 단일 전송한다. 프로세서 노드(500A)는 프로세서 노드(500D)로부터 응답 패킷을 받으면, 프로세서 노드(500D)에게 다시 인식 패킷을 단일 전송한다.Processor node 500D then updates its data memory and memory directory for that request packet, and transmits a single response packet to the data block eviction request to processor node 500A that generated the request packet. When the processor node 500A receives the response packet from the processor node 500D, the processor node 500A transmits a single recognition packet back to the processor node 500D.

한편, 본 발명에 따른 링버스로 연결된 분산된 공유 메모리 다중 프로세서 장치에서 각각의 프로세서 노드는 기존의 버스와 달리 각각의 패킷에 대해 서로 다른 순서로 관찰할 수 있게 된다.Meanwhile, in a distributed shared memory multiprocessor device connected by a ring bus according to the present invention, each processor node may observe each packet in a different order unlike a conventional bus.

예를들어, 도 6을 보면, 프로세서 노드(500A)에서 제 1 요청 패킷(R1)을 발생하고, 프로세서 노드(500C)에서 제 2 요청 패킷(R2)을 발생할 경우, 프로세서 노드(500B)는 제 1 요청 패킷(R1) → 제 2 요청 패킷(R2)의 순으로 스누핑 요청 패킷을 관찰하는 반면, 프로세서 노드(500H)는 제 2 요청 패킷(R2) → 제 1 요청 패킷(R1)의 순으로 스누핑 요청 패킷을 관찰하게 된다. 따라서 스누핑에 참여하는 프로세서 노드의 입장에서 스누핑 순서와 해당 방송 요청이 처리되는 서비스 순서는 무관하게 되며 각 스누핑에 참여하는 프로세서 노드는 자신의 지역정보만을 가지고 상태 전이를 결정하게 된다. 동일 주소에 대한 복수개의 요청 패킷에 대한 서비스 순서는 그 요청에 대한 처리를 담당하는 프로세서 노드(이하 소유권을 가진 프로세서 노드라 칭하며, 반드시 하나만 존재할 수 있다.) 즉, 요청된 데이터 블록을 원격 캐쉬에 수정된 유효한 상태(예를 들면, '갱신' 상태나 '갱신-공유' 상태)로 저장한 프로세서 노드나, 요청된 데이터 블록을 지역 공유 메모리에 유효한 상태로 저장한 프로세서 노드에 요청 패킷이 도착하는 순서로 정의된다. 따라서 링 버스를 통해 요청을 전송한 모든 프로세서 노드는 소유권을 가진 프로세서 노드에 의해 응답 패킷을 받거나 혹은 재시도 패킷을 받게된다. 소유권을 가진 프로세서 노드는 다른 프로세서 노드로부터의 MRFR, MRFW, MINV, MFLSH, MCASTOUT, MPRG 요청 패킷등에 대해 그 프로세서 노드로 MRPLY 패킷을 통해 요청된 블록을 전송하거나 무효화 요청에 대한 승인을 전송하며, MRPLY 패킷을 전송받은 프로세서 노드로부터 인식 패킷을 받기 전에 다른 프로세서 노드로부터 동일한 블록에 대한 상기의 요청을 받으면 그 프로세서 노드에게 재시도할 것을 의미하는 패킷을 전송한다.For example, referring to FIG. 6, when the processor node 500A generates the first request packet R1 and the processor node 500C generates the second request packet R2, the processor node 500B generates the first request packet R1. While observing a snooping request packet in the order of one request packet R1 → a second request packet R2, the processor node 500H snoops in the order of the second request packet R2 → the first request packet R1. Observe the request packet. Therefore, from the standpoint of the processor node participating in snooping, the snooping order and the service order in which the corresponding broadcast request is processed are irrelevant, and the processor node participating in each snooping decides the state transition using only its own local information. The service order for a plurality of request packets for the same address is the processor node responsible for processing the request (hereinafter referred to as the processor node with ownership, and there can be only one). That is, the requested data block is placed in the remote cache. Request packets arrive at processor nodes that have been saved in a modified valid state (for example, 'update' or 'update-shared'), or at processor nodes that have stored the requested data blocks in local shared memory. Defined in order. Therefore, every processor node that sends a request over the ring bus receives a response packet or a retry packet by an owner of the processor node. The processor node with ownership transfers the requested block through the MRPLY packet or sends an acknowledgment for the invalidation request to the processor node for MRFR, MRFW, MINV, MFLSH, MCASTOUT, MPRG request packets, etc. from another processor node. If the processor receives the request for the same block from another processor node before receiving the acknowledgment packet from the processor node receiving the packet, the packet is transmitted to the processor node, which means to retry.

따라서 본원 발명의 다른 실시예에서는 도 7에 도시된 바와 같이, 지역 공유 메모리부(300)를 위한 메모리 디렉토리(320A, 320B)와 원격 캐쉬(170)를 위한 원격 태그 캐쉬(172A, 172B) 이외에, MRPLY 패킷을 전송했으나 아직 인식 패킷을 받지 못한 데이터 블록 요청에 대한 정보를 저장하는 펜딩 버퍼(pending buffer : 190)를 더 구비함으로서, 스누핑을 수행할 때, 지역 공유 메모리부(300)를 위한 메모리 디렉토리(320B)와 원격 캐쉬(170)를 위한 원격 태그 캐쉬(172B) 뿐만아니라 펜딩 버퍼(190)를 참조하여 펜딩 버퍼(190)에 저장된 블록에 대한 요청이 발생했을 경우에 재시도할 것을 의미하는 패킷을 전송한다.Therefore, in another embodiment of the present invention, as shown in Figure 7, in addition to the memory directory (320A, 320B) for the local shared memory unit 300 and the remote tag cache (172A, 172B) for the remote cache 170, It further includes a pending buffer (190) for storing information on the data block request that has transmitted the MRPLY packet but has not yet received the recognition packet, when performing the snooping, the memory directory for the local shared memory unit 300 Refers to the pending buffer 190 as well as the remote tag cache 172B for the 320B and the remote cache 170, and a packet which means to retry when a request for a block stored in the pending buffer 190 occurs. Send it.

본 발명의 실시예를 설명함에 있어서, 방송 요청 패킷 중 MRFR, MRFW, MINV은 원격 공유 메모리 영역에 대한 요청으로, MFLSH, MCASTOUT, MPRG는 지역 공유 메모리 영역에 대한 요청으로 분리된 경우에 대해 설명했으나 본 발명은 두 경우가 합해진 경우 즉, MRFR와 MFLSH, MRFW와 MCASTOUT, MINV와 MPRG를 각각 같은 요청으로 처리한 경우에도 동일하게 적용될 수 있으며, 본 발명의 실시예에 있어서 블록 축출의 경우 MWBE와 MWBS로 분리된 경우에 대해 설명했으나 두 요청이 같은 요청으로 처리한 경우에도 동일하게 적용될 수 있음을 이해해야 한다.In the description of the embodiment of the present invention, a case in which MRFR, MRFW, and MINV are divided into requests for a remote shared memory area and MFLSH, MCASTOUT, and MPRG are divided into requests for a local shared memory area among broadcast request packets has been described. The present invention can be equally applied to the case where the two cases are combined, that is, MRFR and MFLSH, MRFW and MCASTOUT, MINV and MPRG, respectively, in the same request. Although we have described the case where it is separated into two parts, it should be understood that the same can be applied when two requests are treated as the same request.

또한, 본 발명의 실시예에 있어서, 원격 캐쉬가 '갱신'와 '갱신-공유', '공유', '무효'를 갖는 경우에 대해 설명했으나, 본 발명이 원격 캐쉬가 변형된 다른 상태, 즉 '갱신'과 '공유', '무효'를 갖는 경우나, '갱신', '공유', '배타(exculsive)', '무효'를 갖는 경우를 포함한 다양한 경우에도 동일하게 적용될 수 있으며, 본 발명의 실시예에 있어서 독립적으로 접근되는 지역 공유 메모리를 위한 두 개의 디렉토리가 각각 EXL, LS, RS, LRS, GONE, STALE 상태와 GN, EX, NE 상태를 유지하는 경우에 대해 설명했으나 본 발명은 각각의 디렉토리가 변형된 다양한 다른 상태를 갖는 경우에도 동일하게 적용될 수 있음을 이해해야 한다.In addition, in the embodiment of the present invention, the case in which the remote cache has a 'update' and 'update-share', 'share', 'invalid', the present invention has been described in other states that the remote cache is modified, that is, The same may be applied to various cases including 'update', 'shared', and 'invalid', or 'update', 'shared', 'exculsive', and 'invalid'. In the embodiment of the present invention, a case in which two directories for independently accessed local shared memories maintain EXL, LS, RS, LRS, GONE, STALE states and GN, EX, NE states, respectively, is described. It is to be understood that the same applies to a directory in which it has a variety of different states.

본 발명의 실시예를 설명함에 있어서 또한 모든 프로세서 노드가 복수개의 프로세서와 캐쉬 및 I/O 시스템, 단일 시스템 버스를 사용한 경우에 대해 설명했으나, 일부의 프로세서 노드가 도 8 에서 도시된 바와 같이 하나 이상의 프로세서와 캐쉬만을 내포한 경우나, 도 9 에서 도시된 바와 같이 지역 공유 메모리 모듈만을 포함한 경우, 프로세서 노드가 임의의 단계의 계층 버스를 포함하는 경우, 도 10 에서 도시된 바와 같이 복수개의 링 버스로 연결되어, 양방향으로 전송하는 경우 본 발명이 동일하게 적용될 수 있음은 자명하다.In the description of the embodiment of the present invention, the case in which all the processor nodes use a plurality of processors, a cache and an I / O system, and a single system bus has been described. However, some processor nodes may include one or more processors as shown in FIG. When only the processor and the cache are included, or when the local node includes only the local shared memory module as shown in FIG. 9, when the processor node includes the hierarchical bus of any stage, the plurality of ring buses are illustrated as shown in FIG. 10. Obviously, the present invention can be equally applied when connected and transmitted in both directions.

또한, 프로세서 노드내의 프로세서가 임의의 상호 연결망으로 연결하는 경우에도 본 발명이 동일하게 적용될 수 있음을 본 분야에 종사하는 자라면 누구라도 알수 있을 것이다.In addition, it will be appreciated by those skilled in the art that the present invention can be applied equally even when a processor in a processor node is connected to any interconnection network.

이상 설명한 바와 같이 본 발명은 별도의 캐쉬 일관성 유지를 위한 트랜잭션을 생성하지 않기 때문에 이로 인한 링의 이용률 및 지연을 감소시키는 효과가 있다.As described above, since the present invention does not generate a transaction for maintaining a separate cache coherency, there is an effect of reducing the utilization and delay of the ring.

Claims

A plurality of processor nodes 500A to 500H having a distributed shared memory structure arranged in a ring shape to generate a request signal for requesting data from the predetermined processor node 500A, and to receive the request signal. Nodes 500B to 500H include: a processor node group that snoops inside and outputs data corresponding to a request signal;

A path connecting the plurality of processor nodes 500A to 500H in a ring shape so that the request signal and data corresponding to the request signal are circulated through the processor nodes to be delivered to the generated processor node 500A. Multiprocessor device of a distributed shared memory structure including a ring bus 510 to provide a

The method of claim 1,

The request signal is a multiprocessor device of a distributed shared memory structure consisting of broadcast packets

The method of claim 1,

The request signal is a multi-processor device of a distributed shared memory structure consisting of a single transport packet

The method of claim 1,

The data corresponding to the request signal is a multi-processor device of a distributed shared memory structure consisting of a single transport packet

The method of claim 1,

The ring bus 510 is a unidirectional ring bus that unidirectionally connects the plurality of processor nodes.

The method of claim 1,

The ring bus 510 is a bidirectional ring bus that connects the plurality of processor nodes in both directions.

The method of claim 1, wherein each of the processor nodes 500A to 500H is:

A plurality of processor modules 100A, 100B;

Local shared memory unit 300 for storing data that can be shared by each of the plurality of processor modules (100A to 100B);

A remote cache 170 storing data corresponding to the request signal;

When the data stored in the local shared memory unit 300 and the remote cache 170 is searched and provided by the processor module 100A or 100B, the desired data is not provided by the processor module 100A or 100B. After generating the request signal, when data corresponding to the request signal is input, the request signal is provided to the processor modules 100A and 100B, and the request signal requested by the other processor node from the ring bus 510 is generated. A node controller 150 that searches for data stored in the local shared memory unit 300 and the remote cache 170 and provides valid data corresponding to the request signal on the ring bus when inputted;

Multiprocessor device of a distributed shared memory structure consisting of a ring interface 160 connecting the node controller 150 and the ring bus 510

The method of claim 7, wherein

Storing information indicating when data corresponding to a request signal provided from the other processor node is transmitted through the linker, but whether or not another processor generating the request signal has received data corresponding to the request signal Multiprocessor device with a distributed shared memory structure further comprising a pending buffer 190 for

The method according to claim 7 or 8,

The remote cache 170 is:

A remote data cache (176) for storing the contents of the data;

And a tag address for data stored in the remote data cache (176) and a remote tag cache (172) for storing data state.

The method of claim 9,

The remote tag cache 172 is:

Multiprocessor device with distributed shared memory structure consisting of two independent remote tag caches (172A, 172B).

The method according to claim 7 or 8:

The local shared memory unit 300 is:

A data memory 310 for storing contents of the data;

And a directory memory (320) for storing data representing a state of data stored in the data memory (310).

The method of claim 11,

The directory memory 320 is:

Multiprocessor device with distributed shared memory structure consisting of two independent directory memories (320A 320B).

The method according to claim 7 or 8,

A plurality of processor modules (100A, 100B) and the node controller 160 is a multi-processor device of a distributed shared memory structure connected by a local system bus.

The method of claim 7, wherein

The plurality of processor modules (100A) and the node control unit 160 is a multi-processor device of a distributed shared memory structure connected by an interconnection network.