KR20060090512A

KR20060090512A - Resource sharing and pipelining in coarse-grained reconfigurable architecture

Info

Publication number: KR20060090512A
Application number: KR1020050011451A
Authority: KR
Inventors: 최기영; 정진용; 마영란; 김윤진; 박철수
Original assignee: 재단법인서울대학교산학협력재단
Priority date: 2005-02-07
Filing date: 2005-02-07
Publication date: 2006-08-11
Also published as: KR100722428B1

Abstract

본 발명은 재구성가능 배열 구조에서의 리소스 공유 및 파이프라이닝에 관한 것이다. 본 발명은, 배열 구조 내의 모든 프로세싱 요소에 동일한 기능 리소스(functional resource)를 구현시키는 대신, 각각의 행 및/또는 열에 배열된 프로세싱 요소들 간에 기능 리소스를 공유하도록 구성함으로써 설계시 필요한 전체 리소스의 개수를 줄이고, 나아가 상기 공유 리소스를 파이프라이닝 시킴으로써 면적 및 지연시간의 측면에서 더욱 효율적인 설계를 할 수 있는 리소스 공유 및 파이프 라이닝 구성을 갖는 재구성 가능 배열구조에 관한 것이다. The present invention relates to resource sharing and pipelining in a reconfigurable arrangement. The present invention is designed to share the functional resources among the processing elements arranged in each row and / or column, instead of implementing the same functional resource for all processing elements in the array structure, thereby reducing the total number of resources needed in the design. In addition, the present invention relates to a reconfigurable arrangement having a resource sharing and pipelining configuration, which enables a more efficient design in terms of area and latency by further pipelining the shared resources.

본 발명에 의한 리소스 공유 및 파이프라이닝 방법은 특정 응용 영역(application domain)에 따라 최적화된 재구성가능 배열 구조를 제안하며, 성능을 향상시키고, 전체 리소스의 개수를 줄임으로써 제조 비용을 절감하는 효과가 있다. The resource sharing and pipelining method of the present invention proposes a reconfigurable array structure optimized according to a specific application domain, improves performance, and reduces manufacturing costs by reducing the total number of resources. .

리소스 공유, 파이프라이닝, 재구성가능 배열 구조Resource sharing, pipelining, reconfigurable array structure

Description

Resource Sharing and Pipelining in Coarse-Grained Reconfigurable Architecture}

도 1은, 4 ×4 배열 구성을 갖는 프로세싱 요소(PE)와 캐시(Cache)를 나타낸 것이다.1 illustrates a processing element (PE) and a cache having a 4 × 4 arrangement.

도 2는, 도 1에 도시된 프로세싱 요소들이 데이터 버스로 연결된 구성을 나타낸 것이다.FIG. 2 shows a configuration in which the processing elements shown in FIG. 1 are connected by a data bus.

도 3은, 수학식 1의 행렬 곱을 적용한 SIMD 모델의 공간시간표를 나타낸 것이다. 3 shows a spatial time table of the SIMD model to which the matrix product of Equation 1 is applied.

도 4는, 수학식 1의 행렬 곱을 적용한 루프 파이프라인 처리된 모델의 공간시간표를 나타낸 것이다.4 illustrates a spatial time table of a loop pipelined model to which the matrix product of Equation 1 is applied.

도 5는 도 3의 SIMD모델의 공간시간표에서 5번째 사이클에서의 작동 과정에 대한 스냅사진(snapshot)을 나타낸 것이다. FIG. 5 shows a snapshot of the operating procedure in the fifth cycle in the spatial time table of the SIMD model of FIG. 3.

도 6은 도 4의 루프 파이프라이닝 모델의 공간시간표에서 5번째 사이클에서의 작동 과정에 대한 스냅사진(snapshot)을 나타낸 것이다.FIG. 6 shows a snapshot of the course of operation in the fifth cycle in the spatial timetable of the loop pipelining model of FIG. 4.

도 7은, 본 발명에 따라 곱셈기를 공유하는 4 ×4 배열구성을 갖는 프로세싱 요소들을 나타낸 것이다.Figure 7 illustrates processing elements with a 4x4 arrangement that shares a multiplier in accordance with the present invention.

도 8은, 도 7의 배열 구성 중에서 프로세싱 요소와 공유 곱셈기 간의 연결관계를 상세히 나타낸 것이다. FIG. 8 illustrates the connection relationship between the processing element and the shared multiplier in detail in the arrangement of FIG. 7.

도 9는, 파이프라인 처리되지 않은 일반적인 프로세싱 요소의 구성을 나타낸 것이다. 9 shows the configuration of a typical processing element that is not pipelined.

도 10은, 파이프라인 처리된 프로세싱 요소의 구성을 나타낸 것이다. 10 shows the configuration of a pipelined processing element.

도 11은, 리소스 공유만이 적용된 프로세싱 요소의 배열구성 및 그 공간시간표를 나타낸 것이다. 11 shows an arrangement of processing elements to which only resource sharing is applied and a space time table thereof.

도 12는, 리소스 공유 및 파이프라이닝이 함께 적용된 프로세싱 요소의 배열구성 및 그 공간시간표를 나타낸 것이다.12 shows an arrangement of processing elements to which resource sharing and pipelining are applied together and their spatial timetables.

도 13a 내지 13d 는, 리소스 공유(RS) 또는 리소스 공유 및 파이프라이닝 (RSP) 배열구조에 관한 4가지 모델을 나타낸 것이다. 13A-13D illustrate four models of resource sharing (RS) or resource sharing and pipelining (RSP) arrangements.

도 14는, 8 ×8 프로세싱 요소의 배열에 있어서, 기본 배열 구조, 리소스 공유 구조, 리소스 공유 및 파이프라이닝 구조에 대한 면적 및 지연 시간 특징을 나타낸 표이다. FIG. 14 is a table showing area and delay time characteristics for the basic arrangement structure, resource sharing structure, resource sharing and pipelining structure in the arrangement of 8x8 processing elements.

도 15는, 리버모어 루프 벤치마크 내 커널들을 본 발명의 리소스 공유 및 파이프라이닝의 배열구성에 적용한 평가 결과를 나타낸 표이다. 15 is a table showing evaluation results of applying kernels in the Livermore loop benchmark to the arrangement of resource sharing and pipelining of the present invention.

도 16은, 2D-FDCT, SAD, MVM 및 FET를 포함하는 특정 어플리케이션들을 본 발명의 리소스 공유 및 파이프라이닝의 배열구성에 적용한 평가 결과를 나타낸 표이다.FIG. 16 is a table showing evaluation results of applying specific applications including 2D-FDCT, SAD, MVM and FET to the arrangement of resource sharing and pipelining of the present invention.

높은 품질의 멀티미디어 서비스에 대한 요청이 증가함에 따라, 특히 휴대형 시스템에 있어서, 오디오 및/또는 비디오 데이터 처리에 관한 효율적인 응용 프로그램에 대한 기술이 발전하고 있다. 이러한 응용 프로그램들은, 일반적으로 복잡한 데이터 집중적인 계산(data intensive computation)을 수행하는 특징을 가지는 데, 이러한 응용 프로그램은 2개의 상이한 방식으로 구현될 수 있다. As the demand for high quality multimedia services increases, techniques for efficient application for audio and / or video data processing, especially in portable systems, are developing. Such applications are generally characterized by performing complex data intensive computations, which can be implemented in two different ways.

하나는, 일반적인 목적의 프로세서 상에서 운영되는 소프트웨어 구현방식(SI; Software Implementation)이고, 다른 하나는, ASIC(Application-S프로세싱 요소cific Integrated Circuit) 형태의 하드웨어 구현방식(HI; Hardware Implementation)이다. 소프트웨어 구현방식의 경우, 다양한 응용 프로그램을 충분 히 지원할 수 있을 정도로 유연하지만, 응용 프로그램의 복잡성에 충분히 대처하는 능력이 부족하다. 반면 하드웨어 구현방식의 경우, 전력 및 수행효율 양쪽 면에서 최적화될 수 있으나 특정 응용 프로그램에 국한된다는 단점이 있다.One is a software implementation (SI) that runs on a general purpose processor, and the other is a hardware implementation (HI) in the form of an Application-S Processing Element (Cific Integrated Circuit). Software implementations are flexible enough to support a variety of applications, but lack the ability to cope with the complexity of the applications. Hardware implementations, on the other hand, can be optimized for both power and performance, but are limited to specific applications.

콜스 그레인드 재구성가능 배열 구조(Coarse-Grained Reconfigurable Architecture; 이하 "CGRA"라 한다)는 상기 두 가지 구현방식의 단점을 보충한다. 즉, 소프트웨어 구현방식 보다는 높은 성능을 제공하고, 하드웨어 구현방식 보다는 폭 넓은 유연성을 갖는다. A Coarse-Grained Reconfigurable Architecture (hereinafter referred to as "CGRA") compensates for the drawbacks of both implementations. That is, it provides higher performance than a software implementation and has greater flexibility than a hardware implementation.

CGRA는 필드 프로그래머블 게이트 어레이(Field Programmable Gate Array; 이하 "FPGA"라 한다)와 구별된다. 비록, 재구성가능 하드웨어라는 관점에서 양자가 동일하지만 FPGA가 비트 데이터를 조정할 수 있는 로직 블록들을 갖는 것에 비해, CGRA는, 워드 데이터를 처리할 수 있는 산술논리장치(ALU; Arithmetic Logic Unit)와 같은 데이터경로 블록들을 가진다. CGRA is distinguished from a field programmable gate array (hereinafter referred to as "FPGA"). Although both are identical in terms of reconfigurable hardware, compared to FPGAs having logic blocks that can adjust bit data, CGRA provides data such as an Arithmetic Logic Unit (ALU) that can process word data. Have path blocks

CGRA는, 재구성(reconfiguration)을 통해 각 응용프로그램의 서로 다른 특징에 적절히 적응할 수 있으며, 특정 하드웨어 엔진을 채택함으로써 성능을 높일 수 있다. 이러한 이유에서, 많은 정교한 CGRA들이 제안되고 있다. 이들 중 대부분은, 특화된 프로세싱 요소들의 고정된 구조 및 그들 사이의 연결(fabrics)로 이루어져 있다. 그리고, 실행시간에서 각 프로세싱 요소 및 그 인터커넥트 작동을 실행시간에 제어하는 것이 가능하므로 동적으로 재구성 할 수 있다.Reconfiguration allows CGRA to adapt to the different characteristics of each application, and to improve performance by adopting specific hardware engines. For this reason, many sophisticated CGRAs have been proposed. Most of these consist of a fixed structure of specialized processing elements and fabrics between them. At runtime, each processing element and its interconnect operations can be controlled at runtime, allowing for dynamic reconfiguration.

현재 다양한 CGRA들이 제시되고 있는데, 그들 중에서 2차원 메시 구조들은 병렬성(parellelism)에 관하여 풍부한 통신 리소스(communication resources)를 갖 는다. Various CGRAs are currently proposed, among which two-dimensional mesh structures have abundant communication resources in terms of parallelism.

예컨대, 모포시스(Morphosys)는, Tiny_RISC 프로세서와 연결된 재구성 셀(Reconfigurable Cell)의 8 ×8 배열로 구성된다. 그 배열 구조는 단일명령 다중데이터(SIMD) 방식 하에서 덧셈(addition), 곱셈(multiplication) 등과 같은 16비트 연산을 수행한다. 그것은 많은 수학적 연산을 필요로 하는 영역의 레귤러 코드 세그먼트에 관해서는 우수한 성능을 보이지만 하드웨어 비용이 높다는 단점이 있다. For example, Morphosys consists of an 8x8 array of Reconfigurable Cells connected to a Tiny_RISC processor. The array structure performs 16-bit operations such as addition, multiplication, etc. under a single instruction multiple data (SIMD) scheme. It shows good performance in terms of regular code segments in areas that require many mathematical operations, but has the disadvantage of high hardware costs.

XPP 재구성 시스템-온-칩 구조(SoC Architecture)는 풍부한 통신 리소스를 갖는 또 다른 예이다. XPP는 4 ×4 또는 8 ×8 재구성 배열 구조를 가지며, 또한 AMBA 버스 구조로 구성된 LEON 프로세서를 가진다. XPP의 프로세싱 요소(PE; Processing Element)는, 하나의 산술논리장치(ALU)와 몇 개의 레지스터들로 구성된다. 그 프로세싱 요소는 단지 기본 리소스들(primitive resources)만 가지기 때문에, 하드웨어의 총 비용은 높지 않지만, 적용 가능한 응용 영역이 제한된다. The XPP reconfiguration system-on-chip architecture is another example with abundant communication resources. The XPP has a 4x4 or 8x8 reconfiguration array and also has a LEON processor configured with an AMBA bus structure. The Processing Element (PE) of the XPP is composed of one arithmetic logic unit (ALU) and several registers. Since the processing element only has primitive resources, the total cost of the hardware is not high, but the applicable application area is limited.

이와 같이 대부분의 CGRA들은, 다중 연산의 병렬 수행 및 유연한 작동 스케쥴링을 위해 일정한 구조(regular structure)의 배열 구성을 이루고 있다. 이런 일정한 구조들은 응용 영역의 특성을 무시한 채 많은 하드웨어 리소스들을 요구하는 데, 리소스 개수의 증가는 필연적으로 배열구조 크기의 증가 및 설계 비용의 증가를 초래하게 된다. 또한 프로세싱 요소의 고정된 구성은, 다양한 프로그램으로 이루어진 응용 영역에 관한 최적화를 제한한다. As such, most CGRAs have an array of regular structures for parallel execution of multiple operations and flexible operational scheduling. These constant structures require many hardware resources, ignoring the characteristics of the application area, and the increase in the number of resources inevitably results in an increase in the size of the array structure and an increase in the design cost. The fixed configuration of the processing elements also limits the optimization of the application area consisting of various programs.

따라서, 이러한 문제점을 극복하고자, 재구성가능 구조의 템플릿(template)을 정의하고, 설계공간탐색(Design space exploration)을 통한 도메인-특정 최적화 방식이 제시되고 있다. 그러나, 지금까지 제안되었던 대부분의 설계 공간 탐색 기술(Design space exploration techniques)은, 단지 각 프로세싱 요소 내에 어떠한 기능 리소스를 추가할 것인지에 대해서 고려한다. 예컨대, 프로세싱 요소의 배열구조 상에서 곱셈기(multiplier), 덧셈기(adder), 쉬프터(shifter) 등과 같은 기능 리소스를 각각 어느 프로세싱 요소에 분산시켜서 구현시킬 것인지가 주요 연구 대상이었다. Accordingly, in order to overcome this problem, a domain-specific optimization method through defining a template of a reconfigurable structure and design space exploration has been proposed. However, most of the design space exploration techniques proposed so far only consider what functional resources to add within each processing element. For example, the main object of research was to distribute functional resources such as a multiplier, an adder, a shifter, and the like to each processing element in the arrangement of the processing elements.

그러나 이와 같이 프로세싱 요소 내에 리소스를 고정시키는 구성 방식은, 일반적으로 높은 성능을 얻을 수는 있지만, 그 제조 비용 역시 높아지게 된다. 높은 수행 효율을 위해, 심지어 기본 프로세싱 요소의 디자인에서도 기본적인 기능 리소스들(basic functional resources)을 갖추어야 하기 때문이다. However, such a configuration in which a resource is fixed within a processing element, although generally high performance can be obtained, the manufacturing cost also increases. This is because for high performance efficiency, even basic design elements must have basic functional resources.

따라서 각각의 응용 영역에 따라 적합한 기능을 수행하면서도, 프로세싱 요소의 배열 상에서 기능 리소스를 효율적으로 사용함으로써 전체 리소스의 개수를 줄이고, 제조 비용을 줄일 수 있는 재구성가능 배열구조의 개발이 요청되고 있는 실정이다. Therefore, there is a demand for the development of a reconfigurable array structure that can reduce the number of total resources and reduce manufacturing costs by efficiently using functional resources on an array of processing elements while performing appropriate functions according to each application area. .

상기 문제점을 해결하기 위해 본 발명에서는, 프로세싱 요소 내에서 최대 면적을 차지하는 기능 리소스들을, 각각의 행 및/또는 각각의 열에 배열된 프로세싱 요소 간에 공유시킴으로써 크기가 감소된 재구성 가능 배열구조를 제공하는 것을 목적으로 한다. In order to solve the above problem, the present invention provides a reduced size reconfigurable arrangement by sharing functional resources occupying a maximum area within a processing element between processing elements arranged in each row and / or each column. The purpose.

본 발명의 다른 목적은, 상기 공유 리소스들을 파이프라이닝 시킴으로써, 크기 및 최대 지연시간이 감소된 재구성 가능 배열 구조를 제공하는 것이다.
Another object of the present invention is to provide a reconfigurable arrangement structure in which size and maximum delay are reduced by pipelining the shared resources.

상기한 목적을 달성하기 위해 본 발명에서는, 루프 파이프라인 방식으로 처리되는 재구성가능 배열구조로서, 메모리 버스 또는 다른 프로세싱 요소로부터 피연산자를 수신하는 프로세싱 요소들; 상기 프로세싱 요소로부터 일대일로 피연산자를 수신하는 버스스위치; 상기 둘 이상의 프로세싱 요소에 의해 공유되고, 데이터 버스를 통해 상기 버스스위치와 연결되어 상기 버스스위치로부터 상기 피연산자를 수신하여 연산을 수행하는 공유 리소스; 및 상기 공유 리소스가 둘 이상인 경우, 상기 버스스위치가 어느 공유 리소스로 피연산자를 전송할 것인지를 제어하는 기능을 수행하는 구성캐시를 포함하는 것을 특징으로 하는 리소스 공유 및 파이프 라이닝 구성을 갖는 재구성가능 배열구조를 제공한다. In order to achieve the above object, the present invention provides a reconfigurable arrangement that is processed in a loop pipeline manner, comprising: processing elements for receiving operands from a memory bus or other processing element; A bus switch for receiving operands one-to-one from the processing element; A shared resource shared by the two or more processing elements, the shared resource being coupled to the bus switch via a data bus to receive and operate the operands from the bus switch; And a configuration cache configured to control to which shared resources the bus switch transmits operands, when the shared resources are more than one. to provide.

일반적으로 콜스-그레인드 재구성가능 배열구조들은 내장형 시스템(Embedded system)들에 적절하도록 설계된다. 그것은 유연성(flexibility)과 높은 성능을 모두 만족시키기 때문이다. 대부분의 재구성가능 배열 구조들은, 다중 연산의 병렬 수행 및 유연한 작동 스케쥴링을 위한 정해진 구조(regular structure)를 이용한다. 그러나 그러한 정해진 구조들은 응용 영역의 특성을 무시한 채 많은 하드웨어 리소스들을 요구한다.In general, coarse-grained reconfigurable arrangements are designed to be suitable for embedded systems. This is because it satisfies both flexibility and high performance. Most reconfigurable array structures use a regular structure for parallel execution of multiple operations and flexible operational scheduling. However, such fixed structures require a lot of hardware resources, ignoring the nature of the application area.

이러한 한계를 극복하기 위하여, 본 발명에서는 새로운 재구성가능 배열구조 템플릿을 제공한다. 이는 기능 리소스들을 두 개의 그룹, 즉 기본 리소스(primitive resources)와 공유/파이프라인드 리소스로 나눈다. 하나의 프로세싱 요소 내에서 최대 면적을 차지하거나 최대 지연시간을 가지는 리소스를 선택하여 프로세싱 요소들에 의해 공유되도록 한다. 본 발명에서, 기본 리소스들은 모든 프로세싱 요소 내부에 위치하게 되고 공유 리소스들은 프로세싱 요소 외부의 배열 구조상에서 공유된다. 이 경우 해당 리소스를, 동일한 행 또는 열에 배열되어 있는 모든 프로세싱 요소들과 연결시켜서 공유되도록 할 수 있다.To overcome this limitation, the present invention provides a new reconfigurable array template. It divides functional resources into two groups: primitive resources and shared / pipelined resources. A resource that occupies a maximum area or has a maximum latency within one processing element is selected to be shared by the processing elements. In the present invention, basic resources are located inside all processing elements and shared resources are shared on an array structure outside the processing elements. In this case, you can link the resource to all processing elements that are arranged in the same row or column so that they are shared.

공유되는 리소스의 개수는, 그 응용 영역에 따라 달라질 수 있다. 전체적인 최대 지연 시간(critical path)의 감소를 위해 상기 공유 리소스들을 파이프라인 처리할 수 있다. 이러한 파이프라인 처리는 구조 전체의 클럭 주파수를 증가시키게 된다. 이러한 방식으로, 본 발명의 재구성가능 배열구조 템플릿은 규칙성(regularity)을 유지하면서 특정 응용 영역에 대한 최적화 구성을 위해 사용될 수 있다. The number of shared resources may vary depending on the application area. The shared resources can be pipelined to reduce the overall maximum critical path. This pipeline processing increases the clock frequency throughout the structure. In this way, the reconfigurable arrangement template of the present invention can be used for an optimized configuration for a particular application area while maintaining regularity.

본 발명의 리소스 공유 및 파이프라이닝 구조는, 기본 배열 구조 비해 면적 및 최대 지연 시간을 큰 폭으로 감소시킬 수 있음을 보일 것이다. 그리고, 여러 응용 프로그램을 리소스 공유 및 파이프라이닝 배열 구조에서 그 수행효율이 크게 증가함을 보일 것이다. It will be seen that the resource sharing and pipelining structures of the present invention can significantly reduce area and maximum latency compared to the basic arrangement structure. And, it will be shown that the performance efficiency of several applications is greatly increased in resource sharing and pipelining arrangement.

본 발명에 의한 리소스 공유 및 파이프라이닝 방법은 단일명령 다중데이터(SIMD) 실행 상에 기초를 두지 않고, 루프 파이프라이닝 실행 상에 기반을 두는 것 으로 가정한다. SIMD와 같은 실행은, 비록 구성(configuration)을 저장하는 데 있어서의 데이터 병렬계산(data parallel computation) 및 일정한 실행(regular execution)에 의한 데이터 스토리지(data storage)에 효율적이기는 하지만, 각 프로세싱 요소가 독립적으로 작동할 수 없다는 점에서 유연성이 없다. It is assumed that the resource sharing and pipelining method according to the present invention is not based on a single instruction multiple data (SIMD) implementation but is based on a loop pipelining implementation. Execution such as SIMD, although efficient for data parallel computation and regular execution of data storage in storing configurations, is independent of each processing element. There is no flexibility in that it cannot work.

반대로, 루프 파이프라이닝은, 각 파이프라인 단계를 제어하기 위해 더 많은 스토리지를 요구하지만, 프로세싱 요소의 작동을 선택하는 데 있어서 더욱 유연하다. 게다가, SIMD 방식에 비해 그 성능을 향상시킬 수 있는 데, 이는 동일한 연산들을 동시에 수행하기 위해 동기화 오버헤드(synchronization overhead)를 감소시키기 때문이다. In contrast, loop pipelining requires more storage to control each pipeline stage, but is more flexible in choosing the operation of processing elements. In addition, the performance can be improved over the SIMD scheme because it reduces the synchronization overhead to perform the same operations simultaneously.

먼저, 상기 SIMD 방식과 루프 파이프라이닝 방식의 수행 효율을 비교해 본다. SIMD와 루프 파이프라이닝은, 콜스 그레인드 재구성가능 배열 구조상에 명령어 루프들을 실행시키기 위한 방법으로서 고려될 수 있다. 행렬곱(matrix multiplication)의 예를 통해 위 두 모델의 차이점을 설명한다. 우선 범용 메시(mesh) 연결 구조에 기초한, 프로세싱 요소들의 콜스-그레인드 재구성 배열을 살펴본다. First, the performance efficiency of the SIMD scheme and the loop pipelining scheme is compared. SIMD and loop pipelining can be considered as a method for executing instruction loops on a coarse grain reconfigurable array structure. An example of matrix multiplication illustrates the difference between the two models. We first look at a coarse-grained reconstruction array of processing elements, based on a general-purpose mesh connection structure.

도 1은, 4 ×4 배열 구성을 갖는 프로세싱 요소(PE)와 캐시(Cache)를 나타낸 것이다. 도 1에서 각각의 프로세싱 요소(2)는, 하나의 산술논리장치(ALU), 하나의 배열 곱셈기(array multiplier) 등을 포함하며, 구성캐시(configuration cache)(4)에 의해 제어된다. 1 illustrates a processing element (PE) and a cache having a 4 × 4 arrangement. Each processing element 2 in FIG. 1 comprises one arithmetic logic unit (ALU), one array multiplier and the like, and is controlled by a configuration cache 4.

도 2는, 도 1에 도시된 프로세싱 요소들이 데이터 버스로 연결된 구성을 나 타낸 것이다. 도시된 바와 같이, 각 행의 프로세싱 요소(2)들은 두 개의 읽기버스(read buses)(5,6)와 하나의 쓰기 버스(write buses)(8)를 공유한다. FIG. 2 shows a configuration in which the processing elements shown in FIG. 1 are connected to a data bus. As shown, the processing elements 2 of each row share two read buses 5, 6 and one write buses 8.

루프 파이프라이닝 되지 않은 SIMD 모델과 루프 파이프라이닝 모델의 작동상의 차이점을 비교하기 위해, 아래와 같은 행렬식을 제안한다. To compare the operational differences between the loop pipelining SIMD model and the loop pipelining model, we propose the following determinant.

두개의 N차 정방행렬 X, Y의 행렬 곱을 고려할 때, 그 출력행렬 Z는 다음과 같이 표현될 수 있다.Considering the matrix product of two N-order square matrices X and Y, the output matrix Z can be expressed as follows.

여기서 i,j = 0,1,...N 이고, C는 구성캐시 내에 포함된 상수 값이다.Where i, j = 0,1, ... N and C is a constant value contained in the configuration cache.

도 3은, 수학식 1의 행렬 곱을 적용한 SIMD 모델의 공간시간표를 나타낸 것이고, 도 4는, 수학식 1의 행렬 곱을 적용한 루프 파이프라인 된 모델의 공간시간표를 나타낸 것이다. 도 3 및 도 4의 공간시간표는, 각각 수학식 1에서 N=4일 때, 도 1에 도시된 프로세싱 요소의 배열 상에서의 실행을 나타내고 있다. 도 2의 공간시간표에서, 첫 번째 행(row)은 루프의 시작 때부터의 사이클 순서를 나타내고, 첫 번째 열(column)은, 프로세싱 요소 배열에 있어서 열 번호를 나타낸다. 3 shows a spatial time table of the SIMD model to which the matrix product of Equation 1 is applied, and FIG. 4 shows a spatial time table of a loop pipelined model to which the matrix product of Equation 1 is applied. The spatial time table of FIGS. 3 and 4 shows the execution on the arrangement of the processing elements shown in FIG. 1 when N = 4 in Equation 1, respectively. In the spatial time table of FIG. 2, the first row represents the cycle order from the beginning of the loop, and the first column represents the column number in the processing element arrangement.

도 3에 도시된 바와 같이, SIMD 모델은 12사이클에 걸쳐서 연산을 수행하고, 5번째 및 8번째 사이클에서 모든 열(col#1, col#2, col#3, col#4)의 프로세싱 요소에서 곱셈연산이 수행된다. SIMD 모델의 경우, 로드(load) 연산 작동은 4번째 사이클이 될 때까지 모든 열에서 실행된다. 그러고 나서 같은 행의 프로세싱 요소들 은, 행 방향으로 동일하게 구성되어 같은 연산을 수행한다.As shown in FIG. 3, the SIMD model performs operations over 12 cycles, and in processing elements of all columns col # 1, col # 2, col # 3, col # 4 in the fifth and eighth cycles. Multiplication operation is performed. For the SIMD model, the load operation runs on all rows until the fourth cycle. Processing elements in the same row are then configured identically in the row direction to perform the same operation.

이에 비해, 루프 파이프라이닝(loop pipelining) 모델은, 도 4에 도시된 것처럼, 9사이클에 걸쳐서 연산을 수행하고, 5번째 및 8번째 사이클에서 제1열(col#1) 및 제4열(col#4)에 배열된 프로세싱 요소에서 곱셈연산이 수행된다. In contrast, the loop pipelining model performs calculations over nine cycles, as shown in FIG. 4, and the first column col # 1 and the fourth column col5 in the fifth and eighth cycles. Multiplication is performed on the processing elements arranged in # 4).

루프 파이프라이닝 모델의 경우, 첫번째 사이클에서 제1열의 모든 프로세싱 요소들이 피연산자를 로드한다. 그러고 나서 두 번째 사이클에서는 곱셈(multiplication)연산을 수행한다. 그 다음의 3번째 및 4번째 사이클에서, 곱셈 연산된 값의 합계를 얻기 위하여 제1열의 프로세싱 요소들은 덧셈(addition)을 수행하고, 그 동안 다음 열들의 프로세싱 요소들은 곱셈을 수행한다. 그러고 나서 5번째 사이클에서는, 제1열의 프로세싱 요소에서 곱셈이 수행 되고, 그동안 첫 번째 곱셈이 4번째 열에서 수행된다. In the loop pipelining model, all the processing elements in the first column load the operands in the first cycle. The second cycle then performs a multiplication operation. In the following third and fourth cycles, the processing elements of the first column perform addition while the processing elements of the next column perform multiplication to obtain the sum of the multiplied values. Then in the fifth cycle, multiplication is performed on the processing elements of the first column, during which the first multiplication is performed on the fourth column.

이처럼, SIMD 모델에서는 데이터가 전달되는 과정과 산술 연산을 수행하는 과정이 시간적으로 분리되기 때문에, 프로세싱 요소들을 효율적으로 활용하지 못한다. 그러나 파이프라인된 명령어 루프 모델에서는, 리소스에 의한 연산 작동이 수행되는 중에 계속하여 다음에 연산될 데이터가 로드된다는 점에서 프로세싱 요소들을 효율적으로 활용할 수 있다. As such, in the SIMD model, the process of transferring data and the process of performing arithmetic operations are separated in time, and thus, processing elements are not effectively utilized. However, in the pipelined instruction loop model, processing elements can be efficiently utilized in that the data to be computed next is loaded continuously while the operation operation by the resource is performed.

위 예에서의 루프 파이프라이닝은, SIMD에 비해 3 사이클을 줄였다. 그러나, 만약 루프에 많은 메모리 연산이 존재한다면, 파이프라이닝에 의한 수행효율은 더욱 향상될 것이다. The loop pipelining in the above example reduced three cycles compared to SIMD. However, if there are many memory operations in the loop, the pipelining performance will be further improved.

이를 보다 상세히 설명하면 다음과 같다. 도 5는 도 3의 SIMD모델의 공간시 간표에서 5번째 사이클에서의 작동 과정에 대한 스냅사진(snapshot)을 나타낸 것이고, 도 6은 도 4의 루프 파이프라이닝 모델의 공간시간표에서 5번째 사이클에서의 작동 과정에 대한 스냅사진(snapshot)을 나타낸 것이다. 도 5과 도 6을 통해서 SIMD 모델과 루프 파이프라이닝 모델에서의 곱셈기 활용도를 비교할 수 있다.This will be described in more detail as follows. FIG. 5 shows a snapshot of the operating process at the fifth cycle in the spatial time table of the SIMD model of FIG. 3, and FIG. 6 is at the fifth cycle in the spatial time table of the loop pipelining model of FIG. A snapshot of the process is shown. 5 and 6, it is possible to compare multiplier utilization in the SIMD model and the loop pipelining model.

앞서 살펴 본 바와 같이, 루프 파이프라이닝 모델에서는 동일한 연산 작동들을 여러 사이클 동안 각 프로세싱 요소에 분배(distribute)하기 때문에, 동시에 모든 프로세싱 요소들이 동일한 기능 리소스들을 갖도록 할 필요가 없다. As discussed earlier, in the loop pipelining model, the same computational operations are distributed to each processing element for several cycles, so there is no need for all processing elements to have the same functional resources at the same time.

도 5와 도 6의 공간시간표에서, 5번째 사이클의 경우 SIMD 모델에서는 모든 열에서 곱셈 연산이 수행되지만, 루프 파이프라인 모델에서는 단지 2번째 및 4번째 열의 프로세싱 요소들만이 곱셈을 수행한다. In the spatial timetables of FIGS. 5 and 6, the multiplication operation is performed on all columns in the SIMD model for the fifth cycle, but only the processing elements of the second and fourth columns perform the multiplication in the loop pipeline model.

따라서, SIMD 모델의 경우, 모든 열의 프로세싱 요소는 곱셈기를 가질 것이 요구되므로, 결국 전체 배열 구조의 영역 및 구현비용이 증가하게 된다. 반면 루프 파이프라인 모델의 경우, 동일한 열 또는 동일한 행에서의 프로세싱 요소들이 영역-임계 리소스들 (area-critical resources)을 공유하는 것이 가능하다. Thus, in the case of the SIMD model, the processing elements of all the columns are required to have a multiplier, thereby increasing the area and the implementation cost of the entire array structure. On the other hand, in the loop pipeline model, it is possible for processing elements in the same column or the same row to share area-critical resources.

수학식1과 같은 4 ×4 행렬곱의 예에서, 곱셈기들은 하나의 프로세싱 요소 내에서 최대 면적을 차지하므로, 곱셈기들을 공유 리소스들로 분류하고, 다른 리소스들은 기본 리소스들(primitive resources)로 분류할 수 있다. In the example of 4 × 4 matrix multiplication, such as Equation 1, multipliers occupy a maximum area within one processing element, so they can be classified as shared resources and other resources as primitive resources. Can be.

도 7은, 본 발명에 따라 곱셈기를 공유하는 4 ×4 배열구성을 갖는 프로세싱 요소들을 나타낸 것이다. 도시된 바와 같이, 각 행마다 4개의 프로세싱 요소(12) 및 2개의 공유 곱셈기(14)가 배열되어 있다. 종래 SIMD 모델의 경우, 모든 프로세 싱 요소가 각각 곱셈기를 가지므로, 4 ×4 프로세싱 요소 배열인 경우에는 16개의 곱셈기를 가진다. 이에 비해, 도 7의 본 발명에 의한 실시예에서는, 총 8개의 곱셈기(14)가 16개의 프로세싱 요소(12)에 공유된다. Figure 7 illustrates processing elements with a 4x4 arrangement that shares a multiplier in accordance with the present invention. As shown, four processing elements 12 and two shared multipliers 14 are arranged in each row. In the conventional SIMD model, since all processing elements each have a multiplier, the multiplier has 16 multipliers in the case of a 4 × 4 processing element array. In contrast, in the embodiment according to the invention of FIG. 7, a total of eight multipliers 14 are shared by sixteen processing elements 12.

또한, 공유되는 곱셈기(14)들은 프로세싱 요소의 외부에 위치하여, 프로세싱 요소(12)들과 버스 스위치(16)를 통해 간접적으로 연결되어 있는 구성을 취하고 있다. 상기 공유되는 곱셈기(14)들은 버스 스위치(16)의 작동에 따라 연산기능을 수행한다. In addition, the shared multipliers 14 are located outside of the processing element, taking the configuration indirectly connected to the processing element 12 and the bus switch 16. The shared multipliers 14 perform arithmetic functions in accordance with the operation of the bus switch 16.

도 7의 실시예에서, 각 프로세싱 요소(12)는, 데이터 버스를 통해 버스 스위치(16)와 일대일로 연결되고, 상기 버스 스위치(16)는 또 다른 데이터 버스를 통해 공유 곱셈기(14)와 연결된다. 버스 스위치의 제어는 프로세싱 요소내의 구성 캐시(configuration cache)를 통해 이루어진다. In the embodiment of FIG. 7, each processing element 12 is connected one-to-one with a bus switch 16 via a data bus, which is connected with a shared multiplier 14 via another data bus. do. Control of the bus switch is achieved through a configuration cache in the processing element.

배열구조 상에 리소스들을 용이하게 할당하기 위하여, 공유되는 리소스들의 정해진 구조(regular structure)를 형성시킬 필요가 있다. 일실시예로서, 공유되는 리소스들은 그 재구성 구조의 행들 및/또는 열들과 함께 동일한 라인 내에 위치하도록 할 수 있다. In order to easily allocate resources on an array structure, it is necessary to form a regular structure of shared resources. In one embodiment, the shared resources may be located in the same line with the rows and / or columns of the reconstruction structure.

도 7에 도시된 연결 버스들은, 리소스 공유와 관련된 버스만을 표시한 것이다. 본 예시에서는 4 ×4 2차원 배열 구조에 대하여 설명하지만 리소스 공유 연결구조는 다른 정방형 구조, 1차원 배열 구조에도 적용 가능하다. The connection buses shown in FIG. 7 represent only buses related to resource sharing. In this example, a 4 × 4 two-dimensional array structure is described, but the resource sharing connection structure can be applied to other square structures and one-dimensional array structures.

도 8은 도 7의 배열 구성 중에서 프로세싱 요소와 공유 곱셈기 간의 연결관계를 상세히 나타낸 것이다. 도시된 바와 같이, 프로세싱 요소(12)와 버스 스위치 (16) 사이에는, 피연산자(operand)가 전송되는 두 개의 n 비트 데이터버스(21, 22) 및 연산 결과 값이 전송되는 한 개의 2n 비트 데이터버스(23)가 구성되어 있다. 상기 버스 스위치(16)와 각각의 공유 곱셈기(14) 사이에도 위와 동일하게 두 개의 n 비트 데이터버스(24, 25) 및 한 개의 2n 비트 데이터버스(26)가 구성되어 있다. FIG. 8 illustrates in detail the connection relationship between the processing element and the shared multiplier in the arrangement of FIG. 7. As shown, between the processing element 12 and the bus switch 16, two n-bit databuses 21 and 22 to which an operand is transmitted and one 2n-bit databus to which an operation result value is transmitted. 23 is comprised. In the same manner as above, two n-bit data buses 24 and 25 and one 2n-bit data bus 26 are formed between the bus switch 16 and each shared multiplier 14.

한편 프로세싱 요소(12)마다 할당된 구성캐시(18)는 제어선(17)을 통해 버스스위치(16)와 직접 연결됨으로써 버스 스위치를 제어한다. 공유 곱셈기(14)와 프로세싱 요소(12)를 동적으로 맵핑하는 것은 컴파일 시간에 결정되고, 그 정보는 구성 캐시(configuration cache)(18)에 저장된 구성 명령들(configuration instructions)에 포함된다.On the other hand, the component cache 18 allocated to each processing element 12 controls the bus switch by being directly connected to the bus switch 16 through the control line 17. Dynamic mapping of shared multiplier 14 and processing element 12 is determined at compile time, and the information is included in configuration instructions stored in configuration cache 18.

도 8을 참조로 하여, 공유 리소스를 이용한 연산 수행 과정을 설명하면 다음과 같다. 먼저, 읽기버스(read bus)(도시하지 않음)로부터 피연산자(operand)를 입력받은 프로세싱 요소(12)는, 상기 구성캐시(18)로부터 수신한 구성 명령(configuration instruction)에 공유 리소스(14)가 사용되는 연산이 포함된 경우, 그 연산을 위해 피연산자를 우선 버스 스위치(16)로 전송한다. Referring to FIG. 8, a process of performing an operation using a shared resource is as follows. First, the processing element 12 which receives an operand from a read bus (not shown) is configured to share a shared resource 14 with a configuration instruction received from the configuration cache 18. If the operation used is included, the operand is first sent to the bus switch 16 for that operation.

한편, 상기 구성캐시(18)는 다수의 공유 리소스(14) 중에서 어느 하나를 선택하기 위한 제어신호를 상기 버스 스위치(16)로 전송하고, 버스 스위치(16)는 프로세싱 요소(12)로부터 수신한 피연산자를 상기 제어신호에 따라 해당하는 공유 곱셈기로 전송한다. 공유 곱셈기(14)는 상기 수신한 피연산자에 대해 곱셈 연산을 수행한 후에, 2n비트 버스(26)를 통해 버스 스위치(16)로 전송하고, 버스 스위치(16)는 상기 결과 값을 원래의 프로세싱 요소로 전송한다. On the other hand, the configuration cache 18 transmits a control signal for selecting any one of the plurality of shared resources 14 to the bus switch 16, the bus switch 16 received from the processing element 12 The operand is transmitted to the corresponding shared multiplier according to the control signal. The shared multiplier 14 performs a multiplication operation on the received operand and then sends it to the bus switch 16 via a 2n bit bus 26, the bus switch 16 sending the result value to the original processing element. To send.

도 7에 도시된 본 발명의 실시예를 통해 알 수 있듯이, 각 프로세싱 요소에 곱셈기를 구현시키는 것 대신에, 각 행의 프로세싱 요소마다 두 개의 곱셈기를 공유시킴으로써, 종래 배열구조에서 필요로 하는 리소스의 개수(16개)에 비해 절반의 개수(8개)의 리소스만으로 더욱 짧은 시간 내에 동일한 연산을 수행할 수 있다.As can be seen from the embodiment of the present invention shown in FIG. 7, instead of implementing a multiplier for each processing element, by sharing two multipliers for each processing element in each row, The same operation can be performed in a shorter time using only half the number (eight) resources compared to the number (16).

한편, 상기한 바와 같이 곱셈기를 공유시키는 것과 함께, 각각의 곱셈기를 파이프라인 처리함으로써 곱셈기를 더욱 효율적으로 사용할 수 있다. 만약 어떠한 리소스들이 프로세싱 요소 내에서 최대 면적과 최대 지연시간을 모두 가진다면(본 발명의 실시예에서는 ‘곱셈기’), 리소스 공유와 리소스 파이프라이닝이 동시에 적용될 수 있다. 만약 이러한 두 기술들이 합성되면, 리소스 공유를 위한 조건들(conditions)은 완화되고, 따라서 그 리소스들이 더욱 효율적으로 활용될 수 있다. On the other hand, in addition to sharing a multiplier as described above, the multiplier can be used more efficiently by pipeline processing each multiplier. If some resources have both maximum area and maximum latency in the processing element ('multiplier' in the embodiment of the present invention), resource sharing and resource pipelining can be applied simultaneously. If these two techniques are combined, the conditions for resource sharing are relaxed, so that the resources can be utilized more efficiently.

리소스의 파이프라인 처리에 대해 간단히 살펴보면 다음과 같다. 도 9는 파이프라인 처리되지 않은 일반적인 프로세싱 요소의 구성을 나타낸 것이고, 도 10은 파이프라인 처리된 프로세싱 요소의 구성을 나타낸 것이다. Here's a quick look at the pipelined processing of resources: 9 shows the configuration of a general pipelined processing element and FIG. 10 shows the configuration of a pipelined processing element.

도 9에서, 이웃하는 프로세싱 요소(50)의 출력 레지스터(42)로부터 데이터(피연산자)가 입력되고, 입력된 피연산자는 최대 지연시간을 갖는 임계 리소스(54) 또는 다른 기능 리소스(56)에 의해 연산된 후 출력 레지스터(52)에 임시로 저장된다. 하나의 출력 레지스터(42)에서 다른 레지스터(52)로 데이터가 이동하는 것이 1사이클에 해당하므로, 만약 임계 리소스(54)의 최대 지연시간이 길어지면, 프로세싱 요소(60)의 사이클이 길어지게 되어, 결국 배열 구조 전체의 처리속도가 늦어지게 된다. In FIG. 9, data (an operand) is input from an output register 42 of a neighboring processing element 50, and the input operand is operated by a threshold resource 54 or other functional resource 56 having a maximum delay time. And then temporarily stored in the output register 52. Since moving data from one output register 42 to another register 52 corresponds to one cycle, if the maximum delay time of the critical resource 54 is long, the cycle of the processing element 60 becomes long. As a result, the processing speed of the entire array becomes slow.

이러한 문제점을 보완하고자, 상기 최대 지연시간을 갖는 임계 리소스에 레지스터를 삽입함으로써 파이프라인 처리한다. 도 10에 도시된 바와 같이, 임계 리소스(74)에 레지스터(72)를 삽입하여 파이프라인 시킴으로써 1 사이클 연산방식을 2 사이클 연산방식으로 변경시켰다. 일반적인 재구성가능 배열 구조에서, 프로그램 수행시간은 그 임계경로에 의해 고정되지만, 본 발명의 파이프라인 된 프로세싱 요소로 구성된 배열구조에서는, 멀티-사이클 작동(multi-cycle operations)이 허용되고, 따라서 그 수행시간은 그 연산 작동에 매우 의존하여 변화할 수 있다. 이는, 그 시스템 클럭 주파수(system clock frequency)가 증가하기 때문이다. To compensate for this problem, pipeline processing is performed by inserting registers into the critical resource having the maximum latency. As shown in FIG. 10, the one-cycle operation method was changed to the two-cycle operation method by inserting a register 72 into the critical resource 74 to pipeline. In a typical reconfigurable array structure, the program execution time is fixed by its critical path, but in an array structure of pipelined processing elements of the present invention, multi-cycle operations are allowed, and thus the execution thereof. The time can vary depending on the operation of the operation. This is because the system clock frequency is increased.

도 11은 리소스 공유만이 적용된 프로세싱 요소(14)의 배열구성 및 그 공간시간표를 나타낸 것이고, 도 12는 리소스 공유 및 파이프라이닝이 함께 적용된 프로세싱 요소(114)의 배열구성 및 그 공간시간표를 나타낸 것이다. 도 11과 도 12의 비교를 통해 알 수 있듯이, 리소스 공유 및 파이프라이닝이 함께 적용된 구성은, 연산과정에 대한 어떠한 정지도 없이, 리소스 공유만이 적용된 구성과 동일한 작동을 수행하면서도, 곱셈기의 개수를 8개에서 4개로 줄일 수 있다. FIG. 11 shows an arrangement of the processing elements 14 to which only resource sharing is applied and its spatial time table. FIG. 12 shows an arrangement of the processing elements 114 to which resource sharing and pipelining are applied together and its space time table. . As can be seen from the comparison between FIG. 11 and FIG. 12, the configuration in which resource sharing and pipelining are applied together performs the same operation as the configuration in which resource sharing only is applied, without any interruption to the calculation process, while the number of multipliers is increased. It can be reduced from eight to four.

이는, 하나의 파이프라인된 곱셈기를 공유하는 두 프로세싱 요소들이, 서로 다른 파이프라인 단계들을 이용하여 동시에 두 곱셈을 수행할 수 있기 때문이다. This is because two processing elements sharing one pipelined multiplier can perform two multiplications at the same time using different pipeline stages.

본 발명에 의한 리소스 공유 및 파이프라인 기법의 적용으로 인한 재구성가능 배열구조의 효율성 향상에 대하여 살펴본다. 설명의 편의를 위해, 8 ×8 프로세싱 요소의 배열 구조를 기본 배열 구조, 리소스 공유(RS) 구조, 리소스 공유 및 파이프라이닝(RSP) 구조의 세 가지 경우로 나누어 서로 비교한다. The efficiency improvement of the reconfigurable array structure due to the application of resource sharing and pipeline techniques will be described. For convenience of description, the array structure of 8 x 8 processing elements is divided into three cases, a basic array structure, a resource sharing (RS) structure, a resource sharing and a pipelining (RSP) structure, and compared with each other.

먼저 리소스 공유(RS)의 형태로서, 도 13에 도시된 바와 같이 프로세싱 요소와 리소스를 배열할 수 있다. 도 13a 내지 13d 는, 리소스 공유(RS) 또는 리소스 공유 및 파이프라이닝 (RSP) 배열구조에 관한 4가지 모델을 나타낸 도면으로서 구체적으로는 다음과 같다. First, as a form of resource sharing (RS), as shown in FIG. 13, processing elements and resources may be arranged. 13A to 13D are diagrams illustrating four models of a resource sharing (RS) or resource sharing and pipelining (RSP) arrangement, specifically as follows.

도 13a 는 각 행마다 하나의 곱셈기가 8개의 프로세싱 요소에 의해 공유되거나 공유 및 파이프라인된 프로세싱 요소의 배열구조를 나타낸다. 도 13b 는 각 행마다 두 개의 곱셈기가 8개의 프로세싱 요소에 의해 공유되거나 공유 및 파이프라인된 프로세싱 요소의 배열구조를 나타낸다. 도 13c 는 각 행마다 두 개의 곱셈기가 8개의 프로세싱 요소에 의해 공유되거나 공유 및 파이프라인되고, 각 열마다 하나의 곱셈기가 8개의 프로세싱 요소에 의해 공유되거나 공유 및 파이프라인된 프로세싱 요소의 배열구조를 나타낸다. 도 13d 는 각 행마다 두 개의 곱셈기가 8개의 프로세싱 요소에 의해 공유되거나 공유 및 파이프라인되고, 각 열마다 두 개의 곱셈기가 8개의 프로세싱 요소에 의해 공유되거나 공유 및 파이프라인된 프로세싱 요소의 배열구조를 나타낸다. FIG. 13A shows an arrangement of processing elements in which each multiplier is shared or shared and pipelined by eight processing elements. FIG. 13B shows an arrangement of processing elements in which each multiplier is shared or shared and pipelined by eight processing elements. 13C shows an arrangement of processing elements in which two multipliers are shared or shared and pipelined by eight processing elements in each row, and one multiplier is shared or shared and pipelined by eight processing elements in each column. Indicates. FIG. 13D illustrates an arrangement of processing elements in which two multipliers are shared or shared and pipelined by eight processing elements in each row, and two multipliers are shared or shared and pipelined by eight processing elements in each column. Indicates.

8 ×8 프로세싱 요소의 배열에 있어서, 기본 배열 구조(리소스가 공유되지도, 파이프라이닝 되지도 않은 구조), 리소스 공유(RS) 구조, 리소스 공유 및 파이프라이닝(RSP) 구조의 특징이 면적 및 지연 시간의 면에서 도 14의 표에 도시되었다. In the arrangement of 8 × 8 processing elements, the characteristics of the basic array structure (structure that is neither shared nor piped), resource sharing (RS) structure, resource sharing and pipelining (RSP) structure are characterized by area and delay. It is shown in the table of FIG. 14 in terms of time.

도 14의 표에 나타난 바와 같이, 공유 리소스를 갖는 구조(RS#), 또는 공유 및 파이프라인 된 리소스를 모두 갖는 구조(RSP#)는, 어느 것도 갖지 않는 기본 배 열 구조(Base)에 비해 면적 및 지연시간이 큰 폭으로 감소되는 것을 알 수 있다. As shown in the table of FIG. 14, the structure having shared resources (RS #) or the structure having both shared and pipelined resources (RSP #) has an area as compared to the base array structure (Base) having nothing at all. And it can be seen that the delay time is greatly reduced.

특히 하나의 곱셈기가 각 행에서 8개의 프로세싱 요소에 의해 공유되는 구성의 경우(RS#1), 기본 배열구조에 비해 면적에서는 42.8% 로 감소되었고, 하나의 곱셈기가 각 행에서 8개의 프로세싱 요소에 의해 공유되는 동시에 파이프라인 된 경우(RSP#1), 지연시간이 34.69% 로 감소된 것을 확인할 수 있다. Especially for configurations where one multiplier is shared by eight processing elements in each row (RS # 1), the area is reduced by 42.8% in area compared to the basic array structure, and one multiplier is applied to eight processing elements in each row. If shared by both pipelines (RSP # 1), the latency is reduced to 34.69%.

도 14의 표를 통해 알 수 있듯이, 본 발명에 의한 리소스 공유 및 파이프라이닝 방법이 적용된 배열 구성은, 면적 및 지연시간의 관점에서 원래의 기본 배열 구조보다 더욱 효율적이다. 또한 리소스의 공유에만 의존하는 것보다, 공유된 리소스를 파이프라인 처리함으로써 더욱 적은 면적과 짧은 지연시간을 갖는 배열구조를 제공할 수 있다는 것도 알 수 있다. As can be seen from the table of FIG. 14, the arrangement configuration to which the resource sharing and pipelining method according to the present invention is applied is more efficient than the original basic arrangement structure in terms of area and delay time. It can also be seen that, rather than relying solely on resource sharing, pipelined shared resources can provide an array structure with less area and lower latency.

한편, 리소스를 공유시키기 위한 설계를 하는 경우, 공유 리소스들에 의해 실행되어져야 하는 연산의 개수가 카운트되어야 하고, 이는 모든 사이클에서 필요로 하는 공유리소스의 개수와 비교되어야 한다. 만약 한 사이클에서, 공유리소스를 사용하는 연산의 개수가 공유되는 리소스들의 개수보다 많다면, 그것은 공유 리소스들의 개수가 그 사이클에서 부족하다는 것을 의미한다. 이러한 경우에는, 리소스 충돌(resource conflict)을 피하기 위한 방법으로서, 몇 개의 정지과정 사이클들을 삽입할 수 있다. 이를 RS 정지과정(RS stall)이라고 언급한다. On the other hand, when designing for sharing resources, the number of operations to be executed by the shared resources should be counted, which should be compared with the number of shared resources required in every cycle. If in one cycle, the number of operations using shared resources is greater than the number of shared resources, it means that the number of shared resources is insufficient in that cycle. In this case, several stall cycles can be inserted as a way to avoid resource conflicts. This is referred to as the RS stall.

게다가, 리소스 파이프라이닝(RP)의 경우, 그 파이프라인된 리소스들 상에서 실행되어져야 할 연산들은, 다중 사이클들을 갖는다. 따라서 파이프라인된 리소스들에 의존하는 연산들은 함께 정지 되어야 한다. 이를 RP 정지과정(RP stall)이라 고 언급한다. 따라서 RS 정지 및 RP 정지 사이클에 의하여 정지 사이클들의 전체적인 개수에 가까워질 수 있다. In addition, in the case of resource pipelining (RP), the operations to be executed on those pipelined resources have multiple cycles. Therefore, operations that depend on pipelined resources must be stopped together. This is referred to as the RP stall. Therefore, the RS stop and RP stop cycles can be used to approximate the total number of stop cycles.

초기 리소스 공유 및 파이프라이닝(RSP)의 구성 정보 코드들은 리소스 공유(RS) 및 리소스 파이프라이닝(RP)에 따라 재배치된다. 그 재배치에 관하여 2가지 규칙이 있다. 첫째는, 리소스 공유를 위한 설계시, 공유된 리소스들은 루프 반복(loop iteration)의 순서대로 프로세싱 요소에 할당된다. 그러므로, 만약 공유된 리소스들이 부족하면, 나중의 루프 반복들에서의 작동들은 다음 사이클로 이동된다. Configuration information codes of initial resource sharing and pipelining (RSP) are relocated according to resource sharing (RS) and resource pipelining (RP). There are two rules about the relocation. First, in designing for resource sharing, shared resources are assigned to processing elements in the order of loop iteration. Therefore, if shared resources are lacking, operations in later loop iterations are moved to the next cycle.

둘째, 리소스 파이프라인을 위한 설계시, 파이프라인된 리소스들 사이에서의 연산들은 다중 사이클들을 갖기 때문에, 파이프라인된 리소스들의 출력에 의존하는 다른 연산들이 함께 정지 되어야 한다. 나아가, 연속적인 파이프라인된 연산들의 경우, 그 연산들 사이에서 겹쳐진 사이클들(overlapped cycles)이 제거되어야 한다. RSP의 경우, 위 2가지 법칙들은 초기 구성 정보 코드들에 적용된다. Second, when designing for a resource pipeline, because operations between pipelined resources have multiple cycles, other operations that depend on the output of the pipelined resources must be stopped together. Furthermore, for successive pipelined operations, overlapped cycles between those operations should be eliminated. In the case of RSP, the above two laws apply to the initial configuration information codes.

한편, 본 발명에 의한 리소스 공유 및 파이프라인 설계에 있어서는, 공유되는 리소스들의 타입, 파이프라인된 리소스들의 타입, 파이프라인된 리소스들의 파이프라인 단계의 수, 공유된 리소스들의 행(row)들의 개수, 그 공유된 리소스들의 열(column)들의 개수 등을 고려해야 한다. Meanwhile, in the resource sharing and pipeline design according to the present invention, the type of shared resources, the type of pipelined resources, the number of pipeline stages of pipelined resources, the number of rows of shared resources, Consider the number of columns of the shared resources.

본 발명에 따른, 특정 응용 영역 최적화를 위한 리소스 공유 및 파이프라이닝 방법은, 프로세싱 요소의 어떠한 정방형 배열 구조(n ×n)에도 적용 가능하다. 따라서, 많은 현존하는 콜스-그레인드 재구성 배열들을 포함하는 범용 배열 구조 템플릿에 적용할 수 있다. The resource sharing and pipelining method for optimizing a particular application area according to the present invention is applicable to any square array structure (n × n) of processing elements. Thus, it can be applied to a universal array structure template that includes many existing Coles-Grain reconstruction arrays.

본 발명에 의한 리소스 공유 및 파이프라이닝의 효과는, 예컨대, 리버모어 루프 벤치마크(Livermore loops benchmark)의 여러 커널들에 적용하거나, 또는 2D-FDCT, SAD, MVM 및 FET를 포함하는 다양한 어플리케이션들에 적용함으로써 확인할 수 있다. The effect of resource sharing and pipelining according to the present invention is applied to various kernels of, for example, Livermore loops benchmark, or to various applications including 2D-FDCT, SAD, MVM and FET. This can be confirmed by.

이하에서 본 발명에 의한 리소스 공유 및 파이프라이닝 적용 결과를 구체적으로 살펴본다. 리버모어 루프 벤치마크(Livermore loops benchmark)의 여러 커널들 및 특정 어플리케이션들을 리소스 공유(RS) 및 리소스 공유 파이프라이닝(RSP) 구조들에 적용한다. Hereinafter, the result of resource sharing and pipelining application according to the present invention will be described in detail. Various kernels and specific applications of the Livermore loops benchmark apply to resource sharing (RS) and resource sharing pipelining (RSP) structures.

도 15의 표는 리버모어 루프 벤치마크 내 커널들을 본 발명의 리소스 공유 및 파이프라이닝의 배열구성에 적용한 결과를 평가한 것이다. 도 15의 표에서, ET(ns), DR(%), stall 은 각각 다음과 같은 의미를 지닌다. The table in FIG. 15 evaluates the results of applying the kernels in the Livermore loop benchmark to the arrangement of resource sharing and pipelining of the present invention. In the table of FIG. 15, ET (ns), DR (%), and stall have the following meanings, respectively.

ET(ns) = 실행시간(Execution Time)= 사이클 ×최대지연시간ET (ns) = Execution Time = Cycle × Maximum Delay Time

DR(%) = 기본배열구조 대비 지연시간 감소율DR (%) = reduction of delay time compared to basic array structure

stall = 리소스 부족(resource lack)에 의한 정지연산의 개수stall = number of stall operations due to lack of resources

도 15의 표를 통해 알 수 있듯이, 본 발명의 리소스 공유 및 파이프라이닝 구성을 세 가지 커널(Hydro, ICCG, Tri-diagonal)들에 적용한 결과, 리소스 공유 및 파이프라이닝을 하지 않은 기본배열구조에 비해, 지연시간이 각각, RSP#2 에서 15.92%, RSP#1 에서 32.12%, RSP#1 에서 31.91% 로 감소하였다. As can be seen from the table of FIG. 15, as a result of applying the resource sharing and pipelining configuration of the present invention to three kernels (Hydro, ICCG, Tri-diagonal), compared to the basic array structure without resource sharing and pipelining The delay time was reduced to 15.92% in RSP # 2, 32.12% in RSP # 1 and 31.91% in RSP # 1, respectively.

도 16은, 2D-FDCT, SAD, MVM 및 FET를 포함하는 특정 어플리케이션들을 본 발명의 리소스 공유 및 파이프라이닝의 배열구성에 적용한 평가 결과를 나타낸 표이다. 도 16에서 "MVM" 은 행렬 벡터 곱(Matrix Vector Multiplication)을 의미한다. 도 16의 표에서, 네 가지 커널들(2D-FDCT, SAD, MVM, FET)에 대해 본 발명의 리소스 공유 및 파이프라이닝 구성을 적용한 결과, 기본배열구조에 비해, 지연시간이 각각, RSP#2 에서 17.01%, RSP#1 에서 35.7%, RSP#1 에서 32.31%, RSP#2 에서 22.07% 로 감소하였음을 확인할 수 있다. FIG. 16 is a table showing evaluation results of applying specific applications including 2D-FDCT, SAD, MVM and FET to the arrangement of resource sharing and pipelining of the present invention. In FIG. 16, "MVM" means Matrix Vector Multiplication. In the table of FIG. 16, as a result of applying the resource sharing and pipelining configuration of the present invention to four kernels (2D-FDCT, SAD, MVM, and FET), the delay time is RSP # 2, respectively, compared to the basic array structure. In 17.01%, 35.7% in RSP # 1, 32.31% in RSP # 1 and 22.07% in RSP # 2.

도 16의 표에서는, RSP 구조2 (RSP#2)가, 리버모어 루프들의 모든 선택된 커널들을 연산 정지(stall) 없이 지원하며, 최상의 성능을 가지는 것으로 표시하고 있다. 도 15와 도 16의 표를 통해, 성능저하 없는 최적의 배열구조는 RSP#1 또는 RSP#2 임을 알 수 있다. In the table of FIG. 16, RSP structure 2 (RSP # 2) supports all selected kernels of Livermore loops without operational stall and is indicated as having the best performance. 15 and 16, it can be seen that the optimal arrangement without deterioration is RSP # 1 or RSP # 2.

이처럼, 본 발명에 의한 리소스 공유 및 파이프라이닝 방법을 적용함으로써 얻을 수 있는 성능 향상율은, 실행되는 응용 프로그램에 따라 달라질 수 있다. 예컨대, 동일한 리소스 공유 및 파이프라이닝 구조를 갖는 배열구조에서 곱셈연산의 수행을 필요로 하는 2D-FDCT와 곱셈연산의 수행을 필요로 하지 않는 SAD 함수를 적용할 경우, 도 16의 표에도 나타난 바와 같이, SAD 함수를 적용하는 경우가 더욱 향상된 성능을 달성할 수 있다. 이는 그 곱셈기를 파이프라이닝한 것에 의해 클럭 주파수가 증가하기 때문이다. 일반적으로, 많은 수의 곱셈과 함께 커널들의 더 많은 속도향상을 얻을 수는 없는데, 이는 곱셈들이 RSP 구조들 내에서 다중 사이클들을 가져오기 때문이다. As such, the performance improvement rate obtained by applying the resource sharing and pipelining method according to the present invention may vary depending on the executed application program. For example, when 2D-FDCT requiring multiplication and SAD function without multiplication are applied to an array structure having the same resource sharing and pipelining structure, as shown in the table of FIG. In this case, the improved performance can be achieved by applying the SAD function. This is because the clock frequency is increased by pipelining the multiplier. In general, it is not possible to get more speedup of kernels with a large number of multiplications because multiplications lead to multiple cycles within RSP structures.

이상 설명한 본 발명에 의한 리소스 공유 및 파이프 라이닝 구성을 갖는 재 구성 가능 배열구조를 하드웨어 상에 구현할 경우, 전체 리소스의 개수를 줄일 수 있으므로 하드웨어의 크기 및 제작비용을 줄일 수 있다. When the reconfigurable arrangement having the resource sharing and pipe lining configuration according to the present invention described above is implemented on the hardware, the total number of resources can be reduced, thereby reducing the size and manufacturing cost of the hardware.

이상 본 발명의 특정 실시예에 대해 설명하였으나, 본 발명의 정신에 위배됨이 없이 본 발명이 속하는 분야에서 숙련된 기술자에 의해 본 발명은 다양하게 변형될 수 있을 것이다. 이러한 변형된 실시예들은, 본 발명의 정신에 위배되지 않는 범위 내에서 본 발명의 특허청구범위의 권리범위에 포함된다고 해석되어야 할 것이다. While specific embodiments of the present invention have been described above, the present invention may be variously modified by those skilled in the art without departing from the spirit of the present invention. Such modified embodiments should be construed as being included in the scope of the claims of the present invention without departing from the spirit of the present invention.

상기한 바와 같이, 본 발명에 의한 리소스 공유 및 파이프 라이닝 구성을 갖는 재구성 가능 배열구조는, 동일한 행 및/또는 열에 배열된 프로세싱 요소들 간에 연산 동작을 수행하는 리소스 를 공유시킴으로써, 배열 구조 구성에 필요한 전체 리소스 의 개수를 줄일 수 있고, 나아가 프로세싱 요소들 간에 공유되는 상기 리소스를 파이프라이닝 시킴으로써 영역 및 지연시간의 측면에서 더욱 효율적인 배열 구조를 구현할 수 있다.As described above, the reconfigurable array structure having the resource sharing and pipelining configuration according to the present invention is required for constructing the array structure by sharing resources for performing arithmetic operations among processing elements arranged in the same row and / or column. The total number of resources can be reduced, and further, by pipelining the resources shared among the processing elements, a more efficient arrangement structure in terms of area and latency can be realized.

또한, 본 발명에 의한 리소스 공유 및 파이프 라이닝 구성을 갖는 재구성 가능 배열구조를 통해 특정 응용 영역에 관하여 최적화된 배열구조를 얻을 수 있고, 성능 저하 없이 하드웨어의 실현 비용을 줄일 수 있다.In addition, through the reconfigurable array structure having resource sharing and pipe lining configuration according to the present invention, an array structure optimized for a specific application area can be obtained, and the realization cost of hardware can be reduced without degrading performance.

Claims

A reconfigurable array of loop pipelined processing elements

Two or more processing elements for receiving operands from a memory bus or other processing element;

A bus switch connected one-to-one with the processing element to receive an operand from the processing element;

A shared resource shared by the two or more processing elements, the shared resource being coupled to the bus switch via a data bus to receive and operate the operands from the bus switch; And

And a configuration cache configured to control to which shared resources the bus switch transmits operands when there are two or more shared resources.

The method of claim 1,

The shared resource is a reconfigurable arrangement having a resource sharing and pipelining configuration, characterized in that the register is inserted into the resource pipeline processing.

The method of claim 1,

And wherein said shared resource is a multiplier.

The method of claim 1,

And the configuration cache controls which shared resource the operable bus switch transmits operands to via a control line directly connected to the bus switch.

The method of claim 1,

Wherein said configuration cache and said processing element are one-to-one connected.

The method of claim 1,

A reconfigurable array structure with resource sharing and pipelining further comprising data memory having multiple read / write data buses.

The method of claim 1,

A reconfigurable arrangement having a resource sharing and pipelining configuration, wherein the same resources are shared per processing element arranged in the same row or in the same column.

The method of claim 1,

And a data bus connecting said processing element and said bus switch comprises two n-bit buses and one 2n-bit bus.

The method of claim 1,

And a data bus connecting the bus switch and the shared resource includes two n-bit buses and one 2n-bit bus.

10. A hardware device comprising any one of the arrangements of claims 1-9.