KR101270763B1

KR101270763B1 - Method for producing reconfigable processing array architecture

Info

Publication number: KR101270763B1
Application number: KR1020110015038A
Authority: KR
Inventors: 윤종희; 이종은; 박상현; 김용주; 조두산; 백윤흥
Original assignee: 서울대학교산학협력단
Priority date: 2011-02-21
Filing date: 2011-02-21
Publication date: 2013-06-03
Also published as: KR20120095615A

Abstract

본 발명은, 재구성형 프로세싱 어레이 구조 생성 방법에 관한 것으로서, 컴파일러로부터 컴파일된 적어도 하나 이상의 어플리케이션들을 구성하는 연산(operation)들과 연산들 간의 데이터 의존성(data dependency)들에 관한 정보를 획득하여, 이를 기초로, 기저 재구성형 프로세싱 어레이 구조에 프로세싱 요소들 간의 상호 연결을 추가하여 확장한 확장 재구성형 프로세싱 어레이 구조를 생성하는 재구성형 프로세싱 어레이 구조 생성 방법을 제공한다.The present invention relates to a method for generating a reconfigurable processing array structure, the method comprising obtaining information about data dependencies between operations and operations constituting at least one or more applications compiled from a compiler, The present invention provides a method of generating a reconstructed processing array structure that adds interconnections between processing elements to an underlying reconstructed processing array structure to create an extended reconstructed processing array structure.

Description

Method for producing reconfigable processing array architecture

본 발명은, 재구성형 프로세싱 어레이 구조 생성 방법에 관한 것으로서, 보다 상세하게는, 기저 재구성형 프로세싱 어레이 구조에 프로세싱 요소들 간의 상호 연결을 추가하여 확장한 확장 재구성형 프로세싱 어레이 구조를 생성하는 재구성형 프로세싱 어레이 구조 생성 방법에 관한 것이다.The present invention relates to a method for generating a reconstructed processing array structure, and more particularly, to reconstructed processing for generating an extended reconstructed processing array structure by adding interconnections between processing elements to a base reconstructed processing array structure. A method of generating an array structure.

본 발명을 위한 연구는 교육과학 기술부의 정보기술사업단에서 주관하는 BK21 사업[과제고유번호: 0567-20100001]에 의해 지원되었다.The research for the present invention was supported by the BK21 project (project unique number: 0567-20100001), which is organized by the Information Technology Business Division of the Ministry of Education, Science and Technology.

최근 NRE 비용의 증가와 응용 프로그램의 빠른 변화에 대응하기 위해 ASIC(Application Specific Intergrared Circuit)에 가까운 성능을 내면서도 다양한 어플리케이션에 적용 가능한 유연성을 갖춘 재구성형 프로세서를 활용하는 사례가 늘고 있다. In recent years, in order to cope with the increase in NRE cost and rapid change in applications, there are increasing cases of using reconfigurable processors with performance close to Application Specific Intergrared Circuit (ASIC) while being flexible for various applications.

재구성형 프로세서는 유연성을 가진 하드웨어 구조를 가지며, 내부 구성을 동적으로 바꿔가며 다양한 연산을 하면서도 높은 성능을 유지할 수 있도록 되어 있어, 최근 많은 주목을 받고 있다.Reconfigurable processors have received a lot of attention recently because they have a flexible hardware structure and can dynamically change internal configurations and maintain high performance while performing various operations.

이러한 재구성형 프로세서는 크게 FPGA를 사용하여 비트 수준의 재구성이 가능한 파인-그레인드 재구성형 프로세서와 워드 수준의 재구성이 가능한 코어스-그레인드 재구성형 프로세서로 나뉜다.These reconfigurable processors are divided into fine-grain reconfigurable processors that can be bit-level reconstructed using FPGAs and coarse-grained reconfigurable processors that can be reconstructed by word-level reconstruction.

이러한 재구성형 프로세서의 기능 유닛들 간의 연결 구조를 결정함에서 가장 중요한 이슈 중 하나가 바로 DSE(Design space exploration)이다. 여기에서, DSE란, 하드웨어를 설계할 때 원하는 성능을 제공하면서도 최소한의 비용을 소모하는 설계 구조를 탐색하는 과정을 말한다.One of the most important issues in determining the connection structure between functional units of such a reconfigurable processor is design space exploration (DSE). Here, DSE refers to the process of searching for a design structure that provides the desired performance while designing the hardware and consumes the least cost.

재구성형 프로세서에 관한 DSE에 대하여 종래의 연구들은 주로 몇가지 규칙적인 구조에 관한 성능을 분석하여 가장 좋은 성능을 발휘하는 규칙적인 구조를 제안하는 것에 한정되어 있었다.Previous studies on the DSE of a reconfigurable processor have been mainly limited to suggesting a regular structure that shows the best performance by analyzing the performance of several regular structures.

따라서, 기존의 규칙적인 구조에 DSE에 따른 구성이 추가되어 확장된 구조의 재구성형 프로세서를 제공하는 기술이 필요하다.Therefore, there is a need for a technique of providing a reconfigurable processor having an extended structure by adding a configuration according to the DSE to an existing regular structure.

본 발명이 이루고자 하는 기술적 과제는, 기저 재구성형 프로세싱 어레이 구조에 프로세싱 요소들 간의 상호 연결을 추가하여 확장한 확장 재구성형 프로세싱 어레이 구조를 생성하는 재구성형 프로세싱 어레이 구조 생성 방법을 제공하는 데에 있다.SUMMARY OF THE INVENTION The present invention has been made in an effort to provide a method of generating a reconfigured processing array structure in which an extended reconfigured processing array structure is generated by adding interconnections between processing elements to a base reconfigured processing array structure.

상기의 기술적 과제를 이루기 위한, 본 발명에 의한, 재구성형 프로세싱 어레이 구조 생성 방법은 컴파일러로부터 컴파일된 적어도 하나 이상의 어플리케이션들을 구성하는 연산(operation)들과 연산들 간의 데이터 의존 관계를 나타내는 데이터 의존성(data dependency)들에 관한 정보를 획득하는 단계; 및 획득된 연산들 및 데이터 의존성들에 관한 정보를 기초로 기저 재구성형 프로세싱 어레이 구조에 기저 재구성형 프로세싱 어레이 구조에 속하는 프로세싱 요소들 간의 상호 연결을 추가하여 확장한 확장 재구성형 프로세싱 어레이 구조를 생성하는 생성단계를 포함한다.In order to achieve the above technical problem, according to the present invention, a method for generating a reconfigurable processing array structure includes a data dependency indicating a data dependency relationship between operations and operations constituting at least one or more applications compiled from a compiler. obtaining information regarding dependencies; And adding an interconnection between the processing elements belonging to the base reconstructed processing array structure to the base reconstructed processing array structure based on the obtained information on the operations and data dependencies to generate an extended reconstructed processing array structure. A generation step.

바람직하게는, 기저 재구성형 프로세싱 어레이 구조는 획득된 연산들의 개수 이상의 프로세싱 요소들이 메쉬 형태의 네트워크로 연결된 2차원 배열의 구조일 수 있다.Preferably, the basic reconstructed processing array structure may be a structure of a two-dimensional array in which processing elements of the number or more of the obtained operations are connected in a network in a mesh form.

바람직하게는, 획득된 연산들 및 데이터 의존성들에 관한 정보를 기초로 요구되는 프로세싱 요소의 개수를 도출하여, 적어도 도출된 개수 이상의 프로세싱 요소들을 포함하는 기저 재구성형 프로세싱 어레이 구조를 선택하는 단계를 더 포함할 수 있다.Preferably, deriving the required number of processing elements based on the obtained information about the operations and data dependencies, and further selecting a base reconstructed processing array structure comprising at least the derived number of processing elements. It may include.

바람직하게는, 생성단계는, 기저 재구성형 프로세싱 어레이 구조의 프로세싱 요소들 및 프로세싱 요소들 간의 상호 연결들과 획득된 연산들 및 데이터 의존성들을 서로 매핑하여, 기저 재구성형 프로세싱 어레이 구조에 기저 재구성형 프로세싱 어레이 구조의 상호 연결에 대응되지 않는 데이터 의존성들에 대응가능한 상호 연결을 추가한 매핑 케이스를 생성하는 매핑 케이스 단계; 생성된 매핑 케이스들 중에서 추가된 상호 연결의 개수가 가장 적은 매핑 케이스를 선택하는 매핑케이스선택단계; 및 선택된 매핑 케이스를 기초로 확장 재구성형 프로세싱 어레이 구조를 생성하는 단계를 포함할 수 있다.Preferably, the generating step maps the processing elements and the interconnections between the processing elements and the obtained operations and data dependencies of the base reconstructed processing array structure with each other, thereby basing the reconstructed processing on the base reconstructed processing array structure. A mapping case step of generating a mapping case adding an interconnect corresponding to data dependencies not corresponding to the interconnections of the array structure; A mapping case selecting step of selecting a mapping case having the smallest number of added interconnections among the generated mapping cases; And generating an extended reconfigurable processing array structure based on the selected mapping case.

바람직하게는, 매핑케이스생성단계는 기저 재구성형 프로세싱 어레이 구조에 속하는 프로세싱 요소들 각각을 VERTEX로 하고, 프로세싱 요소들 간의 상호 연결을 EDGE로 하는 그래프 형태의 제1데이터 구조로 변환하는 단계; 획득된 연산 및 데이터 의존성들을 각각 VERTEX와 EDGE로 하는 그래프 형태의 제2데이터 구조로 변환하는 단계; 및 제1데이터 구조 및 제2데이터 구조를 서로 매핑하여 매핑 케이스를 생성하는 단계를 포함할 수 있다.Preferably, the mapping case generating step comprises converting each of the processing elements belonging to the basic reconstructed processing array structure into a first data structure in the form of a graph with VERTEX and interconnections between the processing elements as EDGE; Converting the obtained operations and data dependencies into a second data structure in the form of a graph with VERTEX and EDGE, respectively; And generating a mapping case by mapping the first data structure and the second data structure to each other.

바람직하게는, 매핑케이스생성단계는 획득된 연산들 및 데이터 의존성들에 관한 정보, 기저 재구성형 프로세싱 어레이 구조의 행별로 할당되는 전용의 프로세싱 요소를 동일한 행에 속한 다른 프로세싱 요소에서 공유할 수 있는 공유 자원으로 한정한 공유 자원 제약 조건을 기초로 획득된 연산들을 기저 재구성형 프로세싱 어레이 구조의 2차원 배열의 행별로 분류하는 단계를 더 포함할 수 있다.Preferably, the mapping case generation step may share information about the obtained operations and data dependencies, a dedicated processing element allocated per row of the underlying reconstructed processing array structure, among other processing elements belonging to the same row. The method may further include classifying the operations obtained based on the shared resource constraints limited to the resources by row of the two-dimensional array of the basic reconstructed processing array structure.

바람직하게는, 전용의 프로세싱 요소는 산술논리 연산을 처리하는 ALU(arithmetic-logic unit), 곱셈 연산을 처리하는 곱셈기(multiplier), 부동소수점 연산을 처리하는 FPU(floating point unit) 및 메모리 연산을 처리하는 LSU(load/store unit) 중 적어도 하나 이상을 포함할 수 있다.Preferably, the dedicated processing element handles an arithmetic-logic unit (ALU) that handles arithmetic logic operations, a multiplier that handles multiplication operations, a floating point unit (FPU) that handles floating-point operations, and memory operations. It may include at least one or more of the load / store unit (LSU).

바람직하게는, 매핑케이스생성단계는, 분류된 결과를 기초로 자신과 1-홉 거리에 있는 연산들 중 동일한 행에 속하는 복수의 연산들과 동시에 데이터 의존성들이 있는 연산들 및 자신과 2-홉 이상의 거리에 있는 연산들 중 다른 행에 속하는 연산과 데이터 의존성이 있는 연산들을 탐색하는 단계; 및 탐색된 연산들의 데이터 의존성들을 기초로 기저 재구성형 프로세싱 어레이 구조에 추가할 상호 연결을 결정하는 추가상호연결결정단계를 더 포함할 수 있다.Preferably, the mapping case generation step may be based on the sorted result, a plurality of operations belonging to the same row among the operations one-hop distance from the self, and operations having data dependencies at the same time and two or more hops with themselves. Searching for operations with a data dependency and operations belonging to another row among the operations at distance; And an additional interconnect decision step of determining an interconnect to add to the underlying reconfigured processing array structure based on the data dependencies of the discovered operations.

바람직하게는, 추가상호연결결정단계는, 탐색된 연산들의 데이터 의존성들을 컴파일러로부터 획득된 연산들과 매핑되지 않은 비매핑 프로세싱 요소를 거쳐서 라우팅되는 라우팅 상호 연결로 매핑하는 단계 및 탐색된 연산들의 데이터 의존성들 중 라우팅 상호 연결로 매핑되지 않은 나머지 데이터 의존성을 기저 재구성형 프로세싱 어레이 구조에 추가할 상호 연결로 매핑하는 단계를 포함할 수 있다.Preferably, the further interconnect determination step comprises the steps of: mapping the data dependencies of the retrieved operations to a routing interconnect routed through a non-mapping processing element that is not mapped with the operations obtained from the compiler and the data dependencies of the discovered operations. Mapping the remaining data dependencies not mapped to the routing interconnect to the interconnect to add to the underlying reconfigured processing array structure.

바람직하게는, 생성단계는, 획득된 연산들을 트리 구조의 노드로 추가해가며, 노드로 추가된 연산들 간의 데이터 의존성들을 기저 재구성형 프로세싱 어레이 구조의 프로세싱 요소들 간의 상호 연결과 비교한 결과에 따라 추가되어야 할 상호 연결에 따른 추가발생비용과, 아직 비교되지 않은 데이터 의존성에 따라 발생 가능한 최소한의 추가예상비용을 각각 계산하여 이들의 합을 가중치로 최소비용의 그래프 형태를 탐색하는 에이 스타 검색(A* search)을 수행하는 단계; 및 에이 스타 검색을 통해 도출된 최소비용의 그래프 형태를 기초로 확장 재구성형 프로세싱 어레이 구조를 생성하는 단계를 포함할 수 있다.Preferably, the generating step adds the obtained operations to the nodes of the tree structure, and adds the data dependencies between the operations added to the nodes according to the result of comparing the interconnections between the processing elements of the underlying reconstructed processing array structure. A-star search (A *) that calculates the additional cost of interconnection to be interconnected and the minimum additional cost incurred according to the data dependencies not yet compared, and then weights the sum of these sums. performing a search); And generating an extended reconfigurable processing array structure based on the least cost graph form derived through the A-Star search.

바람직하게는, 컴파일러는, 수행할 어플리케이션이 복수인 경우에, 복수의 어플리케이션들에 대해서 기지정된 우선순위를 기초로 복수의 어플리케이션들 중에서 적어도 하나 이상의 어플리케이션들을 선택하고, 선택된 적어도 하나 이상의 어플리케이션들을 우선순위 순으로 컴파일하여, 컴파일된 어플리케이션들을 구성하는 연산(operation)들과 연산들 간의 데이터 의존 관계를 나타내는 데이터 의존성(data dependency)들에 관한 정보를 제공할 수 있다.Preferably, the compiler selects at least one or more applications from among the plurality of applications based on a predetermined priority with respect to the plurality of applications, and when there are a plurality of applications to be executed, the compiler prioritizes the selected at least one or more applications. By compiling in order, information about data dependencies representing operations and data dependencies between the operations constituting the compiled applications may be provided.

바람직하게는, 기저 재구성형 프로세싱 어레이 구조 및 확장 재구성형 프로세싱 어레이 구조는 코어스 그레인드 재구성 어레이의 구조일 수 있다.Preferably, the base reconstructed processing array structure and the extended reconstructed processing array structure may be structures of a coarse grain reconstructed array.

상기의 기술적 과제를 이루기 위한, 본 발명에 의한, 재구성형 프로세싱 어레이 구조를 생성하는 기능을 포함하는 프로그램을 수록한 컴퓨터로 읽을 수 있는 기록매체에 있어서, 컴파일러로부터 컴파일된 적어도 하나 이상의 어플리케이션들을 구성하는 연산(operation)들과 연산들 간의 데이터 의존 관계를 나타내는 데이터 의존성(data dependency)들에 관한 정보를 획득하는 기능; 및 획득된 연산들 및 데이터 의존성들에 관한 정보를 기초로 기저 재구성형 프로세싱 어레이 구조에 기저 재구성형 프로세싱 어레이 구조에 속하는 프로세싱 요소들 간의 상호 연결을 추가하여 확장한 확장 재구성형 프로세싱 어레이 구조를 생성하는 기능을 포함하는 기능을 포함하는 프로그램을 수록한다.A computer-readable recording medium containing a program including a function for generating a reconfigurable processing array structure according to the present invention for achieving the above technical problem, comprising at least one application compiled from a compiler Obtaining information about data dependencies indicative of operations and data dependency relationships between the operations; And adding an interconnection between the processing elements belonging to the base reconstructed processing array structure to the base reconstructed processing array structure based on the obtained information on the operations and data dependencies to generate an extended reconstructed processing array structure. Lists programs that contain functions.

본 발명에 의하면, 종래의 재구성형 프로세서 어레이 구조들(예컨대, 메쉬형, 1-홉형, 사선형, 혼합형)들에 비해 상대적으로 적은 비용으로 높은 성능을 발휘할 수 있는 확장 재구성형 프로세서 어레이 구조를 생성할 수 있다.According to the present invention, an extended reconfigurable processor array structure capable of performing high performance at a relatively low cost compared to conventional reconfigurable processor array structures (eg, mesh type, 1-hop type, diagonal type, mixed type) is generated. can do.

도 1a은 재구성형 프로세싱 어레이 구조의 바람직한 일 예인 코어스 그레인드 재구성형 어레이 구조의 일반적인 템플릿을 예시한 도면이다.
도 1b는 재구성형 프로세싱 어레이 구조의 바람직한 일 예인 코어스 그레인드 재구성형 어레이 구조의 일 구성요소인 프로세싱 요소의 구성을 예시한 도면이다.
도 2a 내지 2d는 일반적인 재구성형 프로세서의 재구성형 프로세싱 어레이 구조를 예시한 도면이다.
도 3은 본 발명의 바람직한 일 실시예에 따른 재구성형 프로세싱 어레이 구조 생성 방법의 일 실시예를 도시한 도면이다.
도 4a 내지 4c는 본 발명의 바람직한 일 실시예에 따른 재구성 프로세싱 어레이 구조 생성 방법 중 매핑 케이스 생성 과정을 통해 생성된 매핑 케이스를 예시한 도면이다.
도 5는 본 발명의 바람직한 일 실시예에 따른 재구성형 프로세싱 어레이 구조 생성 방법에서 컴파일러로부터 획득한 정보와 기저 재구성형 프로세싱 어레이 구조를 매핑하고 기존 기저 재구성형 프로세싱 어레이 구조에 추가되어야 할 상호 연결을 도출하는 과정의 흐름을 예시한 도면이다.
도 6a 내지 6h는 본 발명의 바람직한 일 실시예에 따라 에이 스타 검색을 수행하여 확장 재구성형 프로세싱 어레이 구조를 도출하는 과정을 보다 상세하게 예시한 도면이다.1A is a diagram illustrating a general template of a coarse grain reconstructed array structure, which is a preferred example of a reconstructed processing array structure.
FIG. 1B is a diagram illustrating a configuration of a processing element that is one component of a coarse grain reconstructed array structure, which is a preferred example of the reconstructed processing array structure.
2A through 2D are diagrams illustrating a reconfigured processing array structure of a general reconfigured processor.
3 is a diagram illustrating an embodiment of a method of generating a reconfigurable processing array structure according to an exemplary embodiment of the present invention.
4A to 4C are diagrams illustrating mapping cases generated through a mapping case generation process in a reconstruction processing array structure generation method according to an exemplary embodiment of the present invention.
5 maps the information obtained from the compiler and the underlying reconfigured processing array structure in the method of generating a reconfigured processing array structure according to an exemplary embodiment of the present invention, and derives interconnections to be added to the existing base reconstructed processing array structure. It is a diagram illustrating the flow of a process to.
6A through 6H illustrate a process of deriving an extended reconfigurable processing array structure by performing an A-star search according to an exemplary embodiment of the present invention.

이하의 내용은 단지 본 발명의 원리를 예시한다. 그러므로 당업자는 비록 본 명세서에 명확히 설명되거나 도시되지 않았지만 본 발명의 원리를 구현하고 본 발명의 개념과 범위에 포함된 다양한 장치를 발명할 수 있는 것이다. 또한, 본 명세서에 열거된 모든 조건부 용어 및 실시예들은 원칙적으로, 본 발명의 개념이 이해되도록 하기 위한 목적으로만 명백히 의도되고, 이와 같이 특별히 열거된 실시예들 및 상태들에 제한적이지 않는 것으로 이해되어야 한다. 또한, 본 발명의 원리, 관점 및 실시예들 뿐만 아니라 특정 실시예를 열거하는 모든 상세한 설명은 이러한 사항의 구조적 및 기능적 균등물을 포함하도록 의도되는 것으로 이해되어야 한다. 또한, 이러한 균등물들은 현재 공지된 균등물뿐만 아니라 장래에 개발될 균등물 즉 구조와 무관하게 동일한 기능을 수행하도록 발명된 모든 소자를 포함하는 것으로 이해되어야 한다. The following merely illustrates the principles of the invention. Thus, those skilled in the art will be able to devise various apparatuses which, although not explicitly described or shown herein, embody the principles of the invention and are included in the concept and scope of the invention. Furthermore, all of the conditional terms and embodiments listed herein are, in principle, intended only for the purpose of enabling understanding of the concepts of the present invention, and are not intended to be limiting in any way to the specifically listed embodiments and conditions . It is also to be understood that the detailed description, as well as the principles, aspects and embodiments of the invention, as well as specific embodiments thereof, are intended to cover structural and functional equivalents thereof. In addition, these equivalents should be understood to include not only equivalents now known, but also equivalents to be developed in the future, that is, all devices invented to perform the same function regardless of structure.

따라서, 프로세서 또는 이와 유사한 개념으로 표시된 기능 블럭을 포함하는 도면에 도시된 다양한 소자의 기능은 전용 하드웨어뿐만 아니라 적절한 소프트웨어와 관련하여 소프트웨어를 실행할 능력을 가진 하드웨어의 사용으로 제공될 수 있다. 프로세서에 의해 제공될 때, 기능은 단일 전용 프로세서, 단일 공유 프로세서 또는 복수의 개별적 프로세서에 의해 제공될 수 있고, 이들 중 일부는 공유될 수 있다. 또한, 프로세서, 제어 또는 이와 유사한 개념으로 제시되는 용어의 사용은 소프트웨어를 실행할 능력을 가진 하드웨어를 배타적으로 인용하여 해석되어서는 아니 되고, 제한 없이 디지털 신호 프로세서(DSP) 하드웨어, 소프트웨어를 저장하기 위한 롬(ROM), 램(RAM) 및 비휘발성 메모리를 암시적으로 포함하는 것으로 이해되어야 한다. 주지 관용의 다른 하드웨어도 포함될 수 있다. Thus, the functionality of the various elements shown in the figures, including functional blocks represented by a processor or similar concept, can be provided by the use of dedicated hardware as well as hardware capable of executing software in conjunction with appropriate software. When provided by a processor, the functions may be provided by a single dedicated processor, a single shared processor, or a plurality of individual processors, some of which may be shared. Also, the use of terms such as processor, control, or similar concepts should not be construed as exclusive reference to hardware capable of executing software, but may include, without limitation, digital signal processor (DSP) hardware, (ROM), random access memory (RAM), and non-volatile memory. Other hardware may also be included.

상술한 목적, 특징 및 장점들은 첨부된 도면과 관련한 다음의 상세한 설명을 통하여 더욱 분명해 질 것이다. 본 발명을 설명함에 있어서, 관련된 공지 기술에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우 그 상세한 설명을 생략하거나 간략하게 설명하는 것으로 한다. The above objects, features and advantages will become more apparent from the following detailed description in conjunction with the accompanying drawings. DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Hereinafter, the present invention will be described in detail with reference to the accompanying drawings.

한편 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라, 다른 구성요소를 더 포함할 수 있는 것을 의미한다.On the other hand, when an element is referred to as "including " an element, it does not exclude other elements unless specifically stated to the contrary.

이하, 첨부된 도면을 참조하여 바람직한 실시예에 따른 본 발명을 상세히 설명하기로 한다. Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings.

도 1a은 재구성형 프로세싱 어레이 구조의 바람직한 일 예인 코어스 그레인드 재구성형 어레이(CGRA, Coarse-grained reconfigurable array, 100) 구조의 일반적인 템플릿을 예시한 도면이다.1A is a diagram illustrating a general template of a coarse-grained reconfigurable array (CGRA) structure, which is a preferred example of a reconfigurable processing array structure.

도 1a을 참조하면, CGRA(100)구조는 병렬 계산을 위한 프로세싱 요소들(processing element, PE, 110)들이 그물망 형태의 메쉬 네트워크로 연결되어 2차원 배열 구조를 형성하는 프로세싱 요소 어레이(PE array, 120)를 포함한다.Referring to FIG. 1A, a CGRA 100 structure includes a processing element array (PE array) in which processing elements (PE) 110 for parallel computation are connected to a mesh network in a mesh form to form a two-dimensional array structure. 120).

프로세싱 요소 어레이(120)를 구성하는 각 프로세싱 요소(110)들은 프로세싱 요소 어레이(120)의 각 행에 대해 연산하고 있는 데이터 버스들의 집합을 통해서 지역 메모리(local memory, 130)와 메인 메모리(160)간의 입력/출력 데이터를 수동으로 수신/송신한다.Each of the processing elements 110 constituting the processing element array 120 is configured with a local memory 130 and a main memory 160 through a set of data buses operating on each row of the processing element array 120. Receive / transmit data manually.

프로세싱 요소(PE, 110)는 수학적 또는 논리적 연산, 곱셈 또는 로드/저장(load/store)을 실행할 수 있는 연산 유닛들로, 프로세싱 요소(110)들은 지역 메모리(130)로부터 데이터를 로드(load)하거나 저장(store)할 수 있으며, 상호 연결된 네트워크를 통하여 이웃하는 프로세싱 요소의 출력을 처리할 수도 있다. The processing elements PE 110 are computational units capable of performing mathematical or logical operations, multiplications or load / store, and the processing elements 110 load data from the local memory 130. It can also store and store the output of neighboring processing elements through interconnected networks.

자원이 제한된 많은 CGRA 설계들은 특정의 기능을 위한 전용의 프로세싱 요소들을 가진다. 예를 들면, 프로세싱 요소 어레이의 각 행(row)별로, 일부 프로세싱 요소들은 산술 연산 또는 곱셈을 위하여 예약되고, 다른 일부 프로세싱 요소들은 지역 메모리(130)로부터 또는 지역 메모리(130)로의 로딩 및 저장을 실행하는 기능으로 예약된다. 이처럼 프로세싱 요소 별로 할당된 기능(즉, 소스 피연산자들의 선택, 연산 결과의 목적지, 및 실행되는 연산)은 구성 메모리(configuration memory, 140)에 저장되고, 구성은 CGRA에 대하여 어플리케이션을 컴파일링한 결과로써 생성된다.Many resource-constrained CGRA designs have dedicated processing elements for specific functions. For example, for each row of the processing element array, some processing elements are reserved for arithmetic operations or multiplications, and some other processing elements are for loading and storing to or from local memory 130. Reserved for execution. As such, the functions assigned to each processing element (ie, the choice of source operands, the destination of the result of the operation, and the operation to be executed) are stored in configuration memory 140, which is the result of compiling the application against CGRA. Is generated.

일반적으로 이러한 이러한 전용의 프로세싱 요소들 중 대표적인 것으로는 산술논리 연산을 처리하는 ALU(arithmetic-logic unit), 곱셈 연산을 처리하는 곱셈기(multiplier), 부동소수점 연산을 처리하는 FPU(floating-point unit), 메모리 연산을 처리하는 LSU(load/store unit) 들이 있으며, 이들을 공유 자원으로 한정하여, 프로세싱 요소 어레이(120)의 각 행별로 한 개씩만 할당하여 동일한 행에 속하는 다른 프로세싱 요소들에서 이를 공유하여 사용할 수 있도록 구성할 수도 있다.In general, representative of such dedicated processing elements are arithmetic-logic units (ALUs) for arithmetic logic operations, multipliers for multiplication operations, and floating-point units (FPUs) for floating-point operations. There are load / store units (LSUs) that process memory operations, and they are limited to shared resources, and only one is allocated to each row of the processing element array 120 to be shared by other processing elements belonging to the same row. It can also be configured for use.

예컨대, 도 1a에서는 프로세싱 요소 어레이를 구성하는 프로세싱 요소들 중 LSU를 빗금 처리로 표시하여 예시한다. 도 1a을 참조하면, LSU가 프로세싱 요소 어레이의 각 행에 하나씩 할당되어 있다.For example, in FIG. 1A, an LSU of the processing elements constituting the processing element array is indicated by hatching. 1A, one LSU is assigned to each row of the processing element array.

일반적으로, CGRA는 메인 프로세서(150)에 대한 보조 프로세서로써 사용된다. 메인 프로세서(150)는 메모리-매핑된 I/O를 통하여 CGRA 구성의 로딩 및 CGRA 실행의 초기화와 같은 CGRA의 실행을 관리한다. CGRA가 실행을 시작하면, 메인 프로세서(150)는 다른 작업을 수행할 수 있다. CGRA 실행의 종료를 통지하기 위하여 인터럽트가 사용될 수 있다. CGRA의 지역 메모리(130)는 DMA 콘트롤러(미도시)를 통하여 CGRA에 의하여 관리된다. 하드웨어 더블 버퍼링(hardware double bufferumg)은 버퍼들간의 빠른 스위치뿐만 아니라 CGRA에서 계산 및 데이터 전송 간의 풀(full) 오버랩을 허용한다. 이는 단일 루프를 실행하는 동안에 복수의 버퍼 스위치를 요구할 수도 있는 큰 루프들을 위해서 매우 중요하다.In general, CGRA is used as a coprocessor for the main processor 150. The main processor 150 manages the execution of the CGRA, such as loading the CGRA configuration and initiating the CGRA execution, via memory-mapped I / O. When the CGRA starts executing, the main processor 150 may perform other tasks. Interrupts can be used to notify the end of CGRA execution. The local memory 130 of the CGRA is managed by the CGRA through a DMA controller (not shown). Hardware double bufferumg allows full overlap between computations and data transfers in CGRA as well as fast switches between buffers. This is very important for large loops that may require multiple buffer switches while executing a single loop.

도 1b는 도 1a에 도시된 코어스 그레인드 재구성형 어레이 구조(100)의 일 구성요소인 프로세싱 요소(110)의 구성을 예시한 도면이다.FIG. 1B is a diagram illustrating the configuration of a processing element 110 that is one component of the coarse grain reconstructed array structure 100 shown in FIG. 1A.

프로세싱 요소(110)는 이웃한 프로세싱 요소들의 연산 출력값 및 자신이 수행한 연산 출력값을 입력받는 제1 및 제2 멀티플렉서(111, 112), 연산을 수행하는 기능 유닛(functional unit, FU, 113), 기능 유닛(113)의 연산 출력값을 저장하고 외부로 출력하는 출력 레지스터(114), 및 기능 유닛(113)의 연산 출력값을 피드백하는 레지스터 파일(115)을 포함한다. 프로세싱 요소(processing element, PE, 110)는 구성 메모리(140)와 연결되며, 구성 메모리(140)에 저장된 기능에 따라 자신이 수행해야할 기능(즉, 소스 피연산자들의 선택, 연산 결과의 목적지, 및 실행되는 연산)을 수행한다.The processing element 110 may include first and second multiplexers 111 and 112 that receive operation output values of neighboring processing elements and operation output values of the neighboring processing elements, a functional unit (FU) 113 to perform an operation, An output register 114 for storing the arithmetic output value of the functional unit 113 and outputting it externally, and a register file 115 for feeding back the arithmetic output value of the functional unit 113. The processing element (PE) 110 is connected to the configuration memory 140 and functions according to functions stored in the configuration memory 140 (ie, selection of source operands, destination of operation results, and execution). Operation).

도 1a에 도시된 것처럼, 전형적인 CGRA는 두 개의 저장 유닛들을 가진다. 하나는 구성 메모리(configuration memory, 140)이고, 나머지는 지역 메모리(local memory, 130)이다. 구성 메모리(140)는 프로세싱 요소(110)들에 의해 실행되는 명령들(구성 코드)의 구성을 제공한다. 명령들은 프로세싱 요소(110)들에 의해 실행되는 연산들 및 프로세싱 요소(110)들 간의 상호 연결(interconnection)에 대한 정보를 포함한다. 명령들은 또한 지역 메모리(130)의 데이터가 데이터 버스에 로딩되어야 하는 시간 및 버스에 있는 데이터가 지역 메모리(130)로 저장되어야 하는 시간에 대한 정보를 포함하고 있고, 그에 따라 프로세싱 요소(110)들은 별도의 어드레스 생성 유닛(도시되지 않음)이 없이도 데이터를 수신하고 송신할 수 있다.As shown in FIG. 1A, a typical CGRA has two storage units. One is configuration memory 140 and the other is local memory 130. Configuration memory 140 provides a configuration of instructions (configuration code) executed by processing elements 110. The instructions include information about the operations performed by the processing elements 110 and the interconnections between the processing elements 110. The instructions also include information about when data in the local memory 130 should be loaded onto the data bus and when the data on the bus should be stored into the local memory 130, so that the processing elements 110 may It is possible to receive and transmit data without a separate address generation unit (not shown).

CGRA의 실행은 일반적으로, 입력 데이터 로딩 단계(제1단계), 계산 단계(제2단계), 및 출력 데이터 전송 단계(제3단계)의 세가지 단계로 구분될 수 있다.The execution of the CGRA can generally be divided into three phases: an input data loading step (first step), a calculation step (second step), and an output data transmission step (third step).

입력 데이터 로딩 단계(제1단계)에서, 입력 데이터는 지역 메모리(130)에 전송된다. 입력 데이터는 작업 실행의 결과로써 프로세서(150)에 의해서 직접 기록되거나 혹은 메인 메모리(160)로부터 전송될 수 있다. 프로세싱 요소 어레이(120)는 구성 메모리(140)로부터 명령들을 가져오고 또한 지역 메모리(130)로부터 입력 데이터를 가져옴으로써 계산을 시작한다. 다음으로 프로세싱 요소 어레이(120)는 출력 데이터를 생성하여 지역 메모리(130)에 다시 기록하여 계산 단계(제2단계)를 완료한다. 출력 데이터 전송 단계(제3단계)에서, 출력 데이터는 프로세서(150) 또는 시스템의 다른 마스터들(미도시)에 의해 처리되거나 혹은 이후의 사용을 위해 메인 메모리(160)에 복사될 수 있다.In the input data loading step (first step), the input data is transmitted to the local memory 130. Input data may be written directly by the processor 150 as a result of task execution or transmitted from main memory 160. Processing element array 120 begins the calculation by getting instructions from configuration memory 140 and also input data from local memory 130. Processing element array 120 then generates output data and writes back to local memory 130 to complete the calculation step (second step). In the output data transfer step (third step), the output data may be processed by the processor 150 or other masters (not shown) of the system or copied to main memory 160 for later use.

도 2a 내지 2d는 일반적인 재구성형 프로세서의 재구성형 프로세싱 어레이 구조를 예시한 도면으로, 도 2a 내지 2d는 각각 메쉬형(Mesh) 재구성형 프로세싱 어레이 구조, 1-홉형(1-Hop) 재구성형 프로세싱 어레이 구조, 사선형(Diagonal) 재구성형 프로세싱 어레이 구조 및 혼합형(Mixed) 재구성형 프로세싱 어레이 구조를 각각 도시한다.2A to 2D illustrate a structure of a reconfigured processing array of a general reconfigured processor, and FIGS. 2A to 2D illustrate a mesh reconstructed processing array structure and a 1-hop reconfigured processing array, respectively. A structure, a diagonal reconstructed processing array structure and a mixed reconstructed processing array structure are shown, respectively.

이하에서 기술하는 본 발명에 대한 실시예에서는, 재구성형 프로세싱 어레이 구조의 확장의 기본 단위가 되는 기저(base) 재구성형 프로세싱 어레이 구조로써, 도 2a에 도시된 메쉬형 재구성형 프로세싱 어레이 구조를 선택하여 이로부터 확장한 확장 재구성형 프로세싱 어레이 구조를 생성하는 방법을 제안하고 있으나, 여기에서 기저 재구성형 프로세싱 어레이 구조로써 메쉬형 재구성형 프로세싱 어레이 구조를 선택하는 것은 하나의 실시예에 불과하며, 본 발명이 이에 한정되는 것은 아니라고 할 것이다.In the embodiment of the present invention described below, as the base reconstructed processing array structure that is the base unit of the expansion of the reconstructed processing array structure, by selecting the mesh reconstructed processing array structure shown in Figure 2a There is proposed a method of generating an extended reconfigured processing array structure from this, but selecting a mesh type reconfigured processing array structure as the base reconfigured processing array structure is just one embodiment, and the present invention It is not limited to this.

도 3은 본 발명의 바람직한 일 실시예에 따른 재구성형 프로세싱 어레이 구조 생성 방법의 일 실시예를 도시한 도면이다.3 is a diagram illustrating an embodiment of a method of generating a reconfigurable processing array structure according to an exemplary embodiment of the present invention.

도 3을 참조하면, 먼저, 컴파일러에서 적어도 하나 이상의 어플리케이션들을 컴파일하여(S310) 어플리케이션들을 구성하는 연산(operation)들과 이러한 연산들 간의 데이터 의존 관계를 나타내는 데이터 의존성(data dependency)들에 대한 정보를 생성한다.Referring to FIG. 3, first, the compiler compiles at least one or more applications (S310) to obtain information about operations constituting the applications and data dependencies indicating data dependency relations between the operations. Create

본 실시예에서, 컴파일러에서 적어도 하나 이상의 어플리케이션을 컴파일함에 있어서, 어플리케이션이 두 개 이상(복수 개)인 경우에는, 각 어플리케이션에 대해서 우선순위를 지정하여, 이러한 우선순위에 따라 컴파일하도록 구성될 수 있다.In the present embodiment, when compiling at least one or more applications in the compiler, when there are two or more (plural) applications, the priority may be specified for each application, and may be configured to compile according to these priorities. .

예컨대, 우선 순위가 높은 어플리케이션에서 우선 순위가 낮은 어플리케이션으로 컴파일하여 연산들과 데이터 의존성들의 정보를 생성하거나, 우선순위가 높은 일부의 어플리케이션만을 컴파일하여 연산들과 데이터 의존성들의 정보를 생성하여, 재구성형 프로세싱 어레이 구조 생성의 참조 정보로써 제공하도록 구현할 수 있다.For example, a high-priority application may be compiled from a low-priority application to generate information of operations and data dependencies, or only a few high-priority applications may be compiled to generate information of operations and data dependencies. It can be implemented to provide as reference information of the processing array structure generation.

컴파일러로부터 S310 단계를 통해 생성된 연산들과 데이터 의존성들에 대한 정보를 획득한다(S320).In operation S320, information about the operations and data dependencies generated through the S310 may be obtained from the compiler.

기저 재구성형 프로세싱 어레이 구조의 프로세싱 요소들 및 프로세싱 요소들 간의 상호 연결들에 S320 단계에서 획득된 연산들 및 데이터 의존성들을 서로 매핑한 매핑 케이스들을 생성한다(S330).In operation S330, mapping cases in which operations and data dependencies obtained in operation S320 are mapped to each other are processed at the processing elements and the interconnections between the processing elements of the basic reconstructed processing array structure.

S330 단계에서의 매핑 케이스들은 기저 재구성형 프로세싱 어레이 구조에 데이터 의존성들에 대응되는 상호 연결을 추가한 형태로 생성된다.The mapping cases in step S330 are generated by adding interconnections corresponding to data dependencies to the basic reconfigured processing array structure.

본 실시예에서의 기저 재구성형 프로세싱 어레이 구조로는, S320단계에서 획득된 연산들의 개수 이상의 프로세싱 요소들이 메쉬 형태의 네트워크로 연결되어 2차원 배열의 구조를 형성하는 메쉬형 재구성형 프로세싱 어레이 구조를 사용하고 있으나, 이는 하나의 실시예에 불과하며, 본 발명이 이에 한정되는 것은 아니다.As the basic reconstructed processing array structure according to the present embodiment, a mesh reconstructed processing array structure in which at least one processing element obtained in operation S320 is connected to a mesh network to form a two-dimensional array structure. However, this is only one embodiment, and the present invention is not limited thereto.

일반적으로는, 매핑할 연산들보다 프로세싱 요소가 많아야 하므로, S320 단계에서 획득된 연산들 및 데이터 의존성들에 관한 정보를 기초로 기저 재구성형 프로세싱 어레이 구조에 요구되는 프로세싱 요소의 개수를 도출하여, 적어도 도출된 개수 이상의 프로세싱 요소를 포함하는 기저 재구성형 프로세싱 어레이 구조를 선택하도록 구현할 수 있다.In general, since there should be more processing elements than the operations to be mapped, the number of processing elements required for the base reconstructed processing array structure is derived based on the information on the operations and data dependencies obtained in step S320, and at least, It may be implemented to select a base reconstructed processing array structure that includes more than the derived number of processing elements.

S330단계에서 생성된 매핑 케이스들 중에서 추가할 상호 연결의 개수가 가장 적은 매핑 케이스를 선택한다(S340).The mapping case having the smallest number of interconnections to be added is selected from the mapping cases generated in step S330 (S340).

S340 단계에서 선택된 매핑 케이스를 기초로, 기저 재구성형 프로세싱 어레이 구조에 상호 연결을 추가하여 확장한 확장 재구성형 프로세싱 어레이 구조를 생성한다(S350).Based on the mapping case selected in step S340, an interconnection is added to the base reconfigured processing array structure to generate an extended reconfigured processing array structure (S350).

도 4a 내지 4c는 도 3에 도시된 재구성형 프로세싱 어레이 구조 생성 방법 중 매핑 케이스 생성 과정을 통해 생성된 매핑 케이스를 예시한 도면이다.4A through 4C are diagrams illustrating mapping cases generated through a mapping case generation process among the reconfigurable processing array structure generation method shown in FIG. 3.

도 4a 내지 4c를 참조하면, 본 실시예에는 기저 재구성형 프로세싱 구조와 컴파일러로부터 획득한 정보를 각각 그래프 형태의 데이터 구조로 변환하여 양자를 서로 매핑하도록 구성할 수 있다.4A to 4C, the present embodiment may be configured to convert the information obtained from the basic reconstructed processing structure and the compiler into a data structure in the form of a graph, and to map the two to each other.

좀더 구체적으로 살펴보면, 도 4a는 기저 재구성형 프로세싱 어레이 구조에 속하는 프로세싱 요소들 각각을 VERTEX로 하고, 프로세싱 요소들 간의 상호 연결을 EDGE로 하는 그래프 형태의 제1데이터 구조로 변환한 예를 도시한다.More specifically, FIG. 4A illustrates an example of converting each of the processing elements belonging to the basic reconstructed processing array structure to VERTEX, and converting the interconnection between the processing elements into a first data structure in the form of a graph.

도 4b는 컴파일러로부터 획득된 연산 및 데이터 의존성들을 각각 VERTEX와 EDGE로 하는 그래프 형태의 제2데이터 구조로 변환한 예를 도시한다.4B shows an example of converting the operation and data dependencies obtained from the compiler into a second data structure in the form of a graph with VERTEX and EDGE, respectively.

도 4c는 변환된 제1데이터 구조 및 제2데이터 구조를 서로 매핑하여 생성된 매핑 케이스의 예를 도시한다.4C illustrates an example of a mapping case generated by mapping the converted first data structure and the second data structure to each other.

이처럼 그래프 형태의 데이터 구조로 변환하여 양자를 매핑할 경우 Inexact graph matching 알고리즘을 이용하여, 보다 용이하게 매핑 케이스를 생성할 수 있다.In this way, when the two are mapped by converting the data into a graph-like data structure, the mapping case can be more easily generated by using an inexact graph matching algorithm.

도 5는 본 발명의 바람직한 일 실시예에 따른 재구성형 프로세싱 어레이 구조 생성 방법에서 컴파일러로부터 획득한 정보와 기저 재구성형 프로세싱 어레이 구조를 매핑하고 기존 기저 재구성형 프로세싱 어레이 구조에 추가되어야 할 상호 연결을 도출하는 과정의 흐름을 예시한 도면이다.5 maps the information obtained from the compiler and the underlying reconfigured processing array structure in the method of generating a reconfigured processing array structure according to an exemplary embodiment of the present invention, and derives interconnections to be added to the existing base reconstructed processing array structure. It is a diagram illustrating the flow of a process to.

도 5를 참조하면, 먼저, 컴파일러로부터 획득된 연산들 및 데이터 의존성들에 관한 정보와, 공유 제약 조건을 기초로 하여, 컴파일러로부터 획득된 연산들을, 매핑할 기저 재구성형 프로세싱 어레이 구조의 2차원 배열의 행별로 분류(scatter)한다(S510).Referring to FIG. 5, first, a two-dimensional array of basis reconstructed processing array structure to map operations obtained from a compiler and information obtained from a compiler based on shared constraints and information about operations and data dependencies obtained from a compiler. Scatter by the rows of (S510).

이 때, 공유 제약 조건은 제한된 자원을 보다 효율적으로 활용하기 위해, 기저 재구성형 프로세싱 어레이 구조의 행 별로 전용의 프로세싱 요소를 할당하여, 동일한 행에 속한 다른 프로세싱 요소에서 공유할 수 있는 공유 자원으로 한정한 것을 가리키며, 여기에서 전용의 프로세싱 요소로는 산술논리 연산을 처리하는 ALU(arithmetic-logic unit), 곱셈 연산을 처리하는 곱셈기(multiplier), 부동소수점 연산을 처리하는 FPU(floating point unit) 및 메모리 연산을 처리하는 LSU(load/store unit) 등이 이에 해당될 수 있다. At this time, the shared constraint is limited to shared resources that can be shared by other processing elements belonging to the same row by allocating dedicated processing elements for each row of the underlying reconfigurable processing array structure in order to utilize the limited resources more efficiently. One dedicated processing element is an arithmetic-logic unit (ALU) that handles arithmetic logic operations, a multiplier that handles multiplication operations, a floating point unit (FPU) that handles floating-point operations, and memory. This may be a load / store unit (LSU) that processes an operation.

S510 단계에서 분류된 결과를 기초로, 기저 재구성형 프로세싱 어레이 구조의 상호 연결에 매핑되기 어려운(difficult-to-map) 데이터 의존성을 탐색한다(S520).Based on the results classified in step S510, the data dependencies that are difficult to map to the interconnection of the base reconstructed processing array structure are searched for (S520).

기저 재구성형 프로세싱 어레이 구조의 상호 연결에 매핑되기 어려운 데이터 의존성은 크게 두 가지로 분류될 수 있다.Data dependencies that are difficult to map to the interconnects of the underlying reconfigured processing array structure can be classified into two categories.

첫번째로, 자신과 인접한(adjacent) 즉, 1-홉 거리에 있는 연산들 중 동일한 행에 속하는 복수의 연산들과 동시에 데이터 의존성들이 있는 연산들의 데이터 의존성들은 fork 모양을 형성하며, 기저 재구성형 프로세싱 어레이의 메쉬 네트워크 연결에 매핑되기 어려운 데이터 의존성을 포함하게 된다. 본 실시예에서는 이러한 경우를 "fork"라고 칭한다.Firstly, the data dependencies of operations that have data dependencies simultaneously with a plurality of operations belonging to the same row among the adjacent ones, that is, one-hop distances, form a fork shape, and a base reconstruction processing array. It contains data dependencies that are difficult to map to the mesh network connection of the. In this embodiment, such a case is referred to as "fork".

두번째로, 자신과 2-홉 이상의 거리에 있는 연산들 중 다른 행에 속하는 연산들 사이의 데이터 의존성의 경우에 기저 재구성형 프로세싱 어레이의 메쉬 네트워크 연결에 매핑되기 어렵다. 본 실시예에서는 이러한 경우를 "over-length edge"라고 칭한다.Second, it is difficult to map to the mesh network connection of the underlying reconstructed processing array in the case of data dependency between itself and operations belonging to another row of operations that are more than two hops away. In this embodiment, this case is referred to as "over-length edge".

즉, S520 단계에서는 이러한 fork와 over-length edge의 경우에 해당하는 기저 재구성형 프로세싱 어레이의 상호 연결에 매핑되기 어려운 데이터 의존성을 탐색한다.That is, in step S520, data dependencies that are difficult to map to the interconnection of the base reconstruction processing array corresponding to the fork and over-length edges are searched for.

S520 단계에서 탐색된 기저 재구성형 프로세싱 어레이의 상호 연결에 매핑되기 어려운 데이터 의존성들을 먼저 컴파일러로부터 획득된 연산들과 매핑되지 않은 비매핑 프로세싱 요소를 거쳐서 라우팅되는 라우팅 상호 연결로 매핑할 수 있는지를 판단하여 매핑할 수 있는 경우에는 이러한 라우팅 상호 연결로 매핑한다(S530).By determining whether or not data dependencies that are difficult to map to the interconnection of the underlying reconstructed processing array discovered in step S520 can be mapped to routing interconnects routed through operations obtained from the compiler and unmapped processing elements that are not mapped. If it can be mapped to this routing interconnection (S530).

이는 기저 재구성형 프로세싱 어레이 구조 상에서 비매핑 프로세싱 요소를 거쳐 라우팅되도록 연산들 간의 데이터 의존성을 구현하는 것이 기저 프로세싱 어레이 구조에 새로운 상호 연결을 추가하는 것보다 훨씬 적은 비용이 들기 때문이다. This is because implementing data dependencies between operations to be routed through non-mapping processing elements on the underlying reconfigured processing array structure is much less expensive than adding new interconnects to the underlying processing array structure.

S530 단계에서 라우팅 상호 연결로 매핑되지 않은 나머지 데이터 의존성을 비로소 기저 재구성형 프로세싱 어레이 구조에 추가되어야 할 상호 연결로 결정한다(S540).The remaining data dependency not mapped to the routing interconnect is determined as an interconnect to be added to the basic reconfigured processing array structure in operation S530 (S540).

도 6a 내지 6h는 본 발명의 바람직한 일 실시예에 따라 에이 스타 검색(A* search)을 수행하여 확장 재구성형 프로세싱 어레이 구조를 도출하는 과정을 보다 상세하게 예시한 도면이다.6A to 6H are diagrams illustrating in more detail a process of deriving an extended reconfigurable processing array structure by performing an A * search according to an exemplary embodiment of the present invention.

본 실시예에서는 에이 스타 검색(A* search) 알고리즘을 이용하여 확장 재구성형 프로세싱 어레이 구조를 도출한다.In the present embodiment, an extended reconfigurable processing array structure is derived using an A * search algorithm.

여기에서, 에이 스타 검색이란, 시작 지점에서 목표 지점까지의 최소 비용의 경로(그래프 형태)를 탐색하는 대표적인 그래프 탐색 알고리즘으로, 목표에 얼마나 근접한 것인지를 평가하는데에 휴리스틱 함수를 사용한다는 점을 특징으로 한다.Here, A-Star Search is a typical graph search algorithm that searches the least cost path (graph form) from the starting point to the target point, and heuristic function is used to evaluate how close to the target. do.

즉, 에이 스타 검색에서의 비용을 평가하는 함수 f(n)은 시작 지점(예컨대, 트리 구조의 루트)로부터 현재 노드 n까지의 가중치 값(weight)을 계산하는 g(n)과 현재 노드 n으로부터 목표 지점까지 최선의 선택으로 갔을 경우에 가질 수 있을 것으로 예상되는 가중치 값(weight)을 계산하는 h(n)을 합산한 것으로, 여기에서 h(n)은 최선의 선택을 예상하는 것으로 상황에 맞게 커스터마이징된 휴리스틱 함수를 사용하는 것을 특징으로 한다.That is, the function f (n) for estimating the cost in the A-Star search is from g (n), which computes the weight from the starting point (e.g., root of the tree structure) to the current node n, and from the current node n. The sum of h (n), which calculates the weight that we expect to have if we went to the best option up to the target point, where h (n) is expected to be the best choice. It is characterized by using a customized heuristic function.

본 실시예에서는, 컴파일러에서 획득된 연산들을 트리 구조의 노드로 추가해가며, 에이 스타 검색을 수행하여 최소 비용의 그래프 형태를 도출한다.In this embodiment, the operations obtained by the compiler are added to the nodes of the tree structure, and A-star search is performed to derive the minimum cost graph form.

본 실시예에 따른, 에이 스타 검색의 g(s)는 현재 상태 s에서 노드로 추가된 연산들 간의 데이터 의존성들을 기저 재구성형 프로세싱 어레이 구조의 프로세싱 요소들간의 상호 연결과 비교한 결과에 따라 추가되어야 할 상호 연결에 따른 추가발생비용을 계산하여 산출된 값이며, h(s)는 아직 비교되지 않은 데이터 의존성에 따라 발생 가능한 최소한의 추가예상비용을 계산하여 산출된 값으로, 이들의 합을 비용으로 평가한다.According to this embodiment, g (s) of the A-Star search should be added according to the result of comparing the data dependencies between the operations added to the node in the current state s with the interconnection between the processing elements of the underlying reconstructed processing array structure. Calculated by calculating the additional costs incurred for each interconnection, h (s) is calculated by calculating the minimum additional estimated costs incurred for data dependencies that have not yet been compared. Evaluate.

본 실시예에서는, g(s) 및 h(s)에서 각각 추가발생비용 및 추가예상비용을 계산함에 있어서, 기저 재구성형 프로세서 어레이 구조의 상호 연결에 매핑되기 어려운 데이터 의존성을 비매핑 프로세싱 요소를 거쳐서 라우팅되도록 구성할 경우의 단위비용은 C₁으로, 데이터 의존성을 별도의 상호 연결을 추가하여 구성할 경우의 단위 비용을 C₂로 정의하고, 이들에 각각 임의로 1과 3의 값을 주어 비용을 산출한다.In the present embodiment, in calculating the additional incurring cost and the additional estimated cost in g (s) and h (s), respectively, the data dependency that is difficult to map to the interconnection of the underlying reconfigurable processor array structure is passed through the non-mapping processing element. The unit cost when configuring routing is defined as C ₁ , and the unit cost when configuring data dependency by adding a separate interconnect is defined as C ₂ , and each of them is arbitrarily given a value of 1 and 3 to calculate the cost. do.

본 실시예에서는, 현재 상태 s에서 매핑되지 않은 데이터 의존성 중 기저 재구성형 프로세서 어레이 구조의 상호 연결에 매핑되기 어려운 데이터 의존성의 수를 N_r이라 하고, 이하의 수식과 같이, 현재 상태에서 탐색되는 fork의 개수(N_f) 및 over-length edge의 개수(|E₀|)를 합하여 산출한다.In the present embodiment, the number of data dependencies that are not mapped in the current state s, which are hard to map to the interconnection of the underlying reconfigurable processor array structure, is referred to as N _r , and the fork searched in the current state as in the following formula: It is calculated by summing the number of times N _f and the number of over-length edges (| E ₀ |).

본 실시예에서의 h(s)를 산출하는 구체적인 수식은 다음과 같다.The specific formula for calculating h (s) in the present embodiment is as follows.

여기에서, N_i는 현재 상태 s에서 매핑되지 않은 연산의 수를 가리키는 것이다.Where N _i is the number of operations that are not mapped in the current state s.

먼저, 상기한 수학식 1에 따라, 현재 상태에서 매핑되지 않은 데이터 의존성 중에서 기저 재구성형 프로세서 어레이 구조의 상호 연결에 매핑되기 어려운 데이터 의존성의 수(N_r)를 계산하고, 매핑되기 어려운 데이터 의존성의 수가 추후 라우팅을 통해 해결가능한지를 판단하기 위해, 계산된 데이터 의존성의 수(N_r)와 현재 상태에서 매핑되지 않은 연산의 수(N_i)를 비교한다.First, according to Equation 1, the number of data dependencies (N _r ) that are difficult to map to the interconnection of the underlying reconfigurable processor array structure among data dependencies not mapped in the current state is calculated, and To determine if the number can be resolved through routing later, the calculated number of data dependencies N _r is compared with the number N _i of the unmapped operations in the current state.

비교 결과, 매핑되기 어려운 데이터 의존성보다 아직 매핑되지 않은 노드의 수가 많은 경우에는 이러한 데이터 의존성들을 아직 매핑되지 않은 노드들 사이의 라우팅을 통해서 해결하는 경우의 비용 즉, 매핑되기 어려운 데이터 의존성의 수(N_r)에 라우팅 시의 단위 비용인 C₁을 곱하여 h(s)를 계산한다.The comparison shows that if there are more nodes that are not yet mapped than data dependencies that are difficult to map, the cost of resolving these data dependencies through routing between nodes that are not yet mapped, that is, the number of data dependencies that are difficult to map (N _r ) is multiplied by C ₁ , the unit cost of routing, to calculate h (s).

반면에, 매핑되기 어려운 데이터 의존성들의 수가 아직 매핑되지 않은 노드 보다 많은 경우에는 일단 최대한 라우팅을 통해 해결하고(C₁N_i), 나머지는 상호 연결을 추가하여 해결하는 경우의 비용으로 계산하여(C₂(N_f+|E0|-N_i) 이들을 합하여 h(s)를 계산한다.On the other hand, if the number of data dependencies that are difficult to map is greater than the node that is not yet mapped, then the routing is resolved as far as possible (C ₁ N _i ) and the remainder is calculated as the cost of solving the additional interconnect (C ₁ N _i ). ₂ (N _f + | E0 | -N _i ) Sum these to calculate h (s).

도 6a를 참조하면, 본 실시예에서는 6개의 프로세싱 요소(PE1, PE2, PE3, PE4, PE5, PE6)들로 구성된 2*3 2차원 배열의 매쉬형 재구성형 프로세서 어레이 구조를 기저 재구성형 프로세서 어레이 구조로 사용한다.Referring to FIG. 6A, in the present embodiment, a mesh reconfigured processor array structure of a 2 * 3 two-dimensional array consisting of six processing elements PE1, PE2, PE3, PE4, PE5, and PE6 is based on a reconfigured processor array. Use as a structure.

도 6b를 참조하면, 본 실시예에서는 컴파일러로부터 1, 2, 3, 4의 4개의 연산들과, 1->3, 1->4, 2->3, 3->4의 데이터 의존성들을 입력받았으며, 공유 자원 제약 조건에 따라 1, 4를 첫번째 행으로, 2, 3을 두번째 행으로 분류한 상황을 가정하여 설명한다.Referring to FIG. 6B, in this embodiment, four operations of 1, 2, 3, and 4 are input from the compiler and data dependencies of 1-> 3, 1-> 4, 2-> 3, and 3-> 4. The description is based on the assumption that 1 and 4 are classified into the first row and 2 and 3 into the second row according to the shared resource constraint.

이러한 경우에, 1->3, 3->4의 상호 연결로 인해 fork가 1개 발생하게 되며, over-length edge는 존재하지 않는다. 따라서, h(s)는 이러한 fork가 라우팅/상호 연결 추가를 통해 해결되기 전까지는 N_r(1)*C₁(1)=1로 산출되며, fork가 라우팅/상호 연결 추가를 통해 해결된 이후에는 0으로 산출된다.In this case, one fork occurs due to the interconnection of 1-> 3 and 3-> 4, and there is no over-length edge. Therefore, h (s) is calculated as N _r (1) * C ₁ (1) = 1 until such fork is resolved through the addition of routing / interconnection, and after fork is resolved through the addition of routing / interconnection. Is calculated as 0.

도 6c를 참조하면 먼저, 1을 각각 첫번째으의 프로세싱 요소인 PE1 내지 PE3에 각각 매핑시키고(s=1~3). 각 경우에 따른 g(s) 및 h(s)를 계산하면, 모두 아직 라우팅/상호 연결 추가를 통해 해결되기 전이므로 g(s) 및 h(s)는 각각 0과 1로 산출된다.6C, first, 1 is mapped to PE1 to PE3, which are the first processing elements, respectively (s = 1 to 3). Calculating g (s) and h (s) for each case, g (s) and h (s) are calculated to be 0 and 1, respectively, since both have not yet been resolved through routing / interconnection addition.

이처럼 각 경우에 따라 계산된 비용이 동일하므로, 일단 1을 PE1에 매핑시켰다고 가정하고, 첫번째 행으로 분류된 다음 값인 4를 첫번째 행의 나머지 프로세싱 요소인 PE2와 PE3에 각각 매핑시키고, 각각의 매핑 결과에 따른 비용을 계산한다(s=4, 5).Since the calculated cost is the same in each case, it is assumed that 1 is mapped to PE1, and then 4, which is classified as the first row, is mapped to PE2 and PE3, the remaining processing elements of the first row, respectively. Calculate the cost according to (s = 4, 5).

4를 PE2에 매핑시킨 경우(s=4)에는 이전에 매핑된 1과의 데이터 의존성 즉 1->4는 PE1->PE2에 매핑될 수 있고, 이는 기저 재구성형 프로세서 어레이 구조의 메쉬 형태의 상호연결들에 포함될 수 있으므로, 추가 발생 비용이 발생하지 않아 g(4)는 0이며, 아직 fork가 해결되지 않은 상태이므로 h(4)는 1이 된다.In the case of mapping 4 to PE2 (s = 4), the data dependency with 1 previously mapped 1, 1-> 4, can be mapped to PE1-> PE2, which is a mesh-like interaction of the underlying reconfigured processor array structure. Since no additional incurring costs are incurred, g (4) is zero, and fork has not been resolved yet, so h (4) is one.

반면에, 4를 PE3에 매핑시킨 경우(s=5)에는 1->4는 1(PE1)->PE2->4(PE3)로 매핑되어야 하고, 이러한 경우에 아직 연산이 매핑되지 않은 비매핑 프로세싱 요소인 PE2를 거쳐 라우팅시켜야 하므로, 추가 발생 비용이 C₁(1)만큼 발생하여 g(5)는 1이 되며, 아직 fork가 해결되지 않은 상태이므로 h(5)는 여전히 1이 된다.On the other hand, if 4 is mapped to PE3 (s = 5), then 1-> 4 must be mapped to 1 (PE1)->PE2-> 4 (PE3), in which case the unmapping has not yet been mapped. Since we need to route through the processing element PE2, the additional cost is incurred by C ₁ (1), so g (5) is 1, and fork has not been resolved yet, so h (5) is still 1.

따라서, 추가 발생 비용이 적은 경우인 1을 PE1에, 4를 PE2에 각각 매핑한 경우를 기초로 2와 3을 두번째 행인 PE4 내지 PE6에 매핑한다.Therefore, 2 and 3 are mapped to the second row PE4 to PE6 based on the case where 1 is a case where the additional generation cost is low and 4 is mapped to PE2.

먼저 2를 PE4 내지 PE6에 매핑하는 경우(s=6~8)에 2와 이전에 매핑된 연산인 1, 4와의 데이터 의존성이 전혀 없으므로, 이전 단계(s=4)에 비해, 추가발생비용이 발생하거나, fork가 해결되는 상황이 발생하지 않아, g(s)와 h(s)는 각각 0, 1로 동일한 값을 유지한다.First, when 2 is mapped to PE4 to PE6 (s = 6 to 8), there is no data dependency between 2 and the previously mapped operations 1 and 4, so compared to the previous step (s = 4), the additional cost There is no situation where fork or fork is resolved, so g (s) and h (s) maintain the same values as 0 and 1, respectively.

따라서, 일단 2를 PE4에 매핑시켰다고 가정하고, 3을 PE5에 매핑시키는 경우(s=9)에, 3과 이전에 매핑된 연산들인 1, 2, 4과의 데이터 의존성들인 1->3, 2->3, 3->4들을 기저 재구성형 프로세싱 어레이 구조에 매핑한다.Thus, assuming that 2 is mapped to PE4 and 3 is mapped to PE5 (s = 9), then the data dependencies 1 and 2, 4 of 3 and the previously mapped operations 1-> 3, 2 -> 3, 3-> 4 are mapped to the underlying reconstructed processing array structure.

이에 관한 과정을 보다 상세하게 도시한 도 6d를 참조하면, 현 단계에서 매핑되어야 할 데이터 의존성들 중에서, 2->3, 3->4는 기저 재구성형 프로세서 어레이 구조의 메쉬 형태의 상호연결에 매핑될 수 있으나, 1->3은 기저 재구성형 프로세서 어레이 구조의 메쉬 형태의 상호연결에 매핑될 수 없으며, 1과 3 사이에 1->3의 데이터 의존성을 라우팅해줄 수 있는 프로세싱 요소 또한 발견할 수 없으므로, 1->3의 데이터 의존성은 기저 재구성형 프로세서 어레이 구조에 새로운 상호 연결을 추가함으로써 해결할 수 있게 된다. 따라서, 추가발생비용은 C₂(3)만큼 발생하게 되어 g(9)는 3이 되고, 이에 따라 fork는 해결되므로 h(9)는 0이 된다.Referring to FIG. 6D, which illustrates the process in more detail, among the data dependencies to be mapped at this stage, 2-> 3, 3-> 4 map to the mesh-shaped interconnection of the underlying reconfigurable processor array structure. However, 1-> 3 cannot be mapped to the mesh-like interconnects of the underlying reconfigured processor array structure, and can also find processing elements that can route 1-> 3 data dependencies between 1 and 3. Thus, data dependencies of 1-> 3 can be solved by adding new interconnects to the underlying reconfigured processor array architecture. Therefore, the additional generation cost is generated by C ₂ (3) so that g (9) becomes 3, and thus fork is solved, so h (9) becomes zero.

그리고, 2를 PE4에, 3을 PE5에 각각 매핑시키는 경우(s=10)에는, 3과 이전에 매핑된 연산들인 1, 2, 4과의 데이터 의존성들인 1->3, 2->3, 3->4 들을 기저 재구성형 프로세싱 어레이 구조에 매핑한다.In the case of mapping 2 to PE4 and 3 to PE5 (s = 10), data dependences of 3 and previously mapped operations 1, 2, and 4 are 1-> 3, 2-> 3, Map 3-> 4 to the underlying reconstructed processing array structure.

이에 관한 과정을 보다 상세하게 도시한 도 6e를 참조하면, 현 단계에서 매핑되어야 할 데이터 의존성들인 1->3, 2->3, 3->4은 모두 기저 재구성형 프로세서 어레이 구조의 메쉬 형태의 상호연결에 매핑될 수 없게 된다.Referring to FIG. 6E, which illustrates the process in detail, data dependencies 1-> 3, 2-> 3, and 3-> 4, which are to be mapped at this stage, are all in the form of a mesh of a basic reconfigurable processor array structure. It cannot be mapped to the interconnect.

이중 2->3은 비매핑 프로세싱 요소인 PE5를 통한 라우팅 연결(2(PE4)->PE5->3(PE6))로 해결 가능하며, 3->4 또한 역시 비매핑 프로세싱 요소인 PE3을 통한 라우팅 연결(3(PE6)->PE3->4(PE2))로 해결가능하다.2-> 3 can be solved by routing connection (2 (PE4)-> PE5-> 3 (PE6)) through PE5, which is an unmapped processing element, and 3-> 4 also by PE3, which is also an unmapped processing element. This can be solved by routing connection (3 (PE6)-> PE3-> 4 (PE2)).

반면에, 1->3은 1과 3 사이에 라우팅해줄 수 있는 프로세싱 요소를 발견할 수 없으므로, 1->3은 기저 재구성형 프로세서 어레이 구조에 새로운 상호 연결을 추가함으로써 해결할 수 있게 된다.On the other hand, since 1-> 3 cannot find a processing element that can route between 1 and 3, 1-> 3 can be solved by adding new interconnects to the underlying reconfigured processor array structure.

따라서, 1을 PE1에, 4를 PE2에, 2를 PE4에, 3을 PE5에 각각 매핑시키는 경우(s=10)에는 추가발생비용이 C₁(1)+C₁(1)+C₂(3)만큼 발생하게 되어 g(10)=5가 되고, 이에 따라 fork는 해결되므로 h(10)는 0이 된다.Therefore, if 1 is mapped to PE1, 4 to PE2, 2 to PE4, and 3 to PE5 (s = 10), then the additional cost is C ₁ (1) + C ₁ (1) + C ₂ ( 3), so that g (10) = 5, and fork is solved accordingly, so h (10) becomes zero.

그리고, 2를 PE5에, 3을 PE4에 각각 매핑시키는 경우(s=11)에는, 3과 이전에 매핑된 연산들인 1, 2, 4과의 데이터 의존성들인 1->3, 2->3, 3->4 들을 기저 재구성형 프로세싱 어레이 구조에 매핑한다.In the case of mapping 2 to PE5 and 3 to PE4 (s = 11), data dependencies of 3 and previously mapped operations 1, 2, and 4 are 1-> 3, 2-> 3, Map 3-> 4 to the underlying reconstructed processing array structure.

이에 관한 과정을 보다 상세하게 도시한 도 6f를 참조하면, 현 단계에서 매핑되어야 할 데이터 의존성들인 1->3, 2->3, 3->4 중에서, 1->3, 2->3는 기저 재구성형 프로세서 어레이 구조의 메쉬 형태의 상호연결에 매핑될 수 있으나, 3->4는 기저 재구성형 프로세서 어레이 구조의 메쉬 형태의 상호연결에 매핑될 수 없게 된다.Referring to FIG. 6F, which illustrates the process in detail, among 1-> 3, 2-> 3, and 3-> 4 data dependencies to be mapped at this stage, 1-> 3, 2-> 3 While 3-> 4 may not be mapped to mesh-shaped interconnects of the underlying reconfigurable processor array structure, it may not be mapped.

또한, 3->4는 3과 4 사이에 라우팅해줄 수 있는 프로세싱 요소 또한 발견할 수 없으므로, 3->4은 기저 재구성형 프로세서 어레이 구조에 새로운 상호 연결을 추가함으로써 해결해야 하고, 따라서, 추가발생비용이 C₂(3)만큼 발생하게 되어 g(11)=3가 되고, 이에 따라 fork는 해결되므로 h(11)는 0이 된다.In addition, since 3-> 4 cannot find any processing elements that can be routed between 3 and 4, 3-> 4 must be solved by adding new interconnects to the underlying reconfigured processor array structure, thus generating additional The cost is generated by C ₂ (3) so that g (11) = 3, so that fork is solved, so h (11) becomes zero.

그리고, 2를 PE5에, 3을 PE6에 각각 매핑시키는 경우(s=12)에는, 3과 이전에 매핑된 연산들인 1, 2, 4과의 데이터 의존성들인 1->3, 2->3, 3->4 들을 기저 재구성형 프로세싱 어레이 구조에 매핑한다.In the case of mapping 2 to PE5 and 3 to PE6 (s = 12), data dependencies of 3 and previously mapped operations 1, 2, and 4 are 1-> 3, 2-> 3, Map 3-> 4 to the underlying reconstructed processing array structure.

이에 관한 과정을 보다 상세하게 도시한 도 6g를 참조하면, 현 단계에서 매핑되어야 할 데이터 의존성들인 1->3, 2->3, 3->4 중에서, 2->3는 기저 재구성형 프로세서 어레이 구조의 메쉬 형태의 상호연결에 매핑될 수 있으나, 1->3, 3->4는 기저 재구성형 프로세서 어레이 구조의 메쉬 형태의 상호연결에 매핑될 수 없게 된다.Referring to Figure 6g, which illustrates the process in more detail, among the data dependencies 1-> 3, 2-> 3, 3-> 4 that should be mapped at this stage, 2-> 3 is the underlying reconfigurable processor array. 1-> 3, 3-> 4 cannot be mapped to mesh-shaped interconnects of the underlying reconfigurable processor array structure.

이중 3->4는 비매핑 프로세싱 요소인 PE3를 통한 라우팅 연결(3(PE6)->PE3->4(PE2))로 해결 가능하나, 1->3은 1과 3 사이에 라우팅해줄 수 있는 프로세싱 요소 또한 발견할 수 없으므로, 1->3은 기저 재구성형 프로세서 어레이 구조에 새로운 상호 연결을 추가함으로써 해결해야 하고, 따라서, 추가발생비용이 C₁(1)+C₂(3)만큼 발생하게 되어, g(12)는 4가 되고, h(12)는 0이 된다.Of these, 3-> 4 can be solved by routing connections through PE3, a non-mapping processing element (3 (PE6)->PE3-> 4 (PE2)), but 1-> 3 can be routed between 1 and 3. Since processing elements are also not found, 1-> 3 must be solved by adding new interconnects to the underlying reconfigured processor array structure, thus causing additional costs incurred by C ₁ (1) + C ₂ (3). G (12) becomes four, and h (12) becomes zero.

2를 PE6에, 3을 PE4에 각각 매핑시키는 경우(S=13)에는, 3과 이전에 매핑된 연산들인 1, 2, 4과의 데이터 의존성들인 1->3, 2->3, 3->4 들 중에서, 1->3는 기저 재구성형 프로세서 어레이 구조의 메쉬 형태의 상호연결에 매핑될 수 있으나, 2->3는 비매핑 프로세싱 요소인 PE5를 통한 라우팅 연결로 해결해야 하며, 3->4는 새로운 상호 연결 추가를 통해 해결해야 하므로, 추가발생비용이 C₁(1)+C₂(3)만큼 발생하게 되어, g(13)은 4가 되고, h(13)은 0이 된다.If you map 2 to PE6 and 3 to PE4 (S = 13), then the data dependencies of 3 and the previously mapped operations 1, 2, and 4 are 1-> 3, 2-> 3, 3- Of the> 4, 1-> 3 can be mapped to mesh-shaped interconnects of the underlying reconfigured processor array structure, but 2-> 3 must be resolved by routing connections through PE5, an unmapped processing element. > 4 must be solved by adding new interconnects, resulting in an additional cost of C ₁ (1) + C ₂ (3), where g (13) is 4 and h (13) is 0 .

마지막으로, 2를 PE6에, 3을 PE5에 각각 매핑시키는 경우(s=14)에는, 3과 이전에 매핑된 연산들인 1, 2, 4과의 데이터 의존성들인 1->3, 2->3, 3->4 들을 기저 재구성형 프로세싱 어레이 구조에 매핑하는데, 이하에서는 이에 관한 과정을 보다 상세하게 도시한 도 6h를 참조한다.Finally, if 2 is mapped to PE6 and 3 to PE5 (s = 14), then data dependencies of 3 and the previously mapped operations 1, 2, and 4 are 1-> 3, 2-> 3. , 3-> 4 are mapped to the underlying reconfigurable processing array structure, which will be described below in detail with reference to FIG. 6H.

현 단계에서 매핑되어야 할 데이터 의존성들인 1->3, 2->3, 3->4 중에서, 2->3, 3->4는 기저 재구성형 프로세서 어레이 구조의 메쉬 형태의 상호연결에 매핑될 수 있으나, 1->3는 비매핑 프로세싱 요소인 PE4를 통한 라우팅 연결(1(PE1)->PE4->3(PE5))로 해결해야 하므로, 결과적으로 추가발생비용이 C₁(1)만큼 발생하게 되어, g(14)는 1이 되고, h(14)은 0이 된다.Of the data dependencies to be mapped at this stage, 1-> 3, 2-> 3, 3-> 4, 2-> 3, 3-> 4 will be mapped to the mesh-shaped interconnects of the underlying reconfigurable processor array structure. However, 1-> 3 must be resolved by routing connections (1 (PE1)->PE4-> 3 (PE5)) through the non-mapping processing element PE4, resulting in additional cost of C ₁ (1). G (14) becomes 1, and h (14) becomes 0.

결과적으로, 상기한 에이 스타 검색 알고리즘에 따르면, s=14인 경우에 최소 비용의 그래프 형태가 도출되므로, 이에 대응되는 매핑 케이스를 선택하면 된다. As a result, according to the A-star search algorithm, since the graph form of the minimum cost is derived when s = 14, the mapping case corresponding to the A-star search algorithm may be selected.

즉, 선택되는 매핑 케이스는 1, 4, 2, 3을 각각 PE1, PE2, PE5, PE6에 매핑시키는 경우라 할 것이며, 이는 기저 재구성형 프로세서 어레이 구조에 별도로 추가되어야 할 상호 연결은 없는 경우에 해당된다.That is, the mapping case selected will be a case of mapping 1, 4, 2, and 3 to PE1, PE2, PE5, and PE6, respectively, in which case there is no interconnect to be added to the basic reconfigurable processor array structure. do.

본 실시예에 따라 생성된 확장 재구성형 프로세서 어레이 구조를 사용하는 경우에는 종래의 재구성형 프로세서 어레이 구조들(예컨대, 메쉬형, 1-홉형, 사선형, 혼합형)들에 비해 상대적으로 적은 비용으로 높은 성능을 도출할 수 있게 된다. In the case of using the extended reconfigurable processor array structure generated according to the present embodiment, it is possible to use the reconfigurable processor array structures (for example, mesh, 1-hop, diagonal, mixed type) at a relatively low cost. Performance can be derived.

특히, 본 실시예에서 제시한 방법을 적용하여 기저 재구성형 프로세서 어레이 구조에 추가되어야 할 상호 연결을 탐색하는 경우에는 보다 용이하게 확장 재구성형 프로세서 어레이 구조를 생성할 수 있게 된다.In particular, when searching for the interconnections to be added to the basic reconfigurable processor array structure by applying the method proposed in the present embodiment, it is possible to more easily generate the extended reconfigurable processor array structure.

본 발명은 또한 컴퓨터로 읽을 수 있는 기록매체에 컴퓨터가 읽을 수 있는 코드로서 구현하는 것이 가능하다. 컴퓨터가 읽을 수 있는 기록매체는 컴퓨터 시스템에 의해 읽혀질 수 있는 데이터가 저장되는 모든 종류의 기록장치를 포함한다. 컴퓨터가 읽을 수 있는 기록매체의 예로는 ROM, RAM, CD-ROM, 자기 테이프, 플로피 디스크, 광데이터 저장장치 등이 있으며, 또한 케리어 웨이브(예를 들어 인터넷을 통한 전송)의 형태로 구현되는 것도 포함한다. 또한, 컴퓨터가 읽을 수 있는 기록매체는 네트워크로 연결된 컴퓨터 시스템에 분산되어, 분산방식으로 컴퓨터가 읽을 수 있는 코드가 저장되고 실행될 수 있다. 그리고, 본 발명을 구현하기 위한 기능적인(functional) 프로그램, 코드 및 코드 세그먼트들은 본 발명이 속하는 기술분야의 프로그래머들에 의해 용이하게 추론될 수 있다.The present invention can also be embodied as computer-readable codes on a computer-readable recording medium. A computer-readable recording medium includes all kinds of recording apparatuses in which data that can be read by a computer system is stored. Examples of the computer-readable recording medium include a ROM, a RAM, a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like, and may be implemented in the form of a carrier wave (for example, transmission over the Internet) . In addition, the computer-readable recording medium may be distributed over network-connected computer systems so that computer readable codes can be stored and executed in a distributed manner. In addition, functional programs, codes, and code segments for implementing the present invention can be easily inferred by programmers of the technical field to which the present invention belongs.

본 발명은 첨부된 도면에 도시된 일 실시예를 참고로 설명되었으나, 이는 예시적인 것에 불과하며, 당해 기술분야에서 통상의 지식을 가진 자라면 이로부터 다양한 변형 및 균등한 타 실시예가 가능하다는 점을 이해할 수 있을 것이다. 따라서, 본 발명의 진정한 보호 범위는 첨부된 청구 범위에 의해서만 정해져야 할 것이다.While the present invention has been particularly shown and described with reference to exemplary embodiments thereof, it is clearly understood that the same is by way of illustration and example only and is not to be taken by way of limitation, You will understand. Accordingly, the true scope of protection of the present invention should be determined only by the appended claims.

Claims

Obtaining information about operations that constitute at least one or more applications compiled from a compiler and data dependencies indicative of data dependencies between the operations; And
Create an extended reconstruction processing array structure by adding interconnections between processing elements belonging to the base reconstruction processing array structure to a base reconstruction processing array structure based on the information about the obtained operations and data dependencies. A method for generating a reconfigurable processing array structure comprising a generating step.

The method of claim 1,
And the basis reconstructed processing array structure is a structure of a two-dimensional array in which processing elements or more of the obtained number of operations are connected in a mesh network.

The method of claim 1,
Deriving the number of processing elements required based on the obtained information about the operations and data dependencies, and selecting a base reconstructed processing array structure comprising at least the derived number of processing elements. Characterized in that the method for generating a reconfigurable processing array structure.

The method of claim 1,
The generating step,
Mapping the processing elements and the interconnections between the processing elements and the obtained operations and data dependencies of the underlying reconstructed processing array structure to each other, A mapping case step of generating a mapping case adding an interconnect corresponding to data dependencies not corresponding to the interconnect;
A mapping case selecting step of selecting a mapping case having the smallest number of added interconnections among the generated mapping cases; And
Generating the extended reconfigured processing array structure based on the selected mapping case.

5. The method of claim 4,
The mapping case generation step,
Converting each of the processing elements belonging to the base reconstructed processing array structure into a VERTEX, and converting the interconnections between the processing elements into a first data structure in the form of a graph with EDGE;
Converting the obtained operation and data dependencies into a second data structure in the form of a graph with VERTEX and EDGE, respectively; And
Generating the mapping case by mapping the first data structure and the second data structure to each other.

5. The method of claim 4,
The mapping case generating step may include sharing information on the obtained operations and data dependencies, and a dedicated processing element allocated to each row of the basic reconstructed processing array structure in another processing element belonging to the same row. And classifying the obtained operations for each row of the two-dimensional array of the base reconstructed processing array structure based on the shared resource constraints defined by the data resource constraints.

The method according to claim 6,
The dedicated processing element includes an arithmetic-logic unit (ALU) for arithmetic logic operations, a multiplier for multiplication operations, a floating point unit (FPU) for processing floating-point operations, and an LSU (for processing memory operations). and at least one or more of a load / store unit.

The method according to claim 6,
The mapping case generation step,
Searching for operations each having a data dependency in relation to at least two or more of those belonging to the same row of the one-hop distance from the one based on the classified result; And
And further determining interconnects to add to the base reconstructed processing array structure based on data dependencies of the retrieved operations.

9. The method of claim 8,
The additional interconnection determination step,
Mapping data dependencies of the discovered operations to a routing interconnect routed through an unmapped processing element that is not mapped with operations obtained from the compiler;
Mapping a remaining data dependency of the searched operations that is not mapped to the routing interconnect to an interconnect to add to the underlying reconfigured processing array structure. Way.

The method of claim 1,
The generating step,
The obtained operations are added to nodes of a tree structure, and the data dependencies between the operations added to the nodes are compared to the interconnects to be added as a result of comparing the interconnections between the processing elements of the basic reconstructed processing array structure. To perform the A * search, which calculates the minimum cost of additional costs incurred according to the additional costs incurred and the data dependencies not yet compared, and searches the graph form of the minimum cost by weighting their sums. step; And
Generating the extended reconstructed processing array structure based on the least cost graph form derived through the A-Star search.

The method of claim 1,
When there are a plurality of applications to be executed, the compiler selects at least one or more applications from among the plurality of applications based on a predetermined priority with respect to the plurality of applications, and sequentially orders the selected at least one or more applications. Compiling a processor to provide information about operations that constitute the compiled applications and data dependencies indicating a data dependency relationship between the operations. .

The method of claim 1,
And the base reconstructed processing array structure and the extended reconstructed processing array structure are structures of a coarse grained reconstructed array.

A computer-readable recording medium containing a program comprising a function of generating a reconfigurable processing array structure, the method comprising:
Obtaining information about operations constituting at least one or more applications compiled from a compiler and data dependencies indicative of data dependencies between the operations; And
Create an extended reconstruction processing array structure by adding interconnections between processing elements belonging to the base reconstruction processing array structure to a base reconstruction processing array structure based on the information about the obtained operations and data dependencies. A computer-readable recording medium containing a program that includes a function to do so.