KR20190101409A

KR20190101409A - Program code conversion to improve image processor runtime efficiency

Info

Publication number: KR20190101409A
Application number: KR1020197021659A
Authority: KR
Inventors: 박현철; 알버트 메익스너
Original assignee: 구글 엘엘씨
Priority date: 2017-05-12
Filing date: 2018-01-16
Publication date: 2019-08-30
Also published as: US20200050488A1; US20180329745A1; JP2020519976A; EP3622474A1; CN110192220B; JP6775088B2; TWI690850B; US10489199B2; KR102278021B1; CN110192220A; US10996988B2; WO2018208341A1; TW201908969A

Abstract

방법이 설명된다. 방법은 버퍼가 생산 커널로부터 하나 이상의 소비 커널들로 전송되는 이미지 데이터를 저장하고 포워딩하는 이미지 프로세싱 소프트웨어 데이터 플로우를 구성하는 단계를 포함한다. 상기 방법은 상기 버퍼가 상기 이미지 데이터를 저장하고 포워딩하기에 리소스들이 불충분하다는 것을 인식하는 단계를 포함한다. 또한 상기 방법은 상기 생산 커널로부터 상기 하나 이상의 소비 커널로 상기 이미지 데이터를 전송하는 동안 상기 이미지 데이터를 저장하고 포워딩하는 다수의 버퍼들을 포함하도록 상기 이미지 프로세싱 소프트웨어 데이터 플로우를 수정하는 단계를 포함한다.The method is described. The method includes constructing an image processing software data flow in which a buffer stores and forwards image data sent from a production kernel to one or more consuming kernels. The method includes recognizing that resources are insufficient for the buffer to store and forward the image data. The method also includes modifying the image processing software data flow to include a plurality of buffers for storing and forwarding the image data while transferring the image data from the production kernel to the one or more consuming kernels.

Description

Program code conversion to improve image processor runtime efficiency

본 발명의 분야는 일반적으로 이미지 프로세싱에 관한 것으로, 보다 구체적으로, 이미지 프로세서 런타임 효율을 개선하기 위한 프로그램 코드 변환에 관한 것이다.FIELD OF THE INVENTION The field of the present invention generally relates to image processing, and more particularly, to program code conversion for improving image processor runtime efficiency.

일반적으로 이미지 프로세싱은 어레이로 조직화된 픽셀값들의 프로세싱을 포함한다. 여기에서, 공간적으로 조직화된 2차원 어레이는 이미지들의 2차원적 특성(추가적 차원은 시간(예: 2차원 이미지들의 시퀀스)을 포함할 수 있음) 및 데이터 유형(예: 컬러)을 캡처한다. 일반적인 시나리오에서, 정렬된 픽셀 값들은 모션의 이미지들을 캡처하기 위해 스틸 이미지 또는 프레임들의 시퀀스를 생성한 카메라에 의해 제공된다. 전통적인 이미지 프로세서들은 일반적으로 두 가지 극단적인 측면에 있다.Image processing generally involves the processing of pixel values organized into an array. Here, the spatially organized two-dimensional array captures the two-dimensional characteristics of the images (the additional dimension may include time (eg a sequence of two-dimensional images)) and the data type (eg color). In a typical scenario, the aligned pixel values are provided by the camera that generated a still image or a sequence of frames to capture images of motion. Traditional image processors are generally in two extremes.

제1 극단은 범용 프로세서 또는 범용 유사 프로세서(예를 들어, 벡터 명령어 강화를 가진 범용 프로세서)상에서 실행되는 소프트웨어 프로그램으로서 이미지 프로세싱 작업들을 수행한다. 제1 극단은 일반적으로 매우 다양한 어플리케이션 소프트웨어 개발 플랫폼을 제공하지만, 연관된 오버헤드(예: 명령어 인출 및 디코드, 온칩 및 오프칩 데이터 처리, 추측 실행)와 결합된 보다 정교한 데이터 구조의 사용은 궁극적으로 프로그램 코드의 실행 동안 데이터의 유닛마다 더 많은 양의 에너지가 소비되는 것을 결과로 한다.The first extreme is a software program that runs on a general purpose processor or general purpose like processor (eg, a general purpose processor with vector instruction enhancement) to perform image processing tasks. The first extreme generally provides a wide variety of application software development platforms, but the use of more sophisticated data structures combined with associated overhead (e.g., instruction fetching and decoding, on-chip and off-chip data processing, speculative execution) ultimately leads to programmatic The result is that more energy is consumed per unit of data during the execution of the code.

제2, 반대 극단은 고정된 기능 배선 회로는 훨씬 더 큰 데이터 블록에 적용된다. 커스텀 설계 회로에 직접적으로 적용된 더 큰 데이터 블록의 사용은 데이터의 유닛마다 전력 소비를 크게 감소시킨다. 그러나 커스텀 설계된 고정 기능 회로의 사용은 일반적으로 프로세서가 수행할 수 있는 작업들의 제한된 세트를 결과로 한다. 이와 같이, 널리 사용되는 프로그래밍 환경(제1 극단과 연관됨)은 제2 극단에서 부족하다. Second, the opposite extremes, fixed functional wiring circuitry applies to much larger data blocks. The use of larger data blocks applied directly to custom design circuits greatly reduces power consumption per unit of data. However, the use of custom designed fixed function circuits typically results in a limited set of tasks that the processor can perform. As such, a widely used programming environment (associated with the first extreme) is lacking at the second extreme.

데이터의 유닛마다 개선된 전력 효율성이 조합된 매우 다양한 어플리케이션 소프트웨어 개발 기회를 제공하는 기술 플랫폼은 여전히 바람직하나 해결책이 부족하다.Technology platforms that offer a wide variety of application software development opportunities combined with improved power efficiency for each unit of data are still desirable but lack a solution.

다음의 설명 및 첨부 도면은 본 발명의 실시예를 설명하기 위해 사용된다. 도면에서:
도 1은 이미지 프로세서 하드웨어 아키텍처의 실시예를 도시한다.
도 2a, 2b, 2c, 2d 및 2e는 라인 그룹으로의 이미지 데이터의 파싱, 시트로의 라인 그룹의 파싱 및 오버랩핑 스텐실들을 갖는 시트에 대해 수행된 동작을 도시한다.
도 3a는 스텐실 프로세서의 실시예를 도시한다.
도 3b는 스텐실 프로세서의 명령어의 실시예를 도시한다.
도 4는 스텐실 프로세서 내의 데이터 연산 유닛의 실시예를 도시한다.
도 5a, 5b, 5c, 5d, 5e, 5f, 5g, 5h, 5i, 5j 및 5k는 오버랩핑 스텐실들과 이웃 출력 픽셀 값들의 쌍을 결정하기 위해 2차원 시프트 어레이 및 실행 레인 어레이의 사용예를 도시한다.
도 6은 통합된 실행 레인 어레이 및 2차원 시프트 어레이를 위한 유닛 셀의 실시예를 도시한다.
도 7a 및 7b는 제1 프로그램 코드 변환에 관한 것이다.
도 8a, 8b 및 8c는 제2 프로그램 코드 변환에 관한 것이다.
도 9a 및 9b는 제3 프로그램 코드 변환에 관한 것이다.
도 10a 및 10b는 제4 프로그램 코드 변환에 관한 것이다.
도 11a 및 11b는 제5 프로그램 코드 변환에 관한 것이다.
도 12는 제6 프로그램 코드 변환에 관한 것이다.
도 13a 및 13b는 제7 프로그램 코드 변환에 관한 것이다.
도 14는 제8 프로그램 코드 변환에 관한 것이다.
도 15는 프로그램 코드 변환 방법을 도시한다.
도 16은 소프트웨어 개발 환경에 관한 것이다.
도 17은 컴퓨팅 시스템에 관한 것이다. The following description and the annexed drawings are used to describe embodiments of the present invention. In the drawing:
1 illustrates an embodiment of an image processor hardware architecture.
2A, 2B, 2C, 2D and 2E illustrate the operations performed on a sheet with parsing image data into a group of lines, parsing a group of lines into a sheet and overlapping stencils.
3A shows an embodiment of a stencil processor.
3B illustrates an embodiment of instructions of a stencil processor.
4 illustrates an embodiment of a data computing unit in a stencil processor.
5a, 5b, 5c, 5d, 5e, 5f, 5g, 5h, 5i, 5j and 5k illustrate the use of a two-dimensional shift array and an execution lane array to determine pairs of overlapping stencils and neighboring output pixel values. Illustrated.
6 illustrates an embodiment of unit cells for an integrated execution lane array and a two dimensional shift array.
7A and 7B relate to first program code conversion.
8A, 8B and 8C relate to second program code conversion.
9A and 9B relate to third program code conversion.
10A and 10B relate to fourth program code conversion.
11A and 11B relate to fifth program code conversion.
12 relates to a sixth program code conversion.
13A and 13B relate to a seventh program code conversion.
14 relates to an eighth program code conversion.
15 shows a program code conversion method.
16 relates to a software development environment.
17 relates to a computing system.

i. 도입i. Introduction

아래의 설명은 개선된 전력 효율을 제공하기 위해 더 큰 데이터 블록(예: 아래에서 더 설명되는 라인 그룹들 및 시트들)을 사용하는 광범위한 어플리케이션 소프트웨어 개발 환경을 제공하는 새로운 이미지 프로세싱 기술 플랫폼에 관한 수많은 실시예를 기술한다.The description below describes a number of new image processing technology platforms that provide a broad application software development environment that uses larger data blocks (e.g., line groups and sheets described further below) to provide improved power efficiency. Examples are described.

1.0 하드웨어 아키텍처 실시예들1.0 Hardware Architecture Embodiments

a. 이미지 프로세서 하드웨어 아키텍처 및 동작a. Image Processor Hardware Architecture and Behavior

도 1은 하드웨어로 구현된 이미지 프로세서에 대한 아키텍처(100)의 실시예를 도시한다. 이미지 프로세서는, 예를 들어, 시뮬레이션된 환경 내의 가상 프로세서 용으로 작성된 프로그램 코드를 하드웨어 프로세서에 의해 실제로 실행되는 프로그램 코드로 변환하는 컴파일러에 의해 타게팅될 수 있다. 도 1에서 알 수 있는 바와 같이, 아키텍처(100)는 네트워크(104)를 통해(예: 온칩 스위치 네트워크를 포함하는 네트워크 온 칩(NOC), 온칩 링 네트워크 또는 다른 종류의 네트워크), 복수의 스텐실 프로세서 유닛들(102_1~102_N)(이하, "스텐실 프로세서들", "스텐실 프로세서 유닛들", "이미지 프로세싱 코어들", "코어들" 등으로 지칭함) 및 대응 시트 생성 유닛들(103_1~103_N)(이하, "시트 생성기들", "시트 생성기 유닛들" 등으로 지칭함)에 상호연결된 복수의 라인 버퍼 유닛들(101_1~101_M)(이하, "라인 버퍼들", "라인 버퍼 유닛들" 등으로 지칭함)을 포함한다. 일 실시예에서, 임의의 라인 버퍼 유닛은 네트워크(104)를 통해 임의의 시트 생성기 및 대응 스텐실 프로세서에 연결될 수 있다.1 illustrates an embodiment of an architecture 100 for an image processor implemented in hardware. The image processor may be targeted by, for example, a compiler that translates program code written for a virtual processor in a simulated environment into program code that is actually executed by a hardware processor. As can be seen in FIG. 1, the architecture 100 is configured via a network 104 (eg, a network on chip (NOC), an on-chip ring network, or other type of network including an on-chip switch network), and a plurality of stencil processors. Units 102_1-102_N (hereinafter referred to as "stencil processors", "stencil processor units", "image processing cores", "cores", etc.) and corresponding sheet generating units 103_1-103_N ( Hereinafter, a plurality of line buffer units 101_1 to 101_M (hereinafter referred to as "line buffers", "line buffer units", etc.) interconnected to "sheet generators", "sheet generator units", and the like. ). In one embodiment, any line buffer unit may be connected to any sheet generator and corresponding stencil processor via network 104.

일 실시예에서, 프로그램 코드는 소프트웨어 개발자에 의해 미리 정의된 이미지 프로세싱 동작을 수행하기 위해 대응 스텐실 프로세서(102) 상에 컴파일되고 로딩된다(예를 들어, 설계 및 구현예에 따라, 프로그램 코드는 스텐실 프로세서의 연관 시트 생성기(103)에 로딩될 수 있음). 적어도 일부의 경우에서, 이미지 프로세싱 파이프라인은 제1 파이프라인 스테이지에 대한 제1 커널 프로그램을 제1 스텐실 프로세서(102_1)로 로딩하고, 제2 파이프라인 스테이지에 대한 제2 커널 프로그램을 제2 스텐실 프로세서(102_2)로 로딩함으로써 구현될 수 있고, 제1 커널은 파이프라인의 제1 스테이지의 기능을 수행하고, 제2 커널은 파이프라인의 제2 스테이지의 기능을 수행하고, 추가적인 제어 플로우 방법들이 설치되어 출력 이미지 데이터를 파이프라인의 한 스테이지로부터 파이프라인의 다음 스테이지로 패스한다.In one embodiment, program code is compiled and loaded onto a corresponding stencil processor 102 to perform a predefined image processing operation by a software developer (e.g., according to design and implementation, the program code is stenciled). The association sheet generator 103 of the processor). In at least some cases, the image processing pipeline loads the first kernel program for the first pipeline stage into the first stencil processor 102_1, and loads the second kernel program for the second pipeline stage into the second stencil processor. By loading to 102_2, the first kernel performs the function of the first stage of the pipeline, the second kernel performs the function of the second stage of the pipeline, and additional control flow methods are installed Pass the output image data from one stage of the pipeline to the next stage of the pipeline.

다른 구성에서, 이미지 프로세서는 동일한 커널 프로그램 코드를 동작시키는 2 이상의 스텐실 프로세서들(102_1, 102_2)를 갖는 병렬 기계로서 구현될 수 있다. 예를 들어, 고밀도 및 고 데이터 레이트 스트림의 이미지 데이터는 각각이 동일한 기능을 수행하는 다수의 스텐실 프로세서들에 걸쳐 프레임들을 확산시킴으로써 프로세싱될 수 있다.In another configuration, the image processor may be implemented as a parallel machine with two or more stencil processors 102_1, 102_2 running the same kernel program code. For example, image data of high density and high data rate streams can be processed by spreading frames across multiple stencil processors, each of which performs the same function.

또 다른 구성에서, 본질적으로 커널들의 임의의 DAG는 각각의 스텐실 프로세서들을 그들 자신의 각각의 프로그램 코드의 커널로 구성하고, 출력 이미지들을 하나의 커널로부터의 DAG 디자인의 다음 커널의 입력으로 안내하기 위해 적절한 제어 플로우 후크들을 하드웨어로 구성함으로써 하드웨어 프로세서에 로딩될 수 있다.In another configuration, essentially any DAG of kernels configures each of the stencil processors into a kernel of their own respective program code, and directs the output images from one kernel to the input of the next kernel of the DAG design. It can be loaded into the hardware processor by configuring the appropriate control flow hooks in hardware.

일반적인 흐름으로서, 이미지 데이터의 프레임들은 매크로 I/O 유닛(105)에 의해 수신되고, 프레임 단위로 하나 이상의 라인 버퍼 유닛들(101)로 전달된다. 특정한 라인 버퍼 유닛은 이미지 데이터의 그것의 프레임을 "라인 그룹"이라 지칭되는, 이미지 데이터의 더 작은 영역으로 파싱한 다음, 라인 그룹을 네트워크(104)를 통해 특정한 시트 생성기로 전달한다. 완전한 또는 "풀(full)" 단일 라인 그룹은 예를 들어, 프레임의 다수의 연속적인 완전한 행 또는 열의 데이터로 구성될 수 있다(간결성을 위해, 본 명세서는 주로 연속적인 행을 참조할 것이다). 시트 생성기는 이미지 데이터의 라인 그룹을 "시트"라고 하는 이미지 데이터의 더 작은 영역으로 파싱하고, 상기 시트를 그것의 대응 스텐실 프로세서에 제시한다.As a general flow, frames of image data are received by macro I / O unit 105 and delivered to one or more line buffer units 101 on a frame-by-frame basis. The particular line buffer unit parses its frame of image data into smaller areas of the image data, referred to as "line groups," and then passes the group of lines through the network 104 to a particular sheet generator. A complete or "full" single line group may, for example, consist of data of multiple consecutive full rows or columns of a frame (for brevity, this specification will mainly refer to consecutive rows). The sheet generator parses a group of lines of image data into smaller areas of image data called " sheets, " and presents the sheet to its corresponding stencil processor.

이미지 프로세싱 파이프라인 또는 단일 입력을 갖는 DAG 플로우의 경우, 일반적으로, 입력 프레임들은 이미지 데이터를 라인 그룹으로 파싱하고 라인 그룹들을 대응 스텐실 프로세서(102_1)가 파이프라인/DAG에서 제1 커널의 코드를 실행 중인 시트 생성기(103_1)로 향하게 하는 동일한 라인 버퍼 유닛(101_1)으로 안내된다. 그것이 프로세싱하는 라인 그룹들에서 스텐실 프로세서(102_1)에 의한 동작들이 완료되면, 시트 생성기(103_1)는 출력 라인 그룹을 "다운스트림" 라인 버퍼 유닛(101_2)으로 송신한다(일부 경우에, 출력 라인 그룹은 이전에 입력 라인 그룹들을 송신했던 동일한 라인 버퍼 유닛(101_1)으로 다시 송신될 수 있다).In the case of an image processing pipeline or DAG flow with a single input, in general, the input frames parse the image data into line groups and the line groups correspond to the stencil processor 102_1 executing the code of the first kernel in the pipeline / DAG. The same line buffer unit 101_1 is directed to the sheet generator 103_1 which is in progress. Once the operations by stencil processor 102_1 in the line groups it processes are completed, sheet generator 103_1 sends an output line group to "downstream" line buffer unit 101_2 (in some cases, the output line group). May be sent back to the same line buffer unit 101_1 that previously sent the input line groups).

그 자신의 각각의 다른 시트 생성기 및 스텐실 프로세서(예를 들어, 시트 생성기(103_2) 및 스텐실 프로세서(102_2))에서 실행되는 파이프라인/DAG에서 다음 스테이지/동작을 나타내는 하나 이상의 "소비자" 커널은 다운스트림 라인 버퍼 유닛(101_2)로부터 제1 스텐실 프로세서(102_1)에 의해 생성된 이미지 데이터를 수신한다. 이러한 방식으로, 제1 스텐실 프로세서 상에서 동작하는 "생산자" 커널은 제2 스텐실 프로세서 상에서 동작하는 "소비자" 커널에 포워딩된 그것의 출력 데이터를 가지며, 여기서 소비자 커널은 전반적 파이프라인 또는 DAG의 디자인과 일관된 생산자 커널 이후에 작업들의 다음 세트를 수행한다. 여기서, 라인 버퍼 유닛(101_2)는 생산자 커널로부터 소비자 커널로의 이미지 데이터 전송의 일부로서 생산자 커널에 의해 생성된 이미지 데이터를 저장하고 포워딩한다. One or more "consumer" kernels representing the next stage / action in the pipeline / DAG running on its own respective sheet generator and stencil processor (e.g., sheet generator 103_2 and stencil processor 102_2) are down. Image data generated by the first stencil processor 102_1 is received from the stream line buffer unit 101_2. In this way, the "producer" kernel running on the first stencil processor has its output data forwarded to the "consumer" kernel running on the second stencil processor, where the consumer kernel is consistent with the overall pipeline or design of the DAG. Perform the next set of tasks after the producer kernel. Here, the line buffer unit 101_2 stores and forwards the image data generated by the producer kernel as part of the image data transfer from the producer kernel to the consumer kernel.

스텐실 프로세서(102)는 이미지 데이터의 다수의 오버랩핑 스텐실들 상에서 동시적으로 동작하도록 설계된다. 스텐실 프로세서의 다수의 오버랩핑 스텐실들 및 내부 하드웨어 프로세싱 용량은 시트의 크기를 효과적으로 결정한다. 여기서, 스텐실 프로세서(102) 내에서, 실행 레인들의 어레이들은 조화롭게 동작하여 다수의 오버랩핑 스텐실들에 의해 커버되는 이미지 데이터 표면 영역을 동시적으로 프로세싱한다.Stencil processor 102 is designed to operate concurrently on multiple overlapping stencils of image data. The multiple overlapping stencils and internal hardware processing capacity of the stencil processor effectively determine the size of the sheet. Here, within stencil processor 102, arrays of execution lanes operate in unison to simultaneously process the image data surface area covered by multiple overlapping stencils.

아래에서 보다 상세히 설명되는 바와 같이, 다양한 실시예들에서, 이미지 데이터의 시트들은 스텐실 프로세서(102) 내의 2차원 레지스터 어레이 구조에 로딩된다. 시트들 및 2차원 레지스터 어레이 구조의 사용은 대량의 데이터를 대량의 레지스터 공간으로 이동시킴으로써, 예를 들어, 프로세싱 작업들을 가지는 단일 로드 동작은 실행 레인 어레이 이후에 즉시 데이터에 직접적으로 수행되므로, 전력 소비 개선을 효과적으로 제공한다. 추가적으로, 실행 레인 어레이와 대응 레지스터 어레이의 사용은 쉽게 프로그래밍/구성가능한 상이한 스텐실 크기들을 제공한다.As described in more detail below, in various embodiments, sheets of image data are loaded into a two-dimensional register array structure within stencil processor 102. The use of sheets and a two-dimensional register array structure moves large amounts of data into a large register space, for example, since a single load operation with processing operations is performed directly on the data immediately after the execution lane array, thus consuming power. Provide effective improvement. In addition, the use of an execution lane array and a corresponding register array provide different stencil sizes that are easily programmable / configurable.

도 2a 내지 도 2e는 라인 버퍼 유닛(101)의 파싱 활동, 시트 생성기 유닛(103)의 더 세분화된 파싱 활동 둘 모두, 뿐만 아니라 시트 생성기 유닛(103)에 결합된 스텐실 프로세서(102)의 스텐실 프로세싱 활동을 고레벨 실시예들에서 도시한다.2A-2E illustrate both the parsing activity of the line buffer unit 101, the more granular parsing activity of the sheet generator unit 103, as well as the stencil processing of the stencil processor 102 coupled to the sheet generator unit 103. The activity is shown in high level embodiments.

도 2a는 이미지 데이터(201)의 입력 프레임의 실시예를 도시한다. 도 2a는 또한 스텐실 프로세서가 동작하도록 설계된 3개의 오버랩핑 스텐실들(202)(3 x 3 픽셀의 디멘션을 갖는 각 스텐실)의 윤곽을 도시한다. 각 스텐실이 각각의 출력 이미지 데이터를 생성하는 출력 픽셀은 단색 검정색으로 하이라이트된다. 간결성을 위해, 3개의 오버랩핑 스텐실들(202)은 수직 방향으로만 오버랩핑되는 것으로 도시된다. 실제로 스텐실 프로세서는 수직 및 수평 방향 모두에서 오버랩핑 스텐실을 갖도록 설계될 수 있음을 인식하는 것이 적절하다.2A illustrates an embodiment of an input frame of image data 201. 2A also depicts the outline of three overlapping stencils 202 (each stencil having a dimension of 3 x 3 pixels) designed for the stencil processor to operate. The output pixels where each stencil produces its respective output image data are highlighted in solid black. For brevity, the three overlapping stencils 202 are shown to overlap only in the vertical direction. In practice it is appropriate to recognize that a stencil processor can be designed to have overlapping stencils in both the vertical and horizontal directions.

스텐실 프로세서 내의 수직 오버랩핑 스텐실들(202) 때문에, 도 2a에서 관찰되는 바와 같이, 단일 스텐실 프로세서가 동작할 수 있는 프레임 내에는 넓은 범위의 이미지 데이터가 존재한다. 아래에서 보다 상세히 설명하는 바와 같이, 일 실시예에서, 스텐실 프로세서들은 이미지 데이터를 가로 질러 좌측에서 우측 식으로(그리고 라인들의 다음 세트에 대해, 위에서 아래 순으로 반복함) 그들 오버랩핑 스텐실 내에서 데이터를 프로세싱한다. 따라서, 스텐실 프로세서가 계속해서 동작하면서, 단색 검은색 출력 픽셀 블록들의 수는 수평 방향으로 오른쪽으로 진행한다. 전술한 바와 같이, 라인 버퍼 유닛(101)은 스텐실 프로세서가 확장된 수의 다가올 사이클 동안 동작하기에 충분한 인커밍 프레임으로부터 입력 이미지 데이터의 라인 그룹을 파싱하는 역할을 한다. 라인 그룹의 예시적 도시는 음영 영역(203)으로 도시된다. 일 실시예에서, 이하에서 더 설명되는 바와 같이, 라인 버퍼 유닛(101)은 시트 생성기로/로부터 라인 그룹을 송신/수신하기 위한 상이한 동역학을 포함할 수 있다. 예를 들어, "풀 그룹(full group)"으로 지칭되는 하나의 모드에 따르면, 이미지 데이터의 완전한 전체 폭 라인들은 라인 버퍼 유닛과 시트 생성기 사이에서 전달된다. "가상 높이"라고 하는 제2 모드에 따르면, 라인 그룹은 초기적으로 전체 너비 행의 서브셋과 함께 전달된다. 나머지 행은 작은(전체 너비보다 작은) 조각들로 순차적으로 전달된다.Because of the vertical overlapping stencils 202 in the stencil processor, as seen in FIG. 2A, there is a wide range of image data in a frame in which a single stencil processor can operate. As described in more detail below, in one embodiment, the stencil processors have data in their overlapping stencil in a left-to-right fashion (and repeating from top to bottom, for the next set of lines) across the image data. To process. Thus, while the stencil processor continues to operate, the number of monochrome black output pixel blocks proceeds to the right in the horizontal direction. As mentioned above, the line buffer unit 101 serves to parse a group of lines of input image data from an incoming frame sufficient for the stencil processor to operate for an extended number of upcoming cycles. An exemplary illustration of a line group is shown by shaded area 203. In one embodiment, as will be described further below, the line buffer unit 101 may include different dynamics for transmitting / receiving line groups to / from the sheet generator. For example, according to one mode called "full group", the full full width lines of the image data are transferred between the line buffer unit and the sheet generator. According to a second mode called "virtual height", the group of lines is initially delivered with a subset of the full width rows. The rest of the rows are passed sequentially in small (smaller than full width) pieces.

라인 버퍼 유닛에 의해 정의되고, 입력 생성기 유닛으로 전달된 입력 이미지 데이터의 라인 그룹(203)에 의해, 시트 생성기 유닛은 스텐실 프로세서의 하드웨어 한계에 보다 정확하게 맞는 더 미세한 시트로 라인 그룹을 더 파싱한다. 보다 상세하게는, 이하에 더욱 상세히 설명하는 바와 같이, 일 실시예에서, 각 스텐실 프로세서는 2차원 시프트 레지스터 어레이로 구성된다. 2차원 시프트 레지스터 어레이는 기본적으로 이미지 데이터를 실행 레인들의 어레이 "아래로" 시프팅하며, 여기서 시프팅의 패턴은 각 실행 레인으로 하여금 그 자체의 각각의 스텐실 내의 데이터에서 동작하게 한다(즉, 각 실행 레인이 해당 스텐실에 대한 출력을 생성하기 위해 정보의 그 자체의 스텐실에서 프로세싱). 일 실시예에서, 시트들은 2차원 시프트 레지스터 어레이에 "채워지거나(fill)" 이와 달리 로딩되는 입력 이미지 데이터의 표면 영역들이다.By the line group 203 of input image data defined by the line buffer unit and passed to the input generator unit, the sheet generator unit further parses the line group into finer sheets that more accurately fit the hardware limits of the stencil processor. More specifically, as described in more detail below, in one embodiment, each stencil processor is comprised of a two-dimensional shift register array. The two-dimensional shift register array basically shifts the image data “down” in an array of execution lanes, where the pattern of shifting causes each execution lane to operate on data in its own respective stencil (ie, each Execution lanes are processed in its own stencil of information to generate output for that stencil. In one embodiment, the sheets are surface regions of input image data that are "filled" or otherwise loaded into a two-dimensional shift register array.

따라서, 도 2b에서 관찰되는 바와 같이, 시트 생성기는 라인 그룹(203)으로부터 초기 시트(204)를 파싱하고, 그것을 스텐실 프로세서에 제공한다(여기에서, 데이터의 예시적 시트는 일반적으로 참조 번호(204)에 의해 식별되는 5×5 음영 영역에 대응한다). 도 2c 및 도 2d에서 관찰되는 바와 같이, 스텐실 프로세서는 오버랩핑 스텐실(202)을 시트 상에서 좌측에서 우측으로 효과적으로 이동시킴으로써 입력 이미지 데이터의 시트에서 동작한다. 도 2d와 같이, 시트 내의 데이터로부터 출력값이 계산될 수 있는 픽셀의 수(어두운 3 × 3 어레이에서 9)가 소모된다(다른 픽셀 위치는 시트 내에서 정보로부터 결정된 출력 값을 가질 수 없다). 단순화를 위해 이미지의 경계 영역이 무시되었다.Thus, as observed in FIG. 2B, the sheet generator parses the initial sheet 204 from the group of lines 203 and provides it to the stencil processor (here, an exemplary sheet of data is generally referred to by reference numeral 204. Corresponding to the 5 × 5 shaded area identified by As observed in FIGS. 2C and 2D, the stencil processor operates on a sheet of input image data by effectively moving the overlapping stencil 202 from left to right on the sheet. As shown in FIG. 2D, the number of pixels (9 in a dark 3 × 3 array) from which the output value can be calculated from the data in the sheet is consumed (other pixel positions may not have an output value determined from the information in the sheet). The border area of the image is ignored for simplicity.

도 2e에서 관찰된 바와 같이, 그 후 시트 생성기는 스텐실 프로세서가 동작을 계속하도록 다음 시트(205)를 제공한다. 스텐실이 다음 시트에서 동작하기 시작할 때의 스텐실들의 초기 위치는 (도 2d에서 이전에 도시된 바와 같이) 제1 시트 상의 고갈 포인트로부터 우측으로의 다음 진행이다. 새로운 시트(205)로, 스텐실 프로세서가 제1 시트의 프로세싱과 동일한 방식으로 새로운 시트 상에서 동작할 때, 스텐실들은 오른쪽으로 단순히 계속 이동할 것이다.As observed in FIG. 2E, the sheet generator then provides the next sheet 205 for the stencil processor to continue operation. The initial position of the stencils when the stencil begins to operate on the next sheet is the next run from the depletion point on the first sheet to the right (as shown previously in FIG. 2D). With the new sheet 205, when the stencil processor operates on the new sheet in the same manner as the processing of the first sheet, the stencils will simply continue to move to the right.

출력 픽셀 위치를 둘러싸는 스텐실의 경계 영역 때문에 제1 시트(204)의 데이터와 제2 시트(205)의 데이터 사이에 약간의 오버랩이 있음을 주목한다. 오버랩은 시트 생성기가 오버랩핑 데이터를 두 번 재전송함으로써 간단히 처리될 수 있다. 다른 구현 예에서, 스텐실 프로세서에 다음 시트를 공급하기 위해, 시트 생성기는 스텐실 프로세서에 새로운 데이터를 송신만 하기 위해 진행하고, 스텐실 프로세서는 이전 시트로부터의 오버랩핑 데이터를 재사용할 수 있다.Note that there is some overlap between the data of the first sheet 204 and the data of the second sheet 205 due to the boundary area of the stencil surrounding the output pixel position. Overlap can be handled simply by the sheet generator retransmitting the overlapping data twice. In another implementation, to supply the next sheet to the stencil processor, the sheet generator proceeds only to send new data to the stencil processor, which can reuse the overlapping data from the previous sheet.

b. 스텐실 프로세서 설계 및 동작b. Stencil Processor Design and Operation

도 3a는 스텐실 프로세서 유닛 아키텍처(300)의 실시예를 도시한다. 도 3a에서 관찰된 바와 같이, 스텐실 프로세서는 데이터 계산 유닛(301), 스칼라 프로세서(302) 및 연관 메모리(303) 및 I/O 유닛(304)을 포함한다. 데이터 계산 유닛(301)은 실행 레인들의 어레이(305), 2차원 시프트 어레이 구조(306) 및 어레이의 특정 행 또는 열과 연관된 별개의 각각의 랜덤 액세스 메모리들(307)을 포함한다.3A shows an embodiment of a stencil processor unit architecture 300. As observed in FIG. 3A, the stencil processor includes a data computing unit 301, a scalar processor 302, and an associated memory 303 and an I / O unit 304. The data calculation unit 301 includes an array of execution lanes 305, a two-dimensional shift array structure 306 and separate respective random access memories 307 associated with a particular row or column of the array.

I/O 유닛(304)은 시트 생성기로부터 수신된 데이터의 "입력"시트를 데이터 계산 유닛(301)에 로딩하고, 스텐실 프로세서로부터 데이터의 "출력" 시트를 시트 생성기에 저장한다. 일 실시예에서, 시트 데이터를 데이터 계산 유닛(301)에 로딩하는 것은 수신된 시트를 이미지 데이터의 행/열로 파싱하고, 이미지 데이터의 행/열을 2차원 시프트 레지스터 구조(306) 또는 실행 레인 어레이의 행/열의 각각의 랜덤 액세스 메모리들(307)로 로딩하는 것을 수반한다(아래에서 자세히 설명됨). 시트가 메모리들(307)에 초기적으로 로딩되면, 실행 레인 어레이(305) 내의 개별 실행 레인들은 적절한 때에(예를 들어, 시트의 데이터에 대한 동작 바로 전의 로드 명령어에 따라) 시트 데이터를 랜덤 액세스 메모리들(307)로부터 2 차원 시프트 레지스터 구조(306)로 로딩할 수 있다. (시트 생성기로부터 또는 메모리들(307)로부터 직접적으로) 레지스터 구조(306)로의 데이터 시트의 로딩이 완료되면, 실행 레인 어레이(305)의 실행 레인들은 데이터 상에서 동작하고, 결국 종료된 데이터를 시트를 시트 생성기로 직접 되돌려 보내거나 또는 랜덤 액세스 메모리들(307)에 "다시 기록(write back)"한다. 실행 레인들이 랜덤 액세스 메모리들(307)에 다시 기록되면, I/O 유닛(304)은 랜덤 액세스 메모리들(307)로부터 데이터를 인출하여 출력 시트를 형성하고, 출력 시트는 그 다음 시트 생성기로 포워딩된다.The I / O unit 304 loads the "input" sheet of data received from the sheet generator into the data calculation unit 301 and stores the "output" sheet of data from the stencil processor in the sheet generator. In one embodiment, loading the sheet data into the data calculation unit 301 parses the received sheet into rows / columns of the image data and parses the rows / columns of the image data into the two-dimensional shift register structure 306 or the execution lane array. This involves loading into each of the random access memories 307 of the row / column of (described in detail below). Once the sheet is initially loaded into the memories 307, the individual execution lanes in the execution lane array 305 randomly access the sheet data at the appropriate time (e.g., according to the load instruction just before the operation on the data in the sheet). It may load from the memories 307 into the two-dimensional shift register structure 306. When the loading of the data sheet into the register structure 306 (from the sheet generator or directly from the memories 307) is complete, the execution lanes of the execution lane array 305 operate on the data and eventually retrieve the terminated data from the sheet. Either return directly to the sheet generator or "write back" to the random access memories 307. Once the execution lanes are written back to the random access memories 307, the I / O unit 304 fetches data from the random access memories 307 to form an output sheet, which is then forwarded to the sheet generator. do.

스칼라 프로세서(302)는 스텐실 메모리(303)로부터 스텐실 프로세서의 프로그램 코드의 명령어들을 판독하고, 실행 레인 어레이(305)의 실행 레인에 명령어들을 발행하는 프로그램 제어기(309)를 포함한다. 일 실시예에서, 단일의 동일한 명령어가 어레이(305) 내의 모든 실행 레인들에 브로드캐스트되어, 데이터 계산 유닛(301)으로부터의 SIMD(single instruction multiple data)-유사 행동에 영향을 준다. 일 실시예에서, 스칼라 메모리(303)로부터 판독되고 실행 레인 어레이(305)의 실행 레인들에 발행되는 명령어들의 명령어 포맷은 명령어 당 하나 이상의 연산 코드를 포함하는 VLIW(very-long-instruction-word) 유형 포맷을 포함한다. 또 다른 실시예에서, VLIW 포맷은 각 실행 레인의 ALU에 의해 수행되는 수학적 기능을 지시하는 ALU 연산 코드(후술하는 바와 같이, 일 실시예에서 하나 이상의 종래의 ALU 연산을 특정할 수 있음) 및 메모리 연산 코드(특정 실행 레인 또는 실행 레인들의 세트에 대한 메모리 동작을 지시)를 포함한다.The scalar processor 302 includes a program controller 309 that reads instructions of the program code of the stencil processor from the stencil memory 303 and issues instructions to the execution lane of the execution lane array 305. In one embodiment, a single identical instruction is broadcast to all execution lanes in the array 305, affecting single instruction multiple data (SIMD) -like behavior from the data computing unit 301. In one embodiment, the instruction format of instructions read from scalar memory 303 and issued to the execution lanes of execution lane array 305 includes a very long-instruction word (VLIW) that includes one or more operation codes per instruction. Include type format. In another embodiment, the VLIW format is an ALU operation code (as described below, in one embodiment may specify one or more conventional ALU operations) and memory that indicates the mathematical function performed by the ALU of each execution lane. Arithmetic code (indicating a memory operation for a particular execution lane or set of execution lanes).

"실행 레인"이란 용어는 명령어를 실행할 수 있는 하나 이상의 실행 유닛들의 세트를 의미한다(예: 명령어를 실행할 수 있는 논리 회로). 실행 레인은, 다양한 실시예에서, 단지 실행 유닛 이상의 프로세서-유사 기능을 포함할 수 있다. 예를 들어, 하나 이상의 실행 유닛들 외에도, 또한 실행 레인은 수신된 명령어들을 디코딩하는 논리 회로, 또는 더 많은 다중 명령어 다중 데이터(MIMD)-유사 설계의 경우 명령어를 인출하고 디코딩하는 논리 회로를 포함할 수 있다. MIMD-유사 접근법들에 관해서는, 중앙화된 프로그램 제어 접근법이 본 명세서에 대부분 기술되었지만, 다양한 대안적인 실시예에서 더 분산된 접근법이 구현될 수 있다(예를 들어, 어레이(305)의 각 실행 레인 내에 프로그램 코드 및 프로그램 제어기를 포함).The term "execution lane" means a set of one or more execution units capable of executing an instruction (eg, a logic circuit capable of executing an instruction). Execution lanes may, in various embodiments, include processor-like functionality beyond just an execution unit. For example, in addition to one or more execution units, the execution lane may also include logic circuitry for decoding received instructions, or logic circuitry for retrieving and decoding instructions for more multi-instruction multiple data (MIMD) -like designs. Can be. Regarding MIMD-like approaches, although a centralized program control approach has been described mostly herein, a more distributed approach can be implemented in various alternative embodiments (eg, each execution lane of array 305). Program code and program controller).

실행 레인 어레이(305), 프로그램 제어기(309) 및 2차원 시프트 레지스터 구조(306)의 조합은 광범위한 프로그래머블 기능에 대한 광범위한 적응/구성가능한 하드웨어 플랫폼을 제공한다. 예를 들어, 어플리케이션 소프트웨어 개발자들은 개별 실행 레인들이 다양한 기능을 수행할 수 있고, 임의의 출력 어레이 위치 근처의 입력 이미지 데이터에 쉽게 액세스할 수 있음을 고려하여, 디멘션(예: 스텐실 크기) 뿐만 아니라 광범위한 상이한 기능적 능력을 가지는 커널들을 프로그래밍할 수 있다.The combination of the execution lane array 305, the program controller 309, and the two-dimensional shift register structure 306 provide a broad adaptable / configurable hardware platform for a wide range of programmable functions. For example, application software developers may consider a wide range of dimensions (e.g., stencil size) as well, given that individual execution lanes can perform a variety of functions and provide easy access to input image data near any output array location. It is possible to program kernels with different functional capabilities.

실행 레인 어레이(305)에 의해 동작되는 이미지 데이터에 대한 데이터 스토어로서 동작하는 것 외에, 랜덤 액세스 메모리들(307)은 또한 하나 이상의 룩업 테이블을 유지할 수 있다. 다양한 실시예에서, 하나 이상의 스칼라 룩업 테이블이 또한 스칼라 메모리(303) 내에서 인스턴스화될 수 있다. 룩업 테이블은 종종 이미지 프로세싱 작업들에 의해 사용되어 예를 들어, 다른 어레이 위치에 대한 필터 또는 변환 계수를 획득하고, 룩업 테이블이 입력 인덱스 값 등에 대한 함수 출력을 제공하는 복잡한 함수들(예: 감마 곡선, 사인, 코사인)을 구현할 수 있다. 여기서, SIMD 이미지 프로세싱 시퀀스들은 종종 동일한 클럭 사이클 동안 동일한 룩업 테이블로 룩업을 수행할 것으로 예상된다. 유사하게, 하나 이상의 상수 테이블들이 스칼라 메모리(303)에 저장될 수 있다. 여기서, 예를 들어, 상이한 실행 레인들은 동일한 클럭 사이클 상의 동일한 상수 또는 다른 값을 필요로 할 수 있다(예: 전체 이미지에 적용되는 특정 승수). 따라서, 상수 룩업 테이블로의 액세스는 각 실행 레인에 동일한 스칼라 값을 반환한다. 룩업 테이블은 일반적으로 인덱스 값을 사용하여 액세스된다.In addition to acting as a data store for image data operated by execution lane array 305, random access memories 307 may also maintain one or more lookup tables. In various embodiments, one or more scalar lookup tables may also be instantiated in scalar memory 303. Lookup tables are often used by image processing tasks to obtain, for example, filters or transform coefficients for different array positions, and complex functions (eg gamma curves) in which the lookup table provides a function output for input index values, etc. , Sine, cosine). Here, SIMD image processing sequences are often expected to perform lookups with the same lookup table for the same clock cycle. Similarly, one or more constant tables may be stored in scalar memory 303. Here, for example, different execution lanes may require the same constant or different value on the same clock cycle (eg, a specific multiplier applied to the entire image). Thus, access to the constant lookup table returns the same scalar value in each execution lane. Lookup tables are typically accessed using index values.

스칼라 룩업(scalar look-up)은 동일한 인덱스로부터 동일한 룩업 테이블의 동일한 데이터 값을 실행 레인 어레이(305) 내의 각 실행 레인에 전달하는 것을 포함한다. 다양한 실시예에서, 전술한 VLIW 명령어 포맷은 스칼라 프로세서에 의해 수행된 룩업 동작을 스칼라 룩업 테이블로 향하게 하는 스칼라 연산 코드를 포함하도록 확장된다. 연산 코드와 함께 사용하도록 특정된 인덱스는 즉시 피연산자이거나 다른 데이터 저장 위치로부터 인출될 수 있다. 그럼에도 불구하고, 일 실시예에서, 스칼라 메모리 내의 스칼라 룩업 테이블로부터의 룩업은 본질적으로 동일한 클럭 사이클 동안 실행 레인 어레이(305) 내의 모든 실행 레인으로 동일한 데이터 값을 브로드캐스팅하는 것을 포함한다. 룩업 테이블의 사용 및 동작에 관한 추가 세부 사항이 아래에 제공된다.Scalar look-up involves passing the same data value of the same lookup table from the same index to each execution lane in execution lane array 305. In various embodiments, the aforementioned VLIW instruction format is extended to include scalar operation code that directs a lookup operation performed by a scalar processor to a scalar lookup table. An index specified for use with an opcode can be an immediate operand or retrieved from another data storage location. Nevertheless, in one embodiment, the lookup from the scalar lookup table in the scalar memory involves broadcasting the same data value to all execution lanes in the execution lane array 305 during essentially the same clock cycle. Further details regarding the use and operation of the lookup table are provided below.

도 3b는 전술한 VLIW 명령어 워드 실시예(들)을 요약한다. 도 3b에서 관찰되는 바와 같이, VLIW 명령어 워드 포맷은 3개의 개별 명령어들에 대한 필드들을 포함한다: 1)스칼라 프로세서에 의해 실행되는 스칼라 명령어(351); 2)실행 레인 어레이 내의 각각의 ALU들에 의해 SIMD 방식으로 브로드캐스팅되고 실행되는 ALU 명령어(352); 및 3)부분적 SIMD 방식으로 브로드캐스팅되고 실행되는 메모리 명령어(353)(예를 들어, 실행 레인 어레이의 동일한 행을 따르는 실행 레인이 동일한 랜덤 액세스 메모리를 공유하면, 다른 행의 각각으로부터 하나의 실행 레인이 실제로 명령어를 실행한다(메모리 명령어(353)의 포맷은 각 행으로부터 어느 실행 레인이 명령어를 실행하는지를 식별하는 피연산자를 포함할 수 있다)).3B summarizes the above-described VLIW instruction word embodiment (s). As observed in FIG. 3B, the VLIW instruction word format includes fields for three separate instructions: 1) a scalar instruction 351 executed by a scalar processor; 2) an ALU instruction 352 broadcast and executed in a SIMD manner by respective ALUs in the execution lane array; And 3) a memory instruction 353 broadcast and executed in a partial SIMD manner (e.g., if execution lanes following the same row of an execution lane array share the same random access memory, one execution lane from each of the other rows) This actually executes the instruction (the format of the memory instruction 353 may include an operand that identifies which execution lane from each line executes the instruction).

하나 이상의 즉시 피연산자들에 대한 필드(354)도 포함된다. 명령어들(351, 352, 353) 중 어떤 것이 어떤 즉시 피연산자 정보를 사용하는지가 명령어 포맷에서 식별될 수 있다. 명령어들(351, 352, 353) 각각은 또한 그 자체의 각각의 입력 피연산자 및 결과 정보(예를 들어, ALU 연산을 위한 로컬 레지스터들 및 로컬 레지스터 및 메모리 액세스 명령어들을 위한 메모리 어드레스)를 포함한다. 일 실시예에서, 스칼라 명령어(351)는 실행 레인 어레이 내의 실행 레인들이 다른 두 명령어들(352, 353) 중 하나를 실행하기 전에 스칼라 프로세서에 의해 실행된다. 즉, VLIW 워드의 실행은 스칼라 명령어(351)가 실행되고 그 다음에 다른 명령어들(352, 353)과 함께 제2 사이클이 실행되는 제1 사이클을 포함한다(다양한 실시예들에서 명령어들(352 및 353)은 병렬로 실행될 수 있음).Also included is a field 354 for one or more immediate operands. Which of the instructions 351, 352, 353 uses which immediate operand information can be identified in the instruction format. Each of the instructions 351, 352, 353 also includes its own respective input operand and result information (eg, local registers for ALU operations and memory addresses for local register and memory access instructions). In one embodiment, the scalar instruction 351 is executed by the scalar processor before the execution lanes in the execution lane array execute one of the other two instructions 352, 353. In other words, the execution of the VLIW word includes a first cycle in which the scalar instruction 351 is executed followed by a second cycle in conjunction with the other instructions 352, 353 (in various embodiments the instructions 352). And 353) may be executed in parallel).

일 실시예에서, 스칼라 프로세서(302)에 의해 실행되는 스칼라 명령어들은 데이터 계산 유닛(301)의 메모리들 또는 2D 시프트 레지스터(306)에/로부터 시트들을 로딩/저장하기 위해 시트 생성기(103)에 발행되는 명령을 포함한다. 여기서, 시트 생성기의 동작은 라인 버퍼 유닛(101)의 동작 또는 시트 생성기(103)가 스칼라 프로세서(302)에 의해 발행된 임의의 명령을 완료하는 데 걸리는 사이클 수의 사전 런타임 파악(pre-runtime comprehension)을 방해하는 다른 변수에 의존할 수 있다. 이와 같이, 일 실시예에서, 스칼라 명령어(351)가 대응하는 또는 이와 달리 명령이 시트 생성기(103)에 발행되게 하는 임의의 VLIW 워드는 또한 다른 2개의 명령어 필드(352, 353)에 무-연산(NOOP) 명령어들을 포함한다. 그 후, 프로그램 코드는 시트 생성기가 데이터 계산 유닛에/로부터 로딩/저장을 완료 할 때까지 명령어 필드(352, 353)에 대한 NOOP 명령어의 루프를 입력한다. 여기서, 시트 생성기에 명령을 발행할 때, 스칼라 프로세서는 명령의 완료시 시트 생성기가 리셋하는 인터록 레지스터의 비트를 설정할 수 있다. NOOP 루프 동안, 스칼라 프로세서는 인터록 비트의 비트를 모니터링한다. 스칼라 프로세서가 시트 생성기가 명령을 완료했음을 검출하면, 정상 실행이 다시 시작된다.In one embodiment, the scalar instructions executed by the scalar processor 302 are issued to the sheet generator 103 to load / store the sheets to / from memories or the 2D shift register 306 of the data computing unit 301. Contains the command Here, the operation of the sheet generator is a pre-runtime comprehension of the operation of the line buffer unit 101 or the number of cycles it takes for the sheet generator 103 to complete any instruction issued by the scalar processor 302. May depend on other variables that interfere with). As such, in one embodiment, any VLIW word to which the scalar instruction 351 corresponds or otherwise causes the instruction to be issued to the sheet generator 103 is also non-operated in the other two instruction fields 352 and 353. Contains the (NOOP) instructions. The program code then enters a loop of NOOP instructions for the instruction fields 352 and 353 until the sheet generator has finished loading / storing to / from the data computation unit. Here, when issuing an instruction to the sheet generator, the scalar processor may set the bits of the interlock register that the sheet generator resets upon completion of the instruction. During the NOOP loop, the scalar processor monitors the bits of the interlock bits. If the scalar processor detects that the sheet generator has completed the instruction, normal execution resumes.

도 4는 데이터 연산 유닛(401)의 실시예를 도시한다. 도 4에서 관찰된 바와 같이, 데이터 계산 유닛(401)은 2차원 시프트 레지스터 어레이 구조(406)의 "위에" 논리적으로 위치된 실행 레인들의 어레이(405)를 포함한다. 전술한 바와 같이, 다양한 실시예에서, 시트 생성기에 의해 제공된 이미지 데이터의 시트는 2차원 시프트 레지스터(406)에 로딩된다. 그 다음, 실행 레인들은 레지스터 구조(406)로부터의 시트 데이터에 대해 동작한다.4 shows an embodiment of a data computing unit 401. As observed in FIG. 4, data calculation unit 401 includes an array 405 of execution lanes logically located “above” of the two-dimensional shift register array structure 406. As mentioned above, in various embodiments, the sheet of image data provided by the sheet generator is loaded into the two-dimensional shift register 406. Execution lanes then operate on the sheet data from register structure 406.

실행 레인 어레이(405) 및 시프트 레지스터 구조(406)는 서로 상대적인 위치에 고정된다. 그러나, 시프트 레지스터 어레이(406) 내의 데이터는 전략적이고 조정된 방식으로 시프팅되어 실행 레인 어레이의 각 실행 레인이 데이터 내의 다른 스텐실을 프로세싱하게 한다. 이와 같이, 각 실행 레인은 생성되는 출력 시트의 다른 픽셀에 대한 출력 이미지 값을 결정한다. 도 4의 아키텍처로부터, 오버랩핑 스텐실들은 실행 레인 어레이(405)가 수직으로 인접한 실행 레인뿐만 아니라 수평으로 인접한 실행 레인을 포함하기 때문에 수직으로뿐만 아니라 수평으로 배열됨이 명백하다.The execution lane array 405 and the shift register structure 406 are fixed at positions relative to each other. However, the data in shift register array 406 is shifted in a strategic and coordinated manner such that each run lane of the run lane array processes another stencil in the data. As such, each execution lane determines an output image value for another pixel in the output sheet that is generated. From the architecture of FIG. 4, it is clear that overlapping stencils are arranged vertically as well as horizontally because the execution lane array 405 includes horizontally adjacent execution lanes as well as vertically adjacent execution lanes.

데이터 계산 유닛(401)의 일부 주목할 만한 아키텍처적 구성들은 실행 레인 어레이(405)보다 넓은 디멘션을 갖는 시프트 레지스터 구조(406)를 포함한다. 즉, 실행 레인 어레이(405) 외부에 레지스터들(409)의 "헤일로(halo)"가 있다. 헤일로(409)가 실행 레인 어레이의 양측에 존재하는 것으로 도시되어 있지만, 구현예에 따라, 헤일로는 실행 레인 어레이(405)의 측면에 더 적게(1 개) 또는 많게(3 또는 4) 존재할 수 있다. 헤일로(405)는 데이터가 실행 레인(405)의 "아래"로 시프팅됨에 따라, 실행 레인 어레이(405)의 경계 외부로 유출되는 데이터에 대해 "스필 오버(spill-over)"공간을 제공하는 역할을 한다. 간단한 경우로서, 실행 레인 어레이(405)의 우측 에지를 중심으로 하는 5x5 스텐실은 스텐실의 가장 왼쪽 픽셀이 프로세싱될 때 우측으로 네 개의 헤일로 레지스터 위치들을 필요로 할 것이다. 도면의 용이함을 위해, 도 4는 명목상의 실시예에서, 양측(오른쪽, 아래)의 레지스터들이 수평 및 수직 연결을 모두 가질 수 있는 경우, 수평 시프트 연결만을 갖는 것으로 헤일로의 우측의 레지스터들 및 수직 시프트 연결만을 갖는 것으로서 헤일로의 하부 측의 레지스터들을 도시한다.Some notable architectural configurations of the data calculation unit 401 include a shift register structure 406 having a wider dimension than the execution lane array 405. That is, there is a "halo" of registers 409 outside the execution lane array 405. Although halo 409 is shown as being present on both sides of the execution lane array, halo may be less (one) or more (3 or 4) on the side of the execution lane array 405. . Halo 405 provides a “spill-over” space for data that flows out of the boundary of execution lane array 405 as the data is shifted “down” of execution lane 405. Play a role. As a simple case, a 5x5 stencil centered on the right edge of the execution lane array 405 would require four halo register locations to the right when the leftmost pixel of the stencil is processed. For ease of drawing, FIG. 4 shows, in a nominal embodiment, the registers on the right side of the halo and the vertical shift as having only a horizontal shift connection when the registers on both sides (right, bottom) can have both horizontal and vertical connections. The registers on the bottom side of the halo are shown as having only a connection.

어레이의 각 행 및/또는 각 열 또는 그 일부에 결합되는 랜덤 액세스 메모리(407)에 의해 추가 스필-오버 룸이 제공된다(예를 들어, 랜덤 액세스 메모리는 4개의 실행 레인 행 및 2개의 실행 레인 열로 확장하는 실행 레인 어레이의 "영역"에 할당될 수 있다). 단순화를 위해 나머지 어플리케이션은 주로 행 및/또는 열 기반 할당 체계를 참조한다. 여기서, 실행 레인의 커널 동작들이 2차원 시프트 레지스터 어레이(406) 외부의 픽셀 값을 프로세싱하도록 요구하는 경우(일부 이미지 프로세싱 루틴들이 요구할 수 있음), 이미지 데이터의 평면은 예를 들어, 헤일로 영역(409)으로부터 랜덤 액세스 메모리(407)로 추가로 스필-오버할 수 있다. 예를 들어, 하드웨어가 실행 레인 어레이의 오른쪽 에지에서 실행 레인 오른쪽에 4개의 저장 엘리먼트들만 있는 헤일로 영역을 가지는 6X6 스텐실을 고려한다. 이 경우, 데이터는 스텐실을 완전히 프로세싱하기 위해 헤일로(409)의 오른쪽 에지에서 오른쪽으로 더 멀리 시프팅되어야 한다. 헤일로 영역(409) 외부로 시프팅된 데이터는 랜덤 액세스 메모리(407)로 스필-오버될 수 있다. 랜덤 액세스 메모리들(407) 및 도 3의 스텐실 프로세서의 다른 적용들이 아래에 더 제공된다.Additional spill-overroom is provided by random access memory 407 coupled to each row and / or each column or portion of the array (e.g., random access memory includes four execution lane rows and two execution lanes). May be assigned to a "region" of the execution lane array that expands to columns). For simplicity, the rest of the application refers primarily to row and / or column based allocation schemes. Here, if kernel operations of an execution lane require processing pixel values outside the two-dimensional shift register array 406 (some image processing routines may require), the plane of the image data is, for example, the halo region 409. ) May further spill over to the random access memory 407. For example, consider a 6 × 6 stencil with hardware having a halo region with only four storage elements to the right of the execution lane at the right edge of the execution lane array. In this case, the data must be shifted further to the right from the right edge of halo 409 to fully process the stencil. Data shifted out of the halo region 409 may spill over to the random access memory 407. Further applications of the random access memories 407 and the stencil processor of FIG. 3 are further provided below.

도 5a 내지 도 5k는 이미지 데이터가 상술한 바와 같이 실행 레인 어레이의 "아래"의 2차원 시프트 레지스터 어레이 내에서 시프팅되는 방식의 실시예를 설명한다. 도 5a에서 관찰된 바와 같이, 2차원 시프트 어레이의 데이터 컨텐츠는 제1 어레이(507)에 도시되고, 실행 레인 어레이는 프레임(505)으로 도시된다. 또한, 실행 레인 어레이 내의 2개의 이웃 실행 레인(510)이 간략하게 도시되어 있다. 이 단순한 도시(510)에서, 각 실행 레인은 시프트 레지스터로부터 데이터를 수용하거나, ALU 출력으로부터 데이터를 수용하거나(예를 들어, 사이클을 통해 누산기로서 동작하도록), 또는 출력 데이터를 출력 목적지에 기록할 수 있는 레지스터 R1을 포함한다.5A-5K illustrate an embodiment of the manner in which image data is shifted within the two-dimensional shift register array " below " of the execution lane array as described above. As observed in FIG. 5A, the data content of the two-dimensional shift array is shown in the first array 507 and the execution lane array is shown in the frame 505. In addition, two neighboring execution lanes 510 in the execution lane array are shown schematically. In this simple illustration 510, each execution lane may receive data from a shift register, receive data from an ALU output (e.g., operate as an accumulator over a cycle), or write the output data to an output destination. It contains register R1.

로컬 레지스터 R2에서, 각 실행 레인은 2차원 시프트 어레이에서 컨텐츠 바로 아래의 컨텐츠를 사용할 수 있다. 따라서, R1은 실행 레인의 물리적 레지스터이고, R2는 2차원 시프트 레지스터 어레이의 물리적 레지스터이다. 실행 레인은 R1 및/또는 R2에 의해 제공된 피연산자에서 동작할 수 있는 ALU를 포함한다. 아래에서 더 상세히 설명되는 바와 같이, 일 실시예에서, 시프트 레지스터는 어레이 위치 당 다수의("깊이"의) 저장/레지스터 엘리먼트들로 실제로 구현되지만, 시프팅 활동은 저장 엘리먼트들의 한 평면에 한정된다(예를 들어, 저장 엘리먼트들의 한 평면만 사이클마다 시프트할 수 있음). 도 5a 내지 도 5k는 각각의 실행 레인들으로부터 결과 X를 저장하는데 사용되는 이들 더 깊은 레지스터 위치들 중 하나를 도시한다. 설명의 용이함을 위해, 더 깊은 결과 레지스터는 대응하는 레지스터(R2) 아래가 아닌 측면에 도시된다.In the local register R2, each execution lane can use the content just below the content in the two-dimensional shift array. Thus, R1 is the physical register of the execution lane and R2 is the physical register of the two-dimensional shift register array. The execution lane includes an ALU that can operate on the operands provided by R1 and / or R2. As described in more detail below, in one embodiment, the shift register is actually implemented with multiple ("depth") storage / register elements per array position, but the shifting activity is limited to one plane of storage elements. (Eg, only one plane of storage elements can shift per cycle). 5A-5K illustrate one of these deeper register locations used to store the result X from each of the execution lanes. For ease of explanation, the deeper result register is shown on the side rather than below the corresponding register R2.

도 5a 내지 도 5k는 중앙 위치가 실행 레인 어레이(505) 내에 도시된 한 쌍의 실행 레인 위치(511)와 정렬되는 2개의 스텐실들의 계산에 초점을 맞추고 있다. 도시의 용이함을 위해, 한 쌍의 실행 레인들(510)은 실제로 다음의 예에 따라, 그들이 수직 이웃이 될 때 수평 이웃으로 도시된다.5A-5K focus on the calculation of two stencils whose center position is aligned with the pair of execution lane positions 511 shown in the execution lane array 505. For ease of illustration, the pair of execution lanes 510 are actually shown as horizontal neighbors when they become vertical neighbors, according to the following example.

도 5a에서 초기적으로 관찰된 바와 같이, 실행 레인들(511)은 그들의 중심 스텐실 위치에서 중앙에 위치된다. 도 5b는 양 실행 레인들(511)에 의해 실행되는 오브젝트 코드를 도시한다. 도 11b에서 관찰되는 바와 같이, 양 실행 레인들(511)의 프로그램 코드는 시프트 레지스터 어레이(507) 내의 데이터가 한 위치 아래로 시프트되고 한 위치 오른쪽으로 시프트되게 한다. 이것은 양 실행 레인들(511)을 각각의 스텐실들의 상부 좌측 모서리에 정렬시킨다. 그 다음 프로그램 코드는 각각의 위치들에 (R2에) 위치된 데이터가 R1에 로딩되게 한다.As initially observed in FIG. 5A, execution lanes 511 are centered at their central stencil position. 5B shows the object code executed by both execution lanes 511. As observed in FIG. 11B, the program code of both execution lanes 511 causes the data in the shift register array 507 to be shifted down one position and shifted one position to the right. This aligns both execution lanes 511 to the upper left corner of each stencil. The program code then causes the data located at each location (at R2) to be loaded into R1.

도 5c에서 관찰된 바와 같이, 프로그램 코드는 다음에 실행 레인 쌍(511)이 시프트 레지스터 어레이(507) 내의 데이터를 좌측으로 한 유닛 시프트시켜, 각 실행 레인의 각각의 위치의 우측에 대한 값이 각 실행 레인의 위치로 시프트되게 한다. R1(이전 값)의 값은 실행 레인의 위치(R2)로 시프트된 새 값과 함께 추가된다. 결과는 R1에 기록된다. 도 5d에서 관찰된 바와 같이, 도 5c에 대해 전술한 것과 동일한 프로세스가 반복되어, 결과 R1이 상부 실행 레인에 A+B+C 값을 포함하고 하부 실행 레인에 F+G+H 값을 포함하게 한다. 이 시점에서, 두 실행 레인들(511)은 각각의 스텐실들의 상부 행을 프로세싱했다. 실행 레인 어레이(505)의 왼쪽에 있는 헤일로 영역(왼쪽에 하나가 존재하는 경우)에 또는 실행 레인 어레이(505)의 왼쪽에 헤일로 영역이 없으면 랜덤 액세스 메모리에 스필-오버를 기록한다.As observed in Fig. 5C, the program code then causes the execution lane pair 511 to unit shift the data in the shift register array 507 to the left, so that the value for the right side of each position of each execution lane is Allow shift to the location of the execution lane. The value of R1 (old value) is added with the new value shifted to position R2 of the execution lane. The result is recorded in R1. As observed in FIG. 5D, the same process as described above with respect to FIG. 5C is repeated so that the resulting R1 contains the A + B + C values in the upper run lanes and the F + G + H values in the lower run lanes. do. At this point, the two execution lanes 511 processed the top row of the respective stencils. A spill-over is written to the halo region (if one exists on the left) of the execution lane array 505 or to the random access memory if there is no halo region to the left of the execution lane array 505.

도 5e에서 관찰된 바와 같이, 다음에 프로그램 코드는 시프트 레지스터 어레이 내의 데이터를 한 유닛 위로 시프트시켜, 두 실행 레인들(511)이 각각의 스텐실의 중간 행의 우측 에지와 정렬되게 한다. 양 실행 레인들(511)의 레지스터(R1)는 현재 스텐실의 상부 행과 중간 행의 가장 우측 값의 합계를 포함한다. 도 5f와 5g는 두 실행 레인의 스텐실들의 중간 행을 가로 질러 왼쪽으로 계속 진행하는 것을 도시한다. 누적 가산이 계속되어, 도 5g의 프로세스의 종료시 두 실행 레인(511) 모두가 각각의 스텐실의 상부 행과 중간 행의 값들의 합산을 포함하도록 한다.As observed in FIG. 5E, the program code next shifts the data in the shift register array up one unit, causing the two execution lanes 511 to align with the right edge of the middle row of each stencil. The register R1 of both execution lanes 511 contains the sum of the rightmost values of the upper and middle rows of the current stencil. 5F and 5G show continuing to the left across the middle row of stencils of the two running lanes. Cumulative addition is continued, such that at the end of the process of FIG. 5G both execution lanes 511 include the sum of the values of the top and middle rows of each stencil.

도 5h는 각 실행 레인을 대응하는 스텐실의 최하위 행과 정렬시키는 다른 시프트를 도시한다. 도 5i와 5j는 두 실행 레인들의 스텐실들 과정에서 프로세싱을 완료하기 위해 계속적으로 시프팅하고 있음을 도시한다. 도 5k는 각 실행 레인을 데이터 어레이 내의 그 정확한 위치와 정렬시키고 이에 대한 결과를 기록하기 위한 추가적인 시프팅을 도시한다.5H shows another shift in aligning each execution lane with the lowest row of the corresponding stencil. 5i and 5j show that we are continuously shifting to complete processing in the stencils of both execution lanes. 5K shows additional shifting to align each execution lane with its exact location in the data array and record the results for it.

도 5a 내지 도 5k의 예에서, 시프트 동작을 위한 오브젝트 코드는 (X, Y) 좌표로 표현된 시프트의 방향 및 크기를 식별하는 명령어 포맷을 포함할 수 있다. 예를 들어, 한 위치씩 시프트 업을 위한 오브젝트 코드는 SHIFT 0, +1과 같은 오브젝트 코드로 표현될 수 있다. 다른 예시로서, 한 위치씩 오른쪽으로의 시프트는 SHIFT +1, 0과 같은 오브젝트 코드로 표현될 수 있다. 다양한 실시예들에서, 더 큰 크기의 시프트들 또한 오브젝트 코드(예를 들어, SHIFT 0, + 2)에서 특정될 수 있다. 여기서, 2D 시프트 레지스터 하드웨어가 사이클 당 한 위치에 의한 시프트만을 지원하는 경우, 상기 명령어는 기계에 의해 다수의 사이클 실행을 요구하도록 해석될 수 있거나 또는 2D 시프트 레지스터 하드웨어는 사이클 당 하나 이상의 위치에 의한 시프트들을 지원하도록 설계될 수 있다. 후술하는 실시예는 이하에서 더 상세하게 설명된다.In the example of FIGS. 5A-5K, the object code for the shift operation may include an instruction format that identifies the direction and magnitude of the shift expressed in (X, Y) coordinates. For example, the object code for shifting up by one position may be represented by an object code such as SHIFT 0 and +1. As another example, the shift to the right by one position may be represented by an object code such as SHIFT +1, 0. In various embodiments, larger magnitude shifts may also be specified in the object code (eg, SHIFT 0, + 2). Here, if the 2D shift register hardware only supports shift by one position per cycle, the instruction may be interpreted to require multiple cycle executions by the machine or the 2D shift register hardware may be shifted by one or more positions per cycle. Can be designed to support them. The embodiments described below are described in more detail below.

도 6은 어레이 실행 레인 및 시프트 레지스터 구조에 대한 유닛 셀의 또 다른 상세한 도시를 보여준다(헤일로 영역의 레지스터들은 대응하는 실행 레인을 포함하지 않는다). 일 실시예에서, 실행 레인 및 실행 레인 어레이의 각 위치와 연관된 레지스터 공간은 실행 레인 어레이의 각 노드에서 도 6에서 관찰된 회로를 인스턴스화함으로써 구현된다. 도 6에서 관찰된 바와 같이, 유닛 셀은 4개의 레지스터(R1 내지 R4)로 구성된 레지스터 파일(602)에 연결된 실행 레인(601)을 포함한다. 임의의 사이클 동안, 실행 레인(601)은 임의의 레지스터들(R0 내지 R4)로부터 판독하거나 그에 기록할 수 있다. 두 입력 피연산자를 필요로 하는 명령어의 경우, 실행 레인은 R0에서 R4 중 임의의 것에서 피연산자를 모두 검색할 수 있다.Figure 6 shows another detailed illustration of the unit cell for the array execution lanes and shift register structure (the registers in the halo region do not include corresponding execution lanes). In one embodiment, the register space associated with each position of the execution lanes and the execution lane array is implemented by instantiating the circuits observed in FIG. 6 at each node of the execution lane array. As observed in FIG. 6, the unit cell includes an execution lane 601 connected to a register file 602 composed of four registers R1 to R4. During any cycle, execution lane 601 may read from or write to any registers R0 to R4. For instructions that require two input operands, the execution lane may retrieve all of the operands from any of R0 through R4.

일 실시예에서, 2차원 시프트 레지스터 구조는 단일 사이클 동안, 레지스터들(R1 내지 R3) 중의 임의의 것(단 하나)의 컨텐츠가 출력 멀티플렉서(603)를 통해 이웃 레지스터 파일들 중 하나에 시프트 "아웃"되도록 하고, 입력 멀티플렉서들(604)을 통해 이웃하는 것들이 이웃들 사이의 시프트가 동일한 방향으로 존재하면(예: 모든 실행 레인이 왼쪽으로 시프트되고, 모든 실행 레인이 오른쪽으로 시프트되는 등) 대응하는 것으로부터 시프트 "인"된 컨텐츠로 대체된 레지스터(R1 내지 R3) 중 임의의 것(단 하나)의 컨텐츠를 가짐으로써 구현된다. 동일 레지스터가 그 컨텐츠를 시프트 아웃(shift out)하고 동일한 사이클에서 시프팅된 컨텐츠로 대체하는 것은 일반적일 수 있지만, 멀티플렉서 배열(603, 604)은 동일한 사이클 동안 동일한 레지스터 파일 내에서 상이한 시프트 소스 및 시프트 타겟 레지스터를 허용한다.In one embodiment, the two-dimensional shift register structure is such that during a single cycle, the content of any (only one) of registers R1 through R3 is shifted out of one of the neighbor register files via output multiplexer 603. And the neighbors through input multiplexers 604 correspond to if the shift between neighbors is in the same direction (e.g., all execution lanes are shifted to the left, all execution lanes are shifted to the right, etc.). Is implemented by having the content of any (only one) of registers R1 through R3 replaced with content shifted in. It may be common for the same register to shift out its contents and replace it with shifted content in the same cycle, but multiplexer arrangements 603 and 604 may have different shift sources and shifts within the same register file during the same cycle. Allow the target register.

도 6에 도시된 바와 같이, 시프트 시퀀스 동안 실행 레인은 그것의 레지스터 파일(602)로부터 컨텐츠를 왼쪽, 오른쪽, 위쪽 및 아래쪽 이웃 각각으로 시프팅시킬 것이다. 동일한 시프트 시퀀스와 함께, 실행 레인은 컨텐츠를 왼쪽, 오른쪽, 위쪽 및 아래쪽 이웃들 중 특정한 이웃으로부터 그것의 레지스터 파일로 시프팅할 것이다. 다시 말하지만, 시프트 아웃 타겟과 시프트 인 소스는 모든 실행 레인들에 대해 동일한 시프트 방향과 일관되어야 한다(예: 시프트 아웃이 오른쪽 이웃으로, 시프트 인이 왼쪽 이웃으로부터).As shown in FIG. 6, the execution lane will shift content from its register file 602 to its left, right, top and bottom neighbors, respectively, during the shift sequence. With the same shift sequence, the execution lane will shift the content from a particular one of the left, right, top and bottom neighbors into its register file. Again, the shift out target and the shift in source must be consistent with the same shift direction for all execution lanes (e.g., shift out to the right neighbor and shift in to the left neighbor).

일 실시예에서, 단지 하나의 레지스터의 컨텐츠가 사이클 마다 실행 레인 마다 시프트되는 것이 허용되지만, 다른 실시예들은 하나 이상의 레지스터의 컨텐츠가 시프트 인/아웃되는 것을 허용할 수 있다. 예를 들어, 도 6에서 관찰된 멀티플렉서 회로(603, 604)의 제 2 인스턴스가 도 6의 구성에 통합되는 경우, 동일한 사이클 동안 2개의 레지스터들의 컨텐츠가 시프트 아웃/인될 수 있다. 물론, 단 하나의 레지스터의 컨텐츠가 사이클 마다 시프팅되는 것이 허용되는 실시예에서, 수학적 연산들 간의 시프트들에 대해 더 많은 클럭 사이클들을 소비함으로써, 다수의 레지스터들로부터의 시프트들이 수학적 연산들 간에 발생할 수 있다(예를 들어, 2개의 레지스터들의 컨텐츠들은 수학 연산 사이의 2개의 시프트 연산들 소비함으로서 수학 연산들 사이에서 시프트될 수 있음).In one embodiment, only the contents of one register are allowed to be shifted per cycle of execution, while other embodiments may allow the contents of one or more registers to be shifted in / out. For example, if the second instance of the multiplexer circuit 603, 604 observed in FIG. 6 is incorporated in the configuration of FIG. 6, the contents of the two registers may be shifted out / in during the same cycle. Of course, in an embodiment where the contents of only one register is allowed to shift every cycle, by consuming more clock cycles for shifts between mathematical operations, shifts from multiple registers may occur between mathematical operations. (Eg, the contents of two registers may be shifted between mathematical operations by consuming two shift operations between mathematical operations).

실행 레인의 레지스터 파일들의 모든 컨텐츠가 시프트 시퀀스 동안 시프트 아웃되면, 각 실행 레인의 시프트 아웃되지 않은 레지스터들의 컨텐츠는 그대로 유지된다(시프트하지 않음). 이와 같이, 시프트 인 컨텐츠로 대체되지 않는 시프트되지 않은 컨텐츠는 시프팅 사이클을 통해 실행 레인에 로컬적으로 유지된다. 각 실행 레인에서 관찰되는 메모리 유닛("M")은 실행 레인 어레이 내의 실행 레인의 행 및/또는 열과 연관된 랜덤 액세스 메모리 공간으로/으로부터 데이터를 로드/저장하는데 사용된다. 여기서 M 유닛은 실행 레인의 자체 레지스터 공간으로부터/에 로드/저장될 수 없는 데이터를 로드/저장하는데 자주 사용된다는 점에서 표준 M 유닛으로서 동작한다. 다양한 실시예에서, M 유닛의 주요 동작은 로컬 레지스터로부터 메모리로 데이터를 기록하고, 메모리로부터 데이터를 판독하여 로컬 레지스터에 기록하는 것이다.If all the contents of the register files of the execution lane are shifted out during the shift sequence, the contents of the unshifted registers of each execution lane remain (do not shift). As such, unshifted content that is not replaced with shift-in content is maintained locally in the execution lane through the shifting cycle. The memory unit ("M") observed in each execution lane is used to load / store data to / from random access memory space associated with the rows and / or columns of the execution lanes in the execution lane array. The M unit here acts as a standard M unit in that it is often used to load / store data that cannot be loaded / stored in / out of its own register space of an execution lane. In various embodiments, the main operation of the M unit is to write data from the local register to memory, read data from the memory, and write to the local register.

다양한 실시예에서, 하드웨어 실행 레인(601)의 ALU 유닛에 의해 지원되는 명령어 세트 아키텍처(ISA) 연산 코드와 관련하여, 하드웨어 ALU에 의해 지원되는 수학 연산 코드는 가상 실행 레인에 의해 지원되는 수학 연산 코드들과 통합된다(예를 들면 실질적으로 동일하다)(예: ADD, SUB, MOV, MUL, MAD, ABS, DIV, SHL, SHR, MIN / MAX, SEL, AND, OR, XOR, NOT). 위에서 설명된 바와 같이, 메모리 액세스 명령어들은 실행 레인(601)에 의해 실행되어, 연관된 랜덤 액세스 메모리로부터/에 데이터를 인출/저장한다. 또한 하드웨어 실행 레인(601)은 2차원 시프트 레지스터 구조 내의 데이터를 시프트하기 위한 시프트 동작 명령어(우측, 좌측, 위, 아래)를 지원한다. 전술한 바와 같이, 프로그램 제어 명령어들은 주로 스텐실 프로세서의 스칼라 프로세서에 의해 실행된다.In various embodiments, with respect to the instruction set architecture (ISA) opcode supported by the ALU unit of the hardware execution lane 601, the mathematical opcode supported by the hardware ALU is a mathematical opcode supported by the virtual execution lane. (E.g., substantially the same) (e.g. ADD, SUB, MOV, MUL, MAD, ABS, DIV, SHL, SHR, MIN / MAX, SEL, AND, OR, XOR, NOT). As described above, memory access instructions are executed by execution lane 601 to fetch / store data from / to associated random access memory. Hardware execution lane 601 also supports shift operation instructions (right, left, up, down) for shifting data in a two-dimensional shift register structure. As mentioned above, program control instructions are primarily executed by the scalar processor of the stencil processor.

2.0 런타임 효율성을 개선하기 위한 프로그램 코드 변환2.0 program code conversion to improve runtime efficiency

상술한 바와 같이, 이미지 프로세서를 위해 개발되는 어플리케이션 소프트웨어는 본원에서 커널들로 지칭되는 작고 세분화된 소프트웨어 프로그램들을 지향 비순환 그래프와 같은 보다 큰 전체적 구조로 결합함으로써 정의될 수 있다. 정의는 일반적으로 다수의 "생산" 커널들이 하나 이상의 "소비" 커널들에 출력 이미지 데이터를 제공하는 특정 데이터 흐름 패턴에 서로 다른 커널들을 연결하는 것을 포함한다. 적어도 하나의 커널은 어플리케이션 소프트웨어 프로그램이 동작하는 전체 입력 이미지를 수신하고, 일반적으로 하나의 커널은 어플리케이션 소프트웨어의 전체 출력 이미지를 생성한다.As discussed above, application software developed for an image processor may be defined by combining small, granular software programs, referred to herein as kernels, into a larger overall structure, such as directed acyclic graphs. The definition generally includes connecting different kernels to a particular data flow pattern in which multiple "production" kernels provide output image data to one or more "consumer" kernels. At least one kernel receives the full input image on which the application software program runs, and typically one kernel produces the full output image of the application software.

각 커널은 특정 스텐실 프로세서에 매핑된다. 각 스텐실 프로세서에는 연관된 스텐실 프로세서의 커널이 동작할 이미지 데이터를 수신하는 연관된 시트 생성기를 가진다. 다양한 실시예에서, 이미지 데이터는 라인들의 그룹으로 시트 생성기에 의해 수신된다. 예를 들어, 시트 생성기는 입력 이미지 프레임의 전체 폭에 걸쳐 다수의 행으로서 이미지 데이터를 수신할 수 있다. 그 다음 시트 생성기는 스텐실 프로세서에 제공되고 결국 스텐실 프로세서의 2차원 시프트 레지스터 어레이에 로딩되는 이미지 데이터의 2차원 "시트"를 형성한다.Each kernel is mapped to a specific stencil processor. Each stencil processor has an associated sheet generator that receives image data for the kernel of the associated stencil processor to operate. In various embodiments, the image data is received by the sheet generator in a group of lines. For example, the sheet generator can receive image data as multiple rows over the full width of the input image frame. The sheet generator then forms a two dimensional "sheet" of image data that is provided to the stencil processor and eventually loaded into the two dimensional shift register array of the stencil processor.

다양한 실시예에서, 시트 생성기는 전용 하드웨어 로직 회로(예를 들어, ASIC(application specific integrated circuit) 로직 회로), 프로그래머블 로직 회로(예를 들어, 필드 프로그래머블 게이트 어레이 로직 회로), 임베디드 프로세서 로직 회로 또는 이들의 조합을 구현되어 시트 생성기의 기능을 구현한다. 전용 하드웨어 논리 회로(있는 경우)는 시트 생성기가 상기 시트 생성기가 연관된 스텐실 프로세서에 매핑된 커널의 커널에 대해 시트 생성 활동을 수행하게 하는 어플리케이션 소프트웨어의 컴파일 프로세스에 의해 생성된 정보로 설정되는 연관된 구성 레지스터들을 가진다. 프로그래머블 논리 회로(있는 경우)는 프로그래머블 논리 회로로 하여금 시트 생성기의 연관된 스텐실 프로세서에 매핑되었던 스텐실 프로세서에서 실행되는 커널에 대한 시트 생성기 기능을 구현하게 하는 어플리케이션 소프트웨어의 컴파일 프로세스에 의해 생성된 정보로 프로그래밍된다. 임베디드 프로세서 회로(있는 경우)는 임베디드 프로세서에 의해 실행될 때 임베디드 프로세서로 하여금 시트 생성기의 연관된 스텐실 프로세서에 매핑되었던 스텐실 프로세서에서 실행되는 커널에 대한 시트 생성기 기능을 구현하게 하는 어플리케이션 소프트웨어의 컴파일 프로세스에 의해 생성된 프로그램 코드에 제공된다. 스텐실 프로세서의 스칼라 프로세서는 또한 다양한 시트 생성 활동 작업들을 수행, 보조 또는 다른 방식으로 포함하도록 프로그래밍될 수 있다. 동일한 종류의 회로 구현 가능성 및 연관된 컴파일된 프로그램 코드 및/또는 정보가 또한 라인 버퍼 유닛과 관련하여 존재할 수 있다.In various embodiments, the sheet generator may include dedicated hardware logic circuits (eg, application specific integrated circuit (ASIC) logic circuits), programmable logic circuits (eg, field programmable gate array logic circuits), embedded processor logic circuits, or the like. A combination of implements the functionality of the sheet generator. Dedicated hardware logic circuitry (if any) is associated configuration registers that are set with information generated by the compilation process of the application software that causes the sheet generator to perform sheet generation activities for the kernel of the kernel mapped to the associated stencil processor. Have them. The programmable logic circuitry, if any, is programmed with information generated by the compilation process of the application software that causes the programmable logic circuitry to implement the sheet generator functionality for the kernel running on the stencil processor that was mapped to the sheet generator's associated stencil processor. . Embedded processor circuitry (if any) is generated by a compilation process of application software that, when executed by the embedded processor, causes the embedded processor to implement the sheet generator functionality for the kernel running on the stencil processor that was mapped to the associated stencil processor of the sheet generator. To the generated program code. The scalar processor of the stencil processor may also be programmed to perform, assist or otherwise include various sheet generation activity tasks. The same kind of circuit implementability and associated compiled program code and / or information may also be present in conjunction with the line buffer unit.

따라서 어플리케이션 소프트웨어 개발 프로세스는 커널을 특정 스텐실 프로세서에 매핑하는 것뿐만 아니라 커널에 대한 시트 생성 활동을 수행하는데 사용되는 연관된 구성 정보 및/또는 프로그램 코드의 생성을 포함한다.The application software development process thus includes not only mapping the kernel to a particular stencil processor, but also the generation of associated configuration information and / or program code used to perform sheet generation activities for the kernel.

다양한 어플리케이션 소프트웨어 프로그램 개발 환경에서, 어플리케이션 소프트웨어 프로그램의 상위 레벨 설명을 수락하고, 응답으로 이미지 프로세서에 의한 실행을 위한 하위 레벨 프로그램 코드(예: 오브젝트 코드) 및 임의의 연관된 구성 정보를 생성하는 컴파일러는 어플리케이션 소프트웨어의 다양한 비효율을 인식하고, 비효율을 개선하거나 감소시키기 위해 컴파일되는 프로그램 코드를 변경한다. 변경되는 프로그램 코드는 하나 이상의 시트 생성기들 및/또는 이들에 의해 공급될 커널들 및/또는 라인 버퍼 유닛들에 대한 프로그램 코드 및/또는 구성 정보일 수 있다. In various application software program development environments, a compiler that accepts a high level description of an application software program and in response generates a low level program code (eg, object code) and any associated configuration information for execution by an image processor. It recognizes the various inefficiencies of the software and changes the program code that is compiled to improve or reduce the inefficiency. The program code to be modified may be program code and / or configuration information for one or more sheet generators and / or kernels and / or line buffer units to be supplied by them.

도 7a는 제1 잠재적 비효율에 관한 것이다. 도 7a에서 관찰된 바와 같이, 입력 이미지(701)는 시트 생성기에 의해, 예를 들어 라인 버퍼 유닛에 의해 송신된 다수의 라인 그룹으로서 수신된다. 도 7a에서 관찰된 바와 같이, 입력 이미지는 예를 들어, 시트 발생기가 연결되는 스텐실 프로세서 상에서 실행되는 커널 K1에 의해 프로세싱되기 전에 시트 생성기에 의해 다운 샘플링된다(702). 대안적으로, 커널 K1은 다운-샘플링을 수행하도록 프로그램될 수 있다. 7A relates to a first potential inefficiency. As observed in FIG. 7A, the input image 701 is received by the sheet generator as, for example, a plurality of line groups transmitted by the line buffer unit. As observed in FIG. 7A, the input image is down sampled by the sheet generator before processing by, for example, a kernel K1 running on a stencil processor to which the sheet generator is connected (702). Alternatively, kernel K1 may be programmed to perform down-sampling.

다양한 실시예에서, 스텐실 프로세서는 자연스럽게 스텐실 프로세서의 실행 레인 어레이와 동일한 디멘션을 갖는 출력 이미지 시트들을 생성한다. 예를 들어, 실행 레인 어레이 디멘션이 16x16 픽셀인 실시예에서, 스텐실 프로세서의 커널 프로그램 코드 K1의 구성은 초기에 16x16 픽셀 출력 이미지 시트의 생성을 기본으로 한다.In various embodiments, the stencil processor naturally produces output image sheets having the same dimensions as the execution lane array of the stencil processor. For example, in an embodiment where the execution lane array dimension is 16x16 pixels, the configuration of kernel program code K1 of the stencil processor is initially based on the generation of a 16x16 pixel output image sheet.

스텐실 프로세서가 다운 샘플링된 입력 이미지로부터 실행 레인 어레이와 동일한 차원의 출력 시트를 생성하도록 구성된 경우 많은 양의 버퍼링 공간이 필요하다. 예를 들어, 도 7a를 참조하면, 다운 샘플링(702)이 시트 생성기에 의해 수행되어 스텐실 프로세서의 2차원 시프트 레지스터 어레이에 로딩하기 위한 16×16 픽셀 다운-샘플링된 시트(703)를 생성하면, 시트 생성기는 커널 K1에 의한 소비를 위해 16×16 픽셀 다운-샘플링된 입력 이미지(703)를 형성하기 위해 전체 32×32 픽셀 입력 이미지(701)를 큐잉(queue)할 필요가 있다. 이러한 큐잉에 필요한 많은 양의 메모리를 할당하는 것은 비효율성의 한 형태이다.Large amounts of buffering space are required if the stencil processor is configured to generate an output sheet of the same dimension as the execution lane array from the downsampled input image. For example, referring to FIG. 7A, if down sampling 702 is performed by a sheet generator to generate a 16 × 16 pixel down-sampled sheet 703 for loading into a two-dimensional shift register array of a stencil processor, The sheet generator needs to queue the entire 32x32 pixel input image 701 to form a 16x16 pixel down-sampled input image 703 for consumption by the kernel K1. Allocating a large amount of memory for this queuing is a form of inefficiency.

이와 같이, 실시예에서, 컴파일러는 도 7b에 도시된 바와 같이 어플리케이션 소프트웨어 프로그램을 재구성할 것이다(예를 들어, 임의의 적절한 구성 정보를 포함). 특히 컴파일러는 커널 K1이 실행 레인 어레이가 완전히 활용되면서 작동하지 않도록 프로그램 코드를 구성한다. 현재 예시를 계속하면, 커널 K1은 커널 K1이 8x8 픽셀 출력 시트(704b)를 생성하게 하는 8x8 픽셀 입력 시트(703b) 상에서 동작하도록 설계된다. As such, in an embodiment, the compiler will reconstruct the application software program as shown in FIG. 7B (eg, including any suitable configuration information). In particular, the compiler organizes the program code so that the kernel K1 does not work as the execution lane array is fully utilized. Continuing with the present example, kernel K1 is designed to operate on 8x8 pixel input sheet 703b which causes kernel K1 to produce an 8x8 pixel output sheet 704b.

보다 작은 8x8 픽셀 입력 시트(703b)에서 동작하도록 커널 K1을 구성함으로써, 다운-샘플링 활동(702b)(예를 들어, 시트 생성기에 의해 수행됨)은 도 7a의 입력 이미지 데이터(701A)에 비해 절반의 양의 입력 이미지 데이터(701b)를 큐잉할 것을 요한다. 여기서, 도 1a의 입력 이미지 데이터(701a)는 32 행의 이미지 데이터에 대응하지만, 반면에, 도 7b의 입력 이미지 데이터(701b)는 단지 16 행의 입력 이미지 데이터에 대응한다. 단지 16 행의 입력 이미지 데이터(701b)에 의해, 다운샘플링 활동(702b)은 이미지의 전체 폭에 걸쳐 스팬(span)하는 일련의 8x8 픽셀 입력 시트(703b)를 생산하는 2:1 다운샘플링을 수행할 수 있다.By configuring the kernel K1 to operate on the smaller 8x8 pixel input sheet 703b, the down-sampling activity 702b (eg, performed by the sheet generator) is halved relative to the input image data 701A of FIG. 7A. It is necessary to queue positive input image data 701b. Here, the input image data 701a of FIG. 1A corresponds to 32 rows of image data, whereas the input image data 701b of FIG. 7B corresponds to only 16 rows of input image data. With only 16 rows of input image data 701b, downsampling activity 702b performs a 2: 1 downsampling that produces a series of 8x8 pixel input sheets 703b that spans the full width of the image. can do.

도 8a 및 도 8b는 커널 K1의 출력 이미지 데이터(801)에 대해 업샘플링이 수행 된 후 K1의 소비 커널 K2에 의해 이미지 데이터에 실행되기 전에 같은 양만큼 다운 샘플링이 수행되는 다른 비효율을 나타낸다. 여기서, 도 8a에서 관찰된 바와 같이, 생산 커널 K1은 출력 시트(A0 내지 A3)의 시리즈(801)를 생산한다. 그 다음, 이들 출력 시트(801)의 이미지 데이터는 효과적으로 K1의 출력을 업샘플링하기 위해 인터리빙된다. 즉, 도 8a에 도시된 바와 같이, 예를 들어, 출력 시트 (A0 내지 A3) 각각의 상부 라인은 K1의 소비 커널 K2에 의해 소비되기 전에 K1의 출력 데이터를 일시적으로 큐잉하는 라인 버퍼(802)에 저장된 업샘플링된 K1 출력(803)의 상부 출력 라인을 형성하기 위해 인터리빙된다. 다양한 실시예들에서, 업샘플링은 K1, K1이 실행하는 스텐실 프로세서에 연결된 시트 생성기 또는 K1이 그 출력을 전송하는 라인(802) 버퍼 중 하나에 의해 수행될 수 있다. 8A and 8B show another inefficiency in which downsampling is performed by the same amount after upsampling is performed on the output image data 801 of the kernel K1 but before being executed on the image data by the consuming kernel K2 of K1. Here, as observed in FIG. 8A, production kernel K1 produces a series 801 of output sheets A0-A3. The image data of these output sheets 801 are then interleaved to effectively upsample the output of K1. That is, as shown in FIG. 8A, for example, the upper line of each of the output sheets A0 to A3 is a line buffer 802 that temporarily queues the output data of K1 before being consumed by the consuming kernel K2 of K1. Interleaved to form an upper output line of the upsampled Kl output 803 stored at. In various embodiments, upsampling may be performed by one of K1, a sheet generator connected to a stencil processor executed by K1, or a line 802 buffer where K1 sends its output.

도 8b에서 관찰된 바와 같이, K1의 출력을 소비하는 커널 K2에 대한 입력 프로세싱은 K1의 출력이 업샘플링된 것과 동일한 팩터에 의해 그 입력을 다운 샘플링하도록 구성된다. 따라서 적절한 크기의 입력 데이터로 K2를 공급하는 프로세스는 K1의 출력에 대해 수행된 업샘플링 프로세스를 리버싱해야 한다. 즉, 도 8b를 참조하면, 라인 버퍼(802) 내의 인터리빙된 큐잉된 데이터(803)는 궁극적으로 디-인터리빙되어 원래 K1에 의해 형성된 출력 이미지들(A0 내지 A3)을 재형성한다. 다운샘플링은 라인 버퍼(802), K2가 실행하는 스텐실 프로세서에 연결된 시트 생성기 또는 K2 그 자체 중 하나에 의해 수행될 수 있다. As observed in FIG. 8B, input processing for kernel K2 consuming an output of K1 is configured to downsample its input by the same factor as the output of K1 is upsampled. Thus, the process of supplying K2 with appropriately sized input data must reverse the upsampling process performed on the output of K1. That is, referring to FIG. 8B, interleaved queued data 803 in line buffer 802 is ultimately de-interleaved to reconstruct output images A0-A3 originally formed by K1. Downsampling may be performed by line buffer 802, by a sheet generator connected to the stencil processor that K2 executes, or by K2 itself.

일 실시예에서, 컴파일러는 생산 커널의 업샘플링된 출력이 생산 커널의 출력을 소비할 커널(여러 커널을 포함할 수 있음)에 대해 동일한 팩터(예를 들어, 1:2 업샘플 및 2:1 다운샘플)에 의해 다운샘플링되는 때를 인식하도록 구성된다. 응답으로, 컴파일러는 소비자 데이터 경로에 대한 생산자와 업샘플링과 다운샘플링을 모두 제거하기 위해 개발 중인 프로그램 코드를 재구성할 것이다. 이 해결책은 도 8c에 도시된다. 여기서, K1의 비-업샘플링된 출력은 K1과 K2 연결 사이에 결합된 라인 버퍼(802)에 단순히 큐잉된다. 그 다음 비-업샘플링된 K1 출력은 다운샘플링없이 직접 K2에 공급된다. 이와 같이, 도 8a의 업샘플링 활동과 도 8b의 다운샘플링 활동 모두는 회피될 수 있다.In one embodiment, the compiler is responsible for the same factor (eg, 1: 2 upsample and 2: 1) for the kernel (which may include multiple kernels) where the upsampled output of the production kernel will consume the output of the production kernel. And downsampled by a downsample. In response, the compiler will reconstruct the program code under development to eliminate both producers and upsampling and downsampling on the consumer data path. This solution is shown in FIG. 8C. Here, the non-upsampled output of K1 is simply queued in line buffer 802 coupled between K1 and K2 connections. The non-upsampled K1 output is then fed directly to K2 without downsampling. As such, both the upsampling activity of FIG. 8A and the downsampling activity of FIG. 8B can be avoided.

도 9a 및 도 9b는 예를 들어 다중 컴포넌트 출력 이미지의 경우에 발생할 수 있는 또 다른 비효율에 관계된다. 공지된 바와 같이, 디지털 이미지는 다수의 컴포넌트들(예: RGB, YUV 등)을 가질 수 있다. 다양한 어플리케이션 소프트웨어 프로그램은 상이한 컴포넌트들을 상이한 데이터 평면으로 프로세싱하도록 설계/구성될 수 있다. 여기서, 예를 들어, 완전한 출력 이미지(901)는 제1 컴포넌트(R)만으로 구성된 하나 이상의 데이터 시트를 생성하고, 제2 컴포넌트(R)만으로 구성된 하나 이상의 데이터 시트를 생성하고, 제3 컴포넌트(B)만으로 구성된 데이터의 하나 이상의 데이터 시트를 생성함으로써 생산 커널 K1에 의해 완전히 생성될 수 있다. 다양한 실시예에서, 동일한 라인 버퍼(902)에서 생산 커널과 소비 커널 사이에서 전달되는 이미지의 모든 데이터를 큐잉하는 것이 자연스럽거나 표준 디폴트일 수 있다. 따라서, 도 9a는 동일한 라인 버퍼 유닛(902)에 큐잉되는 3개의 컴포넌트들(901) 모두의 이미지 데이터를 도시한다.9A and 9B relate to another inefficiency that may occur, for example in the case of a multi-component output image. As is known, a digital image can have multiple components (eg, RGB, YUV, etc.). Various application software programs may be designed / configured to process different components into different data planes. Here, for example, the complete output image 901 generates one or more data sheets composed solely of the first component R, one or more data sheets composed solely of the second component R, and the third component B By generating one or more data sheets of data consisting solely of). In various embodiments, it may be natural or standard default to queue all data of the image passed between the production kernel and the consuming kernel in the same line buffer 902. Thus, FIG. 9A shows image data of all three components 901 queued to the same line buffer unit 902.

그러나, 예를 들어, 큰 출력 이미지의 경우, 동일한 라인 버퍼 유닛에 3개의 모든 컴포넌트들의 이미지 데이터를 저장하는 것은 많은 양의 라인 버퍼 메모리 리소스에 무리를 주거가 이와 달리 소비할 수 있다. 따라서, 일 실시예에서, 도 9b를 참조하면, 어플리케이션 소프트웨어 프로그램을 컴파일하는 컴파일러는 다중 컴포넌트 이미지의 상이한 컴포넌트들을 저장하는 것이 라인 버퍼 메모리 리소스에 무리를 주는 경우를 자동으로 인식할 것이다. 예를 들어, 컴파일 프로그램은 초기에 이미지를 저장하고 포워딩하기 위해 고정된 양의 버퍼 메모리 리소스를 할당하거나, 전송될 데이터의 크기 및/또는 양과 상관되는 버퍼 메모리 리소스의 양을 할당할 수 있고, 할당의 면에서, 자동으로 할당된 양이 불충분하거나 최대 임계치에 도달했음을 결정할 수 있다. 다른 접근법들에서, 컴파일 프로세스는 어플리케이션 소프트웨어 프로그램을 시뮬레이션하는 것과 라인 버퍼 유닛이 병목임을 인식하는 것을 포함할 수 있다(예를 들어, 생산 커널에 의해 생성된 라인 그룹을 저장하기 위한 메모리 공간을 종종 갖지 못하거나, 소비 커널로부터의 읽기 요청에 응답하기 위한 대역폭을 가지지 않음). 응답으로, 컴파일 프로세스는 자동으로 어플리케이션 소프트웨어를 수정하고 및/또는 이미지 프로세서를 재구성하여 생산 K1 커널의 출력 이미지 데이터의 상이한 컴포넌트들이 상이한 라인 버퍼 유닛들에 큐잉되도록 한다. 여기서, 도 9b는 서로 다른 라인 버퍼 유닛들(902_1, 902_2 및 902_3)에 각각 큐잉된 R, G 및 B 이미지 데이터를 도시한다. However, for example, for large output images, storing the image data of all three components in the same line buffer unit can otherwise consume a large amount of line buffer memory resources. Thus, in one embodiment, referring to FIG. 9B, a compiler that compiles an application software program will automatically recognize when storing different components of a multi-component image strains line buffer memory resources. For example, the compiler may initially allocate a fixed amount of buffer memory resources for storing and forwarding an image, or may allocate an amount of buffer memory resources that correlates to the size and / or amount of data to be transferred, In terms of, it can be determined that the automatically allocated amount is insufficient or the maximum threshold has been reached. In other approaches, the compilation process may include simulating an application software program and recognizing that the line buffer unit is the bottleneck (e.g., often not having memory space to store line groups generated by the production kernel). Or does not have bandwidth to respond to read requests from the consuming kernel). In response, the compilation process automatically modifies the application software and / or reconfigures the image processor so that different components of the output image data of the production K1 kernel are queued in different line buffer units. Here, FIG. 9B shows R, G, and B image data queued to different line buffer units 902_1, 902_2, and 902_3, respectively.

도 9b의 해결책은 생산 K1 커널이 많은 소비자들을 갖는 경우에도 사용될 수 있다. 이 경우에, 도 9a의 디폴트 해결책이 채택되면, 다수의 소비자가 단일 입력 이미지에 대한 모든 정보를 수신하기 위해 라인 버퍼로부터 다수의 로딩/판독을 필요로 할 것이므로, 이미지 데이터(901)의 모든 컴포넌트를 저장하고 있는 단일 라인 버퍼 유닛은 시스템 병목이 될 수 있다. 그러므로, 일 실시예에서, 도 9b의 접근법은 각 라인 버퍼가 단지 동일한 컴포넌트 유형에 대한 데이터를 보유하는 경우에 채택된다. 논의되는 예시에서, 이는 도 9a의 디폴트 접근법과 비교하여, 단일 라인 버퍼 리소스에 대한 소비자의 판독 요청을 66% 감소시킬 것이다. 즉, 도 9b의 라인 버퍼 유닛들(902_1, 902_2, 902_3) 각각은 도 9a의 라인 버퍼 유닛(902)의 소비 판독 로드의 33%만을 지원할 필요가 있을 것이다. 동일한 감소된 수요 영향은 생산 커널의 이미지 데이터가 라인 버퍼 리소스에 기록되는 활동에 대해서도 발생한다. The solution of FIG. 9B can be used even when the production K1 kernel has many consumers. In this case, if the default solution of FIG. 9A is adopted, all components of the image data 901 will be required because multiple consumers will need multiple loading / readings from the line buffer to receive all the information for a single input image. A single line buffer unit that stores the system can be a bottleneck. Therefore, in one embodiment, the approach of FIG. 9B is adopted where each line buffer only holds data for the same component type. In the example discussed, this would reduce the consumer's read request for a single line buffer resource by 66% compared to the default approach of FIG. 9A. That is, each of the line buffer units 902_1, 902_2, 902_3 in FIG. 9B will only need to support 33% of the consuming read load of the line buffer unit 902 in FIG. 9A. The same reduced demand impact also occurs for activities in which image data from the production kernel is written to the line buffer resource.

도 9b의 접근법이 비효율을 감소시킬 수 있는 또 다른 상황은 특정 소비자가 컴포넌트의 서브셋만을 소비하는 경우이다. 예를 들어 극단적인 경우 한 소비자가 R 컴포넌트를 소비하고, 다른 소비자는 G 컴포넌트를 소비하며, 다른 소비자는 G 컴포넌트를 소비한다. 이 경우, 각 상이한 소비자는 그 자체의 전용 라인 버퍼 소스로 구성되어, (서로 다른 라인 버퍼 유닛 연결을 통해) 서로 다른 데이터 경로를 따라 다른 컴포넌트 기반 데이터 플로우를 간소화한다. 대조적으로, 도 9a의 접근법이 사용되면, 상이한 컴포넌트 기반 데이터 플로우는 단일 지점의 라인 버퍼(902)(도 9a)에서 수렴할 것이고, 이 경우 한 컴포넌트의 데이터 플로우는 다른 컴포넌트들을 포워딩하는 라인 버퍼 유닛(901)에서의 대량의 판독 및 기록 활동 때문에 지연될 것이다.Another situation where the approach of FIG. 9B can reduce inefficiency is when a particular consumer consumes only a subset of components. In extreme cases, for example, one consumer consumes an R component, another consumer consumes a G component, and another consumer consumes a G component. In this case, each different consumer is configured with its own dedicated line buffer source, simplifying different component based data flows along different data paths (via different line buffer unit connections). In contrast, if the approach of FIG. 9A is used, different component-based data flows will converge in a single point of line buffer 902 (FIG. 9A), in which case the data flow of one component forwards the other components. There will be a delay due to the large amount of read and write activity at 901.

도 10a 및 도 10b는 단일 소비자로부터 라인 버퍼 리소스 다운스트림의 확산에 기초한 또 다른 효율성 개선을 보여준다. 여기서, 너무 많은 소비자가 존재하면 단일 생산 커널의 출력 이미지 데이터를 포워딩하기 위해 다수의 라인 버퍼 유닛을 사용해야 할 수 있다. 도 10a는 4개의 상이한 소비자들(K2 내지 K5)이 단일 라인 버퍼 유닛(1002)으로부터 단일 생산 K1 커널의 출력을 소비하는 경우의 잠재적 비효율을 나타낸다. 또한, 단일 라인 버퍼 유닛(1002)은 모든 소비자가 그것을 소비할 때까지 큐잉된 데이터를 제거할 수 없기 때문에 병목이 될 수 있다. 이 경우, 라인 버퍼 유닛(1002)으로부터의 전체 데이터 플로우는 최소한 가장 느린 소비자의 입력 속도로 감소될 것이다. 또한, 라인 버퍼 유닛(1002)은 라인 버퍼 유닛(1002)의 리소스를 압도할 수 있는 지원하는 많은 수의 소비자를 고려할 때, 판독 요청의 많은 부하를 많이 받게될 것이다.10A and 10B show another efficiency improvement based on the spread of line buffer resources downstream from a single consumer. Here, if too many consumers exist, it may be necessary to use multiple line buffer units to forward the output image data of a single production kernel. 10A shows the potential inefficiency when four different consumers K2 to K5 consume the output of a single production K1 kernel from a single line buffer unit 1002. Also, the single line buffer unit 1002 can be a bottleneck because it cannot remove the queued data until all consumers have consumed it. In this case, the entire data flow from the line buffer unit 1002 will be reduced to at least the slowest consumer's input speed. In addition, the line buffer unit 1002 will be subjected to a large load of read requests, given the large number of supporting consumers that can overwhelm the resources of the line buffer unit 1002.

이와 같이, 도 10b에 도시된 바와 같이, 소비자 K2, K3의 제1 서브셋은 제1 라인 버퍼 유닛(1002_1)에 할당되고, 소비자 K4, K5의 제2 서브셋은 제2 라인 버퍼 유닛(1002_2)에 할당된다. 생산 커널 K1의 출력 이미지 스트림은 양쪽 라인 버퍼 유닛들(1002_1, 1002_2)에 공급된다. 다수의 라인 버퍼 유닛 리소스들(1002_2, 1002_2) 사이에서 전체 소비자 부하를 분산시키는 것은(도 10a의 접근법과 비교하여) 임의의 특정 라인 버퍼 유닛 리소스에 대한 총 요구를 감소시킨다. 또한, 컴파일러는 동일한 라인 버퍼 유닛으로 더 빠른 입력 스트림 소비 커널을 공급할 수 있어(및/또는 다른 라인 버퍼 유닛으로 더 느린 입력 스트림 소비 커널을 공급), 더 빠른 소비 커널들이 더 느린 입력 속도 소비 커널의 소비 속도에 의해 지연되지 않도록 할 수 있다.As such, as shown in FIG. 10B, the first subset of consumers K2, K3 is allocated to the first line buffer unit 1002_1, and the second subset of consumers K4, K5 is assigned to the second line buffer unit 1002_2. Is assigned. The output image stream of the production kernel K1 is supplied to both line buffer units 1002_1 and 1002_2. Distributing the overall consumer load among multiple line buffer unit resources 1002_2, 1002_2 (compared to the approach of FIG. 10A) reduces the total demand for any particular line buffer unit resource. In addition, the compiler can supply a faster input stream consuming kernel to the same line buffer unit (and / or a slower input stream consuming kernel to another line buffer unit), so that faster consuming kernels It can be prevented from being delayed by the speed of consumption.

도 11a는 DAG로 설계된 어플리케이션 소프트웨어 프로그램(또는 그 컴포넌트들)으로부터 발생할 수있는 "분할 및 결합" 비효율을 도시한다. 도 11a에서 관찰된 바와 같이, 소스 커널 K1의 출력은 2개의 상이한 소비 커널 K2 및 K3에 공급된다. 추가적으로 커널 K3은 커널 K2의 출력을 소비한다. 커널 K1의 출력으로부터의 커널 K3의 이중 종속성은 런타임 계산상의 비효율 과 모델링/디자인 비효율을 초래할 수 있다. 런타임 비효율에 관하여, LB2 라인 버퍼(1102_2)는 많은 양의 K1의 출력 데이터를 큐잉하기 위해 매우 크게 만들어야 할 수 있다. 일반적으로, 커널 K3은 대략 커널 K3가 프로세싱할 LB3(1002_3)로부터의 다음 라인 그룹이 LB2(1102_3)으로부터의 다음 라인 그룹이 함께 사용가능할 때까지 LB2(1102_2)로부터의 다음 라인 그룹을 요청하지 않는다. K2를 통한 큰 전파 지연으로 인해, LB2(1102_2)가 매우 커질 수 있다. LB2(1102_2) 내의 데이터가 소비될 준비가 되었을 때와 커널 K2로부터 커널 K3에 대한 그 형제 입력 데이터가 LB3(1102_3)에서 사용가능할 때 사이의 상기 불일치는 또한 어플리케이션 소프트웨어의 설계 중에 모델링 또는 최적화 프로세스를 더욱 어렵게할 수 있다.FIG. 11A illustrates the " splitting and combining " inefficiencies that may arise from an application software program (or components thereof) designed with DAG. As observed in FIG. 11A, the output of source kernel K1 is fed to two different consuming kernels K2 and K3. In addition, kernel K3 consumes the output of kernel K2. Double dependency of kernel K3 from the output of kernel K1 can lead to run-time computational inefficiencies and modeling / design inefficiencies. With regard to runtime inefficiency, the LB2 line buffer 1102_2 may have to be made very large to queue large amounts of output data of K1. In general, kernel K3 does not request the next line group from LB2 1102_2 until approximately the next line group from LB3 1002_3 that kernel K3 will process is available with the next line group from LB2 1102_3. . Due to the large propagation delay through K2, LB2 1102_2 can be very large. The discrepancy between when data in LB2 1102_2 is ready to be consumed and when its sibling input data from kernel K2 to kernel K3 is available in LB3 1102_3 also causes a modeling or optimization process to occur during the design of the application software. It can be more difficult.

도 11b는 컴파일러가 분할 및 결합 구조 상에 파이프라인 구조를 강제하는 해결책을 도시한다. 여기서, 도 11a의 K2 커널은 원래의 K2 커널과 LB1(1102_1)으로부터 컨텐츠를 단순히 소비하고 이를 LB4(1102_4)로 포워딩하는 로드/저장 알고리즘(1103)을 포함하는 다른 커널 K2'로 확장된다. 중요하게는, 로드/저장 알고리즘(1103)은 K1으로부터의 원래의 출력 데이터가 K3에 의해 소비될 준비가 되었을 때와 K2로부터의 출력 데이터가 LB3(1102_3)에서 K3에 의해 소비될 준비가 될 때 사이의 불일치를 제거하는 K1로부터 프로세싱되지 않은 스트림에 약간의 전파 지연을 유도할 수 있다. Figure 11B illustrates a solution for the compiler to force the pipeline structure on the partitioning and concatenation structure. Here, the K2 kernel of FIG. 11A is extended to another kernel K2 'which includes the original K2 kernel and a load / store algorithm 1103 that simply consumes content from LB1 1102_1 and forwards it to LB4 1102_4. Importantly, the load / store algorithm 1103 is prepared when the original output data from K1 is ready to be consumed by K3 and when the output data from K2 is ready to be consumed by K3 at LB3 1102_3. It can lead to some propagation delays in the unprocessed stream from K1, which eliminates discrepancies between them.

다양한 실시예들에서, 스칼라 메모리(303)는 룩업 테이블 또는 상수 테이블을 유지하도록 구성될 수 있다는 것을 도 3a의 설명으로부터 알 수 있다. 특정 적용에서, 커널에 의해 프로세싱되는 입력 이미지 데이터는 가변 정보가 아닌 고정된 상수이다(예: 다양한 입력 데이터에서 동작하는 소스 커널에 의해 생성된 것처럼). 예를 들어, 렌즈 표면의 상당히 큰 입자 크기 영역에 대한 렌즈의 보정 값이 기록되는 렌즈 음영 보정이 그 예이다. 상당히 큰 입자 크기는 저해상도 이미지 데이터에 해당한다(기록된 데이터가 각 항목이 다른 입자에 해당하는 다른 항목으로 구현되는 경우, 기록된 데이터에는 많은 항목을 포함하지 않음). In various embodiments, it can be seen from the description of FIG. 3A that the scalar memory 303 can be configured to maintain a lookup table or a constant table. In certain applications, the input image data processed by the kernel is a fixed constant rather than variable information (eg, as generated by a source kernel operating on various input data). For example, lens shading correction, in which the correction value of the lens is recorded for a significantly large particle size region of the lens surface. Significantly large particle sizes correspond to low resolution image data (when the recorded data is implemented with different items where each item corresponds to a different particle, the recorded data does not contain many items).

이미지 프로세서가 렌즈를 포함하는 카메라로부터의 이미지를 프로세싱할 때, 이들 기록된 보정 값 중 하나는 실행 레인 어레이에 의해 프로세싱되는 이미지 영역에 대응한다. 따라서 기록된 값은 각 실행 레인에 입력 값으로 적용된다. 이러한 의미에서, 렌즈 보정 값은 룩업 테이블 또는 상수 테이블과 유사하게 구현된다. 추가적으로, 수정 값을 구현하는 데 필요한 총 데이터 양이 제한되므로, 수정 값은 엄청난 양의 메모리 공간을 소비하지 않는다. 이와 같이, 도 12에서 관찰되는 바와 같이, 다양한 실시예들에서, 고정되고 스칼라 메모리(1203)에 적합하기에 충분히 작은 입력 이미지 데이터(1210)는 (예를 들어, 어플리케이션 소프트웨어의 초기 구성으로서) 스칼라 메모리(1203)에 로딩되고, 런타임 동안 스텐실 프로세서의 실행 레인 어레이에시 실행되는 커널에 의해 룩업 테이블 또는 상수 테이블로서 참조된다(소스 커널에 의해 생성되어 라인 버퍼 유닛을 통해 커널에 공급되지 않고).When the image processor processes an image from a camera that includes a lens, one of these recorded correction values corresponds to the image area processed by the execution lane array. The recorded value is therefore applied as an input value to each run lane. In this sense, the lens correction value is implemented similarly to a lookup table or a constant table. In addition, since the total amount of data required to implement the modification value is limited, the modification value does not consume a large amount of memory space. As such, as observed in FIG. 12, in various embodiments, the input image data 1210 that is fixed and small enough to fit the scalar memory 1203 is a scalar (eg, as an initial configuration of application software). It is loaded into memory 1203 and referenced as a lookup table or constant table by the kernel, which is executed in the run lane array of the stencil processor during runtime (not generated by the source kernel and supplied to the kernel through the line buffer unit).

도 13a는 잠재적으로 라인 버퍼 유닛 및/또는 시트 생성기에 큐잉되는 많은 양의 데이터를 초래할 수 있는 다른 런타임 이슈를 도시한다. 여기서, 도 13a는 라인 버퍼 유닛으로부터 제공된 후에 시트 생성기에 예를 들어 큐잉되는 3개의 라인 그룹들(1301, 1302, 1303)을 도시한다. 예를 들어, 각 라인 그룹(1301, 1302, 1303)이 16행의 이미지 데이터를 포함하고, 시트 생성기의 대응 스텐실 프로세서의 실행 레인 어레이의 디멘션(1305)도 16x16 픽셀이라고 가정한다. 또한, 2차원 시프트 레지스터 어레이의 디멘션(1306)는 실행 레인 어레이의 주변부 주위에 4 픽셀 폭의 경계를 형성하는 헤일로 영역을 지원하기 위해 24×24 픽셀이라고 가정한다. 적어도 이러한 상황 하에서, 자연적 구성은 실행 레인 어레이(1305)의 16개의 행을 특정한 라인 그룹의 16개의 행와 정렬하는 것일 수 있다. 즉, 시트 생성기는 특정한 라인 그룹을 중심으로 시트를 형성한다. 도 13a는 실행 레인(1305)이 제2 라인 그룹(1302)의 높이에 대해 동작하도록 정렬되는 이러한 접근법을 도시한다. 13A illustrates another runtime issue that can potentially result in a large amount of data queued to the line buffer unit and / or sheet generator. Here, FIG. 13A shows three line groups 1301, 1302, 1303, for example queued to a sheet generator after being provided from a line buffer unit. For example, assume that each line group 1301, 1302, 1303 contains 16 rows of image data, and that the dimension 1305 of the execution lane array of the corresponding stencil processor of the sheet generator is also 16x16 pixels. Further, the dimension 1306 of the two-dimensional shift register array is assumed to be 24x24 pixels to support a halo region that forms a four pixel wide border around the periphery of the execution lane array. At least under these circumstances, the natural configuration may be to align 16 rows of execution lane array 1305 with 16 rows of a particular line group. That is, the sheet generator forms a sheet about a specific group of lines. FIG. 13A illustrates this approach in which the execution lane 1305 is aligned to operate with respect to the height of the second line group 1302.

문제는, 도 13a에 도시된 바와 같이, 헤일로(1306)의 존재로 인해, 2차원 시프트 레지스터 어레이로 공급되는 완전한 시트는 제1 라인 그룹(1301)의 하부 영역 및 제3 라인 그룹(1303)의 상부 영역(헤일로 영역은 이들 라인 그룹들도 커버한다)으로부터의 데이터를 필요로할 것이다. 이와 같이, 일 실시예에서, 도 13b에 도시된 바와 같이, 전체 크기의 시트를 형성하기 위해 최소한의 라인 그룹이 존재할 필요가 있도록 정렬이 변경된다. 이 예시에서, 도 13b의 정렬은 도 13a의 정렬에 비해 4개의 픽셀 값만큼 위로 시프트되어, 2개의 라인 그룹(1301, 1302)만이 전체 크기의 시트를 형성하기 위해 시트 생성기에 존재할 필요가 있도록 한다. 이렇게함으로써, 시트 생성기(그리고 잠재적으로 라인 버퍼)에 필요한 메모리 공간이 줄어들뿐만 아니라 시트는 프로세싱을 시작하기 위해 3개의 라인 버퍼를 기다리지 않고 2개의 라인 그룹만을 기다린다. The problem is that, as shown in FIG. 13A, due to the presence of the halo 1306, the complete sheet fed to the two-dimensional shift register array has the lower region of the first line group 1301 and the third line group 1303. You will need data from the top region (the halo region also covers these line groups). As such, in one embodiment, as shown in FIG. 13B, the alignment is altered such that a minimum group of lines need to be present to form a full size sheet. In this example, the alignment of FIG. 13B is shifted up by four pixel values compared to the alignment of FIG. 13A, such that only two line groups 1301 and 1302 need to be present in the sheet generator to form a full size sheet. . This not only reduces the memory space required for the sheet generator (and potentially the line buffer), but the sheet only waits for two groups of lines instead of waiting for three line buffers to begin processing.

도 14는 데이터 레인마다 다수의 픽셀을 포함하는 입력 이미지 데이터 또는 다른 방식으로, 시트 생성기의 커널에 의해 프로세싱될 데이터의 기초 유닛을 공급받는 커널에 대한 입력 프로세스가 다수의 픽셀을 포함함에 따라 시트 생성기에 의해 수행된 디-인터리빙 프로세스에 관한 것이다. 예시로서, 도 14는 상이한 컬러 픽셀들의 모자이크(1401)를 포함하도록 구조화된 시트 생성기에 의해 수신되는 입력 이미지의 픽셀들을, 예를 들어 바이어(Bayer) 패턴 포맷으로 도시한다. 여기서, 입력 이미지는 라인 버퍼 유닛에 의해 제공된 라인 그룹으로서 시트 생성기에 의해 수신된다. 이와 같이, 예를 들어, 시트 생성기에 의해 수신된 각 라인 그룹의 각 행은 R, G 및 B 픽셀을 포함한다. 여기서, 입력 이미지 데이터의 기초 유닛은 R 픽셀, B 픽셀 및 2개의 G 픽셀을 포함하는 4개의 픽셀의 유닛 셀(1402)을 포함한다.14 illustrates a sheet generator as input image data including a plurality of pixels per data lane or, alternatively, as the input process for a kernel supplied with a base unit of data to be processed by the kernel of the sheet generator includes a plurality of pixels. A de-interleaving process performed by the present invention. As an example, FIG. 14 shows pixels of an input image received by a sheet generator structured to include a mosaic 1401 of different color pixels, for example in a Bayer pattern format. Here, the input image is received by the sheet generator as a group of lines provided by the line buffer unit. As such, for example, each row of each line group received by the sheet generator includes R, G, and B pixels. Here, the base unit of the input image data includes a unit cell 1402 of four pixels including an R pixel, a B pixel, and two G pixels.

시트 생성기는 수신된 입력 이미지 구조(1401)로부터 시트를 직접 간단히 파싱하기보다는(바이어 패턴을 갖는 시트를 생성할 것임), 대신에 입력 이미지 데이터 구조(1401)에 대해 디-인터리빙 프로세스를 수행하여, 4개의 유형의 시트를 포함하는 커널에 대한 새로운 입력 구조(1403)를 생성한다. 즉, 도 14에서 관찰된 바와 같이, 새로운 입력 구조(1403)는 1) 입력 이미지의 R 픽셀들만으로 구성되거나 그렇지 않으면 R 픽셀들로부터만 도출된 시트들; 2) 유닛 셀 입력 이미지의 유닛 셀의 동일한 제1 위치에 위치하는 G 픽셀들만으로 구성되거나 그렇지 않으면 G 픽셀들로부터만 도출된 시트들; 3) 유닛 셀 입력 이미지의 유닛 셀의 동일한 제2 위치에 위치하는 G 픽셀들만으로 구성되거나 그렇지 않으면 G 픽셀들로부터만 도출된 시트들; 4) 입력 이미지의 B 픽셀만으로 구성되거나 그렇지 않으면 B 픽셀로부터만 도출된 시트를 포함한다. 시트는 입력 이미지 픽셀들로만 구성될 수 있거나, 예를 들어, 상이한 컬러들이 위치되는 입력 이미지 위치들로 값들을 보간함으로써 업-샘플링될 수 있다. The sheet generator does not simply parse the sheet directly from the received input image structure 1401 (will generate a sheet with a via pattern), but instead performs a deinterleaving process on the input image data structure 1401, Create a new input structure 1403 for the kernel containing four types of sheets. That is, as observed in FIG. 14, the new input structure 1403 comprises: 1) sheets composed only of R pixels of the input image or otherwise derived only from R pixels; 2) sheets composed only of G pixels or otherwise derived only from G pixels located at the same first location of the unit cell of the unit cell input image; 3) sheets composed only of G pixels or otherwise derived only from G pixels located at the same second position of the unit cell of the unit cell input image; 4) Include sheets that consist only of B pixels of the input image or otherwise derive only from B pixels. The sheet may consist only of input image pixels or may be up-sampled, for example, by interpolating values to input image locations where different colors are located.

그 다음 새롭게 구조화된 시트는 그들을 프로세스하는 시트 생성기의 연관된 커널에 제공되어, 상기 시트 생성기로부터 다시 제공받은 동일한 구조(1403)(시트 당 한 컬러)의 출력 시트를 생성한다. 시트 생성기는 단색 구조(1403)에 인터리빙 프로세스를 수행하여, 혼합 컬러의 유닛 셀을 포함하는 원래의 구조(1401)를 갖는 소비용 출력 이미지를 생성한다.The newly structured sheets are then provided to the associated kernel of the sheet generator that processes them, producing an output sheet of the same structure 1403 (one color per sheet) provided back from the sheet generator. The sheet generator performs an interleaving process on the monochrome structure 1403 to generate an output image for consumption having the original structure 1401 including unit cells of mixed color.

다양한 실시예에서, 전술한 라인 버퍼 또는 라인 버퍼 유닛들은 생산 및 소비 커널 간에 이미지 데이터를 저장하고 포워딩하는 버퍼로서 보다 일반적으로 특성화될 수 있다. 즉, 다양한 실시예들에서, 버퍼는 반드시 라인 그룹들을 큐잉할 필요가 없다. 추가적으로, 이미지 프로세서의 하드웨어 플랫폼은 연관된 메모리 리소스를 갖는 복수의 라인 버퍼 유닛을 포함할 수 있고, 하나 이상의 라인 버퍼들은 단일 라인 버퍼 유닛으로부터 동작하도록 구성될 수 있다. 즉, 하드웨어의 단일 라인 버퍼 유닛은 상이한 생산/소비 커널 쌍들 사이에서 상이한 이미지 데이터 플로우를 저장하고 포워딩하도록 구성될 수 있다.In various embodiments, the above-described line buffer or line buffer units may be more generally characterized as a buffer for storing and forwarding image data between production and consuming kernels. That is, in various embodiments, the buffer does not necessarily have to queue line groups. Additionally, the hardware platform of the image processor may include a plurality of line buffer units having associated memory resources, and one or more line buffers may be configured to operate from a single line buffer unit. That is, a single line buffer unit of hardware may be configured to store and forward different image data flows between different production / consumer kernel pairs.

도 15는 상술한 방법을 도시한다. 방법은 버퍼가 생산 커널로부터 하나 이상의 소비 커널들로 전송되는 이미지 데이터를 저장하고 포워딩하는 이미지 프로세싱 소프트웨어 데이터 플로우를 구성하는 단계(1501)를 포함한다. 상기 방법은 상기 버퍼가 상기 이미지 데이터를 저장하고 포워딩하기에 리소스들이 불충분하다는 것을 인식하는 단계(1502)를 포함한다. 또한 상기 방법은 상기 생산 커널로부터 상기 하나 이상의 소비 커널로 상기 이미지 데이터를 전송하는 동안 상기 이미지 데이터를 저장하고 포워딩하는 다수의 버퍼들을 포함하도록 상기 이미지 프로세싱 소프트웨어 데이터 플로우를 수정하는 단계(1503)를 포함한다.15 illustrates the method described above. The method includes constructing 1501 an image processing software data flow for storing and forwarding image data from which a buffer is sent from a production kernel to one or more consuming kernels. The method includes a step 1502 of recognizing that resources are insufficient for the buffer to store and forward the image data. The method also includes modifying the image processing software data flow to include a plurality of buffers for storing and forwarding the image data while transferring the image data from the production kernel to the one or more consuming kernels. do.

3.0 저 레벨 프로그램 코드의 구성3.0 Low Level Program Code Structure

도 16은 프로그래머가 높은 레벨의 이미지 프로세싱 기능을 설계하고 어플리케이션 개발 환경이 섹션 2.0의 전술한 변환 중 임의의 것 또는 전부를 제공하여 개발자가 비효율성을 식별하고 및/또는 처음부터 변환을 작성할 필요가 없도록 하는 사전-런타임 개발 환경을 도시한다.16 illustrates that a programmer may design a high level image processing function and the application development environment may provide any or all of the aforementioned transformations of section 2.0 such that the developer needs to identify inefficiencies and / or write the transformations from scratch. A pre-runtime development environment is shown.

여기서, 개발 환경은 전술한 비효율성을 자동적으로 인식하고, 예를 들어 비효율성의 설명(개발 환경이 포함시키기 위한 개발되는 프로그램 코드를 스캔) 및 대응하는 수정들(비효율성이 발견된 경우 적용됨)을 포함하는 라이브러리(1601)를 참조함으로써 대응하는 변형 개선을 자동적으로 부과한다. 즉, 개발 환경은 보다 효율적인 프로세스(예를 들어, 컴파일 프로세스의 일부로서)를 수행하는 라이브러리(1601)로부터의 프로그램 코드를 자동적으로 삽입하거나 그렇지 않으면 비효율적인 코드를 상기 비효율성에 대한 수정을 포함하는 새로운 코드로 대체하기 위해 프로그램 코드를 수정한다.Here, the development environment automatically recognizes the inefficiency described above, for example, a description of the inefficiency (scan the program code being developed for inclusion in the development environment) and the corresponding modifications (applied if inefficiency is found). By referencing a library 1601 that includes a to automatically impose a corresponding modification improvement. That is, the development environment may automatically insert program code from the library 1601 to perform a more efficient process (eg, as part of the compilation process) or otherwise modify the inefficient code to include a fix for the inefficiency. Modify the program code to replace it with new code.

따라서, 상술한 동작 또는 대안적 실시예를 수행하는 프로그램 코드는 상위 레벨 프로그램 코드 또는 하위 레벨 오브젝트 코드로 표현될 수 있다. 다양한 실시예들에서, 더 높은 레벨의 가상 명령어 세트 아키텍처(ISA) 코드는 x, y 어드레스 좌표들을 갖는 메모리를 판독함에 따라 동작될 데이터 값들을 특정할 수 있고, 오브젝트 코드는 이들 데이터 액세스들을 대신 2차원 시프트 레지스터 동작들(전술한 시프트 동작들 또는 유사한 실시예들 중 임의의 것과 같이)로서 이해할 수 있다. Thus, program code that performs the above-described operations or alternative embodiments may be represented as a high level program code or a low level object code. In various embodiments, a higher level virtual instruction set architecture (ISA) code may specify data values to be operated upon reading a memory having x, y address coordinates, and the object code may replace these data accesses with 2 instead. It can be understood as dimensional shift register operations (such as any of the shift operations described above or similar embodiments).

컴파일러는 개발 환경에서 x, y 판독을 특정된 오브젝트 코드인 2차원 시프트 레지스터의 대응하는 시프트로 변환할 수 있다(예: x, y 좌표(+2, +2)를 갖는 개발 환경에서의 판독은 오브젝트 코드에서 왼쪽으로 두 칸 시프트와 두 칸 아래로의 시프트으로 구현). The compiler can convert the x, y readings in the development environment into the corresponding shifts of the two-dimensional shift register, which is the specified object code (e.g., in a development environment with x, y coordinates (+2, +2) Implemented by two shifts left and two shifts down in the object code).

환경에 따라 개발자는 이 두 가지 레벨 모두에 대한 가시성을 가질 수 있다(예: 단지 높은 가상 ISA 레벨). 또 다른 실시예에서, 이러한 미리 작성된 루틴은 사전 런타임이 아닌 런타임 동안 호출될 수 있다(예를 들어, 저스트-인-타임 컴파일러에 의해).Depending on the environment, developers may have visibility into both of these levels (for example, only high virtual ISA levels). In another embodiment, such prewritten routines may be called during runtime rather than pre-runtime (eg, by a just-in-time compiler).

4.0 결론4.0 Conclusion

이전 섹션은 섹션 1.0에서 기술된 것과 같은 이미지 프로세서는 컴퓨터 시스템의 하드웨어에서 구현될 수 있음을 인식해야 한다(예: 핸드헬드 디바이스의 카메라로부터의 데이터를 프로세싱하는 핸드헬드 디바이스의 시스템 온 칩(System On Chip, SOC)의 일부로서).The previous section should recognize that an image processor such as that described in Section 1.0 can be implemented in the hardware of a computer system (e.g. System On Chip of a handheld device processing data from the camera of the handheld device). Chip, as part of SOC).

상술한 다양한 이미지 프로세서 아키텍처 구성들은 반드시 종래의 의미에서의 이미지 프로세싱에 한정되지 않으며, 따라서 이미지 프로세서가 재특징화될 수 있는 (또는 아닐 수도 있는) 다른 어플리케이션들에 적용될 수 있음을 지적하는 것이 적절하다. 예를 들어, 상술한 다양한 이미지 프로세서 아키텍처 구성들 중 임의의 것이 실제 카메라 이미지의 프로세싱과는 대조적으로 애니메이션의 창작 및/또는 생성 및/또는 렌더링에 사용되면, 이미지 프로세서는 그래픽 프로세싱 유닛으로서 특징화될 수 있다. 추가적으로, 상술한 이미지 프로세서 아키텍처 구성들은 비디오 프로세싱, 비전 프로세싱, 이미지 인식 및/또는 기계 학습과 같은 다른 기술적 어플리케이션들에 적용될 수 있다. 이러한 방식으로 적용되면, 이미지 프로세서는 보다 범용적인 프로세서(예를 들어, 컴퓨팅 시스템의 CPU이거나 그의 일부분)와 통합될 수 있거나(예를 들어, 코-프로세서로서), 컴퓨팅 시스템 내의 단독형 컴퓨터 시스템일 수 있다.It is appropriate to point out that the various image processor architecture configurations described above are not necessarily limited to image processing in the conventional sense, and thus may be applied to other applications that may (or may not) be recharacterized. . For example, if any of the various image processor architecture configurations described above are used in the creation and / or generation and / or rendering of the animation as opposed to the processing of the actual camera image, the image processor may be characterized as a graphics processing unit. Can be. In addition, the image processor architecture configurations described above may be applied to other technical applications such as video processing, vision processing, image recognition and / or machine learning. Applied in this manner, an image processor may be integrated with a more general purpose processor (eg, a CPU or part of a computing system) (eg, as a co-processor), or may be a standalone computer system within a computing system. Can be.

전술한 하드웨어 설계 실시예는 반도체 칩 내에서 및/또는 반도체 제조 프로세스를 향한 최종 타겟팅을 위한 회로 설계의 디스크립션으로서 구현될 수 있다. 후자의 경우, 그러한 회로 디스크립션은(예를 들어, VHDL 또는 Verilog) 레지스터 전송 레벨(RTL) 회로 디스크립션, 게이트 레벨 회로 디스크립션, 트랜지스터 레벨 회로 디스크립션 또는 마스크 디스크립션 또는 그들의 다양한 조합의 형태를 취할 수 있다. 회로 디스크립션은 일반적으로 컴퓨터 판독 가능 저장 매체(CD-ROM 또는 다른 유형의 저장 기술)로 구현된다.The hardware design embodiments described above may be implemented as descriptions of circuit designs for final targeting within semiconductor chips and / or towards semiconductor fabrication processes. In the latter case, such circuit descriptions (eg, VHDL or Verilog) may take the form of register transfer level (RTL) circuit descriptions, gate level circuit descriptions, transistor level circuit descriptions or mask descriptions or various combinations thereof. Circuit descriptions are generally implemented in computer readable storage media (CD-ROM or other type of storage technology).

이전 섹션은 상기 기술된 것과 같은 이미지 프로세서는 컴퓨터 시스템의 하드웨어에서 구현될 수 있음을 인식해야 한다(예: 핸드헬드 디바이스의 카메라로부터의 데이터를 프로세싱하는 핸드헬드 디바이스의 시스템 온 칩(System On Chip, SOC)의 일부로서). 이미지 프로세서가 하드웨어 회로로서 구현되는 경우, 이미지 프로세서에 의해 프로세싱되는 이미지 데이터는 카메라로부터 직접 수신될 수 있음을 주목해야 한다. 여기서, 이미지 프로세서는 개별 카메라의 일부이거나, 통합된 카메라를 갖는 컴퓨팅 시스템의 일부일 수 있다. 추후에 이미지 데이터는 카메라로부터 직접 또는 컴퓨팅 시스템의 시스템 메모리로부터 수신될 수 있다(예: 카메라가 이미지 데이터를 이미지 프로세서가 아닌 시스템 메모리로 전송). 또한 이전 섹션에서 설명한 많은 구성들이 그래픽 프로세싱 유닛(애니메이션 렌더링)에 적용될 수 있다.The previous section should recognize that an image processor such as that described above may be implemented in the hardware of a computer system (e.g., a system on chip of a handheld device that processes data from a camera of the handheld device). As part of SOC). It should be noted that if the image processor is implemented as hardware circuitry, the image data processed by the image processor may be received directly from the camera. Here, the image processor may be part of an individual camera or part of a computing system with an integrated camera. The image data may later be received directly from the camera or from the system memory of the computing system (eg, the camera sends the image data to system memory rather than to an image processor). In addition, many of the configurations described in the previous section can be applied to graphics processing units (animation rendering).

도 17은 컴퓨팅 시스템의 예시적 도면을 제공한다. 아래에 기술된 컴퓨팅 시스템의 많은 컴포넌트들은 통합 카메라 및 연관 이미지 프로세서(예: 스마트폰 또는 태블릿 컴퓨터와 같은 핸드헬드 디바이스)를 구비한 컴퓨팅 시스템에 적용할 수 있다. 통상의 기술자는는 이 둘 사이를 쉽게 구분할 수 있을 것이다.17 provides an example diagram of a computing system. Many components of the computing system described below are applicable to computing systems with integrated cameras and associated image processors (eg, handheld devices such as smartphones or tablet computers). Those skilled in the art will readily be able to distinguish between the two.

도 17에서 관찰되는 바와 같이, 기본 컴퓨팅 시스템은 중앙 처리 장치(1701)(예를 들어, 멀티코어 프로세서 또는 어플리케이션 프로세서에 배치된 복수의 범용 프로세싱 코어들(1715_1 내지 1715_N) 및 메인 메모리 제어기(1717)를 포함할 수 있음), 시스템 메모리(1702), 디스플레이(1703)(예: 터치스크린, 플랫 패널), 로컬 유선 점대점 링크(예: USB) 인터페이스(1704), 다양한 네트워크 I/O 기능(1706)(예: 이더넷 인터페이스 및/또는 셀룰러 모뎀 서브시스템), 무선 근거리 네트워크(예: WiFi) 인터페이스(1706), 무선 점대점 링크(예: 블루투스) 인터페이스(1707) 및 GPS 인터페이스(1708), 다양한 센서들(1709_1 내지 1709_N), 하나 이상의 카메라들(1710), 배터리(1711), 전력 관리 제어 유닛(1724), 스피커 및 마이크로폰(1713) 및 오디오 코더/디코더(1714)를 포함할 수 있다.As observed in FIG. 17, the basic computing system includes a central processing unit 1701 (eg, a plurality of general purpose processing cores 1715_1 through 1715_N and a main memory controller 1725 disposed in a multicore processor or application processor). , System memory 1702, display 1703 (e.g., touchscreen, flat panel), local wired point-to-point link (e.g., USB) interface 1704, various network I / O functions (1706). ) (E.g. Ethernet interface and / or cellular modem subsystem), wireless local area network (e.g. WiFi) interface 1706, wireless point-to-point link (e.g. Bluetooth) interface 1707 and GPS interface 1708, various sensors 1709_1 to 1709_N, one or more cameras 1710, a battery 1711, a power management control unit 1724, a speaker and a microphone 1713, and an audio coder / decoder 1714.

어플리케이션 프로세서 또는 멀티코어 프로세서(1750)는 그 CPU(1701) 내의 하나 이상의 범용 프로세싱 코어들(1715), 하나 이상의 그래픽 프로세싱 유닛들(1716), 메모리 관리 기능(1717)(예: 메모리 제어기), I/O 제어 기능(1718) 및 이미지 프로세싱 유닛(1719)를 포함할 수 있다. 범용 프로세싱 코어(1715)는 일반적으로 컴퓨팅 시스템의 운영 체제 및 어플리케이션 소프트웨어를 실행한다. 그래픽 프로세싱 유닛들(1716)은 일반적으로 예를 들어, 디스플레이(1703)에 제시되는 그래픽 정보를 생성하기 위해 그래픽 집중적 기능들을 실행한다. 메모리 제어 기능(1717)은 시스템 메모리(1702)와 인터페이스하여 시스템 메모리(1702)로/로부터 데이터를 기록/판독한다. 전력 관리 제어 유닛(1724)은 일반적으로 시스템(1700)의 전력 소비를 제어한다.The application processor or multicore processor 1750 may include one or more general purpose processing cores 1715, one or more graphics processing units 1716, a memory management function 1725 (eg, a memory controller) within its CPU 1701, I / O control function 1718 and image processing unit 1725. General purpose processing core 1715 generally executes the operating system and application software of the computing system. Graphics processing units 1716 generally execute graphics intensive functions to generate, for example, graphical information presented on display 1703. The memory control function 1917 interfaces with the system memory 1702 to write / read data to / from the system memory 1702. The power management control unit 1724 generally controls the power consumption of the system 1700.

이미지 프로세싱 유닛(1719)은 이전 섹션에서 상술된 이미지 프로세싱 유닛 실시예들 중 임의의 것에 따라 구현될 수 있다. 선택적으로 또는 조합하여, IPU(1719)는 GPU(1716) 및 CPU(1701) 중 하나 또는 둘 모두에 코-프로세서로서 연결될 수 있다. 추가적으로, 다양한 실시예에서, GPU(1716)는 상기 기술된 임의의 이미지 프로세서 구성들로 구현될 수 있다. Image processing unit 1719 may be implemented in accordance with any of the image processing unit embodiments described above in the previous section. Alternatively or in combination, the IPU 1719 may be connected as a co-processor to one or both of the GPU 1716 and the CPU 1701. Additionally, in various embodiments, GPU 1716 may be implemented with any of the image processor configurations described above.

터치 스크린 디스플레이(1703), 통신 인터페이스(1704 내지 1707), GPS 인터페이스(1708), 센서(1709), 카메라(1710) 및 스피커/마이크로폰 코덱(1713, 1714) 각각은 적절한 곳에 통합된 주변 디바이스들(예를 들어, 하나 이상의 카메라(1710))도 포함하는 전반적 컴퓨팅 시스템과 관련된 다양한 형태의 I/O(입력 및/또는 출력)로서 보여질 수 있다. 구현예에 따라, 이들 I/O 컴포넌트들 중 다양한 컴포넌트들은 어플리케이션 프로세서/멀티코어 프로세서(1750) 상에 통합될 수 있고 또는 어플리케이션 프로세서/멀티코어 프로세서(1750)의 다이 외부 또는 패키지 외부에 위치될 수 있다.Each of the touch screen display 1703, the communication interface 1704-1707, the GPS interface 1708, the sensor 1709, the camera 1710, and the speaker / microphone codec 1713, 1714 are each integrated peripheral devices ( For example, it may be viewed as various forms of I / O (input and / or output) associated with the overall computing system that also include one or more cameras 1710. Depending on the implementation, various of these I / O components may be integrated on the application processor / multicore processor 1750 or may be located outside the die or outside the package of the application processor / multicore processor 1750. have.

일 실시예에서, 하나 이상의 카메라들(1710)은 그 시야에서 카메라와 오브젝트 사이의 깊이를 측정할 수 있는 깊이 카메라를 포함한다. 어플리케이션 프로세서 또는 다른 프로세서의 범용 CPU 코어(또는 프로그램 코드를 실행하기 위한 명령어 실행 파이프라인을 갖는 다른 기능 블록)에서 실행되는 어플리케이션 소프트웨어, 운영 체제 소프트웨어, 디바이스 드라이버 소프트웨어 및/또는 펌웨어는 상기 기술된 기능들 중 임의의 것을 수행할 수 있다. 여기서, 도 17의 컴퓨팅 시스템의 많은 컴포넌트들은 상기 기술된 임의의/모든 변환들을 수행하는 컴파일러를 포함하는 도 16의 어플리케이션 개발 환경에 대응하는 프로그램 코드를 실행하는 고성능 컴퓨팅 시스템(예: 서버) 내에서 제시될 수 있다.In one embodiment, one or more cameras 1710 include a depth camera capable of measuring the depth between the camera and the object in its field of view. Application software, operating system software, device driver software, and / or firmware that executes on an application processor or on a general purpose CPU core (or other functional block having an instruction execution pipeline for executing program code) may include the functions described above. Any of may be performed. Here, many components of the computing system of FIG. 17 are implemented in a high performance computing system (eg, a server) that executes program code corresponding to the application development environment of FIG. 16 including a compiler that performs any / all transformations described above. Can be presented.

본 발명의 실시예는 전술한 바와 같은 다양한 프로세스들을 포함할 수 있다. 프로세스들은 기계 실행가능 명령어들로 구현될 수 있다. 상기 명령어들은 범용 또는 특수 목적 프로세서가 특정 프로세스들을 수행하게 하는데 사용될 수 있다. 대안적으로, 이러한 프로세스들은 프로세스들을 수행하기 위한 배선된 로직을 포함하는 특정 하드웨어 컴포넌트들에 의해, 또는 프로그래밍된 컴퓨터 컴포넌트들 및 커스텀 하드웨어 컴포넌트들의 임의의 조합에 의해 수행될 수 있다.Embodiments of the present invention may include various processes as described above. Processes may be implemented in machine executable instructions. The instructions may be used to cause a general purpose or special purpose processor to perform certain processes. Alternatively, these processes may be performed by specific hardware components including wired logic to perform the processes, or by any combination of programmed computer components and custom hardware components.

본 발명의 엘리먼트들은 또한 기계 실행가능 명령어들을 저장하기 위한 기계 판독가능 매체로서 제공될 수 있다. 기계 판독가능 매체는 플로피 디스켓, 광학 디스크, CD-ROM 및 자기광학 디스크, 플래시 메모리, ROM, RAM, EPROM, EEPROM, 자기 또는 광학 카드, 전파 매체 또는 전자적 명령어들을 저장하기에 적합한 기타 유형의 매체/기계 판독가능 매체를 포함하지만 이에 한정되지 않는다. 예를 들어, 엘리먼트들은 통신 링크(예: 모뎀 또는 네트워크 연결)를 통해 반송파 또는 기타 전파 매체에 수록된 데이터 신호들에 의해 원격 컴퓨터(예: 서버)로부터 요청 컴퓨터(예: 클라이언트)로 전송되는 컴퓨터 프로그램으로서 다운로드될 수 있다.Elements of the invention may also be provided as a machine readable medium for storing machine executable instructions. Machine-readable media may be floppy diskettes, optical disks, CD-ROMs and magneto-optical disks, flash memory, ROM, RAM, EPROM, EEPROM, magnetic or optical cards, propagation media or other types of media suitable for storing electronic instructions / Including but not limited to machine readable media. For example, elements are computer programs transmitted from a remote computer (such as a server) to a requesting computer (such as a client) by means of data signals contained on a carrier or other propagation medium over a communication link (such as a modem or network connection). Can be downloaded as.

전술한 명세서에서, 특정 예시적 실시예가 기술되었다. 그러나, 첨부된 청구범위에 설명된 본 발명의 더 넓은 사상 및 범위를 벗어나지 않고 다양한 수정 및 변경이 이루어질 수 있음이 명백하다. 따라서, 명세서 및 도면은 제한적인 의미라기보다는 예시적인 것으로 고려되어야 한다.In the foregoing specification, specific exemplary embodiments have been described. However, it will be apparent that various modifications and changes may be made without departing from the broader spirit and scope of the invention described in the appended claims. Accordingly, the specification and drawings are to be regarded in an illustrative rather than a restrictive sense.

다음에서 몇 가지 예시들이 설명된다.Some examples are described below.

예시 1: 컴퓨팅 시스템에 의해 프로세싱될 때 상기 컴퓨팅 시스템으로 하여금 방법을 수행하게 하는 프로그램 코드를 포함하는 기계 판독가능 저장 매체에 있어서, 상기 방법은:Example 1: A machine-readable storage medium containing program code for causing the computing system to perform a method when processed by a computing system, the method comprising:

a) 버퍼가 생산 커널로부터 하나 이상의 소비 커널들로 전송되는 이미지 데이터를 저장하고 포워딩하는 이미지 프로세싱 소프트웨어 데이터 플로우를 구성하는 단계;a) constructing an image processing software data flow in which a buffer stores and forwards image data sent from the production kernel to one or more consuming kernels;

b) 상기 버퍼가 상기 이미지 데이터를 저장하고 포워딩하기에 리소스들이 불충분하다는 것을 인식하는 단계; 및b) recognizing that resources are insufficient for the buffer to store and forward the image data; And

c) 상기 생산 커널로부터 상기 하나 이상의 소비 커널로 상기 이미지 데이터를 전송하는 동안 상기 이미지 데이터를 저장하고 포워딩하는 다수의 버퍼들을 포함하도록 상기 이미지 프로세싱 소프트웨어 데이터 플로우를 수정하는 단계를 포함하는 것을 특징으로 하는 컴퓨팅 시스템.c) modifying the image processing software data flow to include a plurality of buffers for storing and forwarding the image data while transferring the image data from the production kernel to the one or more consuming kernels. Computing system.

예시 2: 예시 1에 있어서, 상기 방법은 상기 이미지 데이터의 상이한 부분들이 상기 생산 커널로부터 상기 다수의 버퍼들 중 상이한 버퍼들로 송신되도록 상기 이미지 프로세싱 소프트웨어 데이터 플로우를 수정하는 단계를 더 포함하는 것을 특징으로 하는 기계 판독가능 저장 매체.Example 2: The method of Example 1, wherein the method further comprises modifying the image processing software data flow such that different portions of the image data are sent from the production kernel to different ones of the plurality of buffers. Machine-readable storage media.

예시 3: 예시 2에 있어서, 상기 상이한 부분들은 상기 이미지 데이터의 상이한 컬러 컴포넌트들에 대응하는 것을 특징으로 하는 기계 판독가능 저장 매체.Example 3: The machine readable storage medium of example 2, wherein the different portions correspond to different color components of the image data.

예시 4: 선행하는 예시들 중 적어도 하나에 있어서, 상기 방법은 동일한 이미지 데이터가 상기 생산 커널로부터 상기 다수의 버퍼들 중 제1 및 제2 버퍼들로 송신되도록 상기 이미지 프로세싱 소프트웨어 데이터 플로우를 수정하는 단계를 더 포함하는 것을 특징으로 하는 기계 판독가능 저장 매체.Example 4: The method of at least one of the preceding examples, wherein the method modifies the image processing software data flow such that the same image data is sent from the production kernel to first and second buffers of the plurality of buffers. Machine readable storage medium further comprises.

예시 5: 예시 4에 있어서, 상기 수정하는 단계는, 상기 다수의 버퍼들 중 제1 버퍼를 상기 하나 이상의 소비 커널들 중 제1 소비 커널에 공급하도록 구성하는 단계 및 상기 다수의 버퍼들 중 제2 버퍼를 상기 하나 이상의 소비 커널들 중 제2 소비 커널에 공급하도록 구성하는 단계를 더 포함하는 것을 특징으로 하는 기계 판독가능 저장 매체.Example 5: The method of example 4, wherein the modifying comprises: supplying a first one of the plurality of buffers to a first one of the one or more consuming kernels and a second of the plurality of buffers And supplying a buffer to a second one of the one or more consuming kernels.

예시 6: 선행하는 예시들 중 적어도 하나에 있어서, 상기 방법은,Example 6: The method of at least one of the preceding examples, wherein the method comprises:

상기 이미지 프로세싱 소프트웨어 데이터 플로우가 업샘플링된 이미지를 형성하기 위해 팩터에 의해 이미지를 업샘플링하는 것임을 인식하는 단계; Recognizing that the image processing software data flow is upsampling an image by a factor to form an upsampled image;

상기 이미지 프로세싱 소프트웨어 데이터 플로우가 상기 팩터에 의해 상기 업샘플링된 이미지를 다운샘플링하는 것임을 인식하는 단계; Recognizing that the image processing software data flow is downsampling the upsampled image by the factor;

상기 이미지 프로세싱 소프트웨어 데이터 플로우로부터 상기 이미지의 업샘플링 및 상기 업샘플링된 이미지의 다운샘플링을 제거하는 단계를 더 포함하는 것을 특징으로 하는 컴퓨팅 시스템. And removing the upsampling of the image and the downsampling of the upsampled image from the image processing software data flow.

예시 7: 선행하는 예시들 중 적어도 하나에 있어서, 상기 방법은,Example 7: The method of at least one of the preceding examples, wherein the method comprises:

상기 이미지 프로세싱 소프트웨어 데이터 플로우에 분할-및-결합 패턴을 인식하는 단계; 및 Recognizing a partition-and-combination pattern in the image processing software data flow; And

상기 분할-및-결합 패턴을 제거하기 위해 상기 이미지 프로세싱 소프트웨어 데이터 플로우를 일련의 스테이지들로서 재구성하는 단계를 더 포함하는 것을 특징으로 하는 컴퓨팅 시스템. And reconstructing the image processing software data flow as a series of stages to remove the split-and-combination pattern.

예시 8: 선행하는 예시들 중 적어도 하나에 있어서, 복수의 스텐실 프로세서 유닛들 및/또는 적어도 하나의 대응 시트 생성기 유닛들과 상호연결된 복수의 라인 버퍼 유닛들을 가지는 아키텍처에서 동작하도록 구성되는 것을 특징으로 하는 기계 판독가능 저장 매체.Example 8: At least one of the preceding examples, characterized in that it is configured to operate in an architecture having a plurality of line buffer units interconnected with a plurality of stencil processor units and / or at least one corresponding sheet generator units. Machine-readable storage medium.

예시 9: 선행하는 예시들 중 적어도 하나에 있어서, 스텐실들 특히 오버랩핑 스텐실들을 프로세싱하도록 구성되는 것을 특징으로 하는 기계 판독가능 저장 매체. Example 9: The machine-readable storage medium of at least one of the preceding examples, configured to process stencils, in particular overlapping stencils.

예시 10: 선행하는 예시들 중 적어도 하나에 있어서, 상기 실행 레인 어레이(execution lane array)보다 넓은 디멘션을 가지는 시프트 레지스터 구조를 포함하는 데이터 계산 유닛에서 동작하도록 구성되며, 특히 레지스터들은 상기 실행 레인 어레이 밖에 있는 것을 특징으로 하는 기계 판독가능 저장 매체.Example 10: The at least one of the preceding examples, configured to operate in a data computing unit that includes a shift register structure having a wider dimension than the execution lane array, in particular the registers being outside the execution lane array. And a machine readable storage medium.

예시 11: 컴퓨팅 시스템에 의해 프로세싱될 때 상기 컴퓨팅 시스템으로 하여금 방법을 수행하게 하는 프로그램 코드를 포함하는 기계 판독가능 저장 매체에 있어서, 상기 방법은:Example 11: A machine readable storage medium comprising program code for causing the computing system to perform a method when processed by a computing system, the method comprising:

프로그램 코드의 커널에 의해 프로세싱될 이미지가 다운샘플링될 것임을 인식하는 단계, 상기 프로그램 코드의 커널은 이미지 프로세서의 이미지 프로세싱 코어에서 실행될 것이며, 상기 이미지 프로세싱 코어는 2차원 실행 레인 어레이를 포함하며; Recognizing that an image to be processed by a kernel of program code will be downsampled, the kernel of program code will be executed at an image processing core of an image processor, the image processing core comprising a two-dimensional execution lane array;

상기 이미지의 다운샘플링을 지원하기 위해 사용되는 메모리 리소스들의 소비를 감소시키기 위해 상기 실행 레인 어레이의 실행 레인들 모두보다 적은 실행 레인으로 상기 이미지를 프로세싱하기 위해 상기 커널을 동작을 구조화하는 단계를 포함하는 것을 특징으로 하는 기계 판독가능 저장 매체. Structuring the kernel to process the image to process the image with fewer execution lanes than all of the execution lanes of the execution lane array to reduce consumption of memory resources used to support downsampling of the image. Machine-readable storage medium, characterized in that.

예시 12: 예시 11에 있어서, 상기 이미지 프로세서는 상기 이미지 프로세싱 코어를 포함하는 다수의 이미지 프로세싱 코어들을 포함하며, 그리고 상기 다수의 이미지 프로세싱 코어들 중 하나는 상수 정보(constant information)를 저장하기 위한 연관된 메모리를 가지며, 상기 방법은:Example 12: The apparatus of example 11, wherein the image processor comprises a plurality of image processing cores including the image processing core, and one of the plurality of image processing cores is associated with the constant information for storing constant information. Memory, the method being:

상수 입력 이미지를 저장하기 위한 상기 연관된 메모리를 구성하는 단계를 더 포함하며, 상기 상수 입력 이미지는 상기 연관된 메모리를 가지는 상기 다수의 이미지 프로세싱 코어들 중 하나에서 실행되는 프로그램 코드의 각각의 커널에 의해 프로세싱되는 것을 특징으로 하는 기계 판독가능 저장 매체. Configuring the associated memory for storing a constant input image, wherein the constant input image is processed by each kernel of program code executed in one of the plurality of image processing cores having the associated memory. And a machine readable storage medium.

예시 13: 예시 11 또는 12에 있어서, 상기 이미지 프로세서는 상기 이미지 프로세싱 코어를 포함하는 다수의 이미지 프로세싱 코어들을 포함하며, 그리고 상기 다수의 이미지 프로세싱 코어들 중 하나에 대한 입력 이미지 데이터는 이미지 데이터의 라인들의 그룹들로서 수신되며, 상기 방법은:Example 13: The image processor of examples 11 or 12, wherein the image processor comprises a plurality of image processing cores including the image processing core, and input image data for one of the plurality of image processing cores is a line of image data. Received as groups of, the method:

상기 하나의 이미지 프로세싱 코어가 동작할 입력 이미지 영역을 상기 입력 이미지 영역이 최소한의 수의 상기 라인들 그룹과 오버랩되도록 정렬하는 단계를 더 포함하는 것을 특징으로 하는 기계 판독가능 저장 매체. And aligning an input image region in which said one image processing core is to operate so that said input image region overlaps with a minimum number of said groups of lines.

예시 14: 예시 11 내지 13 중 적어도 하나에 있어서, 상기 이미지 프로세서는 상기 이미지 프로세싱 코어를 포함하는 다수의 이미지 프로세싱 코어들을 포함하며, 그리고 상기 다수의 이미지 프로세싱 코어들 중 하나에 대한 입력 이미지 데이터는 상기 하나의 이미지 프로세싱 코어가 동작할 데이터의 유닛마다 다수의 픽셀들의 모자이크를 포함하며, 상기 방법은:Example 14: The at least one of examples 11-13, wherein the image processor comprises a plurality of image processing cores including the image processing core, and input image data for one of the plurality of image processing cores is One image processing core comprises a mosaic of multiple pixels per unit of data to operate, the method comprising:

상기 하나의 이미지 프로세싱 코어에 의해 프로세싱되기 전에 디-인터리빙(de-interleave)되도록 상기 입력 이미지 데이터를 구성하는 단계;Configuring the input image data to be de-interleave before being processed by the one image processing core;

상기 하나의 이미지 프로세싱 코어에 의해 생성된 출력 이미지 데이터를 데이터의 유닛마다 다수의 픽셀들의 모자이크로 인터리빙되도록 구성하는 단계를 더 포함하는 것을 특징으로 하는 기계 판독가능 저장 매체. And configuring the output image data generated by the one image processing core to interleave into a mosaic of a plurality of pixels per unit of data.

예시 15: 예시 11 내지 14 중 적어도 하나에 있어서, 복수의 스텐실 프로세서 유닛들 및/또는 적어도 하나의 대응 시트 생성기 유닛들과 상호연결된 복수의 라인 버퍼 유닛들을 가지는 아키텍처에서 동작하도록 구성되는 것을 특징으로 하는 기계 판독가능 저장 매체.Example 15: The at least one of examples 11-14, configured to operate in an architecture having a plurality of line buffer units interconnected with a plurality of stencil processor units and / or at least one corresponding sheet generator units. Machine-readable storage medium.

예시 16: 예시 11 내지 15 중 적어도 하나에 있어서, 스텐실들 특히 오버랩핑 스텐실들을 프로세싱하도록 구성되는 것을 특징으로 하는 기계 판독가능 저장 매체.Example 16: The machine-readable storage medium of at least one of Examples 11-15, configured to process stencils, in particular overlapping stencils.

예시 17: 예시 11 내지 16 중 적어도 하나에 있어서, 상기 실행 레인 어레이(execution lane array)보다 넓은 디멘션을 가지는 시프트 레지스터 구조를 포함하는 데이터 계산 유닛에서 동작하도록 구성되며, 특히 레지스터들은 상기 실행 레인 어레이 밖에 있는 것을 특징으로 하는 기계 판독가능 저장 매체.Example 17: The at least one of examples 11-16, configured to operate in a data computing unit that includes a shift register structure having a wider dimension than the execution lane array, in particular registers outside the execution lane array. And a machine readable storage medium.

예시 18: 시스템으로서,Example 18: As a system,

하나 이상의 범용 프로세싱 코어들;One or more general purpose processing cores;

시스템 메모리; System memory;

상기 시스템 메모리와 상기 하나 이상의 범용 프로세싱 코어들 사이에 결합된 메모리 제어기; A memory controller coupled between the system memory and the one or more general purpose processing cores;

상기 컴퓨팅 시스템에 의해 프로세싱될 때 상기 컴퓨팅 시스템으로 하여금 방법을 수행하게 하는 프로그램 코드를 포함하는 저장 매체를 포함하며, 상기 방법은: A storage medium containing program code for causing the computing system to perform a method when processed by the computing system, the method comprising:

예시 19: 예시 18에 있어서, 상기 방법은 상기 이미지 데이터의 상이한 부분들이 상기 생산 커널로부터 상기 다수의 버퍼들 중 상이한 버퍼들로 송신되도록 상기 이미지 프로세싱 소프트웨어 데이터 플로우를 수정하는 단계를 더 포함하는 것을 특징으로 하는 컴퓨팅 시스템.Example 19: The method of example 18, wherein the method further comprises modifying the image processing software data flow such that different portions of the image data are sent from the production kernel to different ones of the plurality of buffers. Computing system.

예시 20: 예시 19에 있어서, 상기 상이한 부분들은 상기 이미지 데이터의 상이한 컬러 컴포넌트들에 대응하는 것을 특징으로 하는 컴퓨팅 시스템.Example 20: The computing system of example 19, wherein the different portions correspond to different color components of the image data.

예시 21: 예시 18 내지 20 중 적어도 한 항에 있어서, 상기 방법은 동일한 이미지 데이터가 상기 생산 커널로부터 상기 다수의 버퍼들 중 제1 및 제2 버퍼들로 송신되도록 상기 이미지 프로세싱 소프트웨어 데이터 플로우를 수정하는 단계를 더 포함하는 것을 특징으로 하는 컴퓨팅 시스템.Example 21: The method of at least one of examples 18-20, wherein the method modifies the image processing software data flow such that the same image data is sent from the production kernel to first and second buffers of the plurality of buffers. Computing system further comprising the step.

예시 22: 예시 18 내지 21 중 적어도 한 항에 있어서, 상기 수정하는 단계는, 상기 다수의 버퍼들 중 제1 버퍼를 상기 하나 이상의 소비 커널들 중 제1 소비 커널에 공급하도록 구성하는 단계 및 상기 다수의 버퍼들 중 제2 버퍼를 상기 하나 이상의 소비 커널들 중 제2 소비 커널에 공급하도록 구성하는 단계를 더 포함하는 것을 특징으로 하는 컴퓨팅 시스템.Example 22: The method of at least one of examples 18-21, wherein the modifying comprises: configuring a first one of the plurality of buffers to supply a first one of the one or more consuming kernels; And supplying a second one of the buffers of to a second one of the one or more consuming kernels.

예시 23: 예시 18 내지 22 중 적어도 한 항에 있어서, 상기 컴파일링하는 단계는:Example 23: The method of at least one of Examples 18-22, wherein the compiling comprises:

상기 이미지 프로세싱 소프트웨어 데이터 플로우가 업샘플링된 이미지를 형성하기 위해 팩터에 의해 이미지를 업샘플링하는 것임을 인식하는 단계;Recognizing that the image processing software data flow is upsampling an image by a factor to form an upsampled image;

예시 24: 예시 18 내지 23 중 적어도 한 항에 있어서, 상기 컴파일링하는 단계는:Example 24: The method of at least one of Examples 18 to 23, wherein the compiling:

상기 분할-및-결합 패턴을 제거하기 위해 상기 이미지 프로세싱 소프트웨어 데이터 플로우를 일련의 스테이지들로서 재구성하는 단계를 더 포함하는 것을 특징으로 하는 컴퓨팅 시스템.And reconstructing the image processing software data flow as a series of stages to remove the split-and-combination pattern.

예시 25: 예시 18 내지 24 중 적어도 하나에 있어서, 상기 버퍼는 상기 이미지 데이터의 라인들의 포워드 그룹들을 저장하는 것을 특징으로 하는 컴퓨팅 시스템.Example 25: The computing system of at least one of examples 18-24, wherein the buffer stores forward groups of lines of the image data.

예시 26: 예시 18 내지 25 중 적어도 하나에 있어서, 상기 버퍼는 상기 이미지 프로세서의 라인 버퍼 유닛에서 구현되고, 상기 이미지 프로세서는 상기 라인 버퍼 유닛과 상기 프로그램 코드의 생산 커널과 상기 프로그램 코드의 하나 이상의 소비 커널들을 각각 실행하는 다수의 프로세싱 코어들 사이에 연결된 네트워크를 포함하는 것을 특징으로 하는 컴퓨팅 시스템. Example 26: The at least one of examples 18-25, wherein the buffer is implemented in a line buffer unit of the image processor, wherein the image processor comprises the line buffer unit, a production kernel of the program code, and at least one consumption of the program code. A computing system comprising a network coupled between a plurality of processing cores each executing kernels.

예시 27: 예시 18 내지 26 중 적어도 하나에 있어서, 복수의 스텐실 프로세서 유닛들 및/또는 적어도 하나의 대응 시트 생성기 유닛들과 상호연결된 복수의 라인 버퍼 유닛들을 가지는 아키텍처를 가지는 프로세서를 포함하는 것을 특징으로 하는 컴퓨팅 시스템.Example 27: The at least one of Examples 18-26, comprising a processor having an architecture having a plurality of line buffer units interconnected with a plurality of stencil processor units and / or at least one corresponding sheet generator units Computing system.

예시 28: 예시 18 내지 27 중 적어도 하나에 있어서, 스텐실들 특히 오버랩핑 스텐실들을 프로세싱하도록 구성되는 것을 특징으로 하는 컴퓨팅 시스템.Example 28: The computing system of at least one of Examples 18-27, configured to process stencils, in particular overlapping stencils.

예시 29: 예시 18 내지 28 중 적어도 하나에 있어서, 상기 실행 레인 어레이(execution lane array)보다 넓은 디멘션을 가지는 시프트 레지스터 구조를 포함하는 데이터 계산 유닛을 포함하며, 특히 레지스터들은 상기 실행 레인 어레이 밖에 있는 것을 특징으로 하는 컴퓨팅 시스템.Example 29: The at least one of examples 18-28, comprising a data computation unit comprising a shift register structure having a dimension wider than the execution lane array, in particular the registers being outside the execution lane array. Characterized by a computing system.

Claims

A machine-readable storage medium comprising program code for causing the computing system to perform a method when processed by a computing system, the method comprising:
a) constructing an image processing software data flow in which a buffer stores and forwards image data sent from the production kernel to one or more consuming kernels;
b) recognizing that resources are insufficient for the buffer to store and forward the image data; And
c) modifying the image processing software data flow to include a plurality of buffers for storing and forwarding the image data while transferring the image data from the production kernel to the one or more consuming kernels. Machine-readable storage medium.

The machine of claim 1, wherein the method further comprises modifying the image processing software data flow such that different portions of the image data are sent from the production kernel to different ones of the plurality of buffers. Readable storage medium.

The machine-readable storage medium of claim 2, wherein the different portions correspond to different color components of the image data.

The method of claim 1, wherein the method further comprises modifying the image processing software data flow such that the same image data is sent from the production kernel to first and second buffers of the plurality of buffers. And a machine readable storage medium.

The method of claim 4, wherein the modifying comprises: configuring a first one of the plurality of buffers to supply a first one of the one or more consuming kernels; And configuring to supply a second one of the one or more consuming kernels.

The method according to at least one of the preceding claims, wherein
Recognizing that the image processing software data flow is upsampling an image by a factor to form an upsampled image;
Recognizing that the image processing software data flow is downsampling the upsampled image by the factor;
Removing the upsampling of the image and downsampling of the upsampled image from the image processing software data flow.

The method according to at least one of the preceding claims, wherein
Recognizing a partition-and-combination pattern in the image processing software data flow; And
And reconstructing said image processing software data flow as a series of stages to remove said partitioning-and-combining pattern.

Machine readable according to at least one of the preceding claims, configured to operate in an architecture having a plurality of line buffer units interconnected with a plurality of stencil processor units and / or at least one corresponding sheet generator units. Storage media.

Machine-readable storage medium according to at least one of the preceding claims, configured to process stencils, in particular overlapping stencils.

The at least one of the preceding claims, characterized in that it is configured to operate in a data computation unit comprising a shift register structure having a wider dimension than the execution lane array, in particular the registers being outside the execution lane array. Machine-readable storage media.

A machine-readable storage medium comprising program code for causing the computing system to perform a method when processed by a computing system, the method comprising:
Recognizing that an image to be processed by a kernel of program code will be downsampled, the kernel of program code will be executed at an image processing core of an image processor, the image processing core comprising a two-dimensional execution lane array;
Structuring the kernel to process the image to process the image with fewer execution lanes than all of the execution lanes of the execution lane array to reduce consumption of memory resources used to support downsampling of the image. Machine-readable storage medium, characterized in that.

The system of claim 11, wherein the image processor includes a plurality of image processing cores including the image processing core, and one of the plurality of image processing cores has an associated memory for storing constant information. The method is as follows:
Configuring the associated memory for storing a constant input image, wherein the constant input image is processed by each kernel of program code executed in one of the plurality of image processing cores having the associated memory. And a machine readable storage medium.

13. The method of claim 11 or 12, wherein the image processor comprises a plurality of image processing cores including the image processing core, and input image data for one of the plurality of image processing cores as groups of lines of image data. Received, the method being:
And aligning an input image region in which said one image processing core is to operate so that said input image region overlaps with a minimum number of said groups of lines.

14. The method of claim 11, wherein the image processor comprises a plurality of image processing cores including the image processing core, and input image data for one of the plurality of image processing cores is selected from the The image processing core comprises a mosaic of a plurality of pixels per unit of data to operate, the method comprising:
Configuring the input image data to be de-interleave before being processed by the one image processing core;
And configuring the output image data generated by the one image processing core to interleave into a mosaic of a plurality of pixels per unit of data.

The machine read of claim 11, configured to operate in an architecture having a plurality of line buffer units interconnected with a plurality of stencil processor units and / or at least one corresponding sheet generator units. Possible storage medium.

The machine-readable storage medium of claim 11, configured to process stencils, in particular overlapping stencils.

As a computing system,
One or more general purpose processing cores;
System memory;
A memory controller coupled between the system memory and the one or more general purpose processing cores;
A storage medium containing program code for causing the computing system to perform a method when processed by the computing system, the method comprising:
a) constructing an image processing software data flow in which a buffer stores and forwards image data sent from the production kernel to one or more consuming kernels;
b) recognizing that resources are insufficient for the buffer to store and forward the image data; And
c) modifying the image processing software data flow to include a plurality of buffers for storing and forwarding the image data while transferring the image data from the production kernel to the one or more consuming kernels. Computing system.

19. The computing method of claim 18, further comprising modifying the image processing software data flow such that different portions of the image data are sent from the production kernel to different ones of the plurality of buffers. system.

20. The computing system of claim 19, wherein said different portions correspond to different color components of said image data.

The method of claim 18, wherein the method further comprises modifying the image processing software data flow such that the same image data is sent from the production kernel to first and second buffers of the plurality of buffers. Computing system comprising a.

22. The method of claim 18, wherein the modifying comprises: configuring a first one of the plurality of buffers to supply a first one of the one or more consuming kernels and the plurality of buffers. And supplying a second buffer to a second consuming kernel of the one or more consuming kernels.

The method of claim 18, wherein the compiling comprises:
Recognizing that the image processing software data flow is upsampling an image by a factor to form an upsampled image;
Recognizing that the image processing software data flow is downsampling the upsampled image by the factor;
And removing the upsampling of the image and the downsampling of the upsampled image from the image processing software data flow.

24. The method of claim 18, wherein the compiling comprises:
Recognizing a partition-and-combination pattern in the image processing software data flow; And
And reconstructing the image processing software data flow as a series of stages to remove the split-and-combination pattern.

25. The computing system of claim 18, wherein the buffer stores forward groups of lines of image data.

26. The system of claim 18, wherein the buffer is implemented in a line buffer unit of the image processor, the image processor configured to produce the line buffer unit, the production kernel of the program code, and the one or more consuming kernels of the program code. A computing system comprising a network coupled between a plurality of processing cores each executing.

27. The computing device of any one of claims 18 to 26, comprising a processor having an architecture having a plurality of line buffer units interconnected with a plurality of stencil processor units and / or at least one corresponding sheet generator units. system.

28. The computing system of any one of claims 18 to 27, configured to process stencils, in particular overlapping stencils.

29. A data processing unit as claimed in any one of claims 18 to 28, comprising a data calculation unit comprising a shift register structure having dimensions wider than the execution lane array, in particular the registers being outside the execution lane array. Computing system.