KR20230159596A

KR20230159596A - Parallel processing architecture using speculative encoding

Info

Publication number: KR20230159596A
Application number: KR1020237036559A
Authority: KR
Inventors: 피터 폴리
Original assignee: 아세니움 인코포레이티드
Priority date: 2021-03-26
Filing date: 2022-03-25
Publication date: 2023-11-21
Also published as: WO2022204450A1; EP4315045A1

Abstract

추론적 인코딩을 사용하는 병렬 처리 아키텍처에서 프로그램 실행을 위한 기술이 개시된다. 계산 엘리먼트들의 2차원 어레이가 액세스되고, 계산 엘리먼트들의 어레이 내의 각각의 계산 엘리먼트는 컴파일러에 알려져 있고, 계산 엘리먼트들의 어레이 내의 이웃하는 계산 엘리먼트들에 결합된다. 계산 엘리먼트들의 어레이에 대한 제어는 사이클 단위로 제공된다. 제어는 컴파일러에 의해 생성된 와이드, 가변 길이, 제어 워드의 스트림에 의해 인에이블된다. 2개 이상의 동작들이 제어 워드로 병합되며, 제어 워드는 분기 결정 및 분기 결정과 연관된 동작들을 포함한다. 병합된 제어 워드는 적어도 2개의 가능한 분기 경로들에 대한 추론적으로 인코딩된 동작들을 포함한다. 적어도 2개의 가능한 분기 경로는 독립적인 부작용들을 생성한다. 분기 결정에 의해 표시되지 않은 분기 결정과 연관된 동작들은 억제된다.A technique for program execution in a parallel processing architecture using speculative encoding is disclosed. A two-dimensional array of computational elements is accessed, and each computational element in the array of computational elements is known to the compiler and is coupled to neighboring computational elements in the array of computational elements. Control over the array of computational elements is provided on a cycle-by-cycle basis. Control is enabled by a stream of wide, variable-length, control words generated by the compiler. Two or more operations are merged into a control word, which includes a branch decision and operations associated with the branch decision. The merged control word contains speculatively encoded operations for at least two possible branch paths. At least two possible branching paths produce independent side effects. Operations associated with a branch decision that are not indicated by the branch decision are suppressed.

Description

Parallel processing architecture using speculative encoding

관련 출원Related applications

본 출원은 2021년 3월 26일에 출원된 일련 번호 제 63/166,298 호 미국 가특허 출원 "추론적 인코딩을 사용하는 병렬 처리 아키텍처"에 대한 우선권을 주장한다.This application claims priority to U.S. Provisional Patent Application No. 63/166,298, “Parallel Processing Architecture Using Speculative Encoding,” filed March 26, 2021.

전술한 출원은 허용 가능한 관할구역들에서 그 전체가 본원에 참조로 통합된다.The foregoing application is hereby incorporated by reference in its entirety for jurisdictions where permissible.

기술 분야technology field

본 출원은 전반적으로 프로그램 실행에 관한 것으로, 보다 구체적으로는 추론적 인코딩(speculative encoding)을 사용하는 병렬 처리 아키텍처에 관한 것이다.This application relates generally to program execution, and more specifically to parallel processing architectures using speculative encoding.

많은 데이터 중에서도 특히 감시에서 마케팅에 이르는 다양한 용도로 수집, 데이터 세트로 저장, 분석, 처리 및 사용되는 데이터의 양이 폭발적으로 증가하고 있다. 데이터가 핸드북, 실험실 저널의 페이지 또는 파일 캐비닛의 보고서 수집의 테이블로 저장되고 액세스된 경우, 디지털 데이터 포맷의 출현 및 디지털 스토리지 기술의 비용 감소는 데이터가 저장되고 액세스될 수 있는 편의성을 크게 증가시켰다. 데이터 포맷은 숫자, 사운드, 그림, 동영상 등 다양한 종류의 데이터의 스토리지를 허용한다. 스토리지 기술(storage technology)는 데이터가 로컬, 원격 및 안전하게 작은 휴대용 디바이스에 저장되는 것을 허용한다. 스토리지 기술들은 회전하는 디스크들 상에 자기적으로 데이터를 저장하는 "하드" 디스크 드라이브들을 포함하는 전자-기계 시스템들에 기반한다. 다른 스토리지 기술들은 움직이는 부분들을 갖지 않는 "솔리드 상태(solid state)" 기법들에 기초하며, 데이터는 트랜지스터들과 같은 전자 디바이스들을 사용하여 저장된다. 데이터는 광 디스크와 자기 테이프에 저장될 수도 있다. 어떤 스토리지 기술들이 사용되는지의 선택은 저장될 데이터의 양, 데이터가 액세스되는 속도, 데이터가 액세스되는 빈도, 및 물론 비용에 기초하여 결정된다. 즉, 지속적으로 액세스되는 데이터에 대한 스토리지 요구 사항은 거의 액세스되지 않는 데이터에 대한 요구 사항보다 엄격하고 비용이 많이 든다.The amount of data being collected, stored as data sets, analyzed, processed, and used for a variety of purposes ranging from surveillance to marketing is exploding. Whether data is stored and accessed as the pages of a handbook, a laboratory journal, or a table in a report collection in a file cabinet, the advent of digital data formats and the decreasing cost of digital storage technologies have greatly increased the convenience with which data can be stored and accessed. Data formats allow storage of various types of data, including numbers, sounds, pictures, and videos. Storage technology allows data to be stored locally, remotely and securely on small, portable devices. Storage technologies are based on electro-mechanical systems that include “hard” disk drives that store data magnetically on rotating disks. Other storage technologies are based on “solid state” techniques, which have no moving parts, and data is stored using electronic devices such as transistors. Data can also be stored on optical disks and magnetic tapes. The choice of which storage technologies are used is based on the amount of data to be stored, the speed at which the data is accessed, the frequency with which the data is accessed, and of course cost. This means that storage requirements for constantly accessed data are more stringent and more expensive than those for rarely accessed data.

수집된 데이터는 매우 클 수 있는 데이터의 세트 또는 데이터 세트(dataset)로 저장된다. 방대한 데이터 세트의 처리는 많은 다른 것들 중에서, 상업, 정부, 의료, 교육, 연구, 또는 소매(retail) 목적을 위한 조직(organization)에 의해 빈번하게 수행된다. 주어진 조직의 성패는 조직에 재정적 및 경쟁적 이익을 위해 데이터를 처리하는 능력에 직접적으로 의존하기 때문에 이러한 조직은 데이터 처리에 막대한 자원을 소모한다. 조직은 데이터 처리가 조직의 목표를 성공적으로 충족시킬 때 번영한다. 그렇지 않고 데이터 처리가 성공적이지 못하면, 조직은 좌초된다. 광범위하고 다양한 개인들로부터 데이터를 수집하기 위해 많은 그리고 다양한 데이터 수집 기술들이 사용된다. 개인은 고객, 시민, 환자, 구매자, 학생, 테스트 피험자(test subject) 및 자원 봉사자를 포함한다. 때때로 개인은 기꺼이 참여하지만, 다른 때에는 그들은 자신도 모르게 데이터 수집의 대상이 된다. 일반적인 데이터 수집 기술들은 "옵트-인(opt-in)" 기술들을 포함하며, 여기서 개인은 가입하거나, 등록하거나, 계정을 생성하거나, 또는 그렇지 않으면 데이터 수집에 참여하는 것에 기꺼이 동의한다. 다른 기법들이 법률로 제정되고, 예컨대, 정부가 시민들에게 등록 번호를 획득하고 정부 기관, 법 집행, 응급 서비스 등과 상호 작용하는 동안 해당 번호를 사용하도록 요구한다. 구매 이력 추적, 웹사이트 방문, 버튼 클릭, 메뉴 선택과 같은 추가 데이터 수집 기술은 더 정교하거나 완전히 숨겨져 있다. 데이터 수집에 사용되는 기술에 관계없이, 수집된 데이터는 조직에 매우 가치가 있다. 이 데이터를 신속하게 처리하는 것이 매우 중요하다. The collected data is stored as a set or dataset, which can be very large. Processing of massive data sets is frequently performed by organizations for commercial, government, medical, educational, research, or retail purposes, among many others. These organizations expend enormous resources on data processing because the success or failure of any given organization is directly dependent on its ability to process data for financial and competitive benefit to the organization. Organizations thrive when data processing successfully meets organizational goals. Otherwise, if data processing is not successful, the organization is left stranded. Many and varied data collection techniques are used to collect data from a wide and diverse range of individuals. Individuals include customers, citizens, patients, purchasers, students, test subjects, and volunteers. Sometimes individuals participate willingly, but other times they become unwitting targets of data collection. Common data collection techniques include “opt-in” techniques, where an individual signs up, registers, creates an account, or otherwise willingly consents to participate in data collection. Other techniques are enacted into law, such as governments requiring citizens to obtain registration numbers and use those numbers while interacting with government agencies, law enforcement, emergency services, etc. Additional data collection techniques, such as tracking purchase history, website visits, button clicks, and menu selections, may be more sophisticated or completely hidden. Regardless of the technology used to collect data, the data collected is extremely valuable to the organization. It is very important to process this data quickly.

조직은 많은 미션 크리티컬(mission-critical)한 데이터 처리 작업을 수행한다. 급여 실행, 연구 데이터 분석 또는 기계 학습을 위한 신경망 훈련 등 작업 처리는 많은 복잡한 태스크(task)로 구성된다. 태스크들은 다양한 데이터세트들을 로딩 및 저장하는 것, 처리 컴포넌트들 및 시스템들에 액세스하는 것 등을 포함할 수 있다. 태스크들 자체는 통상적으로 서브태스크(subtask)들에 기초하며, 여기서 서브태스크들은 스토리지로부터 데이터를 로딩 또는 판독하는 것, 데이터에 대한 계산들을 수행하는 것, 데이터를 스토리지에 다시 저장 또는 기록하는 것, 데이터 및 제어와 같은 서브태스크 간 통신을 핸들링(handling)하는 것 등과 같은 특정 작업들을 핸들링하는 데 사용될 수 있다. 액세스되는 데이터 세트는 종종 방대하며, 처리 태스크에 적합하지 않거나 설계의 유연성이 떨어지는 처리 아키텍처를 쉽게 스왑(swamp)할 수 있다. 태스크 처리 효율 및 처리량을 크게 개선하기 위해, 엘리먼트들의 2차원(2D) 어레이들이 태스크 및 서브태스크 처리에 사용될 수 있다. 2D 어레이들은 계산 엘리먼트들, 곱셈기 엘리먼트들, 캐시들, 큐들, 제어기들, 압축 해제기(decompressor)들, 산술 로직 유닛들(ALUs), 스토리지 엘리먼트들, 및 그들 자신들 사이에서 통신할 수 있는 다른 컴포넌트들을 포함한다. 이러한 엘리먼트들의 어레이들은 사이클 단위(cycle-by-cycle basis)로 어레이에 제어를 제공함으로써 구성되고 동작된다. 2D 어레이의 제어는 컴파일러에 의해 생성된 제어 워드(control word)들을 제공함으로써 달성된다. 제어는 제어 워드들의 스트림(stream)을 포함하고, 여기서 제어 워드들은 컴파일러에 의해 생성된 와이드(wide), 가변(variable) 길이, 마이크로코드 제어 워드들을 포함할 수 있다. 제어 워드는 어레이를 구성하고 태스크 및 서브태스크의 처리를 제어하는 데 사용된다. 또한, 어레이들은 태스크 처리에 가장 적합한 토폴로지로 구성될 수 있다. 어레이들이 구성될 수 있는 토폴로지들은 특히 시스톨릭(systolic), 벡터, 사이클릭(cyclic), 공간, 스트리밍, 또는 VLIW(Very Long Instruction Word) 토폴로지를 포함한다. 토폴로지들은 기계 학습 기능을 가능하게 하는 토폴로지를 포함할 수 있다.Organizations perform many mission-critical data processing tasks. Whether running payroll, analyzing research data, or training neural networks for machine learning, processing a task consists of many complex tasks. Tasks may include loading and storing various datasets, accessing processing components and systems, etc. The tasks themselves are typically based on subtasks, where the subtasks include loading or reading data from storage, performing calculations on the data, storing or writing data back to storage, It can be used to handle specific tasks, such as handling communication between subtasks such as data and control. The data sets being accessed are often large and can easily be swapped for processing architectures that are unsuitable for the processing task or are less flexible in design. To significantly improve task processing efficiency and throughput, two-dimensional (2D) arrays of elements can be used for task and subtask processing. 2D arrays include compute elements, multiplier elements, caches, queues, controllers, decompressors, arithmetic logic units (ALUs), storage elements, and other components that can communicate among themselves. includes them. Arrays of these elements are constructed and operated by providing control to the array on a cycle-by-cycle basis. Control of the 2D array is achieved by providing control words generated by the compiler. Control includes a stream of control words, where the control words may include wide, variable length, microcode control words generated by a compiler. Control words are used to configure the array and control the processing of tasks and subtasks. Additionally, arrays can be configured with a topology most suitable for task processing. Topologies in which arrays may be configured include systolic, vector, cyclic, spatial, streaming, or Very Long Instruction Word (VLIW) topologies, among others. Topologies may include topologies that enable machine learning functions.

태스크 처리는 추론적 인코딩을 사용하는 병렬 처리 아키텍처에 기초한다. 태스크 처리를 위한 프로세서 구현 방법으로서, 계산 엘리먼트(compute element)들의 2차원(2D) 어레이에 액세스하는 단계 - 상기 계산 엘리먼트들의 어레이 내의 각각의 계산 엘리먼트는 컴파일러에 알려져 있고, 상기 계산 엘리먼트들의 어레이 내의 이웃 계산 엘리먼트들에 결합됨 -; 사이클 단위로 상기 계산 엘리먼트들의 어레이에 대한 제어를 제공하는 단계 - 상기 제어는 상기 컴파일러에 의해 생성된 와이드, 가변 길이, 제어 워드들의 스트림에 의해 인에이블됨 -; 및 둘 이상의 동작을 제어 워드로 병합하는 단계(coalescing) - 상기 제어 워드는 분기 결정(branch decision) 및 상기 분기 결정과 연관된 동작들을 포함함 - 를 포함하는, 태스크 처리를 위한 프로세서 구현 방법이 개시된다. 병합된 제어 워드는 적어도 2개의 가능한 분기 경로들에 대한 추론적으로 인코딩된 동작들을 포함한다. 실시예들에서, 분기 결정은 서브루틴 실행을 지원한다. 분기 결정은 프로그래밍 루프를 추가로 지원할 수 있다. 프로그래밍 루프는 루프의 끝 및 루프의 시작 둘 모두로부터의 병합 동작들을 포함할 수 있다.Task processing is based on a parallel processing architecture using speculative encoding. 1. A processor-implemented method for processing a task, comprising: accessing a two-dimensional (2D) array of compute elements, wherein each compute element in the array of compute elements is known to a compiler and its neighbors in the array of compute elements. Combined with calculation elements -; providing control over the array of computational elements on a cycle-by-cycle basis, the control being enabled by a stream of wide, variable length, control words generated by the compiler; and coalescing two or more operations into a control word, wherein the control word includes a branch decision and operations associated with the branch decision. . The merged control word contains speculatively encoded operations for at least two possible branch paths. In embodiments, branch decisions support subroutine execution. Branch decisions can further support programming loops. A programming loop can include merge operations from both the end of the loop and the beginning of the loop.

실시예들은 다운스트림 연산(downstream operation)을 위해 데이터를 프로모팅(promoting)하는 것을 포함한다. 다운스트림 연산은 산술, 벡터, 행렬, 또는 텐서 연산; 부울 연산 등을 포함할 수 있다. 다운스트림 연산은 DAG(Directed Acyclic Graph) 내의 연산을 포함할 수 있다. 가져온 분기 경로에 의해 생성된 데이터의 프로모팅은 컴파일러에 의해 분기 미결정 윈도우(branch indecision window) 외부에서 커밋 기록(commit write)를 수행하도록 스케줄링한 것에 기초할 수 있다. 추가 실시예들은 분기 결정에 의해 표시되지 않은 분기 결정과 연관된 하나 이상의 동작들을 억제하는 것을 포함한다. 억제는 동적으로 수행된다. 억제는 동작들이 억제되지 않았다면 동작들을 처리하는 데 사용되었을 유휴 계산 엘리먼트들을 포함할 수 있다. 억제는 계산 엘리먼트들의 2D 어레이에서의 전력 감소를 가능하게 한다. 억제된 동작은 데이터를 처리하지 않고, 데이터를 생성하지 않고, 제어 신호를 생성하지 않는다. 실시예들에서, 억제는 데이터가 커밋(commit)되는 것을 방지한다. 추가 실시예들은 분기 결정에 의해 표시되지 않은 분기의 측면으로부터 결과들을 제거하는 것을 포함한다. 결과를 제거하는 것은 경쟁 조건(race condition)을 제거하고 데이터 모호성을 피하기 위해 수행될 수 있다. 취해진 분기 경로 데이터를 프로모팅하는 결정은 분기 결정을 기반으로 한다. 따라서, 어느 하나의 분기 경로로부터의 생성된 데이터는 분기 결정이 수행될 때까지 유효한 것으로 간주될 수 없다. 실시예들에서, 취해지지 않은 분기의 측면이 분기 결정 후에 커밋 기록을 실행하려고 시도하면, 어레이의 동작이 중단된다.Embodiments include promoting data for downstream operations. Downstream operations are arithmetic, vector, matrix, or tensor operations; May include Boolean operations, etc. Downstream operations may include operations within a Directed Acyclic Graph (DAG). Promoting data generated by a fetched branch path may be based on scheduling by the compiler to perform a commit write outside the branch indecision window. Additional embodiments include suppressing one or more operations associated with a branch decision that are not indicated by the branch decision. Suppression is performed dynamically. Suppression may include idle computation elements that would have been used to process operations if the operations had not been suppressed. Suppression enables power reduction in the 2D array of computational elements. Inhibited operations do not process data, do not generate data, and do not generate control signals. In embodiments, suppression prevents data from being committed. Additional embodiments include removing results from a side of a branch not indicated by the branch decision. Removing results can be performed to eliminate race conditions and avoid data ambiguity. The decision to promote branch path data taken is based on the branch decision. Therefore, data generated from either branch path cannot be considered valid until a branch decision is performed. In embodiments, if a side of a branch that was not taken attempts to execute a commit record after the branch decision, operation of the array is halted.

다양한 실시예들의 다양한 특징들, 양태들, 및 이점들은 다음의 추가 설명으로부터 더 명백해질 것이다. Various features, aspects, and advantages of various embodiments will become more apparent from the further description that follows.

특정 실시예들의 다음의 상세한 설명은 다음의 도면들을 참조하여 이해될 수 있다.
도 1은 추론적 인코딩을 사용하는 병렬 처리 아키텍처에 대한 흐름도이다.
도 2는 동작 억제를 위한 흐름도이다.
도 3a는 쉘로우 파이프라인(shallow pipeline)을 갖는 고 병렬 아키텍처에 대한 시스템 블록도를 예시한다.
도 3b는 계산 엘리먼트 어레이 세부 사항을 예시한다.
도 4는 코드에서의 분기들을 예시한다.
도 5a는 컴파일러에 의해 병합된 코드 블록들을 도시한다.
도 5b는 컴파일러에 의해 병합된 프로그래밍 루프를 도시한다.
도 6은 코드 이미지의 컴파일러 뷰를 예시한다.
도 7은 예상 분기 경로에 대한 억제된 동작들을 도시한다.
도 8은 압축된 코드 워드 페치(fetch) 및 압축 해제(decompress) 파이프라인을 예시한다.
도 9는 코드 워드 인코딩 및 나이브 요구-드라이빙 페치 파이프라인 오버레이( demand-driven fetch pipeline overlay)를 도시한다.
도 10은 컴파일러 코어스(coarse) 분기 프리페치 힌트를 예시한다.
도 11은 컴파일러 상호작용들에 대한 시스템 블록도를 도시한다.
도 12는 추론적 인코딩을 사용하는 병렬 처리 아키텍처에 대한 시스템 다이어그램이다.The following detailed description of specific embodiments may be understood with reference to the following drawings.
1 is a flow diagram for a parallel processing architecture using speculative encoding.
Figure 2 is a flow chart for motion suppression.
Figure 3A illustrates a system block diagram for a highly parallel architecture with a shallow pipeline.
Figure 3b illustrates computational element array details.
Figure 4 illustrates branches in the code.
Figure 5A shows code blocks merged by the compiler.
Figure 5b shows the programming loop merged by the compiler.
Figure 6 illustrates a compiler view of a code image.
Figure 7 shows suppressed operations for the expected branch path.
Figure 8 illustrates a compressed code word fetch and decompress pipeline.
9 shows a code word encoding and naive request-driving fetch pipeline overlay ( shows a demand-driven fetch pipeline overlay.
Figure 10 illustrates compiler coarse branch prefetch hints.
Figure 11 shows a system block diagram for compiler interactions.
Figure 12 is a system diagram for a parallel processing architecture using speculative encoding.

추론적 인코딩을 사용하는 병렬 처리 아키텍처에 기초한 프로그램 실행을 위한 기술이 개시된다. 실행되는 프로그램은 이미지 또는 오디오 처리 애플리케이션, AI 애플리케이션, 비즈니스 애플리케이션 등과 같은 데이터 조작에 기초한 광범위한 애플리케이션을 가질 수 있다. 프로그램들은 동작들을 포함할 수 있고, 여기서 동작들은 태스크들에 기초할 수 있고, 태스크들 자체는 서브태스크들에 기초할 수 있다. 실행되는 태스크는 산술 연산, 시프트 연산(shift operation), 부울 연산을 포함하는 로직 연산, 벡터 또는 행렬 연산, 텐서 연산 등을 포함하는 다양한 연산을 수행할 수 있다. 서브태스크들은 우선순위(precedence), 우선권(priority), 코딩 순서, 병렬화의 양, 데이터 흐름, 데이터 이용가능성, 계산 엘리먼트 이용가능성, 통신 채널 이용가능성 등에 기초하여 실행될 수 있다. 데이터 조작들은 계산 엘리먼트들의 2차원(2D) 어레이 상에서 수행된다. 2D 어레이 내의 계산 엘리먼트들은 중앙 처리 유닛(CPU), 그래픽 처리 유닛(GPU), 주문형 집적 회로(ASIC), 필드 프로그래머블 게이트 어레이(FPGA), 처리 코어, 또는 다른 처리 컴포넌트 또는 처리 컴포넌트의 조합으로 구현될 수 있다. 계산 엘리먼트(compute element)들은 집적 회로 또는 칩 내의 이종 프로세서들, 동종 프로세서들, 프로세서 코어들 등을 포함할 수 있다. 계산 엘리먼트들은 로컬 메모리 엘리먼트들, 레지스터 파일들, 캐시 스토리지 등을 포함할 수 있는 로컬 스토리지에 결합될 수 있다. L1, L2, L3 캐시와 같은 계층적 캐시를 포함할 수 있는 캐시는 중간 결과, 압축된 제어 워드, 병합된(coalesced) 제어 워드, 압축 해제된 제어 워드, 제어 워드의 관련 부분, 등등과 같은 데이터를 저장하기 위해 사용될 수 있다. 캐시는 취해진 분기 경로에 의해 생성된 데이터를 저장할 수 있으며, 여기서 취해진 분기 경로는 분기 결정에 의해 결정된다. 제어 워드는 계산 엘리먼트들의 어레이 내의 하나 이상의 계산 엘리먼트들을 제어하는 데 사용된다. 압축 및 압축 해제된 제어 워드들 둘 모두는 엘리먼트들의 어레이를 제어하기 위해 사용될 수 있다. 계산 엘리먼트들의 2차원(2D) 어레이의 다수의 층들은 계산 엘리먼트들의 3차원 어레이를 포함하도록 "적층(stack)"될 수 있다.A technique for program execution based on a parallel processing architecture using speculative encoding is disclosed. The executed program can have a wide range of applications based on data manipulation, such as image or audio processing applications, AI applications, business applications, etc. Programs may contain operations, where operations may be based on tasks, and the tasks themselves may be based on subtasks. The executed task may perform various operations including arithmetic operations, shift operations, logic operations including Boolean operations, vector or matrix operations, tensor operations, etc. Subtasks may be executed based on priority, priority, coding order, amount of parallelism, data flow, data availability, computational element availability, communication channel availability, etc. Data manipulations are performed on a two-dimensional (2D) array of computational elements. Computation elements within the 2D array may be implemented as a central processing unit (CPU), graphics processing unit (GPU), application specific integrated circuit (ASIC), field programmable gate array (FPGA), processing core, or other processing component or combination of processing components. You can. Compute elements may include heterogeneous processors, homogeneous processors, processor cores, etc. within an integrated circuit or chip. Computation elements may be coupled to local storage, which may include local memory elements, register files, cache storage, etc. Caches, which may include hierarchical caches such as L1, L2, L3 caches, store data such as intermediate results, compressed control words, coalesced control words, uncompressed control words, relevant portions of control words, etc. Can be used to store . A cache may store data generated by a branch path taken, where the branch path taken is determined by a branch decision. A control word is used to control one or more computational elements within an array of computational elements. Both compressed and uncompressed control words can be used to control the array of elements. Multiple layers of a two-dimensional (2D) array of computational elements can be “stacked” to comprise a three-dimensional array of computational elements.

동작과 연관된 태스크, 서브태스크 등은 컴파일러에 의해 생성된다. 컴파일러는 범용 컴파일러, 하드웨어 기술 기반 컴파일러, 계산 엘리먼트들의 어레이에 대해 기록되거나 "튜닝된" 컴파일러, 제약 기반 컴파일러, 만족도 기반 컴파일러(satisfiability-based compiler)(SAT 솔버) 등을 포함할 수 있다. 제어(control)는 제어 워드(control word)들의 형태로 하드웨어에 제공되며, 여기서 제어 워드들은 사이클 단위로 제공된다. 하나 이상의 제어 워드는 컴파일러에 의해 생성된다. 제어 워드들은 와이드, 가변 길이, 마이크로코드 제어 워드들을 포함할 수 있다. 마이크로코드 제어 워드의 길이는 제어 워드를 압축함으로써, 계산 엘리먼트가 태스크에 의해 불필요한 때를 인식하여 제어 워드 내의 제어 비트들이 해당 계산 엘리먼트 등에 대해 요구되지 않게 함으로써 조정될 수 있다. 제어 워드들은 데이터를 라우팅하고, 계산 엘리먼트들에 의해 수행될 연산들을 셋업하고, 계산 엘리먼트들의 행들 및/또는 열들 또는 개별 계산 엘리먼트들을 유휴 상태로 하는 등에 사용될 수 있다. 계산 엘리먼트와 연관된 컴파일된 마이크로코드 제어 워드는 계산 엘리먼트에 분배된다. 계산 엘리먼트들은 압축 해제된 제어 워드들에 근거하여 동작하는 제어 유닛에 의해 제어된다. 제어 워드들은 계산 엘리먼트들에 의한 처리를 가능하게 한다. 태스크 처리는 하나 이상의 제어 워드를 실행함으로써 인에이블(enable)된다. Tasks, subtasks, etc. related to operations are created by the compiler. Compilers may include general-purpose compilers, hardware technology-based compilers, compilers written or “tuned” for an array of computational elements, constraint-based compilers, satisfiability-based compilers (SAT solvers), etc. Control is provided to the hardware in the form of control words, where control words are provided in cycles. One or more control words are generated by the compiler. Control words may include wide, variable length, microcode control words. The length of a microcode control word can be adjusted by compressing the control word, recognizing when a computational element is unnecessary by the task so that control bits within the control word are not required for that computational element, etc. Control words may be used to route data, set up operations to be performed by computational elements, idle rows and/or columns of computational elements or individual computational elements, etc. Compiled microcode control words associated with a computational element are distributed to the computational element. Computation elements are controlled by a control unit that operates based on decompressed control words. Control words enable processing by computational elements. Task processing is enabled by executing one or more control words.

태스크들의 실행을 가속화하기 위해, 계산 엘리먼트들의 어레이에 대한 중단(stalling)을 감소시키거나 제거하기 위해, 등등, 2개 이상의 동작들이 제어 워드로 병합될 수 있다. 컴파일러는 제어 워드들이 계산 엘리먼트들의 어레이에 로딩되기 전에, 컴파일 시간에 병합을 수행할 수 있다. 병합된 제어 워드는 분기 결정 및 분기 결정과 연관된 동작들을 포함한다. 병합된 제어 워드는 둘 이상의 잠재적인 분기 결정들 또는 측면들의 부분적 또는 완전한 실행을 인에이블할 수 있다. 병합된 제어 워드는 2개의 분기 경로 또는 측면을 초래하는 단일 분기 결정의 부분적 또는 완전한 실행을 인에이블할 수 있다. 사용 예에서, 병합된 제어 워드는 분기 및 분기의 측면들과 연관된 동작들을 포함한다. 분기의 결론(outcome)이 분기의 실행에 선험적(a priori)으로 알려질 가능성이 없기 때문에, 분기의 모든 가능한 측면과 연관된 제어 워드의 실행이 어레이 내의 이용가능한 병렬 자원을 사용하여 시작되거나 "사전-실행(pre-executed)"될 수 있다. 분기 결정이 이루어지면, 분기의 표시된 측면과 연관된 동작이 진행될 수 있다. 표시되지 않은 분기의 측면과 연관된 동작은 억제되거나 무시될 수 있다. 동작들을 억제하는 것은 표시되지 않은 분기의 측면과 연관된 동작들을 실행하는 데 사용될 계산 엘리먼트들을 유휴 상태로 하는 것을 포함할 수 있다. 동작들을 억제하고, 더 나아가, 계산 엘리먼트들을 유휴 상태로 하는 것(idling)은 2D 어레이 내의 전력 소비를 감소시킨다. 동작을 무시하는 것은 동작 단순성이 증가한다. 동작을 억제하거나 동작을 무시하는 것은 계산 엘리먼트 단위(compute element basis)로 인에이블될 수 있다. 동작들을 억제하는 것과 동작들을 무시하는 것의 조합이 계산 엘리먼트 단위로 인에이블될 수 있다.Two or more operations may be merged into a control word to accelerate execution of tasks, reduce or eliminate stalling on an array of computational elements, etc. The compiler can perform the merging at compile time, before the control words are loaded into the array of computational elements. The merged control word includes the branch decision and operations associated with the branch decision. A merged control word may enable partial or complete execution of two or more potential branch decisions or aspects. A merged control word can enable partial or complete execution of a single branch decision resulting in two branch paths or aspects. In a use case, the merged control word includes a branch and operations associated with aspects of the branch. Because the outcome of a branch is unlikely to be known a priori to the execution of the branch, execution of the control words associated with all possible aspects of the branch is initiated using the available parallel resources in the array, or is "pre-executed." (pre-executed)". Once a branch decision is made, actions associated with the indicated aspect of the branch may proceed. Actions associated with aspects of unmarked branches may be suppressed or ignored. Suppressing operations may include idling computational elements that will be used to execute operations associated with aspects of the branch that are not indicated. Inhibiting operations and, by extension, idling computational elements reduces power consumption within the 2D array. Ignoring behavior increases operational simplicity. Suppressing an operation or overriding an operation can be enabled on a compute element basis. A combination of suppressing operations and ignoring operations can be enabled on a per-computation element basis.

추론적 인코딩을 사용하는 병렬 아키텍처는 프로그램 실행을 인에이블한다. 계산 엘리먼트들의 2차원(2D) 어레이가 액세스된다. 계산 엘리먼트들은 집적 회로 내의 계산 엘리먼트들, 프로세서들, 또는 코어들; 주문형 집적 회로(ASIC) 내의 프로세서들 또는 코어들; 필드 프로그래밍가능 게이트 어레이(FPGA)와 같은 프로그래밍가능 디바이스 내에 프로그래밍된 코어들 등을 포함할 수 있다. 계산 엘리먼트는 동종 또는 이종 프로세서를 포함할 수 있다. 계산 엘리먼트들의 2D 어레이 내의 각각의 계산 엘리먼트는 컴파일러에 알려져 있다. 범용 컴파일러, 하드웨어 지향 컴파일러, 또는 계산 엘리먼트들에 특정한 컴파일러를 포함할 수 있는 컴파일러는 계산 엘리먼트들 각각에 대한 코드를 컴파일할 수 있다. 각각의 계산 엘리먼트는 계산 엘리먼트들의 어레이 내의 이웃하는 계산 엘리먼트들에 결합된다. 계산 엘리먼트들의 결합은 계산 엘리먼트들 사이에서 데이터 통신을 인이에블한다. 따라서, 컴파일러는 계산 엘리먼트들 사이의 그리고 계산 엘리먼트들 간에 데이터 흐름을 제어할 수 있을 뿐만 아니라 어레이 외부의 메모리에 대한 데이터 커밋(data commitment)을 제어할 수 있다. 제어는 컴파일러에 의해 생성된 하나 이상의 제어 워드를 통해 하드웨어에 제공된다. 제어는 사이클 단위로 제공될 수 있다. 사이클은 클록 사이클, 데이터 사이클, 처리 사이클, 물리적 사이클, 아키텍처 사이클 등을 포함할 수 있다. A parallel architecture using speculative encoding enables program execution. A two-dimensional (2D) array of computational elements is accessed. Computational elements may include computational elements, processors, or cores within an integrated circuit; Processors or cores within an application-specific integrated circuit (ASIC); may include programmed cores within a programmable device, such as a field programmable gate array (FPGA), etc. Computation elements may include homogeneous or heterogeneous processors. Each computational element in the 2D array of computational elements is known to the compiler. A compiler, which may include a general-purpose compiler, a hardware-oriented compiler, or a compiler specific to computational elements, may compile code for each of the computational elements. Each computational element is coupled to neighboring computational elements in an array of computational elements. Combining computational elements enables data communication between computational elements. Accordingly, the compiler can control data flow to and from computational elements as well as data commitment to memory outside the array. Control is provided to the hardware through one or more control words generated by the compiler. Control may be provided on a cycle-by-cycle basis. Cycles may include clock cycles, data cycles, processing cycles, physical cycles, architecture cycles, etc.

이 제어는 컴파일러에 의해 생성된 와이드, 가변 길이, 제어 워드의 스트림에 의해 인에이블된다. 제어 워드는 마이크로코드 제어 워드, 명령 등가물, 하이-레벨(high-level) 언어를 위한 빌딩 블록(building block) 등을 포함할 수 있다. 실시예들에서, 동작들로부터의 각각의 동작은 동등한 명령을 나타낸다. 실시예들에서, 명령 등가물은 하이-레벨 언어를 위한 빌딩 블록을 포함한다. 마이크로코드 제어 워드 길이와 같은 제어 워드는 제어, 압축, 계산 엘리먼트가 불필요한 것을 식별하는 것과 같은 단순화 등의 유형에 기초하여 변할 수 있다. 압축된 제어 워드, 병합된 제어 워드 등을 포함할 수 있는 제어 워드는 디코딩되어 계산 엘리먼트의 어레이를 제어하는 제어 유닛에 제공될 수 있다. 병합된 제어 워드는 분기 명령 및 하나 이상의 동작들을 포함할 수 있다. 제어 워드는 미세(fine) 제어 세분화도(granularity)의 레벨로 압축 해제될 수 있으며, 여기서 각각의 계산 엘리먼트(정수 계산 엘리먼트, 부동 소수점 계산 엘리먼트, 어드레스 생성 계산 엘리먼트, 기록 버퍼 엘리먼트, 판독 버퍼 엘리먼트 등)는 개별적으로 그리고 고유하게 제어된다. 각각의 압축된 제어 워드는 엘리먼트별 기반으로 제어를 허용하도록 압축 해제된다. 개별 계산 엘리먼트에 적용되는 제어 워드의 세그먼트는 번치(bunch)로 지칭될 수 있다. 디코딩은 주어진 계산 엘리먼트가 태스크 또는 서브태스크를 처리하기 위해 필요한지 여부; 계산 엘리먼트가 연관된 특정 제어 워드를 갖는지 또는 계산 엘리먼트가 반복된 제어 워드(예를 들어, 2개 이상의 계산 엘리먼트에 사용되는 제어 워드)를 수신하는지 여부 등에 의존할 수 있다. 프로그램은 지시들의 세트에 기초하여 계산 엘리먼트들의 어레이 상에서 실행된다. 실행은 컴파일된 태스크와 연관된 복수의 서브태스크들을 실행함으로써 달성될 수 있다. 복수의 서브태스크들을 실행하는 것은 계산 엘리먼트들의 어레이 내의 하나 이상의 계산 엘리먼트들 상에서 동작들을 실행하는 것을 포함할 수 있다.This control is enabled by a stream of wide, variable-length, control words generated by the compiler. Control words may include microcode control words, instruction equivalents, building blocks for high-level languages, etc. In embodiments, each operation from the operations represents an equivalent instruction. In embodiments, instruction equivalents include building blocks for a high-level language. Control words, such as microcode control word length, can vary based on the type of control, compression, simplification, such as identifying computational elements that are unnecessary. The control words, which may include compressed control words, merged control words, etc., may be decoded and provided to a control unit that controls the array of computational elements. The merged control word may include a branch instruction and one or more operations. The control word may be decompressed to a level of fine control granularity, where each compute element (integer compute element, floating point compute element, address generation compute element, write buffer element, read buffer element, etc. ) are controlled individually and uniquely. Each compressed control word is decompressed to allow control on an element-by-element basis. Segments of control words applied to individual computational elements may be referred to as a bunch. Decoding determines whether a given computational element is needed to process a task or subtask; It may depend on whether the computational element has a specific control word associated with it or whether the computational element receives repeated control words (e.g., control words used for more than one computational element). A program is executed on an array of computational elements based on a set of instructions. Execution may be achieved by executing a plurality of subtasks associated with the compiled task. Executing the plurality of subtasks may include performing operations on one or more computational elements in an array of computational elements.

도 1은 추론적 인코딩을 사용하는 병렬 처리 아키텍처에 대한 흐름도이다. 계산 엘리먼트들(CE들)의 2D 어레이 내에 조립된 CE들과 같은 계산 엘리먼트들(CE들)의 그룹화들은 프로그램들과 연관된 다양한 동작들을 실행하도록 구성될 수 있다. 동작들은 태스크들 및 태스크들과 연관된 서브태스크들에 기초할 수 있다. 2D 어레이는 제어기들, 스토리지 엘리먼트들, ALU들, 곱셈기 엘리먼트들 등과 같은 다른 엘리먼트들과 추가로 인터페이싱할 수 있다. 동작들은 애플리케이션 처리, 데이터 조작 등과 같은 다양한 처리 목적들을 달성할 수 있다. 연산들은 정수, 실수 및 문자 데이터 유형들; 벡터들 및 행렬들; 텐서들 등을 포함하는 다양한 데이터 유형들에 대해 동작할 수 있다. 제어는 사이클 단위로 계산 엘리먼트의 어레이에 제공되며, 여기서 제어는 컴파일러에 의해 생성된 제어 워드에 기초한다. 마이크로코드 제어 워드들을 포함할 수 있는 제어 워드들은 다양한 계산 엘리먼트들을 인에이블 또는 유휴(idle)시키고; 데이터를 제공하고; CE들, 캐시들, 및 스토리지 간에 또는 중에 결과들을 라우팅하는 것; 등을 한다. 제어는 계산 엘리먼트 동작, 메모리 액세스 우선 순위 등을 인에이블한다. 계산 엘리먼트 동작 및 메모리 액세스 우선 순위는 하드웨어가 계산 엘리먼트 결과를 적절하게 시퀀싱할 수 있게 한다. 제어는 계산 엘리먼트들의 어레이 상에서 컴파일된 프로그램의 실행을 인에이블한다. 또한, 분기의 두 측면은 추론적으로 인코딩되고, 분기 결정 및 분기 결정과 연관된 동작을 포함하는 제어로 병합된다. 분기 결정이 이루어질 때, 분기의 "취해진(taken)" 측면과 연관된 동작들은 계속 실행될 수 있고, 데이터 등에 대해 연산될 수 있다. 분기의 "취해지지 않는(not taken)" 측면과 연관된 동작은 억제되거나 무시될 수 있다. 취해지지 않은 분기의 측면에 대한 동작들을 억제하는 것은 실행 지연을 감소시키거나 방지할 수 있고, 전력 감소를 가능하게 하고, 데이터가 커밋되는 것을 방지할 수 있는 등등이다. 원하지 않는 데이터 커밋이 진행 중이 아닐 때 동작을 무시하는 일이 발생할 수 있다.1 is a flow diagram for a parallel processing architecture using speculative encoding. Groupings of computational elements (CEs), such as CEs assembled within a 2D array of computational elements (CEs), may be configured to execute various operations associated with programs. Operations may be based on tasks and subtasks associated with the tasks. The 2D array may further interface with other elements such as controllers, storage elements, ALUs, multiplier elements, etc. Operations may achieve various processing objectives such as application processing, data manipulation, etc. Operations operate on integer, real, and character data types; vectors and matrices; It can operate on a variety of data types, including tensors, etc. Control is provided to an array of computational elements on a cycle-by-cycle basis, where control is based on control words generated by the compiler. Control words, which may include microcode control words, enable or idle various computational elements; provide data; routing results between and among CEs, caches, and storage; etc. Controls enable computational element operation, memory access priorities, etc. Compute element operation and memory access priorities allow the hardware to properly sequence compute element results. The control enables execution of the compiled program on the array of computational elements. Additionally, both aspects of the branch are encoded speculatively and merged into control that includes the branch decision and the actions associated with the branch decision. When a branch decision is made, operations associated with the “taken” aspect of the branch may continue to execute, operate on data, etc. Actions associated with the "not taken" aspect of a branch may be suppressed or ignored. Suppressing operations on aspects of a branch that was not taken can reduce or prevent execution delays, enable power reduction, prevent data from being committed, etc. This can result in overriding actions when unwanted data commits are not in progress.

흐름(100)은 계산 엘리먼트들의 2차원(2D) 어레이(110)에 액세스하는 것을 포함하고, 계산 엘리먼트들의 어레이 내의 각각의 계산 엘리먼트는 컴파일러에 알려져 있고, 계산 엘리먼트들의 어레이 내의 이웃 계산 엘리먼트들에 결합된다. 계산 엘리먼트들은 다양한 유형들의 프로세서들에 기초할 수 있다. 계산 엘리먼트들 또는 CE들은 중앙 처리 유닛들(CPU들), 그래픽 처리 유닛들(GPU들), 주문형 집적 회로들(ASIC들) 내의 프로세서들 또는 처리 코어들, 필드 프로그래밍가능 게이트 어레이들(FPGA들) 내에 프로그래밍된 처리 코어들 등을 포함할 수 있다. 실시예들에서, 계산 엘리먼트들의 어레이 내의 계산 엘리먼트들은 동일한 기능을 갖는다. 계산 엘리먼트들은 이종 컴퓨팅 자원들을 포함할 수 있으며, 여기서 이종 컴퓨팅 자원들은 단일 집적 회로 또는 칩 내에 병치(collocate)되거나 그렇지 않을 수 있다. 계산 엘리먼트들은 토폴로지로 구성될 수 있으며, 여기서 토폴로지는 어레이에 구축되거나, 어레이 내에 프로그래밍되거나 구성될 수 있는 등등이다. 실시예에서, 계산 엘리먼트의 어레이는 시스톨릭(systolic), 벡터, 사이클릭, 공간, 스트리밍 또는 VLIW(Very Long Instruction Word) 토폴로지 중 하나 이상을 구현하기 위해 제어 워드에 의해 구성된다.Flow 100 includes accessing a two-dimensional (2D) array 110 of computational elements, each computational element in the array of computational elements being known to the compiler, and coupling to neighboring computational elements in the array of computational elements. do. Computational elements may be based on various types of processors. Computational elements or CEs include central processing units (CPUs), graphics processing units (GPUs), processors or processing cores in application specific integrated circuits (ASICs), field programmable gate arrays (FPGAs) It may include processing cores programmed within, etc. In embodiments, computational elements within an array of computational elements have the same functionality. Computational elements may include heterogeneous computing resources, where the heterogeneous computing resources may or may not be collocated within a single integrated circuit or chip. Computational elements may be organized into topologies, where the topology may be built into an array, programmed or configured within an array, etc. In an embodiment, an array of computational elements is configured by a control word to implement one or more of a systolic, vector, cyclic, spatial, streaming, or Very Long Instruction Word (VLIW) topology.

계산 엘리먼트들은 기계 학습 계산에 적합한 토폴로지를 더 포함할 수 있다. 계산 엘리먼트들은 CE들의 어레이 내의 다른 엘리먼트들에 결합될 수 있다. 실시예들에서, 계산 엘리먼트들의 결합은 하나 이상의 토폴로지를 가능하게 할 수 있다. CE들이 결합될 수 있는 다른 엘리먼트들은 하나 이상의 레벨들의 캐시 스토리지; 곱셈기 유닛들; 로드 (LD) 및 저장 (ST) 어드레스들을 생성하기 위한 어드레스 생성기 유닛들; 큐들 등과 같은 스토리지 엘리먼트들을 포함할 수 있다. 각각의 계산 엘리먼트가 알려진 컴파일러는 C, C++, 또는 파이썬 컴파일러를 포함할 수 있다. 각각의 계산 엘리먼트가 알려진 컴파일러는 특히 계산 엘리먼트들의 어레이에 대해 기록된 컴파일러를 포함할 수 있다. 각각의 CE를 이웃하는 CE들에 결합하는 것은 캐시 엘리먼트들, 곱셈기 엘리먼트들, ALU 엘리먼트들, 또는 제어 엘리먼트들과 같은 엘리먼트들의 공유; 이웃하는 CE들 간에 또는 CE들 중에 통신 등을 가능하게 한다. Computation elements may further include a topology suitable for machine learning computation. Computation elements may be coupled to other elements within the array of CEs. In embodiments, combinations of computational elements may enable one or more topologies. Other elements that CEs can be combined with include one or more levels of cache storage; multiplier units; address generator units for generating load (LD) and store (ST) addresses; May include storage elements such as queues, etc. Compilers for which each computational element is known may include C, C++, or Python compilers. A compiler for which each computational element is known may include a compiler written specifically for the array of computational elements. Coupling each CE to neighboring CEs may include sharing elements such as cache elements, multiplier elements, ALU elements, or control elements; It enables communication between neighboring CEs or among CEs.

흐름(100)은 사이클 단위로 계산 엘리먼트들의 어레이에 대한 제어(120)를 제공하는 것을 포함한다. 어레이에 대한 제어는 데이터를 로딩 및 저장하는 어레이 내의 계산 엘리먼트들과 같은 엘리먼트들의 구성; 계산 엘리먼트들로, 계산 엘리먼트로부터, 그리고 계산 엘리먼트들 사이에서 데이터를 라우팅하는 것 등을 포함할 수 있다. 흐름(100)에서, 제어는 와이드, 가변 길이 제어 워드들의 스트림에 의해 인에이블된다(122). 제어 워드들은 어레이 내의 계산 엘리먼트들 및 다른 엘리먼트들을 구성할 수 있고; 개별 계산 엘리먼트들, 계산 엘리먼트들의 행(row)들 및/또는 열(column)들을 인에이블(enable) 또는 디스에이블(disable)할 수 있고; 데이터를 로드 및 저장할 수 있고; 계산 엘리먼트들로, 계산 엘리먼트로부터, 그리고 계산 엘리먼트들 간에 데이터를 라우팅할 수 있고; 기타 등등을 수행할 수 있다. 하나 이상의 제어 워드는 컴파일러에 의해 생성된다(124). 제어 워드를 생성하는 컴파일러는 C, C++, 또는 Python 컴파일러와 같은 범용 컴파일러; VHDL 또는 Verilog 컴파일러와 같은 하드웨어 기술 언어 컴파일러(hardware description language compiler); 계산 엘리먼트의 어레이를 위해 기록된 컴파일러 등을 포함할 수 있다. 컴파일러는 기능을 계산 엘리먼트들의 어레이에 매핑하는 데 사용될 수 있다. 실시예들에서, 컴파일러는 기계 학습 기능을 계산 엘리먼트들의 어레이에 매핑할 수 있다. 기계 학습은 기계 학습(ML) 네트워크, 딥러닝(DL) 네트워크, 지원 벡터 기계(SVM) 등에 기초할 수 있다. 실시예에서, 기계 학습 기능은 신경망(NL) 구현을 포함할 수 있다. 컴파일러에 의해 생성된 제어 워드는 하나 이상의 CE를 구성하고, CE로 또는 CE로부터 데이터가 흐를 수 있게 하고, 동작을 수행하도록 CE를 구성하는 등에 사용될 수 있다. 계산 엘리먼트들의 어레이를 제어하기 위해 컴파일되는 태스크의 유형 및 크기에 따라, CE들 중 하나 이상이 제어될 수 있는 반면, 다른 CE들은 특정 태스크에 의해 불필요하다. 불필요한 CE는 불필요한 것으로 제어 워드에 마킹될 수 있다. 불필요한 CE는 데이터를 필요로 하지 않으며 CE에 필요한 제어 워드도 아니다. 실시예들에서, 불필요한 계산 엘리먼트는 단일 비트에 의해 제어될 수 있다. 다른 실시예들에서, 단일 비트는 행(row) 내의 각각의 CE에 대한 유휴 신호들을 생성하도록 하드웨어에 지시함으로써 CE들의 전체 행을 제어할 수 있다. 단일 비트는 "불필요(unneeded)"로 설정되거나, "필요(needed)"로 재설정되거나, 특정 CE가 태스크에 의해 불필요할 때를 나타내기 위해 비트의 유사한 사용에 대해 설정될 수 있다. Flow 100 includes providing control 120 over an array of computational elements on a cycle-by-cycle basis. Control over the array includes configuration of elements, such as computational elements within the array, that load and store data; It may include routing data to, from, and between computational elements, etc. In flow 100, control is enabled by a stream of wide, variable length control words (122). Control words can configure computational elements and other elements within the array; enable or disable individual computational elements, rows and/or columns of computational elements; load and save data; route data to, from, and between computational elements; etc. can be done. One or more control words are generated by the compiler (124). The compiler that generates the control words may be a general-purpose compiler, such as a C, C++, or Python compiler; A hardware description language compiler, such as a VHDL or Verilog compiler; It may include a written compiler for arrays of computational elements, etc. A compiler can be used to map functions to an array of computational elements. In embodiments, a compiler may map machine learning functions to an array of computational elements. Machine learning can be based on machine learning (ML) networks, deep learning (DL) networks, support vector machines (SVM), etc. In embodiments, machine learning functionality may include a neural network (NL) implementation. Control words generated by the compiler may be used to configure one or more CEs, allow data to flow to and from the CEs, configure the CEs to perform operations, etc. Depending on the type and size of the task being compiled to control the array of computational elements, one or more of the CEs may be controlled, while other CEs are unnecessary by the particular task. Unnecessary CEs can be marked in the control word as unnecessary. Unnecessary CE requires no data and no control words are required for CE. In embodiments, unnecessary computational elements may be controlled by a single bit. In other embodiments, a single bit can control an entire row of CEs by instructing the hardware to generate idle signals for each CE within the row. A single bit may be set to “unneeded,” reset to “needed,” or similar uses of the bit to indicate when a particular CE is unneeded by the task.

컴파일러에 의해 생성된 제어 워드들은 조건부 제한(conditionality)을 포함할 수 있다. 실시예에서, 제어는 분기를 포함한다. 이미지 처리, 오디오 처리 등과 같은 애플리케이션과 연관된 코드를 포함할 수 있는 코드는 코드의 시퀀스의 실행이 코드의 상이한 시퀀스로 전달되게 할 수 있는 조건들을 포함할 수 있다. 조건부 제한은 부울(Boolean) 또는 산술 표현식과 같은 표현식을 평가하는 것에 기초할 수 있다. 실시예들에서, 조건부 제한은 코드 점프(code jump)들을 결정할 수 있다. 코드 점프는 앞에서 설명한 것처럼 조건부 점프나 중지, 종료 또는 종료 명령으로의 점프와 같은 비조건부 점프를 포함할 수 있다. 조건부 제한은 엘리먼트들의 어레이 내에서 결정될 수 있다. 실시예들에서, 조건부 제한은 제어 유닛에 의해 확립될 수 있다. 제어부에 의한 조건부 제한을 확립하기 위해, 제어 유닛은 제어 유닛에 제공된 제어 워드에 기초하여 동작할 수 있다. 실시예들에서, 제어 유닛은 압축 해제된 제어 워드들에 기초하여 동작할 수 있다. 제어 워드들은 어레이로 가는 도중에 압축된 제어 워드 캐시로부터의 워드들을 압축 해제하는 압축 해제기 로직 블록에 의해 압축 해제될 수 있다. 실시예들에서, 지시들의 세트(set of direction)는 계산 엘리먼트들의 어레이 내의 하나 이상의 계산 엘리먼트들 상의 서브태스크들의 공간 할당을 포함할 수 있다. 다른 실시예들에서, 지시들의 세트는 계산 엘리먼트들의 어레이 내에서 순환하는 다수의 프로그래밍 루프 인스턴스들을 가능하게 할 수 있다. 다수의 프로그래밍 루프 인스턴스들은 동일한 프로그래밍 루프, 다수의 프로그래밍 루프들 등의 다수의 인스턴스들을 포함할 수 있다. Control words generated by the compiler may contain conditionality. In an embodiment, control includes branching. The code, which may include code associated with an application such as image processing, audio processing, etc., may include conditions that may cause the execution of a sequence of code to be passed on to a different sequence of code. Conditional constraints can be based on evaluating an expression, such as a Boolean or arithmetic expression. In embodiments, conditional constraints may determine code jumps. Code jumps can include conditional jumps, as described previously, or non-conditional jumps, such as jumps to stop, quit, or exit instructions. Conditional constraints may be determined within an array of elements. In embodiments, conditional restrictions may be established by the control unit. To establish conditional limits by the control unit, the control unit may act based on a control word provided to the control unit. In embodiments, the control unit may operate based on uncompressed control words. Control words can be decompressed by a decompressor logic block that decompresses words from the compressed control word cache on their way to the array. In embodiments, a set of directions may include spatial allocation of subtasks on one or more computational elements within an array of computational elements. In other embodiments, a set of instructions may enable multiple programming loop instances to cycle within an array of computational elements. Multiple programming loop instances may include multiple instances of the same programming loop, multiple programming loops, etc.

제어 워드의 관련 부분들은 캐시, 레지스터 파일, 또는 계산 엘리먼트들의 어레이와 연관된 다른 스토리지 내에 저장될 수 있다. 압축 해제된 제어 워드(DCW) 캐시에 저장된 제어 워드는 압축된 제어 워드, 압축 해제된 제어 워드 등을 포함할 수 있다. 실시예에서, 액세스 큐는 캐시와 연관될 수 있으며, 여기서 액세스 큐는 데이터를 저장하고 데이터를 로딩하기 위해 캐시, 스토리지 등에 액세스하기 위한 요청을 큐잉하는 데 사용될 수 있다. 데이터 캐시는 레벨 1(L1) 캐시, 레벨 2(L2) 캐시 등과 같은 멀티레벨(multilevel) 캐시를 포함할 수 있다. L1 캐시들은 처리될 데이터의 블록들을 저장하는 데 사용될 수 있다. L1 캐시는 계산 엘리먼트들 및 다른 컴포넌트들에 의해 신속하게 액세스 가능한 작고 빠른 메모리를 포함할 수 있다. L2 캐시는 L1 캐시에 비해 더 크고 더 느린 스토리지를 포함할 수 있다. L2 캐시는 "다음(next up)" 데이터, 중간 결과와 같은 결과 등을 저장할 수 있다. 실시예들에서, L1 및 L2 캐시들은 레벨 3(L3) 캐시에 추가로 결합될 수 있다. L3 캐시는 L2 및 L1 캐시보다 클 수 있고 더 느린 스토리지를 포함할 수 있다. L3 캐시로부터 데이터에 액세스하는 것은 여전히 메인 스토리지에 액세스하는 것보다 빠르다. 실시예들에서, L1, L2, 및 L3 캐시들은 4-웨이(way) 세트 연관 캐시들을 포함할 수 있다. 실시예들에서, 캐시는 이중 판독 단일 기록(2R1W) 데이터 캐시를 포함할 수 있다. 이름에서 알 수 있듯이, 2R1W 데이터 캐시는 판독/기록 충돌, 경쟁 조건(race condition), 데이터 손상 등을 야기하지 않고, 최대 2개의 판독 동작 및 하나의 기록 동작을 동시에 지원할 수 있다. 실시예들에서, 2R1W 데이터 캐시는 컴파일러에 대한 잠재적인 분기 경로들에 대한 동시 페치 데이터를 지원할 수 있다. 분기 조건은 2개 이상의 분기 경로 중에서 선택할 수 있으며, 즉, 취해진 분기 경로와 취해지지 않은 다른 분기 경로가 모두 분기 결정에 의해 결정된다는 것을 상기한다. Relevant portions of the control word may be stored in a cache, register file, or other storage associated with the array of computational elements. Control words stored in the decompressed control word (DCW) cache may include compressed control words, decompressed control words, etc. In embodiments, an access queue may be associated with a cache, where the access queue may be used to queue requests to access the cache, storage, etc. for storing data and loading data. The data cache may include multilevel caches such as level 1 (L1) cache, level 2 (L2) cache, etc. L1 caches can be used to store blocks of data to be processed. The L1 cache may contain small, fast memory that can be quickly accessed by computational elements and other components. The L2 cache can contain larger and slower storage compared to the L1 cache. The L2 cache can store "next up" data, results such as intermediate results, etc. In embodiments, the L1 and L2 caches may be further combined into a level 3 (L3) cache. The L3 cache can be larger than the L2 and L1 caches and contain slower storage. Accessing data from the L3 cache is still faster than accessing main storage. In embodiments, L1, L2, and L3 caches may include 4-way set associative caches. In embodiments, the cache may include a double read, single write (2R1W) data cache. As the name suggests, the 2R1W data cache can support up to two read operations and one write operation simultaneously without causing read/write conflicts, race conditions, or data corruption. In embodiments, the 2R1W data cache may support concurrent fetch data for potential branch paths to the compiler. Recall that the branch condition can be selected from two or more branch paths, that is, both the branch path taken and the other branch path not taken are determined by the branch decision.

흐름(100)은 2개 이상의 동작(130)을 제어 워드로 병합하는 것(coalescing)을 포함한다. 병합은 서로 독립적으로 실행될 수 있는 동작들을 식별하는 것을 포함할 수 있다. 프로세서들의 2D 어레이 상에서 실행될 수 있는 프로그램은 트리에 의해 표현될 수 있으며, 여기서 트리의 분기들은 실행될 동작들 또는 커맨드들의 시퀀스들을 나타낼 수 있음에 유의한다. 프로그램 트리의 어느 분기 또는 분기들이 실행되는지와 연관된 결정은 분기 결정에 기초할 수 있다. 분기 결정은 변수의 값과 같은 값, 조건, 부울 표현식과 같은 표현식 등에 기초할 수 있다. 흐름(100)에서, 제어 워드는 분기 결정 및 분기 결정과 연관된 동작들(132)을 포함한다. 동작들은 분기 결정의 하나 이상의 측면들과 연관될 수 있다. 분기 결정은 예상 결론(predicted outcome), 예측 결론 등에 기초할 수 있다. 실시예들에서, 병합된 제어 워드는 적어도 2개의 가능한 분기 경로들(또는 스레드(thread)들)에 대한 추론적으로 인코딩된 동작들을 포함할 수 있다. 동작이 병합되기 위해서는 동작이 서로 독립적이어야 한다. 동작들은 하나의 스레드와 연관된 동작에 의해 생성된 입력 데이터, 수행된 태스크들, 또는 출력 데이터와 같은 데이터를 공유할 수 있다. 그러나, 이는 다른 스레드와 연관된 동작에 의해 수행되는 태스크 또는 생성된 출력 데이터에 영향을 미치지 않는다. 실시예에서, 적어도 2개의 가능한 분기 경로는 독립적인 부작용을 생성할 수 있다. 독립적인 부작용들은 생성된 제어 신호들, 다른 동작으로 라우팅하기 위한 출력 데이터, 스토리지에 기록될 데이터 등을 포함할 수 있다. 다른 실시예들에서, 적어도 2개의 가능한 분기 경로들은 커밋되어야 하는 계산 엘리먼트 액션들을 생성할 수 있다. 커밋되어야 하는 계산 엘리먼트 액션은 계산 엘리먼트들의 2D 어레이 외부의 스토리지에 대한 기록 동작을 포함할 수 있다. 추가 실시예들에서, 적어도 2개의 가능한 분기 경로들은 계산 엘리먼트들의 2D 어레이에 의해 병렬로 수행될 수 있다. Flow 100 includes coalescing two or more operations 130 into a control word. Merging may involve identifying operations that can be executed independently of each other. Note that a program that can be executed on a 2D array of processors may be represented by a tree, where branches of the tree may represent sequences of operations or commands to be executed. The decision associated with which branch or branches of the program tree are executed may be based on the branch decision. Branch decisions can be based on values such as the value of a variable, conditions, expressions such as Boolean expressions, etc. In flow 100, the control word includes a branch decision and operations 132 associated with the branch decision. Operations may be associated with one or more aspects of a branch decision. Branching decisions may be based on predicted outcomes, predicted outcomes, etc. In embodiments, the merged control word may include speculatively encoded operations for at least two possible branch paths (or threads). In order for actions to be merged, they must be independent of each other. Operations may share data, such as input data, tasks performed, or output data generated by an operation associated with a thread. However, this does not affect tasks performed or output data generated by operations associated with other threads. In embodiments, at least two possible branching paths may produce independent side effects. Independent side effects may include control signals generated, output data for routing to other operations, data to be written to storage, etc. In other embodiments, at least two possible branch paths may produce compute element actions that must be committed. Computational element actions that must be committed may include write operations to storage outside the 2D array of computational elements. In further embodiments, at least two possible branch paths may be performed in parallel by a 2D array of computational elements.

제어 워드로 병합될 수 있는 동작들은 2D 어레이 내의 계산 엘리먼트들을 구성하고, 계산 엘리먼트들을 인에이블 또는 디스에이블하는 등을 할 수 있다. 실시예들에서, 둘 이상의 동작들은 계산 엘리먼트들의 2D 어레이 내의 데이터 흐름을 제어한다. 데이터 흐름은 계산 엘리먼트에, 계산 엘리먼트로부터, 계산 엘리먼트들 간에 또는 계산 엘리먼트들 중에 데이터를 제공하거나 라우팅하는 것 등을 포함할 수 있다. 데이터 흐름은 제어 워드 내의 하나 이상의 비트에 의해 제어될 수 있다. 흐름(100)은 분기 결정에 의해 표시되지 않은(140) 분기 결정과 연관된 하나 이상의 동작들을 억제하는 것을 더 포함한다. 억제는 표시되지 않은 분기 결정과 연관된 동작들의 실행을 중단하는 것, 동작들을 중지하는 것 등을 포함할 수 있다. 실시예들에서, 억제는 동적으로 달성될 수 있다. 사용 예에서, 억제는 분기 결정을 결정한 후에 발생할 수 있다. 따라서, 억제는 분기 결정에 기초하여 "즉시(on the fly)" 이루어질 수 있다. 억제는 다른 처리 또는 지연 감소 기법들을 달성할 수 있다. 실시예들에서, 억제는 추론적 분기 실행 지연을 방지할 수 있다. 다른 실시예들에서, 억제는 계산 엘리먼트들의 2D 어레이에서 전력 감소를 가능하게 할 수 있다. 동작들의 억제는 이제 억제된 동작들이 할당된 계산 엘리먼트들을 유휴 상태로 함으로써 달성될 수 있다. 계산 엘리먼트들을 유휴 상태로 하는 것(idling)은 계산 엘리먼트들에 의해 소비되는 전력량을 감소시키고, 더 나아가, 2D 어레이에 의해 소비되는 전력량을 감소시킨다. 추가 실시예들에서, 억제는 데이터가 커밋되는 것을 방지할 수 있다. 커밋되는 데이터는 계산 엘리먼트들의 2D 어레이 밖으로 전송될 수 있고, 2D 어레이 외부의 스토리지 내에 저장될 수 있다. 억제된 동작에 의해 생성된 데이터는 무시될 수 있으므로 데이터를 커밋할 필요가 없다.Operations that can be incorporated into a control word can configure computational elements within a 2D array, enable or disable computational elements, etc. In embodiments, two or more operations control data flow within a 2D array of computational elements. Data flow may include providing or routing data to, from, between, or among computational elements, etc. Data flow can be controlled by one or more bits within the control word. Flow 100 further includes suppressing one or more operations associated with the branch decision that are not indicated by the branch decision (140). Inhibition may include halting execution of operations associated with an unmarked branch decision, aborting operations, etc. In embodiments, suppression may be achieved dynamically. In one use case, suppression may occur after determining a branch decision. Thus, inhibition can occur “on the fly” based on branch decisions. Suppression can be achieved with other processing or delay reduction techniques. In embodiments, suppression may prevent speculative branch execution delays. In other embodiments, suppression may enable power reduction in a 2D array of computational elements. Suppression of operations can now be achieved by idling computational elements to which the suppressed operations are assigned. Idling compute elements reduces the amount of power consumed by the compute elements, and further reduces the amount of power consumed by the 2D array. In further embodiments, suppression may prevent data from being committed. Data being committed can be transferred out of the 2D array of computational elements and stored in storage external to the 2D array. Data generated by suppressed operations can be ignored, so there is no need to commit the data.

흐름(100)은 분기 결정에 의해 표시되지 않은 분기 결정과 연관된 하나 이상의 동작(142)을 무시하는 것을 더 포함한다. 동작들을 무시하는 것은, 취해진 분기 경로의 제어 워드(들)가 경로의 초기 부분의 "추론적" 실행에 의해 생성된 결과들을 사용하지 않도록, 동작들에 의해 생성된 데이터를 무시하는 것을 포함할 수 있으며, 이는 취해지지 않는 것으로 판명된다. 흐름(100)에서, 무시하는 것(ignoring)은 원자적으로(atomically) 달성된다(144). 원자적 동작(atomic operation)은 격리된 동작에 병렬로 실행될 수 있는 다른 동작들로부터 격리될 수 있는 동작이다. 원자적 동작은 동작이 다른 동작과 독립적으로 실행될 수 있다는 점에서 "분할할 수 없는(indivisible)"것으로 간주될 수 있다. 흐름(100)에서, 무시하는 것은 제어 워드에 유휴 비트(146)를 설정함으로써 달성된다. 유휴 비트(idle bit)는 개별 계산 엘리먼트를 인에이블 또는 유휴 상태로 하고, 계산 엘리먼트의 행 또는 열을 인에이블 또는 유휴 상태로 하고, 제어 워드를 개별 계산 엘리먼트에 송신하는 것에 사용될 수 있다. 무시하는 것은 취해진 분기 경로의 제어 워드(들)가 특정 제어 엘리먼트를 더 이상 사용하지 않게 하여 유휴 비트를 설정하는 것을 포함할 수 있다.Flow 100 further includes ignoring one or more operations 142 associated with the branch decision that are not indicated by the branch decision. Ignoring operations may include ignoring data produced by the operations such that the control word(s) of the branch path taken do not use the results produced by the “speculative” execution of the initial portion of the path. and it turns out that it cannot be taken. In flow 100, ignoring is accomplished atomically (144). An atomic operation is an operation that can be isolated from other operations that can be executed in parallel to the isolated operation. Atomic operations can be considered "indivisible" in the sense that the operation can be executed independently of other operations. In flow 100, overriding is accomplished by setting idle bit 146 in the control word. The idle bit may be used to enable or idle an individual computational element, enable or idle a row or column of a computational element, and send a control word to an individual computational element. Ignoring may include setting an idle bit such that the control word(s) of the branch path taken no longer uses that particular control element.

일부 실시예들에서, 특정 동작들은 분기 명령의 2개 이상의 측면들에 대해 어레이에서 수행된다. 취해지지 않은 분기 경로 또는 경로들의 결과는 무시될 수 있으며, 취해지지 않은 분기 경로 또는 경로들의 모든 부작용을 소거되거나 무시될 수 있다. 그러나, 추론적으로 수행된 동작들의 수를 최소화하는 것은 둘 다 소거되거나 무시될 부작용을 최소화하고, 어레이에서의 전력 소비를 감소시킬 수 있다. 이를 달성하기 위해, 컴파일러는 추론적 인코딩을 구현할 수 있으며, 여기서 제어 워드는 추론적으로 인코딩될 수 있어서 인코딩은, 취해진 분기 경로 및 취해지지 않은 분기 경로 둘 모두에 대해, 분기 이전의 동작들, 분기 결정 동작 자체, 및 분기의 다수의 측면들 상의 분기 이후의 동작들을 포함할 수 있다. 계산 엘리먼트들의 어레이가 큰 자원 기능을 제공할 수 있기 때문에, 압축된 제어 워드(CCW)는 다수의 분기 경로를 포함할 수 있는 많은 수의 병렬 동작들을 추론적으로(speculatively) 인코딩할 수 있다. In some embodiments, certain operations are performed on the array for two or more aspects of a branch instruction. The results of branch paths or paths not taken may be ignored, and all side effects of branch paths or paths not taken may be canceled or ignored. However, minimizing the number of operations performed speculatively can both minimize side effects that will be canceled or ignored, and reduce power consumption in the array. To achieve this, a compiler can implement speculative encoding, where control words can be encoded speculatively so that the encoding is, for both taken and not taken branch paths, the operations before the branch, the branch It may include the decision operation itself, and post-branch operations on multiple aspects of the branch. Because an array of computational elements can provide large resource capabilities, a compressed control word (CCW) can speculatively encode a large number of parallel operations, which may include multiple branching paths.

흐름(100)의 다양한 단계들은 개시된 개념들을 벗어나지 않고 순차적으로 변경, 반복, 생략 등이 될 수 있다. 흐름(100)의 다양한 실시예들은 하나 이상의 프로세서들에 의해 실행가능한 코드를 포함하는 컴퓨터 판독가능 매체에 구현된 컴퓨터 프로그램 제품에 포함될 수 있다. The various steps of the flow 100 may be sequentially changed, repeated, omitted, etc. without departing from the disclosed concepts. Various embodiments of flow 100 may be included in a computer program product embodied in a computer-readable medium containing code executable by one or more processors.

도 2는 동작 억제(operation suppression)를 위한 흐름도이다. 전반적으로 논의되는 바와 같이, 다양한 유형의 프로그램이 계산 엘리먼트의 어레이 상에서 실행될 수 있다. 프로그램들은 계산 엘리먼트들 상에서 처리될 수 있는 태스크들, 서브태스크들 등을 포함한다. 태스크는 산술, 벡터, 어레이, 또는 행렬 연산들과 같은 일반 연산들; NAND, NOR, XOR, 또는 NOT 연산들과 같은 부울 연산(Boolean operation)들; 신경망 또는 딥러닝 연산들과 같은 애플리케이션들에 기초한 연산들; 등을 포함할 수 있다. 태스크들이 정확하게 처리되기 위해, 제어 워드들은 계산 엘리먼트들의 어레이에 사이클 단위(cycle-by-cycle basis)로 제공된다. 제어 워드는 태스크를 실행하도록 어레이를 구성한다. 제어 워드들은 컴파일러에 의해 계산 엘리먼트들의 어레이에 제공될 수 있다. 제어 워드들은 병합된 제어 워드들을 포함할 수 있으며, 여기서 병합된 제어 워드들은 분기 결정 및 분기 결정과 연관된 동작들을 포함한다. 어레이 내의 배치, 스케줄링, 데이터 전송들 등을 제어하는 제어 워드들을 제공하는 것은 태스크 처리 처리량(throughput)을 최대화할 수 있다. 이러한 최대화(maximization)는 제2 태스크에 의해 요구되는 데이터를 생성하는 태스크가 제2 태스크의 처리 이전에 처리되는 것 등을 보장한다. 분기 결정(branch decision)은 분기 동작을 포함하는 태스크에 기초한다. 분기 동작은 조건부 제한(conditionality)에 기초할 수 있으며, 여기서 조건부 제한은 제어 유닛의 동작을 제어하는 프로그램에 의해 확립될 수 있다. 분기는 조건부 제한에 기초하여 취해질 수 있는 복수의 "웨이(way)", "경로(path)" 또는 "측면(side)"을 포함할 수 있다. 조건부 제한은 산술 또는 부울 표현식과 같은 표현식을 평가하는 것, 명령들의 시퀀스로부터 명령들의 제2 시퀀스로 전송하는 것 등을 포함할 수 있다. 실시예들에서, 조건부 제한은 코드 점프(code jump)들을 결정할 수 있다. 취해질 분기 경로는 조건부 제한을 평가하는 것에 선험적으로 알려지지 않기 때문에, 분기의 측면들과 연관된 동작들은 추론적 기반으로 실행될 수 있다. 조건부 제한이 결정되고 분기 결정이 이루어질 때, 취해진 측면과 연관된 동작들이 진행될 수 있는 반면, 표시되지 않은 측면과 연관된 동작들은 무시될 수 있다. 또한, 분기의 취해지지 않은 측면과 연관된 동작을 억제(suppress)할 수 있다. 동작들의 억제는 분기의 취해지지 않은 측면과 연관된 동작들에 대해 구성된 계산 엘리먼트들을 유휴 상태로 함으로써 달성될 수 있다. 이것은 제어 유닛에 의해 "동적으로" 수행되는 것이 아니라, 제어 워드 페치를 취해진 경로로 향하게 하는 것이 아니라, 오히려 분기의 취해진 측면의 제어 워드들은 분기의 다른 측면에 대한 추론적으로 인코딩된 동작들을 유휴 상태로 만들 것이라는 점에 유의한다. 이것은 컴파일 타임에 제어 워드로 인코딩된다. Figure 2 is a flow chart for operation suppression. As discussed throughout, various types of programs can be executed on an array of computational elements. Programs include tasks, subtasks, etc. that can be processed on computational elements. Tasks include general operations such as arithmetic, vector, array, or matrix operations; Boolean operations such as NAND, NOR, XOR, or NOT operations; Operations based applications such as neural networks or deep learning operations; It may include etc. In order for tasks to be processed accurately, control words are provided on a cycle-by-cycle basis to an array of computational elements. Control words configure the array to execute a task. Control words may be provided by the compiler to an array of computational elements. The control words may include merged control words, where the merged control words include a branch decision and operations associated with the branch decision. Providing control words that control placement, scheduling, data transfers, etc. within the array can maximize task processing throughput. This maximization ensures that tasks that generate data required by the second task are processed before processing of the second task, etc. Branch decisions are based on the task containing the branch operation. Branching operations may be based on conditionality, where the conditionality may be established by a program that controls the operation of the control unit. A branch may include multiple “ways,” “paths,” or “sides” that can be taken based on conditional constraints. Conditional constraints may include evaluating an expression such as an arithmetic or Boolean expression, transferring from a sequence of instructions to a second sequence of instructions, etc. In embodiments, conditional constraints may determine code jumps. Because the branch path to be taken is not known a priori for evaluating conditional constraints, operations associated with aspects of the branch can be executed on a heuristic basis. When conditional constraints are determined and branch decisions are made, actions associated with the taken aspect may proceed, while actions associated with the unmarked aspect may be ignored. Additionally, it is possible to suppress actions associated with the untaken aspect of a branch. Suppression of operations may be accomplished by idling computational elements configured for operations associated with the untaken aspect of the branch. This is not done "dynamically" by the control unit, which does not direct control word fetches to the taken path, but rather the control words on the taken side of the branch idle the speculatively encoded operations on the other side of the branch. Please note that it will be made with . This is encoded into a control word at compile time.

병합된 제어 워드는 추론적 인코딩에 의해 결정될 수 있다. 제어 워드들을 병합하는 것은 추론적 인코딩을 사용하여 병렬 처리 아키텍처를 인에이블한다. 계산 엘리먼트들의 2차원(2D) 어레이가 액세스되며, 여기서 계산 엘리먼트들의 어레이 내의 각각의 계산 엘리먼트는 컴파일러에 알려져 있고, 계산 엘리먼트들의 어레이 내의 이웃하는 계산 엘리먼트들에 결합된다. 계산 엘리먼트들의 어레이에 대한 제어는 사이클 단위(cycle-by-cycle basis)로 제공되며, 여기서 제어는 컴파일러에 의해 생성된 와이드, 가변 길이, 제어 워드들의 스트림에 의해 인에이블된다. 2개 이상의 동작들이 제어 워드로 병합되고, 여기서 제어 워드는 분기 결정 및 분기 결정과 연관된 동작들을 포함한다.The merged control word can be determined by speculative encoding. Merging control words enables a parallel processing architecture using speculative encoding. A two-dimensional (2D) array of computational elements is accessed, where each computational element in the array of computational elements is known to the compiler and is coupled to neighboring computational elements in the array of computational elements. Control over the array of computational elements is provided on a cycle-by-cycle basis, where control is enabled by a stream of wide, variable-length, control words generated by the compiler. Two or more operations are merged into a control word, where the control word includes a branch decision and operations associated with the branch decision.

흐름(200)은 분기 결정에 의해 표시되지 않은 분기 결정과 연관된 하나 이상의 동작(210)을 억제(suppress)하거나 무시(ignore)하는 것을 포함한다. 억제하는 것은 동작들의 실행을 종료하는 것, 동작들을 덮어쓰거나 삭제하는 것 등을 포함할 수 있다. 무시하는 것은 컴파일러에 의해 알려진 시스템의 상태가 손상되지 않는 한, 계산 엘리먼트에서의 동작이 완료되게 하는 것을 포함할 수 있다. 흐름(200)에서, 분기 결정이 동적이라는 의미에서, 억제는 동적으로 달성된다(212); "억제(suppression)"는 이미 취해진 분기 경로의 제어 워드(들)에 인코딩된다. 억제는 분기 결정에 기초할 수 있으며, 여기서 분기 결정은 분기의 표시되지 않은 측면, 분기의 취해지지 않은 측면(not taken side) 등을 포함할 수 있다. 표시되지 않은 분기의 측면과 연관된 동작들에 의해 생성된 데이터와 같은 임의의 결과들은 프로그램의 추가 실행, 태스크의 처리, 서브태스크 등에 불필요하다. 불필요한 데이터를 플러시(flush), 덮어쓰기, 삭제 등을 위해 계산 엘리먼트들의 어레이와 연관된 클록 사이클들, 아키텍처 사이클들 등을 소모해야 하기 보다는, 어떠한 사이클들도 데이터를 무시하는 것에는 소모되지 않는다. 후속 제어 워드들은 결정을 포함하는 제어 워드뿐만 아니라 일반적으로 몇 개의 후속하는 제어 워드들 모두에 대해 취해지지 않은 경로의 처음 몇 개의 제어 워드(들)에 의해 생성된, 여전히 어레이에 있는 결과들을 "능동적으로(actively)" 무시한다. 즉, 어레이에서 결정한 후 짧은 "분기 섀도우(branch shadow)" 동안 원치 않는 결과가 생성된다. 이것은 어레이 내부에서의 결정 다음의 몇 개의 사이클들이며, 이는 결정을 어레이로부터 제어 유닛으로 드라이빙하고, 제어 유닛이 해당 결정에 대해 작용하고 잠재적으로 페치된 제어 워드(들)의 경로를 변경하는 데 소비된다. 불필요한 데이터와 연관된 레지스터들, 캐시, 또는 다른 스토리지는 추가 처리를 위해 이용 가능하게 될 수 있다. 추가 실시예들은 분기 결정에 의해 표시되지 않은 분기의 측면으로부터 결과들을 제거하는 단계를 포함할 수 있다. 표시된 분기의 측면과 연관된 데이터를 남겨 두는 것이 경쟁 조건, 데이터 모호성, 또는 일부 다른 가능한 처리 충돌을 야기할 수 있는 경우, 불필요한 데이터(unneeded data)는 스토리지, 레지스터들, 캐시 등으로부터 제거될 수 있다.Flow 200 includes suppressing or ignoring one or more operations 210 associated with a branch decision that are not indicated by the branch decision. Suppressing may include terminating execution of operations, overwriting or deleting operations, etc. Ignoring may include allowing operations on computational elements to complete as long as the state of the system known by the compiler is not corrupted. In flow 200, suppression is achieved dynamically, in the sense that the branch decision is dynamic (212); “Suppression” is encoded in the control word(s) of the branch path already taken. Suppression may be based on a branch decision, where the branch decision may include an unmarked side of the branch, a not taken side of the branch, etc. Random results, such as data generated by operations associated with aspects of an unmarked branch, are unnecessary for further execution of the program, processing of tasks, subtasks, etc. Rather than having to expend clock cycles, architecture cycles, etc. associated with an array of computational elements to flush, overwrite, delete, etc. unnecessary data, no cycles are wasted ignoring data. Subsequent control words "actively" refer to the results still in the array generated by the first few control word(s) of the path that were not taken, not only for the control word containing the decision, but also for generally several subsequent control words. “Actively” is ignored. This means that undesirable results are produced during a short "branch shadow" after a decision is made in the array. This is the few cycles following a decision within the array, which are spent driving the decision from the array to the control unit, and the control unit acting on that decision and potentially rerouting the fetched control word(s). . Registers, cache, or other storage associated with unnecessary data may be made available for further processing. Additional embodiments may include removing results from a side of a branch not indicated by the branch decision. If leaving data associated with an aspect of the indicated branch may cause race conditions, data ambiguity, or some other possible processing conflict, unneeded data may be removed from storage, registers, cache, etc.

흐름(200)에서, 억제는 취해지지 않을 경로의 추론적 실행을 허용하기 때문에 추론적 분기 실행 지연(220)을 방지한다. 전반적으로 논의되는 바와 같이, 추론적 분기 실행(speculative branch execution)은 추론적으로 인코딩된 병합된 제어 워드와 같은 제어 워드를 실행하는 것에 기초할 수 있다. 추론적으로 인코딩된 제어 워드는 분기 명령 및 분기 명령의 2개 이상의 측면들과 연관된 동작들을 병합하는 것에 기초할 수 있다. 측면들 중 하나는 프로그램 실행 동안 더 자주 취해질 수 있거나, 취해질 가능성이 더 높을 수 있거나, 또는 그렇지 않으면 취해질 분기의 측면일 것으로 예측될 수 있다. 추론적 인코딩은 취해질 가능성이 적은 분기의 측면으로부터의 하나 이상의 동작들을 포함할 수 있다. 분기 결정이 취해질 가능성이 적은 분기의 측면들 중 하나를 표시하는 경우, 결정된 분기 측면에 대한 제어 워드들과 연관된 동작들 중 일부는 실행을 위해 이용가능하거나 다른 분기들과 연관된 동작들과 병렬로 이미 실행되었다. 추가 제어 워드들이 페치, 디코딩되고, 계산 엘리먼트들의 2D 어레이에 제공될 수 있다. 따라서, 실행 지연이 방지될 수 있다. 흐름(200)에서, 억제는 계산 엘리먼트들의 2D 어레이에서 전력 감소(222)를 가능하게 한다. 전력 감소는 일단 분기 결정이 이루어지면 억제되는 동작들에 할당될 수 있는 계산 엘리먼트들을 유휴 상태로 함으로써 달성될 수 있다. 계산 엘리먼트들을 유휴 상태로 하는 것은 2D 어레이 내에서 전력 감소를 가능하게 한다. 흐름(200)에서, 억제는 데이터가 커밋되는 것을 방지한다(224). 동작은 처리를 위해 데이터에 액세스할 수 있고, 출력 데이터를 생성할 수 있다. 처리된 동작으로부터의 출력 데이터는 전형적으로 캐시, 레지스터 파일, 메모리 등과 같은 2D 어레이와 연관된 스토리지 내에 저장되거나; 또는 2D 어레이 너머의 또는 외부의 스토리지에 저장된다. 표시되지 않은 분기 측면의 동작을 억제함으로써, 현재 억제된 동작을 실행한 결과로 발생한 임의의 데이터는 무시, 삭제, 덮어쓰기, 플러시될 수 있다. 따라서 데이터가 억제된 동작에 의해 커밋되지 않는다.In flow 200, suppression prevents speculative branch execution delays 220 because it allows speculative execution of paths that would not be taken. As discussed throughout, speculative branch execution may be based on executing control words, such as speculatively encoded merged control words. The speculatively encoded control word may be based on merging a branch instruction and operations associated with two or more aspects of the branch instruction. One of the aspects may be taken more often during program execution, may be more likely to be taken, or may otherwise be predicted to be the aspect of the branch to be taken. Speculative encoding may include one or more actions from the side of the branch that are less likely to be taken. If a branch decision indicates one of the aspects of the branch that is less likely to be taken, some of the operations associated with the control words for the branch side decided are either available for execution or have already been performed in parallel with the operations associated with other branches. It was executed. Additional control words may be fetched, decoded, and provided to a 2D array of computational elements. Accordingly, execution delays can be prevented. In flow 200, suppression enables power reduction 222 in the 2D array of computational elements. Power reduction can be achieved by idling computational elements that can be assigned to operations that are inhibited once a branch decision is made. Idle computational elements enable power reduction within the 2D array. In flow 200, suppression prevents data from being committed (224). An operation can access data for processing and produce output data. Output data from processed operations is typically stored within storage associated with a 2D array, such as a cache, register file, memory, etc.; or stored in storage beyond or external to the 2D array. By suppressing operations on the side of an unmarked branch, any data resulting from executing the currently suppressed operation can be ignored, deleted, overwritten, or flushed. Therefore, data is not committed by suppressed operations.

도 3a는 쉘로우 파이프라인(shallow pipeline)을 갖는 고 병렬 아키텍처에 대한 시스템 블록도를 도시한다. 고 병렬 아키텍처(highly parallel architecture)는 계산 엘리먼트, 처리 엘리먼트, 버퍼, 하나 이상의 레벨의 캐시 스토리지, 시스템 관리, 산술 로직 유닛 등을 포함하는 컴포넌트를 포함할 수 있다. 다양한 컴포넌트들은 태스크 처리를 달성하는데 사용될 수 있으며, 여기서 태스크 처리는 프로그램 실행과 연관된다. 고 병렬 처리 아키텍처에서 추론적 인코딩을 사용하여 태스크가 처리할 수 있다. 계산 엘리먼트들의 2차원(2D) 어레이가 액세스되며, 여기서 계산 엘리먼트들의 어레이 내의 각각의 계산 엘리먼트는 컴파일러에 알려져 있고, 계산 엘리먼트들의 어레이 내의 이웃하는 계산 엘리먼트들에 결합된다. 계산 엘리먼트들의 어레이에 대한 제어는 사이클 단위로 제공되며, 여기서 제어는 컴파일러에 의해 생성된 와이드, 가변 길이 제어 워드들의 스트림에 의해 인에이블된다. 2개 이상의 동작들이 제어 워드에 병합되고, 여기서 제어 워드는 분기 결정 및 분기 결정과 연관된 동작들을 포함한다.Figure 3A shows a system block diagram for a highly parallel architecture with a shallow pipeline. A highly parallel architecture may include components including computational elements, processing elements, buffers, one or more levels of cache storage, system management, arithmetic logic units, etc. Various components can be used to accomplish task processing, where task processing is associated with program execution. In highly parallel processing architectures, tasks can be processed using speculative encoding. A two-dimensional (2D) array of computational elements is accessed, where each computational element in the array of computational elements is known to the compiler and is coupled to neighboring computational elements in the array of computational elements. Control over the array of computational elements is provided on a cycle-by-cycle basis, where control is enabled by a stream of wide, variable length control words generated by the compiler. Two or more operations are merged into a control word, where the control word includes a branch decision and operations associated with the branch decision.

쉘로우 파이프라인을 갖는 고 병렬 아키텍처에 대한 시스템 블록도(300)가 도시된다. 시스템 블록도는 계산 엘리먼트 어레이(310)를 포함할 수 있다. 계산 엘리먼트 어레이(310)는 계산 엘리먼트에 기초할 수 있으며, 여기서 계산 엘리먼트는 프로세서, 중앙 처리 유닛(CPU), 그래픽 처리 유닛(GPU), 코프로세서(coprocessor) 등을 포함할 수 있다. 계산 엘리먼트들은 ASIC들(application specific integrated circuits)과 같은 칩들 내에 구성된 처리 코어들, FPGA들(field programmable gate arrays)과 같은 프로그램가능 칩들로 프로그래밍된 처리 코어들 등에 기초할 수 있다. 계산 엘리먼트들은 계산 엘리먼트들의 동종 어레이를 포함할 수 있다. 시스템 블록도(300)는 변환 색인 버퍼들(312 및 338)과 같은 변환 색인 버퍼(translation and look-aside buffer)들을 포함할 수 있다. 변환 색인 버퍼는 메모리 캐시를 보충할 수 있으며, 메모리 캐시는 스토리지 액세스 시간을 줄이는 데 사용될 수 있다. 시스템 블록도는 로드 및 액세스 순서 및 선택을 위한 로직을 포함할 수 있다. 로드 및 액세스 순서 및 선택을 위한 로직은 로직(314) 및 로직(340)을 포함할 수 있다. 로직(314 및 340)은 각각 하단 데이터 블록(316, 318 및 320) 및 상단 데이터 블록(342, 344 및 346)에 대한 로드 및 액세스 순서 및 선택을 달성할 수 있다. 이러한 레이아웃(layout) 기법은 대역폭을 이중 액세스하고, 상호연결 복잡도를 감소시키는 등을 할 수 있다. 로직(340)은 큐들, 어드레스 생성기들, 및 곱셈기 유닛들(347) 컴포넌트를 통해 계산 엘리먼트 어레이(310)에 결합될 수 있다. 동일한 방식으로, 로직(314)은 큐들, 어드레스 생성기들, 및 곱셈기 유닛들(317) 컴포넌트를 통해 계산 엘리먼트 어레이(310)에 결합될 수 있다.A system block diagram 300 is shown for a highly parallel architecture with a shallow pipeline. The system block diagram may include an array of computational elements 310 . Computational element array 310 may be based on computational elements, where the computational elements may include a processor, a central processing unit (CPU), a graphics processing unit (GPU), a coprocessor, etc. Computational elements may be based on processing cores configured within chips such as application specific integrated circuits (ASICs), processing cores programmed into programmable chips such as field programmable gate arrays (FPGAs), etc. Computational elements may include a homogeneous array of computational elements. System block diagram 300 may include translation and look-aside buffers, such as translation lookaside buffers 312 and 338. Translation lookup buffers can supplement the memory cache, which can be used to reduce storage access times. The system block diagram may include logic for load and access ordering and selection. Logic for load and access ordering and selection may include logic 314 and logic 340. Logic 314 and 340 may achieve load and access ordering and selection for bottom data blocks 316, 318, and 320 and top data blocks 342, 344, and 346, respectively. These layout techniques can provide dual access to bandwidth, reduce interconnection complexity, etc. Logic 340 may be coupled to compute element array 310 via queues, address generators, and multiplier units 347 components. In the same way, logic 314 may be coupled to compute element array 310 via queues, address generators, and multiplier units 317 components.

시스템 블록도는 액세스 큐(access queue)들을 포함할 수 있다. 액세스 큐는 액세스 큐(316 및 342)를 포함할 수 있다. 액세스 큐는 데이터를 저장하고 데이터를 로딩하기 위해 캐시, 스토리지 등에 액세스하기 위한 요청을 큐잉하는 데 사용될 수 있다. 시스템 블록도는 L1 캐시들(318 및 344)과 같은 레벨 1(L1) 데이터 캐시들을 포함할 수 있다. L1 캐시들은 함께 처리될 데이터, 순차적으로 처리될 데이터 등과 같은 데이터의 블록들을 저장하는 데 사용될 수 있다. L1 캐시는 계산 엘리먼트들 및 다른 컴포넌트들에 의해 신속하게 액세스 가능한 작고 빠른 메모리를 포함할 수 있다. 시스템 블록도는 레벨 2(L2) 데이터 캐시들을 포함할 수 있다. L2 캐시는 L2 캐시(320 및 346)를 포함할 수 있다. L2 캐시는 L1 캐시에 비해 더 크고 더 느린 스토리지를 포함할 수 있다. L2 캐시는 "다음" 데이터, 중간 결과와 같은 결과 등을 저장할 수 있다. L1 및 L2 캐시는 레벨 4(L3) 캐시에 추가로 결합될 수 있다. L3 캐시는 L3 캐시(322 및 348)를 포함할 수 있다. L3 캐시는 L2 및 L1 캐시보다 클 수 있고 더 느린 스토리지를 포함할 수 있다. L3 캐시에서 데이터에 액세스하는 것은 여전히 메인 스토리지에 액세스하는 것보다 더 빠르다. 실시예들에서, L1, L2, 및 L3 캐시들은 4-웨이(way) 세트 연관 캐시들을 포함할 수 있다.The system block diagram may include access queues. The access queue may include access queues 316 and 342. Access queues can be used to queue requests to access caches, storage, etc. to store data and load data. The system block diagram may include level 1 (L1) data caches, such as L1 caches 318 and 344. L1 caches can be used to store blocks of data, such as data to be processed together, data to be processed sequentially, etc. The L1 cache may contain small, fast memory that can be quickly accessed by computational elements and other components. The system block diagram may include level 2 (L2) data caches. The L2 cache may include L2 caches 320 and 346. The L2 cache can contain larger and slower storage compared to the L1 cache. The L2 cache can store "next" data, results such as intermediate results, etc. The L1 and L2 caches can be further combined into a level 4 (L3) cache. The L3 cache may include L3 caches 322 and 348. The L3 cache can be larger than the L2 and L1 caches and contain slower storage. Accessing data from the L3 cache is still faster than accessing main storage. In embodiments, L1, L2, and L3 caches may include 4-way set associative caches.

블록도(300)는 시스템 관리 버퍼(324)를 포함할 수 있다. 시스템 관리 버퍼는 계산 엘리먼트의 어레이(310)를 제어하는 데 사용될 수 있는 시스템 관리 코드 또는 제어 워드를 저장하는 데 사용될 수 있다. 시스템 관리 버퍼는 예외 또는 에러 핸들링, 태스크들을 처리하기 위한 병렬 아키텍처의 관리 등을 위해 사용될 수 있는 연산 코드(opcode)들, 코드들, 루틴들, 함수들 등을 보유하기 위해 사용될 수 있다. 시스템 관리 버퍼는 압축 해제기(decompressor)(326)에 결합될 수 있다. 압축 해제기는 시스템 관리 압축 제어 워드 버퍼(328)로부터 시스템 관리 압축 제어 워드들(CCWs)을 압축 해제하기 위해 사용될 수 있고, 압축 해제된 시스템 관리 제어 워드들을 시스템 관리 버퍼(324)에 저장할 수 있다. 압축된 시스템 관리 제어 워드들은 압축되지 않은 제어 워드들보다 더 적은 스토리지를 요구할 수 있다. 시스템 관리 CCW 컴포넌트(328)는 또한 스필 버퍼(spill buffer)를 포함할 수 있다. 스필 버퍼는 다수의 중첩 레벨의 예외를 지원하는 데 사용될 수 있는 대형 정적 랜덤 액세스 메모리(SRAM)를 포함할 수 있다. Block diagram 300 may include a system management buffer 324. The system management buffer may be used to store system management code or control words that may be used to control the array 310 of computational elements. A system management buffer can be used to hold opcodes, codes, routines, functions, etc. that can be used for exception or error handling, management of a parallel architecture for processing tasks, etc. A system management buffer may be coupled to a decompressor 326. A decompressor may be used to decompress system management compression control words (CCWs) from system management compression control word buffer 328 and store the decompressed system management control words in system management buffer 324. Compressed system management control words may require less storage than uncompressed control words. System management CCW component 328 may also include a spill buffer. The spill buffer may include large static random access memory (SRAM) that can be used to support multiple nested levels of exceptions.

계산 엘리먼트들의 어레이 내의 계산 엘리먼트들은 제어 유닛(330)과 같은 제어 유닛에 의해 제어될 수 있다. 컴파일러가 제어 워드를 통해 개별 엘리먼트들을 제어하는 동안, 제어 유닛은 새로운 제어 워드들이 어레이로 드라이빙되지 않는 것을 보장하기 위해 어레이를 일시정지(pause)시킬 수 있다. 제어 유닛은 압축 해제기(332)로부터 압축 해제된 제어 워드를 수신할 수 있다. 압축 해제기는 계산 엘리먼트들의 행들 또는 열들을 인에이블 또는 유휴(idle)시키기 위해; 개별 계산 엘리먼트들을 인에이블 또는 유휴시키기 위해; 개별 계산 엘리먼트들에 제어 워드들을 송신하기 위해; 등을 위해 제어 워드(아래에서 논의됨)를 압축 해제할 수 있다. 압축 해제기는 압축 제어 워드 캐시 1(CCWC1)(334)과 같은 압축 제어 워드 저장소(store)에 결합될 수 있다. CCWC1은 하나 이상의 압축된 제어 워드를 포함하는 L1 캐시와 같은 캐시를 포함할 수 있다. CCWC1은 압축된 제어 워드 캐시 2(CCWC2)(336)와 같은 추가 압축된 제어 워드 저장소에 결합될 수 있다. CCWC2는 압축된 제어 워드에 대한 L2 캐시로 사용될 수 있다. CCWC2는 CCWC1보다 더 크고 더 느릴 수 있다. 실시예에서, CCWC1 및 CCWC2는 4-웨이 세트 연관성(set associativity)을 함유할 수 있다. 실시예들에서, CCWC1 캐시는 압축 해제된 제어 워드들을 포함할 수 있고, 이 경우, 그것은 DCWC1로서 지정될 수 있다. 그 경우, 압축 해제기(332)는 CCWC1(334)(현재 DCWC1)과 CCWC2(336) 사이에 결합될 수 있다.Computational elements within the array of computational elements may be controlled by a control unit, such as control unit 330. While the compiler controls individual elements via control words, the control unit may pause the array to ensure that new control words are not driven into the array. The control unit may receive the decompressed control word from decompressor 332. A decompressor is used to enable or idle rows or columns of computational elements; to enable or idle individual computational elements; to transmit control words to individual computational elements; The control word (discussed below) can be decompressed for, etc. The decompressor may be coupled to a compression control word store, such as Compression Control Word Cache 1 (CCWC1) 334. CCWC1 may include a cache, such as an L1 cache, containing one or more compressed control words. CCWC1 may be coupled to an additional compressed control word store, such as Compressed Control Word Cache 2 (CCWC2) 336. CCWC2 can be used as an L2 cache for compressed control words. CCWC2 can be larger and slower than CCWC1. In an embodiment, CCWC1 and CCWC2 may contain 4-way set associativity. In embodiments, the CCWC1 cache may contain uncompressed control words, in which case it may be designated as DCWC1. In that case, decompressor 332 may be coupled between CCWC1 334 (now DCWC1) and CCWC2 336.

도 3b는 계산 엘리먼트 어레이 상세(302)를 도시한다. 계산 엘리먼트 어레이는 계산 엘리먼트가 하나 이상의 태스크를 처리할 수 있게 하는 컴포넌트에 결합될 수 있다. 컴포넌트는 데이터에 액세스하여 데이터를 제공하고, 특정 고속 동작을 수행할 수 있다. 계산 엘리먼트 어레이 및 연관된 컴포넌트들은 추론적 인코딩을 사용하여 병렬 처리 아키텍처를 인에이블한다. 계산 엘리먼트 어레이(350)는 다양한 처리 태스크를 수행할 수 있으며, 여기서 처리 태스크는 산술, 벡터 또는 행렬 연산 등과 같은 연산을 포함할 수 있다. 계산 엘리먼트들은 하단 곱셈기 유닛들(352) 및 상단 곱셈기 유닛들(354)과 같은 곱셈기 유닛들에 결합될 수 있다. 곱셈기 유닛(multiplier unit)들은 일반적인 처리 태스크들과 연관된 고속 곱셈들, 딥러닝 네트워크들과 같은 신경망들과 연관된 곱셈들, 벡터 연산들과 연관된 곱셈들 등을 수행하는 데 사용될 수 있다. 계산 엘리먼트들은 로드 큐들(356) 및 로드 큐들(358)과 같은 로드 큐들에 결합될 수 있다. 로드 큐는 이전에 설명한 대로 L1 데이터 캐시에 결합될 수 있다. 로드 큐들은 계산 엘리먼트들로부터 스토리지 액세스 요청들을 로드하기 위해 사용될 수 있다. 로드 큐들은 예상된 로드 레이턴시들을 추적할 수 있고, 로드 레이턴시가 임계치를 초과하는 경우 제어 유닛에 통지할 수 있다. 제어 유닛의 통지는 부하가 예상된 시간 프레임 내에 도달하지 않을 수 있음을 시그널링하기 위해 사용될 수 있다. 로드 큐들은 또한 계산 엘리먼트들의 어레이를 일시정지하는 데 사용될 수 있다. 로드 큐들은 전체 어레이를 일시 정지할 제어 유닛에 일시 정지 요청을 발송할 수 있는 한편, 개별 엘리먼트들은 제어 워드의 제어 하에서 유휴 상태(idle)로 될 수 있다. 엘리먼트가 명시적으로 제어되지 않을 때, 그것은 유휴(또는 저전력) 상태에 놓일 수 있다. 동작이 수행되지 않지만 링 버스(ring bus)는 나머지 어레이가 제대로 작동하도록 "패스 스루(pass thru)" 모드로 계속 작동할 수 있다. 계산 엘리먼트가 ALU를 통해 변경되지 않은 데이터를 라우팅하는 데에만 사용되는 경우, 즉 계산 엘리먼트가 라우팅 엘리먼트로 작동하는 경우에도 활성 상태로 간주된다.Figure 3B shows computational element array details 302. Arrays of computational elements can be coupled to components that enable the computational elements to process one or more tasks. Components can access data, present data, and perform certain high-speed operations. The computational element array and associated components enable a parallel processing architecture using speculative encoding. Computation element array 350 may perform various processing tasks, where processing tasks may include operations such as arithmetic, vector or matrix operations, and the like. Computation elements may be coupled to multiplier units, such as bottom multiplier units 352 and top multiplier units 354. Multiplier units may be used to perform fast multiplications associated with general processing tasks, multiplications associated with neural networks such as deep learning networks, multiplications associated with vector operations, etc. Computation elements may be coupled to load queues, such as load queues 356 and load queues 358. The load queue can be coupled to the L1 data cache as previously described. Load queues can be used to load storage access requests from compute elements. Load queues can track expected load latencies and notify the control unit when load latency exceeds a threshold. Notification from the control unit may be used to signal that the load may not arrive within the expected time frame. Load queues can also be used to pause an array of computational elements. Load queues can send a pause request to a control unit that will pause the entire array, while individual elements can be idled under the control of a control word. When an element is not explicitly controlled, it can be placed in an idle (or low-power) state. Although no operation is performed, the ring bus can continue to operate in "pass thru" mode to ensure proper operation of the rest of the array. A compute element is also considered active if it is only used to route unmodified data through the ALU, i.e. if the compute element acts as a routing element.

어레이가 일시 정지되는 동안, 메모리들(데이터 및 제어 워드)로부터 어레이의 백그라운드 로딩(background loading)이 수행될 수 있다. 메모리 시스템은 자유롭게 실행될 수 있고 어레이가 일시 정지되는 동안 계속 동작할 수 있다. 제어 신호 전송으로 인해 다중 사이클 레이턴시가 있을 수 있으며, 이는 추가적인 데드 타임(dead time)을 초래하기 때문에, 메모리 시스템이 어레이에 "도달(reach into)"하고 어레이가 일시 정지되는 동안 로드 데이터를 적절한 스크래치패드 메모리(scratchpad memory)에 전달하는 것을 허용하는 것이 유리할 수 있다. 이 메커니즘은 컴파일러에 관한 한 어레이 상태를 알 수 있도록 동작할 수 있다. 일시 중지 후 어레이 동작이 재개되면, 컴파일러가 정적으로 스케줄링된 모델을 유지하는 데 필요한 새 로드 데이터가 스크래치패드에 도착하게 된다.While the array is paused, background loading of the array from memories (data and control words) can be performed. The memory system can run freely and continue operating while the array is paused. There may be multi-cycle latency due to control signal transmission, which introduces additional dead time so that the memory system "reaches into" the array and properly scratches the load data while the array is paused. It may be advantageous to allow passing to scratchpad memory. This mechanism can operate to make the array state known as far as the compiler is concerned. When array operation resumes after a pause, new load data that the compiler needs to maintain the statically scheduled model arrives at the scratchpad.

도 4는 코드에서의 분기들을 예시한다. 코드, 프로그램, 애플리케이션, 앱 등은 하나 이상의 분기를 포함할 수 있다. 분기들은 변수, 플래그, 신호 등의 값에 기초할 수 있는 조건부 분기들; 및 제어를 전달하고 특정 수의 동작들을 수행할 수 있는 무조건부(unconditional) 분기들 등을 포함할 수 있다. 하나 이상의 동작이 각각의 분기와 연관될 수 있다. 실시예들에서, 2개 이상의 동작들이 제어 워드로 병합될 수 있으며, 여기서 제어 워드는 분기 결정 및 분기 결정과 연관된 동작들을 포함할 수 있다. 병합된 제어 워드는 추론적 인코딩을 사용하여 병렬 처리 아키텍처를 제어할 수 있다. 도면(400)은 실행 트리 내의 7개의 세그먼트를 예시한다. 7개의 세그먼트는 세그먼트 A(410), 세그먼트 B(412), 세그먼트 C(414), 세그먼트 D(416), 세그먼트 E(418), 세그먼트 F(420) 및 세그먼트 G(422)를 포함한다. 7개의 세그먼트는 각각 각 세그먼트에 대해 도시된 4개의 동작과 같은 동작을 포함할 수 있다. 7개의 세그먼트 중 어느 것이 실행될 것인지의 결정은 도시된 3개의 분기점, 즉 분기점(424), 분기점(426), 및 분기점(428)과 같은 분기점에 기초할 수 있다. 분기점에서 이루어진 분기 결정을 기반으로, 트리의 7개 분기를 따라 선택할 수 있는 4가지 잠재적 경로가 존재한다. 경로들은 ABD, ABE, ACF, 및 ACG를 포함할 수 있다. 도면에서, 트리의 분기들을 따라 취해진 경로 ACG는 경로(430)로 표시된다. 트리(400)의 세그먼트들과 연관된 동작들의 실행은 처리 사이클들에 기초할 수 있다. 예(400)에서, 도시된 동작들은 13개의 사이클들로 실행될 수 있다.Figure 4 illustrates branches in the code. Code, programs, applications, apps, etc. can contain one or more branches. Branches are conditional branches that may be based on the value of a variable, flag, signal, etc.; and unconditional branches that can transfer control and perform a certain number of operations. One or more operations may be associated with each branch. In embodiments, two or more operations may be merged into a control word, where the control word may include a branch decision and operations associated with the branch decision. The merged control word can use speculative encoding to control the parallel processing architecture. Figure 400 illustrates seven segments within an execution tree. The seven segments include segment A (410), segment B (412), segment C (414), segment D (416), segment E (418), segment F (420), and segment G (422). Each of the seven segments may include operations such as the four operations shown for each segment. The determination of which of the seven segments will be executed can be based on the three branch points shown, such as branch point 424, branch point 426, and branch point 428. Based on the branching decision made at the branch point, there are four potential paths to take along the seven branches of the tree. Paths may include ABD, ABE, ACF, and ACG. In the figure, the path ACG taken along the branches of the tree is indicated as path 430. Execution of operations associated with segments of tree 400 may be based on processing cycles. In example 400, the operations shown may be executed in 13 cycles.

도 5a는 컴파일러(500)에 의해 병합된 코드 블록들을 도시한다. 전체적으로 논의한 바와 같이, 프로그램 실행은 분기 동작을 기반으로 할 수 있으며 분기 동작은 조건부 분기(conditional branch) 또는 무조건 분기(unconditional branch)를 포함할 수 있다. 조건부 분기는 값, 플래그, 신호 등에 기초할 수 있고, if-then-else 기법, 케이스 또는 스위치 기법 등과 같은 다양한 기법을 사용하여 설명될 수 있다. 무조건 분기는 제어, 동작 순서 등을 전송할 수 있는 "항상 취해지는(always taken)" 분기를 포함할 수 있다. 병합은 추론적 인코딩에 기초할 수 있으며, 여기서 병합된 동작들은 병렬 처리 아키텍처에서 실행될 수 있다. 실시예들에서, 병합된 제어 워드는 적어도 2개의 가능한 분기 경로들에 대한 추론적으로 인코딩된 동작들을 포함할 수 있다. 병합된 제어 워드는 둘 이상의 동작들을 포함할 수 있다. 병합된 제어 워드는 적어도 2개의 가능한 분기 경로들에 대한 추론적으로 인코딩된 동작들을 더 포함할 수 있다. 분기 경로와 연관된 둘 이상의 연산이 제어 워드로 병합되기 위해, 다양한 조건이 적용될 수 있다. 실시예에서, 적어도 2개의 가능한 분기 경로는 독립적인 부작용을 생성할 수 있다. 부작용들은 프로그램 실행 중단(execution stall), 압축된 제어 워드들의 기능 밀도의 증가 등을 포함할 수 있다. 요구된 데이터를 대기하는 동안, 압축된 제어 워드가 페치되고 압축 해제되는 동안 등의 동안 프로그램 실행 중단이 발생할 수 있다.Figure 5A shows code blocks merged by compiler 500. As discussed throughout, program execution may be based on branching operations, and branching operations may include conditional branches or unconditional branches. Conditional branches can be based on values, flags, signals, etc., and can be described using various techniques such as if-then-else techniques, case or switch techniques, etc. Unconditional branches can include "always taken" branches that can transfer control, sequence of actions, etc. Merging may be based on speculative encoding, where merged operations may be executed in a parallel processing architecture. In embodiments, the merged control word may include speculatively encoded operations for at least two possible branch paths. A merged control word may contain two or more operations. The merged control word may further include speculatively encoded operations for at least two possible branch paths. In order for two or more operations associated with a branch path to be merged into a control word, various conditions may apply. In embodiments, at least two possible branching paths may produce independent side effects. Side effects may include program execution stall, increased functional density of compressed control words, etc. Interruptions in program execution may occur while waiting for requested data, while compressed control words are fetched and decompressed, etc.

다른 실시예들에서, 적어도 2개의 가능한 분기 경로들은 커밋되어야 하는 계산 엘리먼트 액션들을 생성할 수 있다. 커밋되어야 하는 계산 엘리먼트 액션들은 스토리지에 기록하는 것을 포함할 수 있으며, 여기서 스토리지는 계산 엘리먼트들의 어레이 외부에 있을 수 있다. 이 예의 목적들을 위해, 커밋 기록(commit write)은 다운스트림 동작에 의해 사용되고, 계산 엘리먼트들의 어레이 내의 다른 계산 엘리먼트들에 제공될 수 있는 데이터를 기록하는 것 등을 포함할 수 있다. 커밋 기록은 준비(ready), 데이터 유효(data valid), 데이터 완료(data complete) 등의 표시를 포함할 수 있다. 분기의 어느 쪽이 취해질 것인지는 선험적으로 알려져 있지 않기 때문에, 분기 방향의 어느 측면이 취해지는지를 결정하기 전에 데이터를 기록하는 것은 경쟁 조건(race condition)을 제시하고, 무효 데이터 등을 제공할 수 있다. 커밋 기록은 하나 이상의 레지스터, 레지스터 파일, 캐시, 스토리지 등에 데이터를 저장할 수 있다. 실시예에서, 커밋 기록은 데이터 스토리지에 대한 커밋 기록을 포함할 수 있다. 데이터 스토리지는 컴퓨터 네트워크 등과 같은 네트워크를 통해 어레이에 의해 액세스 가능하고, 어레이에 결합되고, 엘리먼트들의 어레이 내에 위치될 수 있다. 실시예에서, 데이터 스토리지는 분기와 연관된 계산 엘리먼트 외부에 상주한다. 컴파일러에 의해 병합될 수 있는 코드 블록들의 예들이 도시된다. 병합된 블록들은 블록(510), 블록(512), 블록(514), 블록(516), 블록(518), 블록(520) 등을 포함할 수 있다. 병합된 제어 블록은 하나 이상의 동작들을 포함할 수 있다. 실시예들에서, 병합된 제어 워드는 단일 제어 워드일 수 있다. 제어 블록 내의 동작들은 순차적으로 실행될 수 있다. 추가 실시예들에서, 적어도 2개의 가능한 분기 경로들과 연관된 동작들은 계산 엘리먼트들의 2D 어레이에 의해 병렬로 수행될 수 있다.In other embodiments, at least two possible branch paths may produce compute element actions that must be committed. Computational element actions that must be committed may include writing to storage, where the storage may be external to the array of computational elements. For the purposes of this example, a commit write may include writing data that may be used by downstream operations and provided to other computational elements within the array of computational elements, etc. The commit record may include indicators such as ready, data valid, data complete, etc. Because which side of the branch will be taken is not known a priori, writing data before determining which side of the branch direction is taken can present race conditions, provide invalid data, etc. A commit record can store data in one or more registers, register files, caches, storage, etc. In embodiments, the commit history may include commit history for data storage. Data storage may be accessible by the array, coupled to the array, and located within the array of elements through a network, such as a computer network. In embodiments, data storage resides external to the computational element associated with the branch. Examples of code blocks that can be merged by the compiler are shown. Merged blocks may include block 510, block 512, block 514, block 516, block 518, block 520, etc. A merged control block may include one or more operations. In embodiments, the merged control word may be a single control word. Operations within a control block can be executed sequentially. In further embodiments, operations associated with at least two possible branch paths may be performed in parallel by a 2D array of computational elements.

또한, 병합(coalescing)은, 어느 분기 경로가 선택되는지에 따라, 하나 이상의 분기 결정들 뿐만 아니라 실행될 수 있거나 실행되지 않을 수 있는(또는 결과들이 무시되게 하는) 후속 동작들을 제어하는 후속 제어 워드들을 포함하는 다수의 제어 워드들을 포함할 수 있다. 예를 들어, 블록(516)은 분기 결정 경로들(D, E, F, 및 G)을 포함하는 제어 워드 8, 및 가능한 분기 결정 경로들 각각에 대한 후속 동작들을 포함하는 제어 워드 9를 포함할 수 있으며, 이 동작들은 분기 섀도우(branch shadow)로 지칭될 수 있다. 따라서, 병합은 분기 결정 및 하나 이상의 추가 제어 워드를 포함하는 제어 워드의 추론적 인코딩을 포함할 수 있다. 그리고, 하나 이상의 추가 제어 워드는 분기 결정을 포함하는 제어 워드에 후속하는 동작들을 제어할 수 있다.Coalescing also involves subsequent control words that control one or more branch decisions as well as subsequent actions that may or may not be executed (or have the results ignored), depending on which branch path is selected. It may include a number of control words. For example, block 516 may include control word 8 including branch decision paths D, E, F, and G, and control word 9 including subsequent operations for each of the possible branch decision paths. and these operations may be referred to as branch shadows. Accordingly, merging may include branch decisions and speculative encoding of a control word including one or more additional control words. And, one or more additional control words may control operations subsequent to the control word including the branch decision.

도 5b는 컴파일러에 의해 병합된 프로그래밍 루프를 도시한다. 프로그램 실행의 일부로서 이루어질 수 있는 분기 결정은 프로그래밍 루프(562)를 지원할 수 있다. 프로그래밍 루프(562)는 조건부 분기, 서브루틴 또는 절차의 실행 등과 같은 분기에 의해 제어될 수 있다. 프로그래밍 루프를 지원하기 위해, 루프의 끝 및 루프의 시작으로부터의 동작들이 제어 워드로 병합될 수 있다. 루프의 끝 및 루프의 시작으로부터의 동작을 포함하는 제어 워드는 추론적 인코딩을 사용하는 병렬 처리 아키텍처에 의해 인에이블된다. 컴파일러는 트리(502)에 의해 표현되는 코드 워드들을 병합하기 위해 사용된다. 트리는 7개의 세그먼트들, 세그먼트 A, 세그먼트 B, 세그먼트 C, 세그먼트 D, 세그먼트 E, 세그먼트 F 및 세그먼트 G를 포함한다. 병합된 제어 워드(550)는 루프의 시작 및 루프의 끝과 연관된 동작들을 포함한다. 루프의 시작은 세그먼트 A로부터의 동작들을 포함하는 것으로 선험적으로 알려져 있지만, 다른 동작들이 실행될 것은 분기 결정들이 이루어질 때까지 알려지지 않는다. 따라서, 병합된 제어 워드(556)는 세그먼트들(D, E, F 및 G)로부터의 동작들을 포함한다. 제어 워드들로 병합될 수 있는 7개의 세그먼트들로부터의 다른 동작들은 제어 워드들(552, 554, 556, 558, 및 560)을 포함할 수 있다.Figure 5b shows the programming loop merged by the compiler. Branch decisions that may be made as part of program execution may support programming loop 562. Programming loop 562 may be controlled by branches, such as conditional branches, execution of subroutines, or procedures. To support programming loops, operations from the end of the loop and the beginning of the loop can be merged into a control word. Control words containing operations from the end of the loop and the beginning of the loop are enabled by a parallel processing architecture that uses speculative encoding. A compiler is used to merge the code words represented by tree 502. The tree contains seven segments, segment A, segment B, segment C, segment D, segment E, segment F and segment G. The merged control word 550 includes operations associated with the beginning of the loop and the end of the loop. The beginning of the loop is known a priori to include the operations from segment A, but what other operations will be executed is not known until branch decisions are made. Accordingly, merged control word 556 includes operations from segments D, E, F, and G. Other operations from the seven segments that can be merged into control words can include control words 552, 554, 556, 558, and 560.

도 6은 코드 이미지의 컴파일러 뷰를 도시한다. 이전에 논의된 바와 같이, 다수의 동작들이 제어 워드로 병합될 수 있다. 병합은 적어도 2개의 가능한 분기 경로들에 대한 추론적으로 인코딩된 동작들을 포함할 수 있다. 병합은 동작 세트의 병렬 실행을 지원할 수 있기 때문에, 프로그램 트리의 분기들을 통한 경로와 연관된 동작들을 실행하는 데 필요한 사이클들의 수가 감소될 수 있다. 예(600)에서, 트리를 통한 경로를 따라 진행하는 사이클의 수는 7 사이클로 감소될 수 있다. 코드 이미지의 컴파일러 뷰는 추론적 인코딩을 사용하는 병렬 처리 아키텍처에 의해 인에이블될 수 있다. 계산 엘리먼트들의 2차원(2D) 어레이가 액세스되며, 여기서 계산 엘리먼트들의 어레이 내의 각각의 계산 엘리먼트는 컴파일러에 알려져 있고, 계산 엘리먼트들의 어레이 내의 이웃하는 계산 엘리먼트들에 결합된다. 제어는 사이클 단위로 계산 엘리먼트들의 어레이를 향해 제공되며, 여기서 제어는 컴파일러에 의해 생성된 와이드, 가변 길이 제어 워드들의 스트림에 의해 인에이블된다. 2개 이상의 연산들이 제어 워드로 병합되며, 여기서 제어 워드는 분기 결정 및 분기 결정과 연관된 동작들을 포함한다.Figure 6 shows a compiler view of the code image. As previously discussed, multiple operations can be merged into a control word. A merge may include speculatively encoded operations for at least two possible branch paths. Because merging can support parallel execution of a set of operations, the number of cycles required to execute operations associated with a path through branches of the program tree can be reduced. In example 600, the number of cycles progressing along the path through the tree may be reduced to 7 cycles. Compiler views of code images can be enabled by a parallel processing architecture that uses speculative encoding. A two-dimensional (2D) array of computational elements is accessed, where each computational element in the array of computational elements is known to the compiler and is coupled to neighboring computational elements in the array of computational elements. Control is provided on a cycle-by-cycle basis to an array of computational elements, where control is enabled by a stream of wide, variable length control words generated by the compiler. Two or more operations are merged into a control word, where the control word includes a branch decision and operations associated with the branch decision.

제어 워드들로 병합될 수 있는 동작들에 기초하는 코드 이미지의 컴파일러 뷰가 도시된다. 제어 워드들 내의 동작들은 사이클 단위로 순차적으로 또는 병렬로 실행될 수 있다. 도시된 예에서, 동작 A0 및 A1은 제1 사이클 동안 병렬로 실행될 수 있고, 동작 A2는 제2 사이클 동안 실행될 수 있다. 병합된 제어 워드는 분기 결정(예를 들어, 분기의 어느 측면이 취해지는지) 및 분기 결정과 연관된 동작들을 포함할 수 있음을 상기한다. 도면에는, 동작 A4, B4 및 C4로 표시된 3 개의 분기 결정이 존재한다. 3개의 분기 결정은 조건부 분기를 포함할 수 있다. 이전 사이클에서 결정될 수 있는 업스트림 분기 결정은 후속 사이클에서 어느 분기 결정이 선택될 수 있는지를 결정할 수 있다. 사용 예에서, 분기 B를 선택하는 A4에서의 분기 결정은 C4에서의 분기 결정을 결정할 필요성을 배제하는데, 그 이유는 분기 C가 A4에서의 분기 결정에 의해 제거되었기 때문이다. A compiler view of the code image based operations that can be merged into control words is shown. Operations within control words can be executed sequentially or in parallel, cycle by cycle. In the example shown, operations A0 and A1 may be executed in parallel during the first cycle and operation A2 may be executed during the second cycle. Recall that the merged control word may include a branch decision (e.g., which side of the branch is taken) and operations associated with the branch decision. In the figure, there are three branch decisions marked as operations A4, B4 and C4. The three branch decisions may include conditional branches. The upstream branch decision that may be decided in a previous cycle may determine which branch decision may be selected in a subsequent cycle. In the usage example, a branch decision at A4 that selects branch B precludes the need to determine a branch decision at C4 because branch C has been eliminated by the branch decision at A4.

도 7은 예상 분기 경로(700)에 대한 억제된 동작들을 도시한다. 분기 결정 및 분기 결정과 동작의 병합은 전체에 걸쳐 설명된다. 또한, 이전 사이클에서 결정되는 분기 결론(branch outcome)(예를 들어, 분기의 어느 측면이 취해지는지)은 후속 사이클에서 어느 분기 결정을 선택할지를 결정할 수 있다. 부분적으로, 이러한 "결정론(determinism)"은 계산 엘리먼트들의 어레이를 떠나는 분기 결정 정보의 1 사이클 레이턴시에 기인할 수 있다. 예상 분기 경로에 대한 동작은 추론적 인코딩을 사용하는 병렬 처리 아키텍처에 의해 인에이블된다. 수행할 동작 세트 및 동작이 작용할 데이터 세트를 결정하는 데 사용되는 분기 결정이 멀티플렉서(MUX)를 사용하여 주입될 수 있음을 상기한다. 결과적으로, 압축된 제어 워드(CCW)는 MUX에 대한 입력 선택 라인들로서 사용될 수 있는 하나 이상의 비트 또는 필드를 포함할 수 있다. 입력 선택 라인들은 잠재적인 분기 결정 신호를 수신할 계산 엘리먼트를 선택할 수 있다. 실시예들에서, 2개 이상의 잠재적 분기들이 사이클마다 지원될 수 있다. 분기 결정, 하나 이상의 MUX 입력 선택 라인들 등에 기초하여, 분기의 취해진 측면과 연관된 동작들이 처리되거나 실행되는 반면, 분기의 하나 이상의 취해지지 않은 측면들과 연관된 동작들은 처리되지 않는다. 분기의 하나 이상의 취해지지 않은 측면과 연관된 동작들은 무시, 삭제, 플러시(flush) 등이 될 수 있다.Figure 7 shows suppressed operations for expected branch path 700. Branch decisions and merging of branch decisions and actions are described throughout. Additionally, the branch outcome determined in a previous cycle (e.g., which aspect of the branch is taken) may determine which branch decision is selected in a subsequent cycle. In part, this “determinism” can be due to the one cycle latency of branch decision information leaving the array of computational elements. Operations on expected branch paths are enabled by a parallel processing architecture that uses speculative encoding. Recall that branch decisions used to determine the set of operations to perform and the data sets on which the operations will act can be injected using a multiplexer (MUX). As a result, the compressed control word (CCW) may include one or more bits or fields that can be used as input select lines for the MUX. Input select lines may select a compute element to receive a potential branch decision signal. In embodiments, more than two potential branches may be supported per cycle. Based on the branch decision, one or more MUX input select lines, etc., operations associated with the taken aspect of the branch are processed or executed, while operations associated with one or more untaken aspects of the branch are not processed. Actions associated with one or more untaken aspects of a branch can be ignore, delete, flush, etc.

추가 실시예들은 분기 결정에 의해 표시되지 않은 분기 결정과 연관된 하나 이상의 동작들을 억제하는 것을 포함한다. 억제는 동작들의 실행을 중단하는 것, 동작들을 중지하는 것 등을 포함할 수 있다. 실시예들에서, 억제는 동적으로 달성된다. 동적 억제는 현재 분기 결정, 이전 분기 결정 등에 기초할 수 있다. 실시예에서, 억제는 추론적 분기 실행 지연을 방지할 수 있다. 지연(delay)은 압축된 제어 워드들의 페치들, 데이터의 페치들 등을 완료하기 위해 대기하는 것을 포함할 수 있다. 표시되지 않은 하나 이상의 분기 결정과 연관된 동작의 억제는 주어진 사이클 동안 수행되는 동작의 수 감소, 데이터 요청 감소 등을 가능하게 할 수 있다. 실시예들에서, 억제는 계산 엘리먼트들의 2D 어레이에서 전력 감소를 가능하게 할 수 있다. 전력 감소는 주어진 사이클 동안 처리 동작들에 필요하지 않은 하나 이상의 처리 엘리먼트들을 유휴 상태로 하는 것과 같은 기법들을 사용함으로써 달성될 수 있다. 다른 실시예들에서, 억제는 데이터가 커밋되는 것을 방지할 수 있다. 데이터를 커밋하는 것은 계산 엘리먼트들의 2D 어레이로부터 외부 스토리지와 같은 스토리지에 데이터를 기록하는 것을 포함할 수 있다. Additional embodiments include suppressing one or more operations associated with a branch decision that are not indicated by the branch decision. Inhibition may include stopping execution of operations, stopping operations, etc. In embodiments, inhibition is achieved dynamically. Dynamic suppression may be based on the current branch decision, previous branch decision, etc. In embodiments, suppression may prevent speculative branch execution delays. Delay may include waiting for fetches of compressed control words, fetches of data, etc. to complete. Suppression of operations associated with one or more unmarked branch decisions may enable a reduction in the number of operations performed during a given cycle, fewer data requests, etc. In embodiments, suppression may enable power reduction in a 2D array of computational elements. Power reduction may be achieved by using techniques such as idling one or more processing elements that are not required for processing operations during a given cycle. In other embodiments, suppression may prevent data from being committed. Committing data may include writing data from a 2D array of computational elements to storage, such as external storage.

사용 예에서, 코드 이미지의 컴파일러 뷰와 연관된 동작들을 수행하는 동안 가능한 분기 결정들 A4 및 B4 또는 C4가 이루어질 수 있다. 분기 또는 경로 C는 예측된 분기 또는 경로일 수 있다. A4에서의 분기 결정은 C 분기를 따라 진행하는 것일 수 있고, 따라서 분기 B, D 및 E와 연관된 동작이 억제될 수 있다. 즉, B3, B4, C1, D1, D2, E2, D3, D4, E3, E4, F3, F4의 동작이 억제될 수 있다. 분기들 D, C, 및 E와 연관된 다른 동작들은 추론적 인코딩의 일부로서 수행될 수 있다. 또한, B4에서 분기 결정이 요구되지 않을 것이다. B, D 및 E와 연관된 동작들의 억제는 B, D 및 E와 연관된 동작들을 수행하기 위해 할당되었을 계산 엘리먼트들을 유휴 상태로 함으로써 달성될 수 있다. 계산 엘리먼트들을 유휴 상태로 하는 것은 2D 어레이에서 전력 감소를 가능하게 할 수 있다.In a use case, possible branch decisions A4 and B4 or C4 may be made while performing operations associated with a compiler view of the code image. Branch or path C may be a predicted branch or path. The branch decision at A4 may be to proceed along branch C, and thus operations associated with branches B, D, and E may be suppressed. That is, the operations of B3, B4, C1, D1, D2, E2, D3, D4, E3, E4, F3, and F4 can be suppressed. Other operations associated with branches D, C, and E may be performed as part of speculative encoding. Additionally, no branch decision will be required at B4. Suppression of operations associated with B, D, and E may be accomplished by idling computational elements that would have been assigned to perform operations associated with B, D, and E. Idle computational elements can enable power reduction in a 2D array.

도 8은 압축된 코드 워드 페치 및 압축 해제 파이프라인을 도시한다. 프로그램 실행은 압축된 제어 워드들과 같은 제어 워드들에 기초할 수 있다. 압축된 제어 워드들은 병합된 동작들에 기초할 수 있고, 분기 결정 및 분기 결정과 연관될 수 있는 하나 이상의 동작들을 포함할 수 있다. 압축된 제어 워드들은 압축 해제된 제어 워드 또는 워드들을 계산 엘리먼트들의 2D 어레이에 제공하기 전에 스토리지로부터 페치되고 압축 해제될 수 있다. 압축된 제어 워드들을 페치하는 것 및 압축 해제하는 것은 추론적 인코딩을 사용하여 병렬 처리 아키텍처를 인에이블 한다. 계산 엘리먼트들의 2차원(2D) 어레이가 액세스된다. 계산 엘리먼트들의 어레이에 대한 제어는 사이클 단위로 제공되며, 여기서 제어는 컴파일러에 의해 생성된 와이드, 가변 길이 제어 워드들의 스트림에 의해 인에이블된다. 2개 이상의 동작들이 제어 워드로 병합되며, 여기서 제어 워드는 분기 결정 및 분기 결정과 연관된 동작들을 포함한다. Figure 8 shows the compressed code word fetch and decompression pipeline. Program execution may be based on control words, such as compressed control words. Compressed control words may be based on merged operations and may include a branch decision and one or more operations that may be associated with the branch decision. Compressed control words can be fetched from storage and decompressed before providing the decompressed control word or words to a 2D array of computational elements. Fetching and decompressing compressed control words enables a parallel processing architecture using speculative encoding. A two-dimensional (2D) array of computational elements is accessed. Control over the array of computational elements is provided on a cycle-by-cycle basis, where control is enabled by a stream of wide, variable length control words generated by the compiler. Two or more operations are merged into a control word, where the control word includes a branch decision and operations associated with the branch decision.

압축된 코드 워드 페치 및 압축 해제 파이프라인이 도시된다(800). 파이프라인 내에서 수행되는 동작들은 다양한 동작들이 수행될 수 있는 연관된 사이클들과 함께 도시된다. 파이프라인의 동작들은 어레이 밖으로 분기 결정을 드라이빙함으로써 시작될 수 있다. 이전에 논의된 예들을 참조하면, 분기 결정은 B 분기 및 D 또는 E 분기, C 분기 및 F 또는 G 분기 등을 선택하는 것을 포함할 수 있다. 분기 결정에 기초하여, 하나 이상의 페치 사이클들이 실행될 수 있다. 도시된 예에서, 4개의 페치 사이클들이 수행될 수 있다. 하나 이상의 페치 사이클들은 분기 결정과 연관된 하나 이상의 압축된 제어 워드들을 페치하는 데 사용될 수 있다. 하나 이상의 압축된 제어 워드는 압축 해제될 수 있다. 취해진 분기와 연관된 하나 이상의 압축된 제어 워드들을 압축 해제하기 위해 하나 이상의 압축 해제 사이클들이 수행될 수 있다. 압축 해제된 제어 워드는 계산 엘리먼트들의 어레이에 제공되거나 드라이빙된다. 압축 해제된 제어 워드는 계산 엘리먼트들의 2D 어레이에 의해 실행될 수 있다.A compressed code word fetch and decompression pipeline is shown (800). The operations performed within the pipeline are shown with associated cycles in which the various operations may be performed. Operations in the pipeline can be initiated by driving a branch decision out of the array. Referring to previously discussed examples, branching decisions may include selecting a B branch and a D or E branch, a C branch and an F or G branch, etc. Based on the branch decision, one or more fetch cycles may be executed. In the example shown, four fetch cycles may be performed. One or more fetch cycles may be used to fetch one or more compressed control words associated with a branch decision. One or more compressed control words may be decompressed. One or more decompression cycles may be performed to decompress one or more compressed control words associated with a taken branch. The decompressed control word is provided or driven to an array of computational elements. The decompressed control word can be executed by a 2D array of computational elements.

도 9는 코드 워드 인코딩 및 나이브() 요구-드라이빙 페치 파이프라인 오버레이(demand-driven fetch pipeline overlay)를 도시한다. 인코딩된 제어 워드 스트림에서 추론적 인코딩의 부재 시에, 비-예측된 분기로의 페치는 프로그램 실행에서 상당한 지연을 초래할 수 있다. 비-예측된 분기에 대한 페치와 같은 나이브 요구-드라이빙 페치의 예가 900으로 도시된다. 코드 워드 인코딩들(910)은 코드 워드 요구 페치(920)와 연관된 사이클들과 함께 도시된다. 코드 워드 요구 페치는 A4에서의 분기 결정에 기반할 수 있다. 이전에 설명된 코드 트리들 및 페치 및 디코드 파이프라인에 기반하는 사용 예에서, A4에서의 분기 결정은 사이클 3 동안 발생할 수 있다. 분기 C를 따라 진행하는 예측되거나 전형적인 분기 결정과 연관된 제어 워드들이 도시된다(912). 그러나, A4에서의 분기 결정이 분기 B를 따라 진행하는 것이라면, 페치 사이클은 예측되지 않은 분기에 대해 개시될 수 있다. 912에 도시된 제어 워드들은 억제될 수 있으며, 여기서 억제는 어레이에 의한 전력 소비를 감소시키기 위해 계산 엘리먼트들을 유휴 상태로 하는 것을 포함할 수 있다. 분기 B에 대해 개시된 페치는 어레이 밖으로 분기 결정, 4개의 페치 사이클, 2개의 디코드 사이클을 드라이빙하는 것, 및 어레이 내로 취해진 분기에 대한 압축 해제된 제어 워드를 드라이빙하는 것을 포함할 것이다. 이어서, 압축 해제된 제어 워드의 실행은 B1(사이클 12)로 시작할 수 있다. 제12 사이클 동안 동작 B1의 실행은 계산 엘리먼트들의 2D 어레이에 의해 액세스가능한 L1 캐시에서 분기 B에 대한 압축된 제어 워드가 이용가능한 것에 기초한다. 압축된 제어 워드가 L1 캐시에서 이용가능하지 않은 경우, 추가적인 지연들이 예상될 수 있다. 실시예들에서, 수행될 수 있는 동작들은 후속 압축된 제어 워드들로 추론적으로 인코딩될 수 있다. 동작들은 취해진 경로 및 취해지지 않은 경로와 연관될 수 있다. 압축된 제어 워드들의 추론적 인코딩은 커밋을 요구할 수 있는 계산 엘리먼트 액션들과 같은 부작용들의 생성을 완화시킬 수 있다. 커밋하는 것은 2D 어레이 외부의 스토리지에 데이터를 기록하는 것을 포함할 수 있다.9 shows code word encoding and naive ( ) shows a demand-driven fetch pipeline overlay. In the absence of speculative encoding in the encoded control word stream, fetches on non-predicted branches can cause significant delays in program execution. An example of a naive demand-driving fetch, such as a fetch on a non-predicted branch, is shown at 900. Code word encodings 910 are shown with cycles associated with a code word request fetch 920. Code word request fetch may be based on a branch decision at A4. In a use case based on the previously described code trees and fetch and decode pipeline, the branch decision at A4 may occur during cycle 3. Control words associated with a predicted or typical branch decision proceeding along branch C are shown (912). However, if the branch decision at A4 is to proceed along branch B, a fetch cycle may be initiated on an unexpected branch. The control words shown at 912 may be suppressed, where suppressing may include idling computational elements to reduce power consumption by the array. A fetch initiated for branch B will involve deciding to branch out of the array, driving four fetch cycles, two decode cycles, and driving the decompressed control word for the branch taken into the array. Execution of the decompressed control word can then begin with B1 (Cycle 12). Execution of operation B1 during the twelfth cycle is based on the compressed control word for branch B being available in the L1 cache accessible by the 2D array of computational elements. If the compressed control word is not available in the L1 cache, additional delays can be expected. In embodiments, operations that may be performed may be speculatively encoded into subsequent compressed control words. Actions may be associated with paths taken and paths not taken. Speculative encoding of compressed control words can mitigate the creation of side effects, such as computational element actions that may require commit. Committing may include writing data to storage external to the 2D array.

도 10은 컴파일러 코어스(coarse) 분기 프리페치 힌트(branch prefetch hint)를 도시한다. 이전에 논의된 바와 같이, 컴파일된 코드는 트리의 분기들로서 표현될 수 있으며, 여기서 하나 이상의 동작들이 트리의 각각의 분기와 연관될 수 있다. 주어진 분기의 동작들이 실행되는지 여부는 분기 결정에 기초할 수 있다. 분기 결정은 값, 표현식 등에 기초하여 결정될 수 있다. 2개 이상의 동작들이 제어 워드로 병합될 수 있으며, 여기서 제어 워드는 분기 결정 및 분기 결정과 연관된 동작들을 포함할 수 있다. 트리의 분기 수가 증가함에 따라, 가능한 실행 콘(cone)의 폭도 증가할 수 있다. 즉, 분기 결정들의 수 및 분기 결정들 각각과 연관된 동작들의 수들이 또한 증가하여, 더 큰, 단일 압축된 제어 워드 크기들을 초래한다. 압축된 제어 워드들의 스트림이 행해진 실제 분기 결정들 및 실행된 분기들에 기초하여 별개의 스트림들로 발산(diverge)하는 것을 허용함으로써, 압축된 제어 워드들의 크기들이 감소될 수 있다. 그러나, 새로운 압축된 제어 워드들을 페치하는 것이 성공적으로 예상되지 않으면, 새로운 제어 워드 스트림이 페치되는 동안 계산 엘리먼트들의 2D 어레이는 중단될 수 있다. 압축된 제어 워드들의 스트림 내의 컴파일러 드라이빙 프리페치 힌트는 가능성 있는 새로운 압축 해제된 제어 워드 스트림의 시작을 페치 및 디코딩하는 데 사용될 수 있다. 컴파일러 드라이빙 프리페치 힌트는 추론적 인코딩을 사용하여 병렬 처리 아키텍처를 인에이블한다. 페칭 및 디코딩은 2-판독 1-기록(2R1W) 레벨 1(L1) 압축된 제어 워드 저장소(예를 들어, L1 캐시) 및 추가적인 압축 해제기를 사용하여 달성될 수 있다. Figure 10 shows compiler coarse branch prefetch hints. As previously discussed, compiled code may be represented as branches of a tree, where one or more operations may be associated with each branch of the tree. Whether the operations of a given branch are executed may be based on the branch decision. Branch decisions may be determined based on values, expressions, etc. Two or more operations may be merged into a control word, where the control word may include a branch decision and operations associated with the branch decision. As the number of branches in the tree increases, the width of the possible execution cone may also increase. That is, the number of branch decisions and the number of operations associated with each of the branch decisions also increase, resulting in larger, single compressed control word sizes. By allowing the stream of compressed control words to diverge into separate streams based on actual branch decisions made and branches executed, the sizes of compressed control words can be reduced. However, if fetching new compressed control words is not expected to be successful, the 2D array of computational elements may be halted while a new control word stream is fetched. Compiler driving prefetch hints within the stream of compressed control words can be used to fetch and decode the start of a possible new uncompressed control word stream. Compiler driving prefetch hints enable a parallel processing architecture using speculative encoding. Fetching and decoding can be accomplished using a 2-read 1-write (2R1W) level 1 (L1) compressed control word store (e.g., L1 cache) and an additional decompressor.

예시적인 실행 콘 및 컴파일러 드라이빙 프리페치 힌트가 도시된다(1000). 실행 콘(execution cone)(1010)은 분기 결정을 포함할 수 있다. 분기 결정은 실행이 분기(1012)의 제1 측면 또는 분기(1014)의 제2 측면에 기초하여 동작들을 처리하게 할 수 있다. 분기(1012)의 제1 측면의 실행은 예상 분기 결정에 기초할 수 있지만, 컴파일러 드라이빙 프리페치 힌트(1020)는 분기(1014)의 제2 측면과 연관된 압축된 제어 워드들을 프리페치하기 위해 압축된 제어 워드 스트림에 도입될 수 있다. 따라서, 콘(cone)(1010)에서의 분기 결정이 결정될 때, 제1 측면(1012) 또는 제2 측면(1014)을 갖는 압축 해제된 제어 워드들의 실행은 계산 엘리먼트들의 2D 어레이를 중단시키지 않고 개시될 수 있다.An example execution cone and compiler driving prefetch hint are shown (1000). Execution cone 1010 may include branch decisions. A branch decision may cause execution to process operations based on the first aspect of branch 1012 or the second aspect of branch 1014. Execution of the first aspect of branch 1012 may be based on the expected branch decision, while compiler driving prefetch hint 1020 may prefetch compressed control words associated with the second aspect of branch 1014. May be introduced into the control word stream. Accordingly, when a branch decision at cone 1010 is determined, execution of the decompressed control words with first side 1012 or second side 1014 begins without interrupting the 2D array of computational elements. It can be.

도 11은 컴파일러 상호작용들에 대한 시스템 블록도를 도시한다. 전반적으로 논의되는 바와 같이, 2D 어레이 내의 계산 엘리먼트들은 어레이 상에서의 실행을 위해 태스크들 및 서브태스크들을 컴파일할 수 있는 컴파일러에 알려져 있다. 컴파일된 태스크 및 서브태스크는 태스크 처리를 달성하기 위해 실행된다. 태스크의 배치, 데이터 라우팅 등과 같은 다양한 상호 작용이 컴파일러와 연관될 수 있다. 컴파일러 상호 작용은 추론적 인코딩을 사용하여 병렬 처리 아키텍처를 인에이블한다. 계산 엘리먼트들의 2차원(2D) 어레이가 액세스된다. 계산 엘리먼트들의 어레이 내의 각각의 계산 엘리먼트는 컴파일러에 알려져 있고, 계산 엘리먼트들의 어레이 내의 이웃하는 계산 엘리먼트들에 결합된다. 계산 엘리먼트들의 어레이에 대한 제어는 사이클 단위로 제공된다. 제어는 컴파일러에 의해 생성된 와이드, 가변 길이 제어 워드의 스트림에 의해 인에이블된다. 제어는 2 이상의 웨이(way) 분기 동작을 포함할 수 있다. 두 개 이상의 동작이 제어 워드로 병합된다. 제어 워드는 분기 결정 및 분기 결정과 연관된 동작들을 포함한다.Figure 11 shows a system block diagram for compiler interactions. As discussed throughout, the computational elements within the 2D array are known to the compiler, which can compile tasks and subtasks for execution on the array. Compiled tasks and subtasks are executed to achieve task processing. Various interactions may be associated with the compiler, such as placement of tasks, data routing, etc. Compiler interaction uses speculative encoding to enable a parallel processing architecture. A two-dimensional (2D) array of computational elements is accessed. Each computational element in the array of computational elements is known to the compiler and is coupled to neighboring computational elements in the array of computational elements. Control over the array of computational elements is provided on a cycle-by-cycle basis. Control is enabled by a stream of wide, variable length control words generated by the compiler. Control may include two or more way branching operations. Two or more actions are merged into a control word. The control word includes a branch decision and operations associated with the branch decision.

시스템 블록도(1100)는 컴파일러(1110)를 포함한다. 컴파일러는 C, C++, 파이썬(Python), 또는 유사한 컴파일러와 같은 하이-레벨 컴파일러를 포함할 수 있다. 컴파일러는 VHDL™ 또는 Verilog™ 컴파일러와 같은 하드웨어 기술 언어에 대해 구현된 컴파일러를 포함할 수 있다. 컴파일러는 LLVM(low-level virtual machine) IR(intermediate representation)과 같은 휴대용, 언어 독립적인 중간 표현에 대한 컴파일러를 포함할 수 있다. 컴파일러는 컴퓨터 엘리먼트 및 어레이 내의 다른 엘리먼트에 제공할 수 있는 지시들의 세트를 생성할 수 있다. 컴파일러는 태스크(1120)를 컴파일하는 데 사용될 수 있다. 태스크는 처리 태스크와 연관된 복수의 태스크를 포함할 수 있다. 태스크들은 복수의 서브태스크들을 더 포함할 수 있다. 태스크들은 비디오 처리 또는 오디오 처리 애플리케이션과 같은 애플리케이션에 기초할 수 있다. 실시예들에서, 태스크들은 기계 학습 기능과 연관될 수 있다. 컴파일러는 계산 엘리먼트 결과들을 핸들링하기 위한 지시들(direction)(1130)을 생성할 수 있다. 계산 엘리먼트 결과들은 산술, 벡터, 어레이, 및 행렬 연산들; 부울 연산들 등으로부터 도출된 결과들을 포함할 수 있다. 실시예들에서, 계산 엘리먼트 결과들은 계산 엘리먼트들의 어레이에서 병렬로 생성된다. 병렬 결과들은 계산 엘리먼트들이 입력 데이터를 공유하고, 독립적인 데이터를 사용하는 등을 할 수 있을 때 계산 엘리먼트들에 의해 생성될 수 있다. 컴파일러는 계산 엘리먼트들의 어레이에 대한 데이터 이동(1132)을 제어하는 지시들의 세트를 생성할 수 있다. 데이터 이동의 제어는 계산 엘리먼트들의 어레이 내의 계산 엘리먼트들로의, 계산 엘리먼트로부터, 그리고 계산 엘리먼트들 중에서의 데이터의 이동을 포함할 수 있다. 데이터 이동의 제어는 데이터 이동 동안 임시 데이터 스토리지와 같은 데이터를 로딩 및 저장하는 것을 포함할 수 있다. 다른 실시예들에서, 데이터 이동은 어레이 내(intra-array) 데이터 이동을 포함할 수 있다. System block diagram 1100 includes a compiler 1110. Compilers may include high-level compilers such as C, C++, Python, or similar compilers. The compiler may include a compiler implemented for a hardware description language, such as a VHDL™ or Verilog™ compiler. The compiler may include a compiler for a portable, language-independent intermediate representation, such as a low-level virtual machine (LLVM) intermediate representation (IR). A compiler can generate a set of instructions that can be provided to computer elements and other elements in the array. A compiler may be used to compile task 1120. A task may include multiple tasks associated with a processing task. Tasks may further include a plurality of subtasks. Tasks may be application based, such as a video processing or audio processing application. In embodiments, tasks may be associated with machine learning functionality. The compiler may generate directions 1130 for handling calculation element results. Computation element results include arithmetic, vector, array, and matrix operations; It may include results derived from Boolean operations, etc. In embodiments, computational element results are generated in parallel from an array of computational elements. Parallel results can be generated by computational elements when the computational elements can share input data, use independent data, etc. A compiler may generate a set of instructions that control data movement 1132 for an array of computational elements. Control of data movement may include movement of data to, from, and among computational elements within an array of computational elements. Controlling data movement may include loading and storing data, such as temporary data storage, during data movement. In other embodiments, data movement may include intra-array data movement.

하나 이상의 프로세서 상에서의 실행을 위한 태스크들 및 서브태스크들을 생성하는 데 사용되는 범용 컴파일러와 마찬가지로, 컴파일러는 태스크 및 서브태스크들 핸들링, 입력 데이터 핸들링, 중간 및 결과 데이터 핸들링 등에 대한 지시들을 제공할 수 있다. 컴파일러는 어레이와 연관된 계산 엘리먼트, 저장 엘리먼트, 제어 유닛, ALU 등을 구성하기 위한 지시를 추가로 생성할 수 있다. 앞서 논의된 바와 같이, 컴파일러는 태스크 핸들링을 지원하기 위해 데이터 핸들링에 대한 지시들을 생성한다. 시스템 블록도에서, 데이터 이동은 메모리 어레이를 갖는 로드 및 저장(1140)을 포함할 수 있다. 로드 및 저장은 정수, 실수 또는 부동, 배정도(double-precision), 문자 및 기타 데이터 유형과 같은 다양한 데이터 유형을 처리하는 것을 포함할 수 있다. 로드 및 저장은 레지스터, 레지스터 파일, 캐시 등과 같은 로컬 스토리지에 데이터를 로드 및 저장할 수 있다. 캐시는 레벨 1(L1) 캐시, 레벨 2(L2) 캐시, 레벨 3(L3) 캐시 등과 같은 하나 이상의 레벨의 캐시를 포함할 수 있다. 로드 및 저장은 또한 공유 메모리, 분산 메모리 등과 같은 스토리지와 연관될 수 있다. 로드 및 저장 외에도 컴파일러는 메모리 우선 순위를 포함한 기타 메모리 및 스토리지 관리 동작을 핸들링할 수 있다. 시스템 블록도에서, 메모리 액세스 우선순위는 메모리 데이터(1142)의 순서화(ordering)를 가능하게 할 수 있다. 메모리 데이터는 태스크 데이터 요건들, 서브태스크 데이터 요건들 등에 기초하여 순서화될 수 있다. 메모리 데이터 순서화는 태스크들 및 서브태스크들의 병렬 실행을 인에이블 할 수 있다.Like a general-purpose compiler used to generate tasks and subtasks for execution on one or more processors, the compiler can provide instructions for handling tasks and subtasks, handling input data, handling intermediate and result data, etc. . The compiler may additionally generate instructions for configuring compute elements, storage elements, control units, ALUs, etc. associated with the array. As discussed previously, the compiler generates instructions for data handling to support task handling. In the system block diagram, data movement may include loads and stores 1140 with a memory array. Loads and stores can involve handling a variety of data types, such as integers, floats, double-precision, characters, and other data types. Load and Store can load and store data to local storage such as registers, register files, caches, etc. The cache may include one or more levels of cache, such as level 1 (L1) cache, level 2 (L2) cache, level 3 (L3) cache, etc. Loads and stores can also be associated with storage such as shared memory, distributed memory, etc. In addition to loads and stores, the compiler can handle other memory and storage management operations, including memory priorities. In the system block diagram, memory access priority may enable ordering of memory data 1142. Memory data may be ordered based on task data requirements, subtask data requirements, etc. Memory data ordering can enable parallel execution of tasks and subtasks.

시스템 블록도(1100)에서, 메모리 데이터의 순서화는 계산 엘리먼트 결과 시퀀싱(1144)을 인에이블할 수 있다. 태스크 처리가 성공적으로 달성되기 위해, 태스크 및 서브태스크들은 태스크 우선권, 태스크 우선순위, 동작들의 스케줄 등을 수용할 수 있는 순서로 실행되어야 한다. 메모리 데이터는 태스크들 및 서브태스크들이 실행되도록 스케줄링될 때 태스크들 및 서브태스크들에 의해 요구되는 데이터가 처리를 위해 이용가능할 수 있도록 순서화될 수 있다. 따라서, 태스크들 및 서브태스크들에 의한 데이터의 처리의 결과들은 태스크 실행을 최적화하도록, 메모리 경합 충돌들을 감소시키거나 제거하도록 순서화될 수 있다. 시스템 블록도는 지시들의 세트에 기초하여 둘 이상의 잠재적인 컴파일된 태스크 결론의 동시 실행(1146)을 가능하게 하는 것을 포함한다. 컴파일러에 의해 컴파일되는 코드는 분기점들을 포함할 수 있으며, 여기서 분기점들은 계산들 또는 흐름 제어를 포함할 수 있다. 흐름 제어는 명령 실행을 다른 명령 시퀀스로 전송한다. 예를 들어, 분기 결정의 결과가 선험적으로 알려지지 않기 때문에, 둘 이상의 잠재적인 태스크 결과들과 연관된 명령들의 시퀀스들이 페치될 수 있고, 명령들의 각각의 시퀀스가 실행을 시작할 수 있다. 분기의 정확한 결과가 결정될 때, 정확한 분기 결과와 연관된 명령들의 시퀀스는 실행을 계속하는 반면, 취해지지 않은 분기들은 중단되고 연관된 명령들은 플러싱된다. 실시예들에서, 둘 이상의 잠재적인 컴파일된 결과들은 계산 엘리먼트들의 어레이 내의 공간적으로 별개의 계산 엘리먼트들 상에서 실행될 수 있다.In system block diagram 1100, ordering of memory data may enable computational element result sequencing 1144. In order for task processing to be successfully achieved, tasks and subtasks must be executed in an order that accommodates task priority, task priority, schedule of operations, etc. Memory data may be ordered so that data required by tasks and subtasks is available for processing when the tasks and subtasks are scheduled for execution. Accordingly, the results of processing data by tasks and subtasks can be ordered to optimize task execution and reduce or eliminate memory contention conflicts. The system block diagram includes enabling simultaneous execution 1146 of two or more potential compiled task outcomes based on a set of instructions. Code compiled by a compiler may include branches, where the branches may include calculations or flow control. Flow control transfers instruction execution to another instruction sequence. For example, because the outcome of a branch decision is not known a priori, sequences of instructions associated with two or more potential task outcomes can be fetched, and each sequence of instructions can begin execution. When the exact result of a branch is determined, the sequence of instructions associated with the correct branch result continues execution, while branches not taken are aborted and the associated instructions are flushed. In embodiments, two or more potential compiled results may be executed on spatially separate computational elements within an array of computational elements.

시스템 블록도는 계산 엘리먼트 유휴 상태로 하는 것(idling)(1148)을 포함한다. 실시예에서, 컴파일러로부터의 지시들의 세트는 계산 엘리먼트의 어레이에 위치된 계산 엘리먼트의 행(row) 내의 불필요한 계산 엘리먼트를 유휴 상태로 할 수 있다. 처리 중인 태스크, 서브태스크 등에 따라 모든 계산 엘리먼트가 처리에 필요한 것은 아니다. 계산 엘리먼트들은 단순히 어레이 내에서 이용가능한 계산 엘리먼트들이 있는 것보다 실행할 태스크들이 더 적기 때문에 필요하지 않을 수 있다. 실시예들에서, 유휴 상태로 하는 것은 컴파일러에 의해 생성된 제어 워드 내의 단일 비트에 의해 제어될 수 있다. 시스템 블록도에서, 어레이 내의 계산 엘리먼트들은 다양한 계산 엘리먼트 기능들(1150)을 위해 구성될 수 있다. 계산 엘리먼트 기능은 다양한 유형의 계산 아키텍처, 처리 구성 등을 인에이블할 수 있다. 실시예들에서, 지시들의 세트는 기계 학습 기능을 인에이블할 수 있다. 기계 학습 기능은 이미지 데이터, 오디오 데이터, 의료 데이터 등과 같은 다양한 유형의 데이터를 처리하도록 트레이닝될 수 있다. 실시예에서, 기계 학습 기능은 신경망 구현을 포함할 수 있다. 신경망은 컨볼루션 신경망, 순환 신경망, 딥러닝 네트워크 등을 포함할 수 있다. 시스템 블록도는 계산 엘리먼트들의 어레이 내의 계산 엘리먼트 배치, 결과 라우팅, 및 계산 파면(wave-front) 전파(1152)를 포함할 수 있다. 컴파일러는 어레이 내의 계산 엘리먼트들 상에 태스크들 및 서브태스크들을 배치할 수 있는 지시들 또는 명령들을 생성할 수 있다. 배치는 태스크 또는 서브태스크 사이에 또는 그 중에서의 데이터 종속성을 기반으로 태스크 및 서브태스크를 배치하는 것, 메모리 충돌 또는 통신 충돌을 방지하는 태스크를 배치하는 것을 포함할 수 있다. 지시는 계산 파면 전파를 가능하게 할 수도 있다. 컴파일러는 2D 공간 및 시간 모두에서 교차함으로써 결과들 또는 피연산자들을 교환하기 위해 다수의 실행 스레드들을 지원하도록 어레이에서의 계산 파면 전파를 육성하도록 어레이에서의 2D 파면 실행 흐름을 지시할 수 있다.The system block diagram includes idling computational elements (1148). In an embodiment, a set of instructions from a compiler may idle unnecessary computational elements within a row of computational elements located in an array of computational elements. Depending on the task being processed, subtask, etc., not all computational elements are required for processing. Computation elements may not be needed simply because there are fewer tasks to perform than there are compute elements available in the array. In embodiments, idling may be controlled by a single bit in a control word generated by the compiler. In the system block diagram, computational elements within the array can be configured for various computational element functions 1150. Computational element functions can enable various types of computational architectures, processing configurations, etc. In embodiments, a set of instructions may enable machine learning functionality. Machine learning functions can be trained to process various types of data such as image data, audio data, medical data, etc. In embodiments, machine learning functionality may include a neural network implementation. Neural networks may include convolutional neural networks, recurrent neural networks, deep learning networks, etc. The system block diagram may include placement of computational elements within an array of computational elements, routing of results, and computational wave-front propagation 1152. A compiler may generate instructions or instructions that can place tasks and subtasks on computational elements within the array. Placement may include placing tasks and subtasks based on data dependencies between or among tasks or subtasks, or placing tasks to prevent memory conflicts or communication conflicts. Directions may enable computational wavefront propagation. The compiler can direct the 2D wavefront execution flow in the array to foster computational wavefront propagation in the array to support multiple threads of execution to exchange results or operands by intersecting in both 2D space and time.

시스템 블록도에서, 컴파일러는 아키텍처 사이클들(1160)을 제어할 수 있다. 아키텍처 사이클은 엘리먼트들의 어레이 내의 엘리먼트들과 연관되는 추상 사이클을 포함할 수 있다. 어레이의 엘리먼트들은 계산 엘리먼트들, 저장 엘리먼트들, 제어 엘리먼트들, ALU들 등을 포함할 수 있다. 아키텍처 사이클은 "추상(abstract)" 사이클을 포함할 수 있으며, 여기서 추상 사이클은 로드 사이클, 실행 사이클, 기록 사이클 등과 같은 다양한 아키텍처 레벨 동작들을 지칭할 수 있다. 아키텍처 사이클들은 로우 레벨 동작들보다는 아키텍처의 매크로 동작(macro-operation)들을 지칭할 수 있다. 하나 이상의 아키텍처 사이클은 컴파일러에 의해 제어된다. 아키텍처 사이클은 단일 제어 워드에 의해 제어되는 사이클이다. 어레이가 일시 정지될 때, 메모리 시스템은 "월 클록(wall clock)" 시간(또는 "월 클록" 사이클)에 기초하여 계속 실행된다. 따라서 아키텍처 사이클은 월 클록 사이클과 동일하지 않다. 아키텍처 사이클의 수가 월 클록 사이클에 가까울수록, 시스템의 아키텍처 효율이 더 우수하며, 이는 컴파일러와 하드웨어 효율을 모두 포함한다. 아키텍처 사이클의 실행은 둘 이상의 조건에 의존할 수 있다. 실시예들에서, 아키텍처 사이클은 제어 워드가 계산 엘리먼트들의 어레이 내로 파이프라인되도록 이용 가능할 때 및 모든 데이터 종속성들이 충족될 때 발생할 수 있다. 즉, 계산 엘리먼트들의 어레이는 종속 데이터가 로드되거나 전체 메모리 큐가 소거될 때까지 대기할 필요가 없다. 시스템 블록도에서, 아키텍처 사이클은 하나 이상의 물리적 사이클(1162)을 포함할 수 있다. 물리적 사이클은 로드, 실행, 기록 등을 구현하는 데 필요한 엘리먼트 레벨에서의 하나 이상의 사이클을 지칭할 수 있다. 실시예들에서, 지시들의 세트는 물리적 사이클 단위로 계산 엘리먼트들의 어레이를 제어할 수 있다. 물리적 사이클들은 로컬, 모듈, 또는 시스템 클록과 같은 클록, 또는 일부 다른 타이밍 또는 동기화 기법에 기초할 수 있다. 실시예들에서, 물리적 사이클 단위는 아키텍처 사이클을 포함할 수 있다. 물리적 사이클들은 엘리먼트들의 어레이의 각각의 엘리먼트에 대한 인에이블 신호에 기초할 수 있는 한편, 아키텍처 사이클은 전역적 아키텍처 신호에 기초할 수 있다. 실시예들에서, 컴파일러는 제어 워드를 통해, 계산 엘리먼트들의 어레이의 각각의 열에 대한 유효 비트들을 사이클 단위로 제공할 수 있다. 유효 비트는 데이터가 유효하고 처리할 준비가 되어 있다는 것, 점프 어드레스와 같은 어드레스가 유효하다는 것 등을 나타낼 수 있다. 실시예에서, 유효 비트는 유효한 메모리 로드 액세스가 어레이로부터 출현하고 있음을 나타낼 수 있다. 어레이로부터의 유효한 메모리 로드 액세스는 메모리 또는 스토리지 엘리먼트 내의 데이터에 액세스하는 데 사용될 수 있다. 다른 실시예들에서, 컴파일러는 제어 워드를 통해, 계산 엘리먼트들의 어레이의 각각의 열에 대한 피연산자 크기 정보를 제공할 수 있다. 피연산자 크기는, 피연산자가 다수의 데이터 뱅크들 또는 다수의 캐시 라인들에 걸쳐 있을 수 있으며, 이는 다수의 액세스들을 요구할 수 있기 때문에, 데이터를 획득하기 위해 얼마나 많은 로드 동작들이 요구될 수 있는지를 결정하는 데 사용된다. 다양한 피연산자 크기가 사용될 수 있다. 실시예들에서, 피연산자 크기는 바이트, 하프 워드(half-word), 워드 및 더블 워드(double word)를 포함할 수 있다.In the system block diagram, the compiler may control architecture cycles 1160. An architectural cycle may include an abstract cycle that associates elements within an array of elements. Elements of the array may include compute elements, storage elements, control elements, ALUs, etc. An architecture cycle may include an “abstract” cycle, where the abstract cycle may refer to various architecture level operations such as a load cycle, execute cycle, write cycle, etc. Architectural cycles may refer to macro-operations of an architecture rather than low-level operations. One or more architectural cycles are controlled by the compiler. An architecture cycle is a cycle controlled by a single control word. When the array is paused, the memory system continues to run based on “wall clock” time (or “wall clock” cycles). Therefore, the architecture cycle is not the same as the wall clock cycle. The closer the number of architecture cycles is to the monthly clock cycle, the better the architectural efficiency of the system, which includes both compiler and hardware efficiency. The execution of an architectural cycle may depend on more than one condition. In embodiments, an architectural cycle may occur when a control word is available to be pipelined into an array of computational elements and all data dependencies are met. That is, the array of computational elements does not have to wait until dependent data is loaded or the entire memory queue is cleared. In the system block diagram, an architectural cycle may include one or more physical cycles 1162. A physical cycle may refer to one or more cycles at the element level required to implement load, execute, write, etc. In embodiments, a set of instructions may control an array of computational elements on a physical cycle basis. Physical cycles may be based on a clock, such as a local, module, or system clock, or some other timing or synchronization technique. In embodiments, a physical cycle unit may include an architectural cycle. Physical cycles may be based on an enable signal for each element of the array of elements, while architectural cycles may be based on a global architecture signal. In embodiments, the compiler may provide, on a cycle-by-cycle basis, valid bits for each column of the array of computational elements, via a control word. A valid bit may indicate that data is valid and ready to be processed, that an address such as a jump address is valid, etc. In an embodiment, a valid bit may indicate that a valid memory load access is coming from the array. A valid memory load access from the array can be used to access data within the memory or storage element. In other embodiments, the compiler may provide operand size information for each row of the array of computational elements, via a control word. Operand size determines how many load operations may be required to acquire the data, since the operand may span multiple data banks or multiple cache lines, which may require multiple accesses. It is used to Various operand sizes can be used. In embodiments, operand sizes may include bytes, half-words, words, and double words.

도 12는 병렬 처리를 위한 시스템도이다. 병렬 처리는 병렬 처리 아키텍처에서 수행되며, 여기서 병렬 처리 아키텍처는 추론적 인코딩을 사용한다. 시스템(1200)은 명령들을 저장하는 메모리(1212)에 부착된 하나 이상의 프로세서들(1210)을 포함할 수 있다. 시스템(1200)은 데이터; 중간 단계; 지시(direction); 제어 워드; 압축된 제어 워드; VLIW(Very Long Instruction Word) 기능을 구현하는 제어 워드; 시스톨릭(systolic), 벡터, 사이클릭, 공간, 스트리밍 또는 VLIW 토폴로지를 포함하는 토폴로지 등을 디스플레이하기 위해 하나 이상의 프로세서(1210)에 결합된 디스플레이(1214)를 더 포함할 수 있다. 실시예들에서, 하나 이상의 프로세서들(1210)은 메모리(1212)에 결합되고, 하나 이상의 프로세서들은, 저장된 명령들을 실행할 때, 계산 엘리먼트들의 2차원(2D) 어레이에 액세스하고 - 계산 엘리먼트들의 어레이 내의 각각의 계산 엘리먼트는 컴파일러에 알려져 있고, 계산 엘리먼트들의 어레이 내의 이웃하는 계산 엘리먼트들에 결합됨 -; 사이클 단위로 계산 엘리먼트들의 어레이에 대한 제어를 제공하고 - 제어는 컴파일러에 의해 생성된 와이드, 가변 길이, 제어 워드들의 스트림에 의해 인에이블됨 -; 2개 이상의 동작들을 제어 워드로 병합하도록 - 제어 워드는 분기 결정 및 분기 결정과 연관된 동작들을 포함함 - 구성된다. 실시예들에서, 병합된 제어 워드는 적어도 2개의 가능한 분기 경로들에 대한 추론적으로 인코딩된 동작들을 포함한다. 적어도 2개의 가능한 분기 경로는 독립적인 부작용을 생성할 수 있다. 다른 실시예들에서, 적어도 2개의 가능한 분기 경로들은 계산 엘리먼트 액션들이 커밋되어야 하는 계산 엘리먼트 액션들을 생성할 수 있다. 계산 엘리먼트 액션을 커밋하는 것은 계산 엘리먼트 액션 결과들을 스토리지에 저장하는 것을 포함할 수 있다. 프로모팅(promoted)될 수 있는 데이터에 대해 동작이 수행될 수 있고, 프로모팅된 데이터는 다운스트림 동작에 사용될 수 있다. 다운스트림 연산은 산술 또는 부울 연산, 행렬 연산 등을 포함할 수 있다. 2개 이상의 가능한 분기 경로는 표시된 분기 결정 및 표시되지 않은 하나 이상의 분기 결정을 포함할 수 있다. 분기 결정에 의해 표시되지 않은 분기 결정과 연관된 하나 이상의 동작이 억제될 수 있다. 억제 동작들은 계산 엘리먼트들의 2D 어레이에서 전력 감소를 가능하게 할 수 있다. 계산 엘리먼트들은 하나 이상의 집적 회로들 또는 칩들 내의 계산 엘리먼트들; ASIC들(application specific integrated circuits)과 같은 하나 이상의 프로그램가능 칩들 내에 구성된 계산 엘리먼트들 또는 코어들; FPGA들(field programmable gate arrays); 메시로서 구성된 이종 프로세서들; 독립형 프로세서들 등을 포함할 수 있다.Figure 12 is a system diagram for parallel processing. Parallel processing is performed on a parallel processing architecture, where the parallel processing architecture uses speculative encoding. System 1200 may include one or more processors 1210 attached to memory 1212 that stores instructions. System 1200 may include data; intermediate stage; direction; control word; compressed control word; A control word that implements the Very Long Instruction Word (VLIW) function; It may further include a display 1214 coupled to one or more processors 1210 to display topologies including systolic, vector, cyclic, spatial, streaming, or VLIW topologies. In embodiments, one or more processors 1210 are coupled to memory 1212 and the one or more processors, when executing stored instructions, access a two-dimensional (2D) array of computational elements - within the array of computational elements. Each computational element is known to the compiler and is coupled to neighboring computational elements in an array of computational elements -; Provides control of an array of computational elements on a cycle-by-cycle basis - control enabled by a stream of wide, variable-length, control words generated by the compiler; It is configured to merge two or more operations into a control word, where the control word includes a branch decision and operations associated with the branch decision. In embodiments, the merged control word includes speculatively encoded operations for at least two possible branch paths. At least two possible branching paths can produce independent side effects. In other embodiments, at least two possible branch paths may produce compute element actions for which the compute element actions must be committed. Committing a computational element action may include saving the computational element action results to storage. Operations may be performed on data that may be promoted, and the promoted data may be used in downstream operations. Downstream operations may include arithmetic or Boolean operations, matrix operations, etc. The two or more possible branching paths may include a marked branching decision and one or more unmarked branching decisions. One or more operations associated with the branch decision that are not indicated by the branch decision may be suppressed. Suppression operations may enable power reduction in a 2D array of computational elements. Computational elements may include computational elements within one or more integrated circuits or chips; Computational elements or cores configured within one or more programmable chips, such as application specific integrated circuits (ASICs); field programmable gate arrays (FPGAs); Heterogeneous processors configured as a mesh; It may include stand-alone processors, etc.

시스템(1200)은 캐시(1220)를 포함할 수 있다. 캐시(1220)는 분기 결정들과 연관된 데이터, 계산 엘리먼트들에 대한 지시들, 제어 워드들, 중간 결과들, 마이크로코드(microcode) 등과 같은 데이터를 저장하는 데 사용될 수 있다. 캐시는 하나 이상의 계산 엘리먼트에 이용 가능한 작고, 로컬이며, 쉽게 액세스 가능한 메모리를 포함할 수 있다. 실시예들에서, 저장되는 데이터는 2개 이상의 분기 결정들과 연관된 데이터를 포함할 수 있다. 표시된 분기 결정과 연관된 데이터는 다운스트림 동작을 위해 프로모팅될 수 있는 반면, 표시되지 않은 분기 결정과 연관된 데이터는 무시될 수 있다. 실시예들은 계산 엘리먼트들의 어레이와 연관된 캐시 내에 지시 또는 제어 워드의 관련 부분들을 저장하는 것을 포함한다. 캐시는 하나 이상의 계산 엘리먼트에 액세스 가능할 수 있다. 캐시는 있는 경우 이중 판독, 단일 쓰기(2R1W) 캐시를 포함할 수 있다. 즉, 2R1W 캐시는 서로 간섭하는 판독 및 기록 동작들 없이 동시에 2개의 판독 동작들 및 하나의 기록 동작을 가능하게 할 수 있다. 시스템(1200)은 액세싱 컴포넌트(accessing component)(1230)를 포함할 수 있다. 액세싱 컴포넌트(1230)는 계산 엘리먼트들의 2차원(2D) 어레이에 액세스하기 위한 제어 로직 및 기능들을 포함할 수 있으며, 계산 엘리먼트들의 어레이 내의 각각의 계산 엘리먼트는 컴파일러에 알려져 있고 계산 엘리먼트들의 어레이 내의 이웃하는 계산 엘리먼트들에 결합된다. 계산 엘리먼트는 하나 이상의 프로세서, 프로세서 코어, 프로세서 매크로 등을 포함할 수 있다. 각각의 계산 엘리먼트는 일정량의 로컬 스토리지를 포함할 수 있다. 로컬 스토리지는 하나 이상의 계산 엘리먼트에 액세스 가능할 수 있다. 각각의 계산 엘리먼트는 이웃들과 통신할 수 있으며, 여기서 이웃들은 가장 가까운 이웃들 또는 더 많은 원격 "이웃들"을 포함할 수 있다. 계산 엘리먼트들 사이의 통신은 산업 표준 버스, 링 버스, 유선 또는 무선 컴퓨터 네트워크와 같은 네트워크 등과 같은 버스를 사용하여 달성될 수 있다. 실시예에서, 링 버스는 분산 멀티플렉서(MUX)로 구현된다. 아래에 논의된 바와 같이, 표시된 분기 결정과 관련된 동작은 실행될 수 있는 반면, 표시되지 않은 분기 결정과 관련된 동작은 억제될 수 있다.System 1200 may include cache 1220 . Cache 1220 may be used to store data such as data associated with branch decisions, instructions for computational elements, control words, intermediate results, microcode, etc. A cache may include small, local, easily accessible memory available to one or more computational elements. In embodiments, the data stored may include data associated with two or more branch decisions. Data associated with marked branch decisions may be promoted for downstream operations, while data associated with unmarked branch decisions may be ignored. Embodiments include storing relevant portions of an instruction or control word in a cache associated with an array of computational elements. A cache may have access to one or more computational elements. The cache may include a double read, single write (2R1W) cache if present. That is, the 2R1W cache can enable two read operations and one write operation simultaneously without the read and write operations interfering with each other. System 1200 may include accessing component 1230 . Accessing component 1230 may include control logic and functions for accessing a two-dimensional (2D) array of computational elements, where each computational element within the array of computational elements is known to the compiler and has neighbors within the array of computational elements. It is combined with computational elements that do. A computational element may include one or more processors, processor cores, processor macros, etc. Each computational element may contain a certain amount of local storage. Local storage may be accessible to one or more computational elements. Each computational element may communicate with neighbors, where neighbors may include nearest neighbors or more remote “neighbors.” Communication between computational elements may be accomplished using a bus, such as an industry standard bus, a ring bus, a network such as a wired or wireless computer network, and the like. In an embodiment, the ring bus is implemented with a distributed multiplexer (MUX). As discussed below, operations associated with marked branch decisions may be executed, while operations associated with unmarked branch decisions may be suppressed.

표시된 분기 결정은 코드 조건부 제한에 기초할 수 있으며, 여기서 조건부 제한은 제어 워드(들)를 통해 컴파일러에 의해 확립될 수 있다. 계산 엘리먼트에 도달한 결정은 제어 유닛에 의해 검사되도록 선택된다. 코드 조건부 제한은 분기점, 결정점, 조건 등을 포함할 수 있다. 실시예들에서, 조건부 제한은 코드 점프들을 결정할 수 있다. 코드 점프는 명령들의 순차적 실행으로부터 명령들의 상이한 세트의 실행으로 코드 실행을 변경할 수 있다. 사용 예에서, 2R1W 캐시는 병합된 코드 워드들과 연관된 동작들의 동시 페치를 지원할 수 있다. 분기를 포함하는 지시 또는 제어 워드에 의해 표시된 분기 결정은 데이터 의존적일 수 있고 따라서 선험적으로 알려지지 않기 때문에, 하나 초과의 분기 결정들과 연관된 제어 워드들은 분기 제어 워드의 실행(프리페치(prefetch)) 전에 페치될 수 있다. 다른 곳에서 논의된 바와 같이, 분기 결정들에 기초한 2개 이상의 분기 경로들의 초기 부분은 일련의 병합된 제어 워드들에서 인스턴스화될 수 있다. 정확한 분기 결정이 결정될 때, 표시되지 않은 분기 결정과 연관된 계산은 억제될 수 있다.The indicated branch decision may be based on code conditional constraints, where the conditional constraints may be established by the compiler via control word(s). The decision reached at the computational element is selected for inspection by the control unit. Code conditional constraints can include branch points, decision points, conditions, etc. In embodiments, conditional constraints may determine code jumps. Code jumps can change code execution from sequential execution of instructions to execution of a different set of instructions. In a use case, the 2R1W cache may support concurrent fetching of operations associated with merged code words. Because the branch decision indicated by an instruction or control word containing a branch may be data dependent and thus not known a priori, the control words associated with more than one branch decision may be prefetched prior to execution (prefetch) of the branch control word. Can be fetched. As discussed elsewhere, the initial portion of two or more branch paths based on branch decisions can be instantiated in a series of merged control words. When the correct branch decision is determined, computations associated with unmarked branch decisions may be suppressed.

시스템(1200)은 제공 컴포넌트(providing component)(1240)를 포함할 수 있다. 제공 컴포넌트(1240)는 사이클 단위로 계산 엘리먼트들의 어레이에 대한 제어를 제공하기 위한 제어 및 기능들을 포함할 수 있으며, 여기서 제어는 컴파일러에 의해 생성된 와이드, 가변 길이 제어 워드들의 스트림에 의해 인에이블된다. 제어 워드들은 어셈블리 언어 워드들, 마이크로코드 워드들 등과 같은 로우-레벨 제어 워드들에 기초할 수 있다. 사이클 단위로 계산 엘리먼트들의 어레이의 제어는 다양한 컴퓨팅 동작들을 수행하도록 어레이를 구성하는 것을 포함할 수 있다. 실시예들에서, 컴파일러에 의해 생성된 와이드, 가변 길이 제어 워드들의 스트림은 계산 엘리먼트들의 2D 어레이의 직접적이고, 세분화(fine-grained)된 제어를 제공한다. 컴퓨팅 동작들은 오디오 또는 비디오 처리, 인공 지능 처리, 기계 학습, 딥러닝 등을 가능하게 할 수 있다. 제공 제어(providing control)는 마이크로코드 제어 워드들에 기초할 수 있고, 여기서 마이크로코드 제어 워드들은 연산코드(opcode) 필드들, 데이터 필드들, 계산 어레이 구성 필드들 등을 포함할 수 있다. 제어를 생성하는 컴파일러는 범용 컴파일러, 병렬 컴파일러, 계산 엘리먼트의 어레이에 최적화된 컴파일러, 하나 이상의 처리 태스크를 수행하도록 특화된 컴파일러 등을 포함할 수 있다. 제공 제어(providing control)는 계산 엘리먼트들의 어레이 내의 처리 토폴로지들과 같은 하나 이상의 토폴로지들을 구현할 수 있다. 실시예들에서, 계산 엘리먼트들의 어레이 내에 구현된 토폴로지들은 시스톨릭, 벡터, 사이클릭, 공간, 스트리밍, 또는 VLIW(Very Long Instruction Word) 토폴로지를 포함할 수 있다. 다른 토폴로지들은 신경망 토폴로지를 포함할 수 있다. 제어는 신경망 토폴로지에 대한 기계 학습 기능을 인에이블할 수 있다.System 1200 may include a providing component 1240 . Providing component 1240 may include controls and functions for providing control over an array of computational elements on a cycle-by-cycle basis, where control is enabled by a stream of wide, variable length control words generated by a compiler. . Control words may be based on low-level control words such as assembly language words, microcode words, etc. Controlling an array of computational elements on a cycle-by-cycle basis may include configuring the array to perform various computing operations. In embodiments, a stream of wide, variable length control words generated by a compiler provides direct, fine-grained control of a 2D array of computational elements. Computing operations may enable audio or video processing, artificial intelligence processing, machine learning, deep learning, etc. Providing control may be based on microcode control words, where the microcode control words may include opcode fields, data fields, compute array configuration fields, etc. Compilers that generate control may include general-purpose compilers, parallel compilers, compilers optimized for arrays of computational elements, compilers specialized to perform one or more processing tasks, etc. Providing control may implement one or more topologies, such as processing topologies within an array of computational elements. In embodiments, topologies implemented within an array of computational elements may include systolic, vector, cyclic, spatial, streaming, or Very Long Instruction Word (VLIW) topology. Other topologies may include neural network topologies. Controls can enable machine learning functions for neural network topologies.

분기 결정은 처리 작업과 연관된 많은 태스크 중 하나일 수 있는 컴파일된 태스크의 일부일 수 있다. 컴파일된 태스크는 계산 엘리먼트들의 어레이 내의 하나 이상의 계산 엘리먼트들 상에서 실행될 수 있다. 실시예들에서, 컴파일된 태스크의 실행은 실행을 병렬화하기 위해 계산 엘리먼트들에 걸쳐 분산될 수 있다. 컴파일된 태스크를 실행하는 것은 다수의 데이터세트들을 처리하기 위한 태스크들(예를 들어, 단일 명령 다수 데이터 또는 SIMD 실행)을 실행하는 것을 포함할 수 있다. 실시예들은 둘 이상의 잠재적인 컴파일된 태스크 결과들의 동시 실행을 제공하는 것을 포함할 수 있다. 제공된 제어 워드 또는 워드들이 계산 엘리먼트들의 어레이에 대한 코드 조건부 제한을 제어할 수 있다는 것을 상기한다. 실시예들에서, 둘 이상의 잠재적인 컴파일된 태스크 결론들은 계산 결과 또는 흐름 제어를 포함한다. 값, 부울 방정식 등과 같은 조건을 계산하는 것에 기초할 수 있는 코드 조건부 제한은 조건에 기초하여 둘 이상의 명령 시퀀스 중 하나의 실행을 야기할 수 있다. 실시예들에서, 둘 이상의 잠재적인 컴파일된 결론들은 동일한 제어 워드에 의해 제어될 수 있다. 다른 실시예들에서, 조건부 제한은 코드 점프들을 결정할 수 있다. 둘 이상의 잠재적인 컴파일된 태스크 결론들은 하나 이상의 분기 경로들, 데이터 등에 기초할 수 있다. 실행은 하나 이상의 지시들 또는 제어 워드들에 기초할 수 있다. 잠재적인 컴파일된 태스크 결론들은 조건의 평가에 선험적으로 알려지지 않기 때문에, 지시들의 세트는 둘 이상의 잠재적인 컴파일된 태스크 결과들의 동시 실행을 가능하게 할 수 있다. 조건이 평가될 때, 조건과 연관된 지시들의 세트의 실행은 계속될 수 있는 반면, 조건과 연관되지 않은 지시들의 세트(예를 들어, 취해지지 않은 경로)는 정지, 플러시 등이 될 수 있다. 실시예들에서, 동일한 지시 또는 제어 워드가 계산 엘리먼트들의 어레이에 걸쳐 주어진 사이클에서 실행될 수 있다. 태스크들 실행은 계산 엘리먼트들의 어레이 전체에 걸쳐 위치된 계산 엘리먼트들에 의해 수행될 수 있다. 실시예들에서, 둘 이상의 잠재적인 컴파일된 결론들은 계산 엘리먼트들의 어레이 내의 공간적으로 별개의 계산 엘리먼트들 상에서 실행될 수 있다. 공간적으로 분리된 계산 엘리먼트들을 사용하는 것은 감소된 저장, 버스, 및 네트워크 경합(contention); 계산 엘리먼트들에 의한 감소된 전력 소산 등을 가능하게 할 수 있다.A branch decision may be part of a compiled task, which may be one of many tasks associated with a processing operation. The compiled task may be executed on one or more computational elements within the array of computational elements. In embodiments, execution of a compiled task may be distributed across computational elements to parallelize execution. Executing a compiled task may include executing tasks (eg, single instruction multiple data or SIMD execution) to process multiple datasets. Embodiments may include providing for concurrent execution of two or more potential compiled task results. Recall that a provided control word or words may control code conditional constraints on an array of computational elements. In embodiments, two or more potential compiled task conclusions include computation results or flow control. Code conditional constraints, which may be based on calculating a condition such as a value, Boolean equation, etc., may result in the execution of one of two or more sequences of instructions based on the condition. In embodiments, two or more potential compiled conclusions may be controlled by the same control word. In other embodiments, conditional constraints may determine code jumps. Two or more potential compiled task conclusions may be based on one or more branch paths, data, etc. Execution may be based on one or more instructions or control words. Because the potential compiled task results are not known a priori upon evaluation of the condition, a set of instructions can enable simultaneous execution of two or more potential compiled task results. When a condition is evaluated, execution of the set of instructions associated with the condition may continue, while the set of instructions not associated with the condition (e.g., a path not taken) may be halted, flushed, etc. In embodiments, the same instruction or control word may be executed in a given cycle across an array of computational elements. Execution of tasks may be performed by computational elements located throughout an array of computational elements. In embodiments, two or more potential compiled conclusions may be executed on spatially distinct computational elements within an array of computational elements. Using spatially separated computational elements reduces storage, bus, and network contention; This may enable reduced power dissipation by computational elements, etc.

시스템(1200)은 병합 컴포넌트(1250)를 포함할 수 있다. 병합 컴포넌트(1250)는 둘 이상의 동작들을 제어 워드로 병합하기 위한 제어 로직 및 기능들을 포함할 수 있으며, 여기서 제어 워드는 분기 결정 및 분기 결정과 연관된 동작들을 포함한다. 실시예들에서, 병합된 제어 워드는 단일 제어 워드이다. 코드, 프로그램, 애플리케이션, 앱 등이 계산 엘리먼트의 어레이에서 실행된다는 것을 상기한다. 코드, 프로그램 등의 실행은 트리로서 간주될 수 있으며, 여기서 트리의 각각의 분기는 분기 결정(branch decision)과 연관될 수 있다. 실행과 연관된 트리를 통한 정확한 경로는 선험적으로 알려지지 않는다. 실시예들에서, 병합된 제어 워드는 적어도 2개의 가능한 분기 경로들에 대한 추론적으로 인코딩된 동작들을 포함할 수 있다. 분기는 다른 수의 가능한 분기 경로를 포함할 수 있다. 인코딩 동작들의 병합은 하나 이상의 조건들에 기초할 수 있다. 실시예에서, 적어도 2개의 가능한 분기 경로는 독립적인 부작용을 생성할 수 있다. 부작용들은 프로그램 실행 중단, 압축된 제어 워드들의 겉보기 기능 밀도의 증가 등을 포함할 수 있다. 다른 실시예들에서, 적어도 2개의 가능한 분기 경로들은 커밋되어야 하는 계산 엘리먼트 액션들을 생성할 수 있다. 커밋되어야 하는 액션들은 스토리지에 대한 기록들과 같은 액세스들을 포함할 수 있으며, 여기서 스토리지는 계산 엘리먼트들의 어레이 외부에 있을 수 있다. 추가 실시예들에서, 적어도 2개의 가능한 분기 경로들에 대한 동작들은 계산 엘리먼트들의 2D 어레이에 의해 병렬로 수행될 수 있다. 병렬 실행은 적어도 2개의 가능한 분기 경로가 동일한 데이터 상에서 동작하고, 독립적인 데이터 상에서 동작할 때(예를 들어, SIMD(single instruction multiple data) 실행) 수행될 수 있다. 일부 코드, 프로그램 등에서는 분기 결정이 프로그램 루프를 지원할 수 있다. 프로그램 루프는 조건부 루프 또는 비조건부 루프를 포함할 수 있다. 실시예들에서, 병합은 루프의 끝 및 루프의 시작 둘 모두로부터의 동작들을 포함할 수 있다. 전체에 걸쳐 논의되는 바와 같이, 코드, 프로그램 등의 실행은 계산 엘리먼트들의 2D 어레이의 동작 사이클들과 연관될 수 있다. 실시예에서, 병합은 사이클 단위의 2개 이상의 동작 사이클을 포함할 수 있다. 병합은 다수의 동작 사이클을 제어하는 데 사용될 수 있다. 실시예들에서, 병합은 사이클 단위의 동작 사이클들의 감소를 가능하게 할 수 있다. 동작 사이클들의 감소는 주어진 사이클에서 2개 이상의 동작들을 수행하고, 사이클 내에서 판독 또는 기록 동작들을 조합하는 것 등에 의해 달성될 수 있다.System 1200 may include a merge component 1250 . Merge component 1250 may include control logic and functions for merging two or more operations into a control word, where the control word includes a branch decision and operations associated with the branch decision. In embodiments, the merged control word is a single control word. Recall that code, programs, applications, apps, etc. run on arrays of computational elements. The execution of code, programs, etc. can be thought of as a tree, where each branch of the tree can be associated with a branch decision. The exact path through the tree associated with an execution is not known a priori. In embodiments, the merged control word may include speculatively encoded operations for at least two possible branch paths. A branch may contain a different number of possible branch paths. Merging of encoding operations may be based on one or more conditions. In embodiments, at least two possible branching paths may produce independent side effects. Side effects may include program execution interruption, an increase in the apparent functional density of compressed control words, etc. In other embodiments, at least two possible branch paths may produce compute element actions that must be committed. Actions that must be committed may include accesses such as writes to storage, where the storage may be external to the array of computational elements. In further embodiments, operations on at least two possible branch paths may be performed in parallel by a 2D array of computational elements. Parallel execution may be performed when at least two possible branch paths operate on the same data and operate on independent data (e.g., single instruction multiple data (SIMD) execution). In some codes, programs, etc., branch decisions may support program loops. Program loops can include conditional loops or non-conditional loops. In embodiments, merging may include operations from both the end of the loop and the beginning of the loop. As discussed throughout, execution of code, programs, etc. may involve cycles of operation of a 2D array of computational elements. In embodiments, merging may include two or more cycles of operation on a cycle-by-cycle basis. Merging can be used to control multiple operational cycles. In embodiments, merging may enable reduction of operation cycles on a cycle-by-cycle basis. Reduction in operational cycles can be achieved by performing two or more operations in a given cycle, combining read or write operations within a cycle, etc.

시스템(1200)은 태스크 처리를 위해 컴퓨터 판독가능 매체에 구현된 컴퓨터 프로그램 제품을 포함할 수 있고, 컴퓨터 프로그램 제품은 하나 이상의 프로세서로 하여금, 계산 엘리먼트들의 2차원(2D) 어레이에 액세스하는 단계 - 계산 엘리먼트들의 어레이 내의 각각의 계산 엘리먼트는 컴파일러에 알려져 있고 계산 엘리먼트들의 어레이 내의 이웃하는 계산 엘리먼트들에 결합됨 -; 계산 엘리먼트들의 어레이에 대한 제어를 사이클 단위로 제공하는 단계 - 제어는 컴파일러에 의해 생성된 와이드, 가변 길이, 제어 워드들의 스트림에 의해 인에이블됨 -; 및 2개 이상의 동작을 제어 워드로 병합하는 단계 - 제어 워드는 분기 결정 및 분기 결정과 연관된 동작들을 포함함 - 의 동작들을 수행하게 하는 코드를 포함한다.System 1200 may include a computer program product embodied in a computer-readable medium for processing a task, wherein the computer program product causes one or more processors to access a two-dimensional (2D) array of computational elements—compute. Each computational element in the array of elements is known to the compiler and is coupled to neighboring computational elements in the array of computational elements -; providing control on a cycle-by-cycle basis over an array of computational elements, control enabled by a stream of wide, variable-length, control words generated by a compiler; and merging two or more operations into a control word, wherein the control word includes a branch decision and operations associated with the branch decision.

상기 방법들 각각은 하나 이상의 컴퓨터 시스템들 상의 하나 이상의 프로세서들 상에서 실행될 수 있다. 실시예들은 다양한 형태들의 분산 컴퓨팅, 클라이언트/서버 컴퓨팅, 및 클라우드 기반 컴퓨팅을 포함할 수 있다. 또한, 본 개시의 흐름도들에 포함된 묘사된 단계들 또는 박스들은 단지 예시적이고 설명적임을 이해할 것이다. 단계들은 본 개시의 범위를 벗어나지 않고 수정, 생략, 반복 또는 재순서화될 수 있다. 또한, 각각의 단계는 하나 이상의 서브 단계를 포함할 수 있다. 전술한 도면들 및 설명은 개시된 시스템들의 기능적 양태들을 설명하지만, 소프트웨어 및/또는 하드웨어의 특정 구현 또는 배열은 명시적으로 언급되거나 문맥으로부터 달리 명확하지 않는 한 이러한 설명들로부터 추론되어서는 안된다. 소프트웨어 및/또는 하드웨어의 모든 이러한 배열들은 본 개시의 범위 내에 속하는 것으로 의도된다.Each of the above methods may be executed on one or more processors on one or more computer systems. Embodiments may include various forms of distributed computing, client/server computing, and cloud-based computing. Additionally, it will be understood that the depicted steps or boxes included in the flow diagrams of this disclosure are exemplary and explanatory only. Steps may be modified, omitted, repeated, or reordered without departing from the scope of the present disclosure. Additionally, each step may include one or more substeps. Although the foregoing drawings and description illustrate functional aspects of the disclosed systems, no specific implementation or arrangement of software and/or hardware should be inferred from these descriptions unless explicitly stated or otherwise clear from context. All such arrangements of software and/or hardware are intended to be within the scope of this disclosure.

블록도 및 흐름도는 방법, 장치, 시스템 및 컴퓨터 프로그램 제품을 도시한다. 블록도들 및 흐름도들에서의 엘리먼트들 및 엘리먼트들의 조합들은 방법들, 장치들, 시스템들, 컴퓨터 프로그램 제품들 및/또는 컴퓨터 구현 방법들의 기능들, 단계들, 또는 단계들의 그룹들을 도시한다. 본 명세서에서 일반적으로 "회로", "모듈" 또는 "시스템"으로 지칭되는 임의의 그리고 모든 그러한 기능들은 컴퓨터 프로그램 명령들에 의해, 특수 목적 하드웨어 기반 컴퓨터 시스템들에 의해, 특수 목적 하드웨어와 컴퓨터 명령들의 조합들에 의해, 범용 하드웨어와 컴퓨터 명령들의 조합들에 의해, 등등에 의해 구현될 수 있다. Block diagrams and flow diagrams illustrate methods, devices, systems and computer program products. Elements and combinations of elements in block diagrams and flow diagrams illustrate functions, steps, or groups of steps of methods, devices, systems, computer program products, and/or computer-implemented methods. Any and all such functions, generally referred to herein as a “circuit,” “module,” or “system,” may be implemented by computer program instructions, by special-purpose hardware-based computer systems, or by special-purpose hardware and computer instructions. It may be implemented by combinations, by combinations of general purpose hardware and computer instructions, etc.

전술한 컴퓨터 프로그램 제품들 또는 컴퓨터 구현 방법들 중 임의의 것을 실행하는 프로그램가능 장치는 하나 이상의 마이크로프로세서, 마이크로컨트롤러, 임베디드 마이크로컨트롤러, 프로그램가능 디지털 신호 프로세서, 프로그램가능 디바이스, 프로그램가능 게이트 어레이, 프로그램가능 어레이 로직, 메모리 디바이스, 주문형 집적 회로 등을 포함할 수 있다. 각각은 컴퓨터 프로그램 명령들을 처리하고, 컴퓨터 로직을 실행하고, 컴퓨터 데이터를 저장하는 등을 하도록 적합하게 채용되거나 구성될 수 있다. A programmable device executing any of the foregoing computer program products or computer implemented methods may include one or more microprocessors, microcontrollers, embedded microcontrollers, programmable digital signal processors, programmable devices, programmable gate arrays, programmable It may include array logic, memory devices, custom integrated circuits, etc. Each may be suitably employed or configured to process computer program instructions, execute computer logic, store computer data, etc.

컴퓨터는 컴퓨터 판독 가능 저장 매체로부터의 컴퓨터 프로그램 제품을 포함할 수 있고, 이 매체는 내부 또는 외부, 착탈 가능 및 교체 가능, 또는 고정될 수 있다는 것이 이해될 것이다. 또한, 컴퓨터는 본 명세서에 기재된 소프트웨어 및 하드웨어를 포함하거나, 이와 인터페이스하거나, 이를 지원할 수 있는 BIOS(Basic Input/Output System), 펌웨어, 운영 체제, 데이터베이스 등을 포함할 수 있다.It will be understood that a computer may include a computer program product from a computer-readable storage medium, which may be internal or external, removable and replaceable, or fixed. Additionally, a computer may include a Basic Input/Output System (BIOS), firmware, operating system, database, etc. that may include, interface with, or support the software and hardware described in this specification.

본 발명의 실시예들은 종래의 컴퓨터 애플리케이션들 또는 이들을 실행하는 프로그램가능 장치 어느 것에도 제한되지 않는다. 예시하기 위해: 본 발명의 실시예들은 광학 컴퓨터, 양자 컴퓨터, 아날로그 컴퓨터 등을 포함할 수 있다. 컴퓨터 프로그램은 도시된 기능들 중 임의의 것 및 전부를 수행할 수 있는 특정 기계를 생산하기 위해 컴퓨터 상에 로딩될 수 있다. 이러한 특정 기계는 도시된 기능들 중 임의의 것 및 모두를 수행하기 위한 수단을 제공한다.Embodiments of the invention are not limited to either conventional computer applications or programmable devices that run them. To illustrate: Embodiments of the invention may include optical computers, quantum computers, analog computers, etc. Computer programs can be loaded on a computer to produce a specific machine capable of performing any and all of the functions shown. This particular machine provides the means for performing any and all of the functions shown.

저장을 위한 비일시적 컴퓨터 판독 가능 매체; 전자, 자기, 광학, 전자기, 적외선 또는 반도체 컴퓨터 판독 가능 저장 매체 또는 전술한 것의 임의의 적절한 조합; 휴대용 컴퓨터 디스켓; 하드 디스크; 랜덤 액세스 메모리(RAM); 판독 전용 메모리(ROM), 소거 가능한 프로그램 가능 판독 전용 메모리(EPROM, 플래시, MRAM, FeRAM 또는 상 변화 메모리); 광섬유; 휴대용 콤팩트 디스크; 광학 저장 디바이스; 자기 저장 디바이스; 또는 전술한 것의 임의의 적절한 조합을 포함하지만 이에 제한되지 않는 하나 이상의 컴퓨터 판독 가능 매체의 임의의 조합이 이용될 수 있다. 본 문서의 맥락에서, 컴퓨터 판독가능 저장 매체는 명령 실행 시스템, 장치, 또는 디바이스에 의해 또는 이와 관련하여 사용하기 위한 프로그램을 포함하거나 저장할 수 있는 임의의 유형의 매체일 수 있다.non-transitory computer-readable media for storage; electronic, magnetic, optical, electromagnetic, infrared, or semiconductor computer-readable storage media, or any suitable combination of the foregoing; Portable computer diskettes; hard disk; random access memory (RAM); Read-only memory (ROM), erasable programmable read-only memory (EPROM, Flash, MRAM, FeRAM, or phase change memory); optical fiber; portable compact disc; optical storage device; magnetic storage device; Alternatively, any combination of one or more computer-readable media may be used, including but not limited to any suitable combination of the foregoing. In the context of this document, a computer-readable storage medium may be any tangible medium that can contain or store an instruction execution system, apparatus, or program for use by or in connection with a device.

컴퓨터 프로그램 명령들은 컴퓨터 실행가능 코드를 포함할 수 있다는 것이 이해될 것이다. 컴퓨터 프로그램 명령들을 표현하기 위한 다양한 언어들은 제한 없이 C, C++, Java, JavaScript™, ActionScript™, 어셈블리 언어, Lisp, Perl, Tcl, Python, Ruby, 하드웨어 기술 언어들, 데이터베이스 프로그래밍 언어들, 기능 프로그래밍 언어들, 명령형(imperative) 프로그래밍 언어들 등을 포함할 수 있다. 실시예들에서, 컴퓨터 프로그램 명령들은 컴퓨터, 프로그래밍 가능한 데이터 처리 장치, 프로세서들 또는 프로세서 아키텍처들의 이종 조합 등에서 실행되도록 저장, 컴파일 또는 해석될 수 있다. 제한 없이, 본 발명의 실시예들은 클라이언트/서버 소프트웨어, 서비스형 소프트웨어, 피어-투-피어 소프트웨어 등을 포함하는 웹 기반 컴퓨터 소프트웨어의 형태를 취할 수 있다.It will be understood that computer program instructions may include computer executable code. A variety of languages for expressing computer program instructions include, but are not limited to, C, C++, Java, JavaScript™, ActionScript™, assembly language, Lisp, Perl, Tcl, Python, Ruby, hardware description languages, database programming languages, and functional programming languages. , imperative programming languages, etc. In embodiments, computer program instructions may be stored, compiled, or interpreted for execution on a computer, programmable data processing device, processors, or heterogeneous combination of processor architectures, etc. Without limitation, embodiments of the invention may take the form of web-based computer software, including client/server software, software-as-a-service, peer-to-peer software, and the like.

실시예들에서, 컴퓨터는 다수의 프로그램들 또는 스레드들을 포함하는 컴퓨터 프로그램 명령들의 실행을 가능하게 할 수 있다. 다수의 프로그램들 또는 스레드들은 프로세서의 이용을 향상시키고, 실질적으로 동시 기능들을 가능하게 하기 위해 대략 동시에 처리될 수 있다. 구현예에 의해, 본 명세서에 설명된 임의의 그리고 모든 방법들, 프로그램 코드들, 프로그램 명령들 등은 결국 다른 스레드들을 생성할 수 있는 하나 이상의 스레드들에서 구현될 수 있으며, 이는 그 자체가 이들과 연관된 우선순위들을 가질 수 있다. 일부 실시예들에서, 컴퓨터는 우선순위 또는 다른 순서에 기초하여 이러한 스레드들을 처리할 수 있다.In embodiments, a computer may enable execution of computer program instructions comprising multiple programs or threads. Multiple programs or threads may be processed approximately simultaneously to enhance utilization of the processor and enable substantially concurrent functions. By way of implementation, any and all methods, program code, program instructions, etc. described herein may be implemented in one or more threads, which in turn may spawn other threads, which may themselves Can have associated priorities. In some embodiments, the computer may process these threads based on priority or other order.

명시적으로 언급되거나 문맥으로부터 달리 명백하지 않는 한, 동사 "실행(execute)" 및 "처리(process)"는 실행, 프로세스, 해석, 컴파일, 어셈블(assemble), 링크, 로드 또는 이들의 조합을 나타내기 위해 상호 교환적으로 사용될 수 있다. 따라서, 컴퓨터 프로그램 명령들, 컴퓨터 실행가능 코드 등을 실행하거나 처리하는 실시예들은 설명된 방식들 중 임의의 방식 및 모든 방식으로 명령들 또는 코드에 작용할 수 있다. 또한, 도시된 방법 단계들은 하나 이상의 당사자들 또는 엔티티들로 하여금 단계들을 수행하게 하는 임의의 적절한 방법을 포함하도록 의도된다. 단계 또는 단계의 일부를 수행하는 당사자는 특정 지리적 위치 또는 국가 경계 내에 위치될 필요가 없다. 예를 들어, 미국 내에 위치한 엔티티가 방법 단계 또는 그 일부가 미국 외부에서 수행되게 하면, 방법은 임시 엔티티에 의해 미국에서 수행되는 것으로 간주된다.Unless explicitly stated or otherwise clear from context, the verbs "execute" and "process" refer to executing, processing, interpreting, compiling, assembling, linking, loading, or any combination thereof. Can be used interchangeably to bet. Accordingly, embodiments that execute or process computer program instructions, computer executable code, etc. may act on the instructions or code in any and all of the ways described. Additionally, the method steps depicted are intended to include any suitable method of causing one or more parties or entities to perform the steps. A party performing a step or part of a step need not be located in a particular geographic location or within a country border. For example, if an entity located within the United States causes method steps, or portions thereof, to be performed outside the United States, the method is considered to be performed in the United States by a temporary entity.

본 발명은 상세하게 도시되고 설명된 바람직한 실시예들과 관련하여 개시되었지만, 이에 대한 다양한 수정 및 개선은 당업자에게 명백해질 것이다. 따라서 전술한 예들은 본 발명의 사상 및 범위를 제한하지 않아야 하고; 오히려 법에 의해 허용되는 가장 넓은 의미로 이해되어야 한다.Although the present invention has been disclosed with respect to preferred embodiments which have been shown and described in detail, various modifications and improvements thereto will become apparent to those skilled in the art. Accordingly, the foregoing examples should not limit the spirit and scope of the present invention; Rather, it should be understood in the broadest sense permitted by law.

Claims

As a method of implementing a processor for executing a program,
Accessing a two-dimensional (2D) array of compute elements, each compute element in the array of compute elements being known to the compiler and coupled to neighboring compute elements in the array of compute elements;
Providing control over the array of computational elements on a cycle-by-cycle basis, the control being controlled by a wide, variable length, control word generated by the compiler. enabled by a stream of -; and
A method comprising coalescing two or more operations into a control word, wherein the control word includes a branch decision and an operation associated with the branch decision.

The method of claim 1, wherein the merged control word includes speculatively encoded operations for at least two possible branch paths.

3. The method of claim 2, wherein the at least two possible branching paths produce independent side effects.

3. The method of claim 2, wherein the at least two possible branch paths generate compute element actions that must be committed.

3. The method of claim 2, wherein operations on the at least two possible branch paths can be performed in parallel by the 2D array of computational elements.

The method of claim 1, wherein the branch decision supports subroutine execution.

The method of claim 1, wherein the branch decision supports a programming loop.

8. The method of claim 7, wherein the merging step includes operations from both the end of the loop and the beginning of the loop.

The method of claim 1, wherein the two or more operations control data flow within the 2D array of computational elements.

The method of claim 1, wherein the merged control words are a single control word.

The method of claim 1, further comprising suppressing one or more operations associated with the branch decision that are not indicated by the branch decision.

12. The method of claim 11, wherein the inhibiting step is accomplished dynamically.

12. The method of claim 11, wherein said suppressing enables power reduction in the 2D array of computational elements.

12. The method of claim 11, wherein suppressing prevents data from being committed.

The method of claim 1, wherein the merging step includes branch determination and speculative encoding of the control word including one or more additional control words.

16. The method of claim 15, wherein the one or more additional control words control operations subsequent to the control word including a branch decision.

The method of claim 1, further comprising ignoring one or more operations associated with the branch decision that are not indicated by the branch decision.

18. The method of claim 17, wherein the ignoring step is accomplished by setting an idle bit in the control word.

The method of claim 1, wherein the merging step includes two or more operational cycles in the cycle unit.

20. The method of claim 19, wherein the merging step enables reduction of the cycle-by-cycle operation cycle.

The method of claim 1, wherein each operation from the operations represents an instruction equivalent.

22. The method of claim 21, wherein the instruction equivalents comprise building blocks for a high-level language.

The method of claim 1, wherein the stream of wide, variable length control words generated by the compiler provides direct fine-grained control of the 2D array of computational elements.

A computer program product embodied in a computer-readable medium for program execution, wherein the computer program product causes one or more processors to:
accessing a two-dimensional (2D) array of computational elements, each computational element in the array of computational elements being known to a compiler and coupled to neighboring computational elements in the array of computational elements;
providing control of the array of computational elements on a cycle-by-cycle basis, the control being enabled by a stream of wide, variable length control words generated by the compiler; and
A computer program product comprising code for causing the operations to be performed: coalescing two or more operations into a control word, the control word comprising a branch decision and an operation associated with the branch decision.

25. The computer program product of claim 24, wherein the merged control word includes speculatively encoded operations for at least two possible branch paths.

26. The computer program product of claim 25, wherein the at least two possible branch paths produce independent side effects.

26. The computer program product of claim 25, wherein the at least two possible branch paths generate computational element actions that must be committed.

26. The computer program product of claim 25, wherein operations on the at least two possible branch paths can be performed in parallel by the 2D array of computational elements.

25. The computer program product of claim 24, wherein the branch decision supports a programming loop.

26. The computer program product of claim 25, further comprising code to suppress one or more operations associated with the branch decision that are not indicated by the branch decision.

26. The computer program product of claim 25, further comprising code for ignoring one or more operations associated with the branch decision that are not indicated by the branch decision.

32. The computer program product of claim 31, wherein the ignoring is accomplished by setting an idle bit in the control word.

A computer system for executing a program, comprising:
memory to store instructions;
Comprising one or more processors coupled to the memory, wherein when executing stored instructions, the one or more processors:
Access a two-dimensional (2D) array of computational elements, each computational element in the array of computational elements known to the compiler and coupled to neighboring computational elements in the array of computational elements;
providing control over the array of computational elements on a cycle-by-cycle basis, wherein the control is enabled by a stream of wide, variable length control words generated by the compiler;
A computer system configured to merge two or more operations into a control word, wherein the control word includes a branch decision and operations associated with the branch decision.

34. The computer system of claim 33, wherein the merged control word includes speculatively encoded operations for at least two possible branch paths.

35. The computer system of claim 34, wherein the at least two possible branching paths produce independent side effects.

35. The computer system of claim 34, wherein the at least two possible branch paths generate computational element actions that must be committed.

35. The computer system of claim 34, wherein operations on the at least two possible branch paths can be performed in parallel by the 2D array of computational elements.

34. The computer system of claim 33, wherein the branch decision supports a programming loop.

34. The computer system of claim 33, further configured to suppress one or more operations associated with the branch decision that are not indicated by the branch decision.

34. The computer system of claim 33, further configured to ignore one or more operations associated with the branch decision that are not indicated by the branch decision.

41. The computer system of claim 40, wherein the ignoring is accomplished by setting an idle bit in the control word.