KR102628658B1

KR102628658B1 - Neural processor and control method of neural processor

Info

Publication number: KR102628658B1
Application number: KR1020210035736A
Authority: KR
Inventors: 유승주
Original assignee: 삼성전자주식회사; 서울대학교산학협력단
Priority date: 2021-03-04
Filing date: 2021-03-19
Publication date: 2024-01-25
Also published as: KR20220125116A

Abstract

뉴럴 프로세서 및 뉴럴 프로세서의 제어 방법이 개시된다. 뉴럴 프로세는 이미지 센서 및 이미지 센서의 동작 방법이 개시된다. 뉴럴 프로세서는 복수의 프로세싱 엘리먼트 그룹들을 포함하고, 프로세싱 엘리먼트 그룹들 각각은 벡터 연산을 수행하는 복수의 프로세싱 엘리먼트들, 복수의 프로세싱 엘리먼트들에 의해 공유되고 복수의 프로세싱 엘리먼트들 중 오버플로우 또는 언더플로우가 발생하는 프로세싱 엘리먼트에 의하여 점유되는 오버플로우 누산기, 및 오버플로우 누산기를 점유하는 점유 프로세싱 엘리먼트를 지시하는 정보를 저장하는 레지스터를 포함한다.A neural processor and a method for controlling the neural processor are disclosed. The neural process discloses an image sensor and a method of operating the image sensor. The neural processor includes a plurality of processing element groups, each of the processing element groups is a plurality of processing elements that perform a vector operation, is shared by a plurality of processing elements, and overflow or underflow occurs among the plurality of processing elements. It includes an overflow accumulator that is occupied by the processing element that occurs, and a register that stores information indicating the occupied processing element that occupies the overflow accumulator.

Description

Neural processor and control method of neural processor {NEURAL PROCESSOR AND CONTROL METHOD OF NEURAL PROCESSOR}

아래 실시예들은 뉴럴 프로세서 및 뉴럴 프로세서의 제어 방법에 관한 것이다.The following embodiments relate to a neural processor and a method of controlling the neural processor.

인공 지능 분야의 응용을 위해 신경망 처리 장치(neural processing unit; NPU)로 구성된 하드웨어 가속기는 기본적으로 2개의 벡터들 간의 내적(dot product) 연산을 구현할 수 있다. 신경망 처리 장치(NPU)는 내적 연산을 수행하고, 연산 결과를 저장하기 위해 큰 비트(bit) 수를 가진 누산기(accumulator)와 덧셈기(adder)를 사용할 수 있다. 또한, 내적 연산을 구현하기 위해서 곱셈기(multiplier), 누산기, 및 덧셈기가 이용될 수 있다. 예를 들어, 신경망 처리 장치가 8비트 이상의 연산을 수행하는 경우에는 일반적으로 곱셈기의 비용이 큰 반면, 연산의 정밀도가 8비트 미만으로 낮아질 경우에는 누산기와 덧셈기의 비용이 곱셈기에 대비하여 상대적으로 커질 수 있다. 따라서, 저정밀도 연산을 수행하는 신경망 처리 장치의 효율적인 구현을 위해 누산기와 덧셈기의 비용을 줄일 수 있는 방법이 요구된다.For applications in the field of artificial intelligence, a hardware accelerator consisting of a neural processing unit (NPU) can basically implement a dot product operation between two vectors. A neural network processing unit (NPU) can perform inner product operations and use accumulators and adders with a large number of bits to store the operation results. Additionally, multipliers, accumulators, and adders can be used to implement the dot product operation. For example, when a neural network processing unit performs an operation of 8 bits or more, the cost of the multiplier is generally large, whereas when the precision of the operation is lowered to less than 8 bits, the cost of the accumulator and adder becomes relatively large compared to the multiplier. You can. Therefore, a method of reducing the cost of accumulators and adders is required for efficient implementation of neural network processing devices that perform low-precision calculations.

일 실시예에 따르면, 뉴럴 프로세서는 복수의 프로세싱 엘리먼트 그룹들을 포함하고, 상기 프로세싱 엘리먼트 그룹들 각각은 벡터 연산을 수행하는 복수의 프로세싱 엘리먼트들; 상기 복수의 프로세싱 엘리먼트들에 의해 공유되고, 상기 복수의 프로세싱 엘리먼트들 중 오버플로우(overflow) 또는 언더플로우(underflow)가 발생하는 프로세싱 엘리먼트에 의하여 점유되는 오버플로우 누산기(overflow accumulator); 및 상기 오버플로우 누산기를 점유하는 점유 프로세싱 엘리먼트를 지시하는 정보를 저장하는 레지스터(register)를 포함한다. According to one embodiment, a neural processor includes a plurality of processing element groups, each of the processing element groups comprising a plurality of processing elements that perform a vector operation; an overflow accumulator shared by the plurality of processing elements and occupied by a processing element in which an overflow or underflow occurs among the plurality of processing elements; and a register that stores information indicating an occupied processing element that occupies the overflow accumulator.

상기 오버플로우 누산기는 상기 프로세싱 엘리먼트들 각각에 포함된 누산기들으로부터 수신한 오버플로우가 발생하였는지 여부 혹은 언더플로우가 발생하였는지 여부를 지시하는 정보를 기초로 상기 점유 프로세싱 엘리먼트의 누산기의 연산 결과를 누산(accumulation)할 수 있다. The overflow accumulator accumulates the operation result of the accumulator of the occupied processing element based on information indicating whether overflow or underflow has occurred received from the accumulators included in each of the processing elements ( accumulation) can occur.

상기 정보는 오버플로우(overflow), 언더플로우(underflow), 및 논(none) 중 적어도 하나를 지시할 수 있다. The information may indicate at least one of overflow, underflow, and none.

상기 오버플로우 누산기는 상기 복수의 프로세싱 엘리먼트들 각각과 파이프라인 상호 연결(pipelined interconnect )을 통해 연결될 수 있다. The overflow accumulator may be connected to each of the plurality of processing elements through a pipelined interconnect.

상기 뉴럴 프로세서는 상기 점유 프로세싱 엘리먼트를 지시하는 정보를 기초로, 상기 벡터 연산을 수행하는 복수의 프로세싱 엘리먼트들에 의해 공유되는 오버플로우 누산기가 상기 복수의 프로세싱 엘리먼트들 중 적어도 하나에 의해 점유되는지 여부를 체크할 수 있다. The neural processor determines whether an overflow accumulator shared by a plurality of processing elements performing the vector operation is occupied by at least one of the plurality of processing elements, based on information indicating the occupied processing element. You can check it.

상기 오버플로우 누산기가 점유된 경우, 상기 뉴럴 프로세서는 상기 점유 프로세싱 엘리먼트로부터 출력되는 오버플로우 신호에 따라 상기 오버플로우 누산기에 1을 가산하도록 제어하고, 상기 점유 프로세싱 엘리먼트로부터 출력되는 언더플로우 신호에 따라 상기 오버플로우 누산기에 1을 감산하도록 제어할 수 있다. When the overflow accumulator is occupied, the neural processor controls the overflow accumulator to add 1 according to the overflow signal output from the occupied processing element, and controls the overflow accumulator to add 1 according to the underflow signal output from the occupied processing element. The overflow accumulator can be controlled to subtract 1.

상기 오버플로우 누산기가 점유되지 않은 경우, 상기 뉴럴 프로세서는 상기 프로세싱 엘리먼트들 중 오버플로우 신호 또는 언더플로우 신호를 출력한 프로세싱 엘리먼트를 상기 점유 프로세싱 엘리먼트로 설정할 수 있다. When the overflow accumulator is not occupied, the neural processor may set the processing element that outputs the overflow signal or the underflow signal among the processing elements as the occupied processing element.

상기 점유 프로세싱 엘리먼트는 상기 벡터 연산이 종료됨에 따라 상기 점유 프로세싱 엘리먼트를 지시하는 정보를 상기 오버플로우 누산기의 연산 결과 및 상기 점유 프로세싱 엘리먼트의 연산 결과와 함께 출력할 수 있다. As the vector operation is terminated, the occupied processing element may output information indicating the occupied processing element together with the operation result of the overflow accumulator and the operation result of the occupied processing element.

상기 프로세싱 엘리먼트들 중 상기 점유 프로세싱 엘리먼트를 제외한 비점유 프로세싱 엘리먼트들은 상기 벡터 연산이 종료됨에 따라 상기 비점유 프로세싱 엘리먼트들의 누산기의 연산 결과를 출력할 수 있다. Among the processing elements, non-occupied processing elements excluding the occupied processing element may output operation results of an accumulator of the non-occupied processing elements as the vector operation is terminated.

상기 뉴럴 프로세서는 상기 복수의 프로세싱 엘리먼트들 중 두 개 이상의 프로세싱 엘리먼트들로부터 상기 오버플로우 신호 또는 상기 언더플로우 신호를 동시에 수신한 경우, 상기 두 개 이상의 프로세싱 엘리먼트들 중 어느 하나를 랜덤(random)으로 상기 점유 프로세싱 엘리먼트로 설정할 수 있다. When the neural processor simultaneously receives the overflow signal or the underflow signal from two or more processing elements among the plurality of processing elements, the neural processor randomly selects one of the two or more processing elements. It can be set as an occupied processing element.

상기 레지스터는 상기 점유 프로세싱 엘리먼트에서 오버플로우가 발생하였는지 여부 혹은 언더플로우가 발생하였는지 여부를 지시하는 정보를 더 저장할 수 있다.The register may further store information indicating whether overflow or underflow occurred in the occupied processing element.

상기 프로세싱 엘리먼트들 각각은 복수의 곱셈기들(multipliers), 복수의 덧셈기들(adders) 및 누산기(accumulator)를 포함할 수 있다. Each of the processing elements may include a plurality of multipliers, a plurality of adders, and an accumulator.

상기 프로세싱 엘리먼트들 각각은 MAT(multiplier-adder tree), 덧셈기 및 누산기를 포함할 수 있다. Each of the processing elements may include a multiplier-adder tree (MAT), an adder, and an accumulator.

상기 오버플로우 누산기는 누산기(accumulator) 및 덧셈기(adder)를 포함할 수 있다. The overflow accumulator may include an accumulator and an adder.

일 실시예에 따르면, 뉴럴 프로세서의 제어 방법은 벡터 연산을 수행하는 복수의 프로세싱 엘리먼트들에 의해 공유되는 오버플로우 누산기가 상기 복수의 프로세싱 엘리먼트들 중 적어도 하나에 의해 점유되는지 여부를 체크하는 단계; 상기 오버플로우 누산기가 점유되지 않은 경우, 상기 프로세싱 엘리먼트들 중 오버플로우 신호 또는 언더플로우 신호를 출력한 프로세싱 엘리먼트를 점유 프로세싱 엘리먼트로 설정하는 단계; 상기 오버플로우 누산기가 점유된 경우, 상기 오버플로우 누산기를 점유한 점유 프로세싱 엘리먼트로부터 출력되는 신호 에 따라 상기 오버플로우 누산기를 가산 또는 감산하도록 제어하는 단계; 및 상기 벡터 연산이 종료됨에 따라 상기 점유 프로세싱 엘리먼트를 지시하는 정보를 상기 오버플로우 누산기의 연산 결과 및 상기 점유 프로세싱 엘리먼트의 연산 결과와 함께 출력하는 단계를 포함한다. According to one embodiment, a method for controlling a neural processor includes checking whether an overflow accumulator shared by a plurality of processing elements that perform a vector operation is occupied by at least one of the plurality of processing elements; When the overflow accumulator is not occupied, setting a processing element that outputs an overflow signal or an underflow signal among the processing elements as an occupied processing element; When the overflow accumulator is occupied, controlling the overflow accumulator to add or subtract according to a signal output from the occupied processing element occupying the overflow accumulator; and outputting information indicating the occupied processing element along with an operation result of the overflow accumulator and an operation result of the occupied processing element as the vector operation is terminated.

상기 체크하는 단계는 상기 점유 프로세싱 엘리먼트를 지시하는 정보를 기초로, 상기 오버플로우 누산기가 상기 복수의 프로세싱 엘리먼트들 중 적어도 하나에 의해 점유되는지 여부를 체크하는 단계를 포함할 수 있다. The checking may include checking whether the overflow accumulator is occupied by at least one of the plurality of processing elements, based on information indicating the occupied processing element.

상기 제어하는 단계는 상기 오버플로우 누산기가 점유된 경우, 상기 오버플로우 누산기를 점유한 프로세싱 엘리먼트로부터 출력되는 오버플로우 신호에 따라 상기 오버플로우 누산기에 1을 가산하는 단계; 및 상기 오버플로우 누산기를 점유한 프로세싱 엘리먼트로부터 출력되는 언더플로우 신호에 따라 상기 오버플로우 누산기에서 1을 감산하는 단계를 포함할 수 있다. The controlling step includes, when the overflow accumulator is occupied, adding 1 to the overflow accumulator according to an overflow signal output from the processing element occupying the overflow accumulator; and subtracting 1 from the overflow accumulator according to an underflow signal output from the processing element occupying the overflow accumulator.

상기 출력하는 단계는 상기 오버플로우 누산기의 데이터와 상기 점유 프로세싱 엘리먼트의 연산 결과를 합산한 결과를 상기 점유 프로세싱 엘리먼트와 상기 오버플로우 누산기 간에 수직 방향으로 이어진 파이프라인 상호 연결을 통해 출력하는 단계를 포함할 수 있다. The outputting step may include outputting a result of adding the data of the overflow accumulator and the operation result of the occupied processing element through a pipeline interconnection connected vertically between the occupied processing element and the overflow accumulator. You can.

상기 뉴럴 프로세서의 제어 방법은 상기 복수의 프로세싱 엘리먼트들 중 두 개 이상의 프로세싱 엘리먼트들로부터 상기 오버플로우 신호 또는 상기 언더플로우 신호를 동시에 수신한 경우, 상기 두 개 이상의 프로세싱 엘리먼트들 중 어느 하나를 랜덤(random)으로 상기 점유 프로세싱 엘리먼트로 설정하는 단계를 더 포함할 수 있다.The control method of the neural processor is to randomly select any one of the two or more processing elements when the overflow signal or the underflow signal is simultaneously received from two or more processing elements among the plurality of processing elements. ) may further include setting the occupied processing element.

도 1 내지 도 2는 실시예들에 따른 뉴럴 프로세서의 구성을 도시한 도면.
도 3 내지 도 4는 실시예들에 따른 뉴럴 프로세서의 제어 방법을 나타낸 흐름도.1 to 2 are diagrams showing the configuration of a neural processor according to embodiments.
3 and 4 are flowcharts showing a control method of a neural processor according to embodiments.

실시예들에 대한 특정한 구조적 또는 기능적 설명들은 단지 예시를 위한 목적으로 개시된 것으로서, 다양한 형태로 변경되어 구현될 수 있다. 따라서, 실제 구현되는 형태는 개시된 특정 실시예로만 한정되는 것이 아니며, 본 명세서의 범위는 실시예들로 설명한 기술적 사상에 포함되는 변경, 균등물, 또는 대체물을 포함한다.Specific structural or functional descriptions of the embodiments are disclosed for illustrative purposes only and may be changed and implemented in various forms. Accordingly, the actual implementation form is not limited to the specific disclosed embodiments, and the scope of the present specification includes changes, equivalents, or substitutes included in the technical idea described in the embodiments.

제1 또는 제2 등의 용어를 다양한 구성요소들을 설명하는데 사용될 수 있지만, 이런 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 해석되어야 한다. 예를 들어, 제1 구성요소는 제2 구성요소로 명명될 수 있고, 유사하게 제2 구성요소는 제1 구성요소로도 명명될 수 있다.Terms such as first or second may be used to describe various components, but these terms should be interpreted only for the purpose of distinguishing one component from another component. For example, a first component may be named a second component, and similarly, the second component may also be named a first component.

어떤 구성요소가 다른 구성요소에 "연결되어" 있다고 언급된 때에는, 그 다른 구성요소에 직접적으로 연결되어 있거나 또는 접속되어 있을 수도 있지만, 중간에 다른 구성요소가 존재할 수도 있다고 이해되어야 할 것이다.When a component is referred to as being “connected” to another component, it should be understood that it may be directly connected or connected to the other component, but that other components may exist in between.

단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 명세서에서, "포함하다" 또는 "가지다" 등의 용어는 설명된 특징, 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것이 존재함으로 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.Singular expressions include plural expressions unless the context clearly dictates otherwise. In this specification, terms such as “comprise” or “have” are intended to designate the presence of the described features, numbers, steps, operations, components, parts, or combinations thereof, and are intended to indicate the presence of one or more other features or numbers, It should be understood that this does not exclude in advance the possibility of the presence or addition of steps, operations, components, parts, or combinations thereof.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 해당 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가진다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥상 가지는 의미와 일치하는 의미를 갖는 것으로 해석되어야 하며, 본 명세서에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.Unless otherwise defined, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by a person of ordinary skill in the art. Terms as defined in commonly used dictionaries should be interpreted as having meanings consistent with the meanings they have in the context of the related technology, and unless clearly defined in this specification, should not be interpreted in an idealized or overly formal sense. No.

이하, 실시예들을 첨부된 도면들을 참조하여 상세하게 설명한다. 첨부 도면을 참조하여 설명함에 있어, 도면 부호에 관계없이 동일한 구성 요소는 동일한 참조 부호를 부여하고, 이에 대한 중복되는 설명은 생략하기로 한다.Hereinafter, embodiments will be described in detail with reference to the attached drawings. In the description with reference to the accompanying drawings, identical components will be assigned the same reference numerals regardless of the reference numerals, and overlapping descriptions thereof will be omitted.

도 1은 일 실시예들에 따른 뉴럴 프로세서의 구성을 도시한 도면이다. 도 1을 참조하면, 시스톨릭 어레이(systolic array) 구조의 복수의 프로세싱 엘리먼트 그룹(processing element groups; PEG)들을 포함하는 뉴럴 프로세서(10)의 구조가 도시된다. 1 is a diagram illustrating the configuration of a neural processor according to one embodiment. Referring to FIG. 1, the structure of a neural processor 10 is shown including a plurality of processing element groups (PEGs) in a systolic array structure.

프로세싱 엘리먼트 그룹들(PEG_i _-1, PEG_i _,PEG_i ₊₁) 각각은 같은 기능을 가진 셀들이 연결망을 구성하여 전체적인 동기 신호에 맞추어서 하나의 연산을 수행할 수 있도록 설계된 시스톨릭 어레이이 구조를 가지므로 그 기능 및 동작이 동일할 수 있다. 이하에서는 프로세싱 엘리먼트 그룹들(PEG_i _-1, PEG_i _,PEG_i ₊₁) 중 어느 하나의 프로세싱 엘리먼트 그룹(PEG_i)(100)을 중심으로 그 구성 및 동작을 설명한다. 프로세싱 엘리먼트 그룹(PEG_i)(100)의 구성 및 동작은 다른 프로세싱 엘리먼트 그룹들(PEG_i-1, PEG_i+1)에도 동일하게 적용될 수 있다. Each of the processing element groups (PEG _i _-1, PEG _i _, PEG _i ₊₁ ) has a systolic array structure designed so that cells with the same function can form a network and perform one operation in accordance with the overall synchronization signal. Therefore, the function and operation may be the same. Below, the configuration and operation will be described focusing on the processing element group (PEG _i ) 100 of any one of the processing element groups (PEG _i _-1, PEG _i _, PEG _i ₊₁ ). The configuration and operation of the processing element group (PEG _i ) 100 can be equally applied to other processing element groups (PEG _i-1, PEG _i+1 ).

프로세싱 엘리먼트 그룹(PEG_i)(100)은 복수의 프로세싱 엘리먼트들(processing elements; PEs)(110, 120, 130, 140), 오버플로우 누산기 (overflow accumulator; OA)(150) 및 레지스터(register)(160)를 포함할 수 있다. 일 실시예에서는 하나의 오버플로우 누산기(OA)(150)를 공유하는 복수의 프로세싱 엘리먼트들(PEs)(110, 120, 130, 140)을 '프로세싱 엘리먼트 그룹(PEG)'이라고 부를 수 있다. The processing element group (PEG _i ) 100 includes a plurality of processing elements (PEs) (110, 120, 130, 140), an overflow accumulator (OA) 150, and a register ( 160) may be included. In one embodiment, a plurality of processing elements (PEs) 110, 120, 130, and 140 that share one overflow accumulator (OA) 150 may be referred to as a 'processing element group (PEG)'.

복수의 프로세싱 엘리먼트들(PEs)(110, 120, 130, 140)은 벡터 연산을 수행할 수 있다. 벡터 연산은 예를 들어, 두 벡터들 간의 내적(dot product) 연산을 포함할 수 있으며, 반드시 이에 한정되지는 않는다. 도 1에서는 설명의 편의를 위해 프로세싱 엘리먼트 그룹(PEG_i)(100)이 4개의 프로세싱 엘리먼트들(PEs)(110, 120, 130, 140)을 포함하는 실시예를 예로 들어 설명하지만, 반드시 이에 한정되는 것은 아니다. 하나의 프로세싱 엘리먼트 그룹(PEG_i)(100)에 포함되는 프로세싱 엘리먼트들의 개수는 예를 들어, 2개, 8개, 또는 16개 등과 같이 다양하게 구성될 수 있다. A plurality of processing elements (PEs) 110, 120, 130, and 140 may perform vector operations. Vector operations may include, for example, a dot product operation between two vectors, but are not necessarily limited thereto. In Figure 1, for convenience of explanation, an embodiment in which the processing element group (PEG _i ) 100 includes four processing elements (PEs) 110, 120, 130, and 140 is described as an example, but it is necessarily limited to this. It doesn't work. The number of processing elements included in one processing element group (PEG _i ) 100 may be variously configured, for example, 2, 8, or 16.

프로세싱 엘리먼트(PE)는 벡터 연산, 보다 구체적으로 벡터들 간의 내적 연산을 하드웨어로 구현한 것이라는 점에서 '내적 엔진(dot product engine; DPE)' 또는 '가속기(accelerator)'라고 부를 수도 있다. The processing element (PE) may also be called a 'dot product engine (DPE)' or an 'accelerator' in that it implements vector operations, more specifically, inner product operations between vectors, in hardware.

프로세싱 엘리먼트들(PEs)(110, 120, 130, 140) 각각은 누산기(accumulator; ACC)(115, 125, 135, 145)를 포함할 수 있다. 도면에 도시하지는 않았지만, 프로세싱 엘리먼트들(PEs)(110, 120, 130, 140) 각각은 누산기(ACC)(115, 125, 135, 145) 이외에도 복수의 곱셈기들(multipliers), 및 복수의 덧셈기들(adders)을 더 포함할 수 있다. 프로세싱 엘리먼트들(PEs)(110, 120, 130, 140) 각각은 누산기(ACC)(115, 125, 135, 145)의 연산 결과(예를 들어, 내적의 최종 결과 또는 부분합(partial sum) 등)를 수직으로 이어진 파이프라인 상호 연결을 통해 외부로 전송할 수 있다. 이때, 프로세싱 엘리먼트들(PEs)(110, 120, 130, 140) 각각의 누산기(ACC)(115, 125, 135, 145)와 복수의 덧셈기들은 적은 비트수로 구현될 수 있다. Each of the processing elements (PEs) 110, 120, 130, and 140 may include an accumulator (ACC) 115, 125, 135, and 145. Although not shown in the drawing, each of the processing elements (PEs) 110, 120, 130, and 140 includes a plurality of multipliers and a plurality of adders in addition to the accumulators (ACCs) 115, 125, 135, and 145. Additional (adders) may be included. Processing elements (PEs) 110, 120, 130, 140 each produce an operation result (e.g., final result of dot product or partial sum, etc.) of accumulators (ACC) 115, 125, 135, 145. can be transmitted externally through vertical pipeline interconnections. At this time, the processing elements (PEs) 110, 120, 130, and 140, each accumulator (ACC) 115, 125, 135, and 145, and a plurality of adders may be implemented with a small number of bits.

실시예에 따라서, 프로세싱 엘리먼트들(110, 120, 130, 140) 각각은 MAT(multiplier-adder tree), 덧셈기 및 누산기를 포함할 수도 있다. 프로세싱 엘리먼트들이 MAT, 덧셈기 및 누산기를 포함하는 실시예는 아래의 도 2를 참조하여 보다 구체적으로 설명한다. Depending on the embodiment, each of the processing elements 110, 120, 130, and 140 may include a multiplier-adder tree (MAT), an adder, and an accumulator. An embodiment in which processing elements include a MAT, an adder, and an accumulator will be described in more detail with reference to FIG. 2 below.

오버플로우 누산기(OA)(150)는 프로세싱 엘리먼트들(PEs)(110, 120, 130, 140)에 의해 공유되고, 프로세싱 엘리먼트들(PEs)(110, 120, 130, 140) 중 벡터 연산 과정에서 오버플로우(overflow) 또는 언더플로우(underflow)가 발생하는 프로세싱 엘리먼트에 의하여 점유될 수 있다. The overflow accumulator (OA) 150 is shared by processing elements (PEs) 110, 120, 130, and 140, and is used in the vector operation process among the processing elements (PEs) 110, 120, 130, and 140. It may be occupied by a processing element that overflows or underflows.

오버플로우 누산기(OA)(150)는 예를 들어, 프로세싱 엘리먼트들(PEs)(110, 120, 130, 140) 각각에 포함된 N bit의 누산기들(115, 125, 135, 145)으로부터 2 bit의 정보(예를 들어, 오버플로우(overflow), 언더플로우(underflow), 또는 논(none)을 지시하는 정보)을 수신할 수 있다. 오버플로우 누산기(OA)(150)는 수신한 2 bit의 정보에 따라 누산기들(115, 125, 135, 145) 중 어느 하나의 누산기의 연산 결과를 누적(accumulation)할 수 있다. For example, the overflow accumulator (OA) 150 collects 2 bits from the N bit accumulators 115, 125, 135, and 145 included in each of the processing elements (PEs) 110, 120, 130, and 140. Information (e.g., information indicating overflow, underflow, or none) may be received. The overflow accumulator (OA) 150 can accumulate the operation result of any one of the accumulators 115, 125, 135, and 145 according to the received 2 bits of information.

오버플로우 누산기(OA)(150)는 프로세싱 엘리먼트들(PEs)(110, 120, 130, 140) 각각과 점선으로 표시된 파이프라인 상호 연결(pipelined interconnect)을 통해 연결될 수 있다. 파이프라인 상호 연결은 인터커넥트 라인(170)에 연결될 수 있다. The overflow accumulator (OA) 150 may be connected to each of the processing elements (PEs) 110, 120, 130, and 140 through a pipelined interconnect indicated by a dotted line. A pipeline interconnect may be connected to interconnect line 170.

프로세싱 엘리먼트들(PEs)(110, 120, 130, 140) 각각이 N bit 의 누산기들(115, 125, 135, 145)의 내용인 N bit 데이터를 파이프라인 상호 연결을 통해 외부로 보낼 때, 오버플로우 누산기(OA)(150)는 N bit 누산기들(115, 125, 135, 145)의 N bit 의 연산 결과가 출력되는 클럭 사이클에 맞춰 자신(오버플로우 누산기(OA)(150))의 M bit의 연산 결과를 출력함으로써 M+N bit 의 데이터가 인터커넥트 라인(170)을 통해 외부로 전송되도록 할 수 있다.When processing elements (PEs) (110, 120, 130, 140) each send N bit data, which is the contents of N bit accumulators (115, 125, 135, 145), to the outside through the pipeline interconnection, over The flow accumulator (OA) 150 calculates its M bit (overflow accumulator (OA) 150) according to the clock cycle in which the N bit operation results of the N bit accumulators 115, 125, 135, and 145 are output. By outputting the calculation result, M+N bit data can be transmitted to the outside through the interconnect line 170.

예를 들어, 프로세싱 엘리먼트들(PEs)(110, 120, 130, 140) 각각이 10 bit의 크기를 가지고, 오버플로우 누산기(OA)(150)가 10bit의 크기를 가지는 경우, 인터커넥트 라인(170)은 총 50 bit이외에 점유 프로세싱 엘리먼트를 지시하는 정보(2 bit)와 오버플로우가 발생하였는지 여부 혹은 언더플로우가 발생하였는지 여부를 지시하는 정보(2 bit)를 더하여 총 54 bit의 크기의 데이터를 전송할 수 있다. 여기서, '점유 프로세싱 엘리먼트'는 하나의 프로세싱 엘리먼트 그룹을 구성하는 프로세싱 엘리먼트들(PEs)(110, 120, 130, 140) 중 오버플로우 누산기(OA)(150)를 점유하는 하나의 프로세싱 엘리먼트를 지칭하는 것으로 이해될수 있다. 이때, 프로세싱 엘리먼트들(PEs)(110, 120, 130, 140) 중 오버플로우 누산기(OA)(150)를 점유하지 않는 나머지 프로세싱 엘리먼트들은 '비점유 프로세싱 엘리먼트(들)'라고 부를 수 있다. For example, if the processing elements (PEs) 110, 120, 130, and 140 each have a size of 10 bits, and the overflow accumulator (OA) 150 has a size of 10 bits, the interconnect line 170 In addition to a total of 50 bits, data of a total size of 54 bits can be transmitted by adding information (2 bits) indicating the occupied processing element and information (2 bits) indicating whether overflow or underflow has occurred. there is. Here, 'occupied processing element' refers to one processing element that occupies the overflow accumulator (OA) 150 among the processing elements (PEs) 110, 120, 130, and 140 constituting one processing element group. It can be understood as doing so. At this time, the remaining processing elements that do not occupy the overflow accumulator (OA) 150 among the processing elements (PEs) 110, 120, 130, and 140 may be called 'unoccupied processing element(s)'.

도면에 도시하지는 않았지만, 오버플로우 누산기(150)는 누산기(accumulator) 및 덧셈기(adder)를 포함할 수 있다. Although not shown in the drawing, the overflow accumulator 150 may include an accumulator and an adder.

레지스터(register; Reg)(160)은 오버플로우 누산기(OA)(150)를 점유하는 점유 프로세싱 엘리먼트를 지시하는 정보를 저장할 수 있다. 점유 프로세싱 엘리먼트를 지시하는 정보는 "00", "01", "10", 및 "11"와 같이 2 비트(bit)로 구성될 수 있으며, 프로세싱 엘리먼트들(110, 120, 130, 140) 중 어느 하나의 프로세싱 엘리먼트를 지시할 수 있다. 예를 들어, "00"는 프로세싱 엘리먼트들(110)가 점유 프로세싱 엘리먼트임을 지시할 수 있다. "01"은 프로세싱 엘리먼트들(120)가 점유 프로세싱 엘리먼트임을 지시할 수 있다. "10"는 프로세싱 엘리먼트들(130)가 점유 프로세싱 엘리먼트임을 지시할 수 있다. "11"는 프로세싱 엘리먼트들(140)가 점유 프로세싱 엘리먼트임을 지시할 수 있다.A register (Reg) 160 may store information indicating the occupied processing element that occupies the overflow accumulator (OA) 150. Information indicating the occupied processing element may consist of 2 bits such as "00", "01", "10", and "11", and may be selected among the processing elements 110, 120, 130, and 140. Can indicate any one processing element. For example, “00” may indicate that processing elements 110 are occupied processing elements. “01” may indicate that the processing elements 120 are occupied processing elements. “10” may indicate that the processing elements 130 are occupied processing elements. “11” may indicate that the processing elements 140 are occupied processing elements.

또한, 레지스터(Reg)(160)는 점유 프로세싱 엘리먼트에서 오버플로우가 발생하였는지 여부 혹은 언더플로우가 발생하였는지 여부를 지시하는 2비트의 정보를 더 저장할 수 있다. 예를 들어, "01"는 오버플로우가 발생하였음을 지시하고, "10"은 언더플로우가 발생하였음을 지시하며, "00"은 논(none), 다시 말해 오버플로우 또는 언더플로우가 발생하지 않았음을 지시할 수 있다. Additionally, the register (Reg) 160 may further store 2 bits of information indicating whether overflow or underflow occurred in the occupied processing element. For example, "01" indicates that an overflow occurred, "10" indicates that an underflow occurred, and "00" is none, that is, no overflow or underflow occurred. You can indicate sound.

뉴럴 프로세서(10)는 가장 먼저 오버플로우 신호 또는 언더플로우 신호를 발생한 프로세싱 엘리먼트가 오버플로우 누산기를 점유하는 것으로 가정하여 매 클럭 싸이클(clock cycle)마다 오버플로우 누산기(OA)(150)가 프로세싱 엘리먼트에 의해 점유되는지를 체크할 수 있다. 뉴럴 프로세서(10)는 점유 프로세싱 엘리먼트를 지시하는 정보를 기초로, 벡터 연산을 수행하는 복수의 프로세싱 엘리먼트들(PEs)(110, 120, 130, 140)에 의해 공유되는 오버플로우 누산기(OA)(150)가 복수의 프로세싱 엘리먼트들(PEs)(110, 120, 130, 140) 중 적어도 하나에 의해 점유되는지 여부를 체크할 수 있다. The neural processor 10 assumes that the processing element that first generates the overflow signal or underflow signal occupies the overflow accumulator, so that the overflow accumulator (OA) 150 is connected to the processing element every clock cycle. You can check whether it is occupied. The neural processor 10 includes an overflow accumulator (OA) ( It may be checked whether 150) is occupied by at least one of a plurality of processing elements (PEs) 110, 120, 130, and 140.

오버플로우 누산기(OA)(150)가 점유되지 않은 경우, 뉴럴 프로세서(10)는 프로세싱 엘리먼트들(PEs)(110, 120, 130, 140) 중 오버플로우 신호 또는 언더플로우 신호를 처음 출력한 프로세싱 엘리먼트를 점유 프로세싱 엘리먼트로 설정할 수 있다. When the overflow accumulator (OA) 150 is not occupied, the neural processor 10 selects the processing element that first outputs the overflow signal or underflow signal among the processing elements (PEs) 110, 120, 130, and 140. can be set as an occupied processing element.

오버플로우 누산기(OA)(150)가 점유된 경우, 뉴럴 프로세서(10)는 점유 프로세싱 엘리먼트로부터 출력되는 오버플로우 신호에 따라 오버플로우 누산기(OA)(150)에 1을 가산하도록 제어할 수 있다. 뉴럴 프로세서(10)는 점유 프로세싱 엘리먼트로부터 출력되는 언더플로우 신호에 따라 오버플로우 누산기(OA)(150)에 1을 감산하도록 제어할 수 있다. 이때, 오버플로우 누산기(OA)(150)에 1을 가산 또는 감산하도록 제어하는 신호는 뉴럴 프로세서(10)의 제어 로직(미도시)을 통해 전달될 수 있다. When the overflow accumulator (OA) 150 is occupied, the neural processor 10 can control the overflow accumulator (OA) 150 to add 1 according to the overflow signal output from the occupied processing element. The neural processor 10 may control the overflow accumulator (OA) 150 to subtract 1 according to the underflow signal output from the occupied processing element. At this time, a signal that controls the overflow accumulator (OA) 150 to add or subtract 1 may be transmitted through the control logic (not shown) of the neural processor 10.

제어 로직은 벡터 연산이 종료되는 경우, 출력 제어 신호를 발생할 수 있다. 출력 제어 신호를 수신한 점유 프로세싱 엘리먼트는 클럭 사이클에 맞춰 자신의 연산 결과를 인터커넥트 라인(170)을 통해 전송하고, 동일한 클럭 사이클에 맞춰 오버플로우 누산기(OA)(150)의 연산 결과가 인터커넥트 라인(170)을 통해 전송될 수 있다. The control logic may generate an output control signal when the vector operation is terminated. The occupied processing element that receives the output control signal transmits its operation result through the interconnect line 170 in accordance with the clock cycle, and the operation result of the overflow accumulator (OA) 150 is transmitted in accordance with the same clock cycle through the interconnect line ( 170).

점유 프로세싱 엘리먼트는 벡터 연산이 종료됨에 따라 점유 프로세싱 엘리먼트를 지시하는 정보를 오버플로우 누산기(OA)(150)의 연산 결과 및 점유 프로세싱 엘리먼트의 연산 결과와 함께 출력할 수 있다. As the vector operation is terminated, the occupied processing element may output information indicating the occupied processing element along with the operation result of the overflow accumulator (OA) 150 and the operation result of the occupied processing element.

프로세싱 엘리먼트들 중 점유 프로세싱 엘리먼트를 제외한 비점유 프로세싱 엘리먼트들은 벡터 연산이 종료됨에 따라 비점유 프로세싱 엘리먼트들의 누산기의 연산 결과들을 인터커넥트 라인(170)을 통해 출력할 수 있다. Among the processing elements, the non-occupied processing elements excluding the occupied processing elements may output the operation results of the accumulators of the non-occupied processing elements through the interconnect line 170 as the vector operation is completed.

복수의 프로세싱 엘리먼트들(PEs)(110, 120, 130, 140) 중 두 개 이상의 프로세싱 엘리먼트들로부터 오버플로우 신호 또는 언더플로우 신호를 동시에 수신한 경우, 뉴럴 프로세서(10)는 두 개 이상의 프로세싱 엘리먼트들 중 어느 하나를 랜덤(random)으로 점유 프로세싱 엘리먼트로 설정할 수 있다. When an overflow signal or an underflow signal is simultaneously received from two or more of the plurality of processing elements (PEs) 110, 120, 130, and 140, the neural processor 10 receives the two or more processing elements. Any one of them can be randomly set as the occupied processing element.

일 실시예에서는 내적 연산의 결과를 저장하기 위해서는 큰 비트 수를 사용하는 경우가 드물다는 점을 고려하여 누산기와 덧셈기의 비용(다시 말해 적은 비트 수를 갖는 누산기와 덧셈기를 사용하여 누산기와 덧셈기의 게이트(gate) 수)을 줄일 수 있다. 예를 들어, 4개의 프로세싱 엘리먼트들(PEs)(110, 120, 130, 140) 각각이 (M+N) 비트의 크기를 갖는 누산기와 덧셈기를 구비하는 경우, 이들의 비용은 4(M+N) 비트가 될 수 있다. 이와 달리, 4개의 프로세싱 엘리먼트들(PEs)(110, 120, 130, 140)이 하나의 M bits오버플로우 누산기(OA)(150)를 공유하고, 4개의 프로세싱 엘리먼트들(PEs)(110, 120, 130, 140) 각각이 N bit누산기를 가지는 경우, 하나의 프로세싱 엘리먼트 그룹 내에 포함된 누산기와 덧셈기 각각의 총 비용은 4N+M bits로 표현할 수 있다. 4개의 프로세싱 엘리먼트들(PEs)(110, 120, 130, 140) 각각이 (M+N) 비트의 크기를 갖는 누산기와 덧셈기를 구비하는 경우에 비해 3M bits 만큼의 비용을 줄일 수 있다. 이때, 인터커넥트 라인(170) 또한 같은 비율로 비용(예를 들어, 파이프파인 상호 연결된 와이어들(wires)의 개수)을 줄일 수 있다. In one embodiment, considering that a large number of bits is rarely used to store the result of the dot product operation, the cost of the accumulator and the adder (that is, using an accumulator and an adder with a small number of bits to reduce the gate of the accumulator and the adder) (gate number) can be reduced. For example, if four processing elements (PEs) 110, 120, 130, 140 each have an accumulator and an adder with a size of (M+N) bits, their cost is 4(M+N). ) can be a bit. In contrast, four processing elements (PEs) 110, 120, 130, 140 share one M bits overflow accumulator (OA) 150, and four processing elements (PEs) 110, 120 , 130, 140) If each has an N bit accumulator, the total cost of each accumulator and adder included in one processing element group can be expressed as 4N+M bits. Compared to the case where each of the four processing elements (PEs) (110, 120, 130, and 140) has an accumulator and an adder with a size of (M+N) bits, the cost can be reduced by 3M bits. At this time, the interconnect line 170 can also reduce the cost (eg, the number of pipefine interconnected wires) at the same rate.

또한, 일 실시예에서는 적은 비트 수를 사용하는 누산기와 덧셈기에 의해 복수 개의 프로세싱 엘리먼트들(PEs)(110, 120, 130, 140)을 구성하고, 복수 개의 프로세싱 엘리먼트들(PEs)(110, 120, 130, 140) 중 오버플로우(overflow) 또는 언더플로우(underflow)가 발생하는 점유 프로세싱 엘리먼트가 오버플로우 누산기(OA)(150)를 점유하여 사용하도록 함으로써 누산기와 덧셈기의 비용을 낮추는 반면 효율을 향상시킬 수 있다. In addition, in one embodiment, a plurality of processing elements (PEs) (110, 120, 130, 140) are configured by an accumulator and an adder using a small number of bits, and a plurality of processing elements (PEs) (110, 120) are formed. , 130, 140), the cost of the accumulator and adder is reduced while the cost of the accumulator and adder is improved by allowing the occupied processing element where overflow or underflow occurs to occupy and use the overflow accumulator (OA) 150. You can do it.

도 2는 다른 실시예들에 따른 뉴럴 프로세서의 구성을 도시한 도면이다. 도 2를 참조하면, 두 개의 프로세싱 엘리먼트들(210, 230)이 하나의 오버플로우 누산기(270)를 공유하는 뉴럴 프로세서(200)가 도시된다.Figure 2 is a diagram showing the configuration of a neural processor according to other embodiments. Referring to Figure 2, a neural processor 200 is shown in which two processing elements 210 and 230 share one overflow accumulator 270.

예를 들어, 두 개의 프로세싱 엘리먼트들(210, 230) 각각이 N bits의 누산기(ACC)(217, 237)를 가지고, 오버플로우 누산기(270)는 M bit를 가진다고 하자. 이때, 두 개의 프로세싱 엘리먼트들(210, 230)는 MAT(multiplier-adder tree)(213, 233), N bit의 덧셈기(215, 235), 및 N bits의 누산기(ACC)(217, 237)를 포함할 수 있다. For example, let's say that the two processing elements 210 and 230 each have N bits of accumulators (ACCs) 217 and 237, and the overflow accumulator 270 has M bits. At this time, the two processing elements (210, 230) include a multiplier-adder tree (MAT) (213, 233), an N bit adder (215, 235), and an N bit accumulator (ACC) (217, 237). It can be included.

오버플로우 누산기(270)는 M bit 덧셈기(260)와 M bit 누산기를 포함할 수 있다. The overflow accumulator 270 may include an M bit adder 260 and an M bit accumulator.

예를 들어, 두 개의 프로세싱 엘리먼트들(210, 230) 중 프로세싱 엘리먼트(210)가 오버플로우 누산기(270)를 점유하는 점유 프로세싱 엘리먼트이고, 프로세싱 엘리먼트(230)가 오버플로우 누산기(270)를 점유하지 않는 비 점유 프로세싱 엘리먼트라고 가정하자. For example, of the two processing elements 210 and 230, processing element 210 is the occupying processing element that occupies the overflow accumulator 270, and processing element 230 does not occupy the overflow accumulator 270. Assume that it is an unoccupied processing element.

점유 프로세싱 엘리먼트(210)는 누산기(217)의 N bits의 연산 결과와 오버플로우 누산기(270)의 M bit의 연산 결과를 합쳐서 M+N bits의 연산 결과를 출력할 수 있다. 점유 프로세싱 엘리먼트(210)는 정밀도가 높은 연산 결과를 출력할 수 있다. 이때, 오버플로우 누산기(270)에 제공되는 M bit의 연산 결과는 점유 프로세싱 엘리먼트를 지시하는 정보에 따라 2-to-1 MUX(250)가 선택한 점유 프로세싱 엘리먼트(210)의 연산 결과에 해당할 수 있다. The occupied processing element 210 may output an operation result of M+N bits by combining the N bits of the accumulator 217 and the M bits of the overflow accumulator 270. The occupied processing element 210 can output calculation results with high precision. At this time, the M bit operation result provided to the overflow accumulator 270 may correspond to the operation result of the occupied processing element 210 selected by the 2-to-1 MUX 250 according to the information indicating the occupied processing element. there is.

점유 프로세싱 엘리먼트(210)는 파이프라인 상호 연결(280)을 통해 M+N bits의 연산 결과를 출력할 수 있다. The occupied processing element 210 may output an operation result of M+N bits through the pipeline interconnection 280.

이와 달리, 비점유 프로세싱 엘리먼트(230)는 누산기(237)의 N bits의 연산 결과를 출력할 수 있다. 비점유 프로세싱 엘리먼트(230)에서는 M+N bits 출력 중 누산기(237)의 N bits의 연산 결과를 유효한 결과로 출력할 수 있다. 비점유 프로세싱 엘리먼트(230)는 정밀도가 낮은 연산 결과를 출력할 수 있다. 이때, 최상위 비트에 해당하는 M bit에서는 예를 들어, 숫자의 부호(양수 또는 음수)와 값을 유지하면서 이진수의 비트 수를 늘리는 부호 확장(sign extension) 연산을 통해 숫자들이 추가될 수 있다. 비점유 프로세싱 엘리먼트(230)는 파이프라인 상호 연결(290)을 통해 N bits의 연산 결과를 출력할 수 있다.In contrast, the non-occupied processing element 230 may output the operation result of N bits of the accumulator 237. The non-occupied processing element 230 can output the operation result of N bits of the accumulator 237 as a valid result among M+N bits output. The unoccupied processing element 230 may output low-precision calculation results. At this time, in the M bit corresponding to the most significant bit, numbers can be added, for example, through a sign extension operation that increases the number of bits of a binary number while maintaining the sign (positive or negative) and value of the number. The non-occupied processing element 230 can output an operation result of N bits through the pipeline interconnection 290.

도 3은 일 실시예에 따른 뉴럴 프로세서의 제어 방법을 나타낸 흐름도이다. 도 3을 참조하면, 뉴럴 프로세서가 단계(310) 내지 단계(340)을 통해 연산 결과를 출력하는 과정이 도시된다. Figure 3 is a flowchart showing a control method of a neural processor according to an embodiment. Referring to FIG. 3, a process in which a neural processor outputs calculation results through steps 310 to 340 is shown.

단계(310)에서 뉴럴 프로세서는 벡터 연산을 수행하는 복수의 프로세싱 엘리먼트들에 의해 공유되는 오버플로우 누산기가 복수의 프로세싱 엘리먼트들 중 적어도 하나에 의해 점유되는지 여부를 체크한다. 뉴럴 프로세서는 예를 들어, 점유 프로세싱 엘리먼트를 지시하는 정보를 기초로, 오버플로우 누산기가 복수의 프로세싱 엘리먼트들 중 적어도 하나에 의해 점유되는지 여부를 체크할 수 있다.In step 310, the neural processor checks whether an overflow accumulator shared by a plurality of processing elements performing a vector operation is occupied by at least one of the plurality of processing elements. For example, the neural processor may check whether the overflow accumulator is occupied by at least one of the plurality of processing elements, based on information indicating the occupied processing element.

단계(310)의 체크 결과, 오버플로우 누산기가 점유되지 않은 경우, 단계(320)에서 뉴럴 프로세서는 프로세싱 엘리먼트들 중 오버플로우 신호 또는 언더플로우 신호를 출력한 프로세싱 엘리먼트를 점유 프로세싱 엘리먼트로 설정한다. As a result of the check in step 310, if the overflow accumulator is not occupied, in step 320, the neural processor sets the processing element that outputs the overflow signal or the underflow signal among the processing elements as the occupied processing element.

복수의 프로세싱 엘리먼트들 중 두 개 이상의 프로세싱 엘리먼트들로부터 오버플로우 신호 또는 언더플로우 신호를 동시에 수신한 경우, 뉴럴 프로세서는 두 개 이상의 프로세싱 엘리먼트들 중 어느 하나를 랜덤(random)으로 점유 프로세싱 엘리먼트로 설정할 수 있다. When an overflow signal or an underflow signal is simultaneously received from two or more processing elements among a plurality of processing elements, the neural processor may randomly set any one of the two or more processing elements as the occupied processing element. there is.

단계(310)의 체크 결과, 오버플로우 누산기가 점유된 경우, 단계(330)에서 뉴럴 프로세서는 오버플로우 누산기를 점유한 점유 프로세싱 엘리먼트로부터 출력되는 신호에 따라 오버플로우 누산기를 가산 또는 감산하도록 제어한다. 이때, 점유 프로세싱 엘리먼트로부터 출력되는 신호는 오버플로우 신호일 수도 있고, 또는 언더플로우 신호일 수도 있다. 예를 들어, 오버플로우 누산기가 점유된 경우, 뉴럴 프로세서는 오버플로우 누산기를 점유한 프로세싱 엘리먼트로부터 출력되는 오버플로우 신호에 따라 오버플로우 누산기에 1을 가산하거나, 또는 오버플로우 누산기를 점유한 프로세싱 엘리먼트로부터 출력되는 언더플로우 신호에 따라 오버플로우 누산기에서 1을 감산할 수 있다. As a result of the check in step 310, if the overflow accumulator is occupied, in step 330, the neural processor controls the overflow accumulator to add or subtract according to a signal output from the occupied processing element that occupies the overflow accumulator. At this time, the signal output from the occupied processing element may be an overflow signal or an underflow signal. For example, if the overflow accumulator is occupied, the neural processor adds 1 to the overflow accumulator according to the overflow signal output from the processing element occupying the overflow accumulator, or Depending on the output underflow signal, 1 can be subtracted from the overflow accumulator.

단계(340)에서 뉴럴 프로세서는 벡터 연산이 종료됨에 따라 점유 프로세싱 엘리먼트를 지시하는 정보를 오버플로우 누산기의 연산 결과 및 점유 프로세싱 엘리먼트의 연산 결과와 함께 출력한다. 뉴럴 프로세서는 오버플로우 누산기의 데이터와 점유 프로세싱 엘리먼트의 연산 결과를 합산한 결과를 점유 프로세싱 엘리먼트와 오버플로우 누산기 간에 수직 방향으로 이어진 파이프라인 상호 연결을 통해 외부로 출력할 수 있다. In step 340, as the vector operation is completed, the neural processor outputs information indicating the occupied processing element along with the operation result of the overflow accumulator and the operation result of the occupied processing element. The neural processor can output the result of adding the data of the overflow accumulator and the operation results of the occupied processing element to the outside through a pipeline interconnection connected vertically between the occupied processing element and the overflow accumulator.

도 4는 다른 실시예에 따른 뉴럴 프로세서의 제어 방법을 나타낸 흐름도이다. 도 4를 참조하면, 단계(410) 내지 단계(460)을 통해 뉴럴 프로세서의 오버플로우 누산기가 매 사이클마다 수행하는 동작들이 도시된다. Figure 4 is a flowchart showing a control method of a neural processor according to another embodiment. Referring to FIG. 4, operations performed by the overflow accumulator of the neural processor in each cycle through steps 410 to 460 are shown.

단계(410)에서 뉴럴 프로세서는 벡터 연산을 수행하는 복수의 프로세싱 엘리먼트들에 의해 공유되는 오버플로우 누산기가 복수의 프로세싱 엘리먼트들 중 적어도 하나에 의해 점유되는지 여부를 체크할 수 있다. In step 410, the neural processor may check whether an overflow accumulator shared by a plurality of processing elements that perform a vector operation is occupied by at least one of the plurality of processing elements.

단계(410)에서 오버플로우 누산기가 적어도 하나의 프로세싱 엘리먼트에 의해 점유되었다고 체크된 경우, 단계(440)에서 뉴럴 프로세서는 오버플로우 누산기를 점유한 점유 프로세싱 엘리먼트로부터 출력되는 오버플로우 신호 또는 언더플로우 신호에 따라 오버플로우 누산기(보다 구체적으로 오버플로우 누산기에 포함된 누산기(OACC))에 1을 가산하거나, 1을 뺄 수 있다. 뉴럴 프로세서는 점유 프로세싱 엘리먼트로부터 오버플로우 신호가 출력되면 오버플로우 누산기에 1을 가산하고, 언더플로우 신호가 출력되면 언더플로우 누산기에 1을 감산할 수 있다. If it is checked in step 410 that the overflow accumulator is occupied by at least one processing element, in step 440 the neural processor responds to the overflow signal or underflow signal output from the occupying processing element occupying the overflow accumulator. Accordingly, 1 can be added to the overflow accumulator (more specifically, the accumulator (OACC) included in the overflow accumulator) or 1 can be subtracted. The neural processor may add 1 to the overflow accumulator when an overflow signal is output from the occupied processing element, and may subtract 1 from the underflow accumulator when an underflow signal is output.

단계(410)에서 오버플로우 누산기가 적어도 하나의 프로세싱 엘리먼트에 의해 점유되지 않았다고 체크된 경우, 단계(420)에서 뉴럴 프로세서는 언더플로우 신호 또는 오버플로우 신호가 수신되는지를 판단할 수 있다. If it is checked in step 410 that the overflow accumulator is not occupied by at least one processing element, in step 420 the neural processor may determine whether an underflow signal or an overflow signal is received.

단계(420)에서 언더플로우 신호 또는 오버플로우 신호가 수신되지 않았다고 판단되면, 뉴럴 프로세서는 언더플로우 신호 또는 오버플로우 신호가 수신될 때까지 대기할 수 있다. 실시예에 따라서, 단계(420)에서 언더플로우 신호 또는 오버플로우 신호가 수신되지 않았다고 판단되면, 뉴럴 프로세서는 단계(410)으로 돌아가서 오버플로우 누산기가 복수의 프로세싱 엘리먼트들 중 적어도 하나에 의해 점유되는지 여부를 체크할 수도 있다. If it is determined in step 420 that the underflow signal or overflow signal has not been received, the neural processor may wait until the underflow signal or overflow signal is received. Depending on the embodiment, if it is determined in step 420 that the underflow signal or overflow signal has not been received, the neural processor returns to step 410 to determine whether the overflow accumulator is occupied by at least one of the plurality of processing elements. You can also check .

단계(420)에서 언더플로우 신호 또는 오버플로우 신호가 수신되었다고 판단되면, 단계(430)에서 뉴럴 프로세서는 언더플로우 신호 또는 오버플로우 신호를 출력한 프로세싱 엘리먼트를 점유 프로세싱 엘리먼트(owner PE)로 설정할 수 있다.If it is determined in step 420 that an underflow signal or overflow signal has been received, in step 430 the neural processor may set the processing element that output the underflow signal or overflow signal as the occupied processing element (owner PE). .

단계(420)에서 동시에 2개 이상의 언더플로우 신호 또는 오버플로우 신호가 수신되었다고 판단되면, 뉴럴 프로세서는 언더플로우 신호 또는 오버플로우 신호를 출력한 프로세싱 엘리먼트들 중 임의로 점유 프로세싱 엘리먼트를 설정할 수 있다. If it is determined in step 420 that two or more underflow signals or overflow signals are received at the same time, the neural processor may randomly set an occupied processing element among the processing elements that output the underflow signal or overflow signal.

뉴럴 프로세서는 단계(430)에서 설정된 점유 프로세싱 엘리먼트로부터 출력되는 오버플로우 신호 또는 언더플로우 신호에 따라 단계(440)에서 오버플로우 누산기에 1을 가산하거나, 1을 뺄 수 있다.The neural processor may add 1 to or subtract 1 from the overflow accumulator in step 440 according to the overflow signal or underflow signal output from the occupied processing element set in step 430.

단계(450)에서 뉴럴 프로세서는 벡터 연산이 종료됨에 따라 제어 로직으로부터 출력 제어 신호가 발생했는지를 판단할 수 있다. 단계(450)에서 출력 제어 신호가 발생하지 않은 것으로 판단되면, 뉴럴 프로세서는 동작을 종료할 수 있다. In step 450, the neural processor may determine whether an output control signal has been generated from the control logic as the vector operation is completed. If it is determined in step 450 that the output control signal has not been generated, the neural processor may end the operation.

단계(450)에서 출력 제어 신호가 발생한 것으로 판단되면, 단계(460)에서 뉴럴 프로세서는 클럭 싸이클에 맞춰 오버플로우 누산기의 연산 결과를 파이프라인 상호 연결을 통해 출력하는 한편 동일한 클럭 싸이클에 점유 프로세싱 엘리먼트의 연산 결과 또한 파이프라인 상호 연결을 통해 출력할 수 있다. 이때, 뉴럴 프로세서는 점유 프로세싱 엘리먼트를 지시하는 정보를 오버플로우 누산기의 연산 결과 및 점유 프로세싱 엘리먼트의 연산 결과와 함께 출력할 수 있다. If it is determined that the output control signal has occurred in step 450, in step 460, the neural processor outputs the operation result of the overflow accumulator through the pipeline interconnection in accordance with the clock cycle, while processing the occupied processing element in the same clock cycle. Calculation results can also be output through pipeline interconnection. At this time, the neural processor may output information indicating the occupied processing element together with the operation result of the overflow accumulator and the operation result of the occupied processing element.

이상에서 설명된 실시예들은 하드웨어 구성요소, 소프트웨어 구성요소, 및/또는 하드웨어 구성요소 및 소프트웨어 구성요소의 조합으로 구현될 수 있다. 예를 들어, 실시예들에서 설명된 장치, 방법 및 구성요소는, 예를 들어, 프로세서, 콘트롤러, ALU(arithmetic logic unit), 디지털 신호 프로세서(digital signal processor), 마이크로컴퓨터, FPGA(field programmable gate array), PLU(programmable logic unit), 마이크로프로세서, 또는 명령(instruction)을 실행하고 응답할 수 있는 다른 어떠한 장치와 같이, 범용 컴퓨터 또는 특수 목적 컴퓨터를 이용하여 구현될 수 있다. 처리 장치는 운영 체제(OS) 및 상기 운영 체제 상에서 수행되는 소프트웨어 애플리케이션을 수행할 수 있다. 또한, 처리 장치는 소프트웨어의 실행에 응답하여, 데이터를 접근, 저장, 조작, 처리 및 생성할 수도 있다. 이해의 편의를 위하여, 처리 장치는 하나가 사용되는 것으로 설명된 경우도 있지만, 해당 기술분야에서 통상의 지식을 가진 자는, 처리 장치가 복수 개의 처리 요소(processing element) 및/또는 복수 유형의 처리 요소를 포함할 수 있음을 알 수 있다. 예를 들어, 처리 장치는 복수 개의 프로세서 또는 하나의 프로세서 및 하나의 컨트롤러를 포함할 수 있다. 또한, 병렬 프로세서(parallel processor)와 같은, 다른 처리 구성(processing configuration)도 가능하다.The embodiments described above may be implemented with hardware components, software components, and/or a combination of hardware components and software components. For example, the devices, methods, and components described in the embodiments may include, for example, a processor, a controller, an arithmetic logic unit (ALU), a digital signal processor, a microcomputer, and a field programmable gate (FPGA). It may be implemented using a general-purpose computer or a special-purpose computer, such as an array, programmable logic unit (PLU), microprocessor, or any other device capable of executing and responding to instructions. The processing device may execute an operating system (OS) and software applications running on the operating system. Additionally, a processing device may access, store, manipulate, process, and generate data in response to the execution of software. For ease of understanding, a single processing device may be described as being used; however, those skilled in the art will understand that a processing device includes multiple processing elements and/or multiple types of processing elements. It can be seen that it may include. For example, a processing device may include multiple processors or one processor and one controller. Additionally, other processing configurations, such as parallel processors, are possible.

소프트웨어는 컴퓨터 프로그램(computer program), 코드(code), 명령(instruction), 또는 이들 중 하나 이상의 조합을 포함할 수 있으며, 원하는 대로 동작하도록 처리 장치를 구성하거나 독립적으로 또는 결합적으로(collectively) 처리 장치를 명령할 수 있다. 소프트웨어 및/또는 데이터는, 처리 장치에 의하여 해석되거나 처리 장치에 명령 또는 데이터를 제공하기 위하여, 어떤 유형의 기계, 구성요소(component), 물리적 장치, 가상 장치(virtual equipment), 컴퓨터 저장 매체 또는 장치, 또는 전송되는 신호 파(signal wave)에 영구적으로, 또는 일시적으로 구체화(embody)될 수 있다. 소프트웨어는 네트워크로 연결된 컴퓨터 시스템 상에 분산되어서, 분산된 방법으로 저장되거나 실행될 수도 있다. 소프트웨어 및 데이터는 컴퓨터 판독 가능 기록 매체에 저장될 수 있다.Software may include a computer program, code, instructions, or a combination of one or more of these, which may configure a processing unit to operate as desired, or may be processed independently or collectively. You can command the device. Software and/or data may be used on any type of machine, component, physical device, virtual equipment, computer storage medium or device to be interpreted by or to provide instructions or data to a processing device. , or may be permanently or temporarily embodied in a transmitted signal wave. Software may be distributed over networked computer systems and stored or executed in a distributed manner. Software and data may be stored on a computer-readable recording medium.

실시예에 따른 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 저장할 수 있으며 매체에 기록되는 프로그램 명령은 실시예를 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. The method according to the embodiment may be implemented in the form of program instructions that can be executed through various computer means and recorded on a computer-readable medium. A computer-readable medium may store program instructions, data files, data structures, etc., singly or in combination, and the program instructions recorded on the medium may be specially designed and constructed for the embodiment or may be known and available to those skilled in the art of computer software. there is. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks, and magnetic tapes, optical media such as CD-ROMs and DVDs, and magnetic media such as floptical disks. -Includes optical media (magneto-optical media) and hardware devices specifically configured to store and execute program instructions, such as ROM, RAM, flash memory, etc. Examples of program instructions include machine language code, such as that produced by a compiler, as well as high-level language code that can be executed by a computer using an interpreter, etc.

위에서 설명한 하드웨어 장치는 실시예의 동작을 수행하기 위해 하나 또는 복수의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.The hardware devices described above may be configured to operate as one or multiple software modules to perform the operations of the embodiments, and vice versa.

이상과 같이 실시예들이 비록 한정된 도면에 의해 설명되었으나, 해당 기술분야에서 통상의 지식을 가진 자라면 이를 기초로 다양한 기술적 수정 및 변형을 적용할 수 있다. 예를 들어, 설명된 기술들이 설명된 방법과 다른 순서로 수행되거나, 및/또는 설명된 시스템, 구조, 장치, 회로 등의 구성요소들이 설명된 방법과 다른 형태로 결합 또는 조합되거나, 다른 구성요소 또는 균등물에 의하여 대치되거나 치환되더라도 적절한 결과가 달성될 수 있다.Although the embodiments have been described with limited drawings as described above, those skilled in the art can apply various technical modifications and variations based on this. For example, the described techniques are performed in a different order than the described method, and/or components of the described system, structure, device, circuit, etc. are combined or combined in a different form than the described method, or other components are used. Alternatively, appropriate results may be achieved even if substituted or substituted by an equivalent.

그러므로, 다른 구현들, 다른 실시예들 및 특허청구범위와 균등한 것들도 후술하는 특허청구범위의 범위에 속한다.Therefore, other implementations, other embodiments, and equivalents of the claims also fall within the scope of the claims described below.

Claims

Contains a plurality of processing element groups,
Each of the processing element groups is
A plurality of processing elements that perform vector operations;
an overflow accumulator shared by the plurality of processing elements and occupied by a processing element in which an overflow or underflow occurs among the plurality of processing elements; and
A register that stores information indicating the occupied processing element that occupies the overflow accumulator.
Neural processor, including.

According to paragraph 1,
The overflow accumulator is
Neural, which accumulates the operation result of the accumulator of the occupied processing element based on information indicating whether overflow or underflow has occurred received from the accumulators included in each of the processing elements. processor.

According to paragraph 2,
The above information is
A neural processor that indicates at least one of overflow, underflow, and none.

According to paragraph 1,
The overflow accumulator is
A neural processor connected to each of the plurality of processing elements through a pipelined interconnect.

According to paragraph 1,
The neural processor is
A neural method that checks whether an overflow accumulator shared by a plurality of processing elements performing the vector operation is occupied by at least one of the plurality of processing elements, based on information indicating the occupied processing element. Processor.

According to clause 5,
If the overflow accumulator is occupied, the neural processor
Control to add 1 to the overflow accumulator according to the overflow signal output from the occupied processing element,
A neural processor that controls the overflow accumulator to subtract 1 according to an underflow signal output from the occupied processing element.

According to clause 5,
If the overflow accumulator is not occupied, the neural processor
A neural processor that sets a processing element that outputs an overflow signal or an underflow signal among the processing elements as the occupied processing element.

According to paragraph 1,
The occupied processing element is
A neural processor that outputs information indicating the occupied processing element along with an operation result of the overflow accumulator and an operation result of the occupied processing element as the vector operation is terminated.

According to paragraph 1,
Among the processing elements, unoccupied processing elements excluding the occupied processing element are
A neural processor that outputs an operation result of an accumulator of the unoccupied processing elements as the vector operation is terminated.

According to paragraph 1,
The neural processor is
When an overflow signal or an underflow signal is simultaneously received from two or more of the plurality of processing elements, one of the two or more processing elements is randomly set as the occupied processing element. , neural processor.

According to paragraph 1,
The register is
A neural processor further storing information indicating whether overflow or underflow occurred in the occupied processing element.

According to paragraph 1,
Each of the processing elements is
A neural processor, comprising a plurality of multipliers, a plurality of adders, and an accumulator.

According to paragraph 1,
Each of the processing elements is
A neural processor, including a multiplier-adder tree (MAT), an adder, and an accumulator.

According to paragraph 1,
The overflow accumulator is
Neural processor, including an accumulator and an adder.

checking whether an overflow accumulator shared by a plurality of processing elements performing a vector operation is occupied by at least one of the plurality of processing elements;
When the overflow accumulator is not occupied, setting a processing element that outputs an overflow signal or an underflow signal among the processing elements as an occupied processing element;
When the overflow accumulator is occupied, controlling the overflow accumulator to add or subtract according to a signal output from an occupied processing element that occupies the overflow accumulator; and
As the vector operation is terminated, outputting information indicating the occupied processing element together with the operation result of the overflow accumulator and the operation result of the occupied processing element.
A control method of a neural processor, including.

According to clause 15,
The above checking steps are
Checking whether the overflow accumulator is occupied by at least one of the plurality of processing elements, based on information indicating the occupied processing element.
A control method of a neural processor, including.

According to clause 15,
The controlling step is
If the overflow accumulator is occupied,
adding 1 to the overflow accumulator according to an overflow signal output from a processing element occupying the overflow accumulator; and
Subtracting 1 from the overflow accumulator according to an underflow signal output from the processing element occupying the overflow accumulator.
A control method of a neural processor, including.

According to clause 15,
The output step is
A step of outputting a result of adding the data of the overflow accumulator and the operation result of the occupied processing element through a pipeline interconnection connected vertically between the occupied processing element and the overflow accumulator.
A control method of a neural processor, including.

According to clause 15,
When the overflow signal or the underflow signal is simultaneously received from two or more processing elements among the plurality of processing elements, one of the two or more processing elements is randomly assigned to the occupied processing element. Steps to set up
A control method of a neural processor, further comprising:

A computer program stored in a computer-readable recording medium in combination with hardware to execute the method of any one of claims 15 to 19.