KR20200089302A

KR20200089302A - Systems and methods for converting matrix inputs to vectorized inputs for matrix processors

Info

Publication number: KR20200089302A
Application number: KR1020207017612A
Authority: KR
Inventors: 피터 조세프 배넌; 윌리엄 에이. 맥기; 에밀 탈프스
Original assignee: 테슬라, 인크.
Priority date: 2017-12-12
Filing date: 2018-12-11
Publication date: 2020-07-24
Also published as: CN111465924A; CN111465924B; EP3724771A4; EP3724771A1; JP7101776B2; US20190179870A1; WO2019118517A1; JP2021507335A; KR102451239B1; US10747844B2

Abstract

다수의 동작들이 큰 데이터 세트에 걸쳐 병렬로 수행될 수 있게 하는 하드웨어-특정 회로를 이용함으로써 이미지 및 유사한 산술 연산들의 콘볼루션을 가속시키는 시스템 및 방법이 제공된다. 다양한 실시예에서, 산술 연산은 데이터를 재사용하고, 산술 연산을 수행할 때 레지스터 및 메모리로부터 중간 결과를 저장 및 페치하는 중복되는 단계를 제거함으로써 더욱 향상된다.A system and method for accelerating the convolution of images and similar arithmetic operations by using hardware-specific circuitry that allows multiple operations to be performed in parallel over a large data set is provided. In various embodiments, arithmetic operations are further enhanced by reusing data and eliminating redundant steps of storing and fetching intermediate results from registers and memory when performing arithmetic operations.

Description

Systems and methods for converting matrix inputs to vectorized inputs for matrix processors

본 출원은, 피터 조세프 배넌, 윌리엄 에이. 맥기, 에밀 탈프스가 발명자로 열거된, "매트릭스 프로세서를 위해 매트릭스 입력을 벡터화된 입력으로 변환하기 위한 시스템 및 방법(SYSTEMS AND METHODS FOR CONVERTING A MATRIX INPUT TO A VECTORIZED INPUT FOR A MATRIX PROCESSOR)"이라는 명칭으로 2017년 12월 12일에 출원된 공동 소유의 미국 특허 출원 제15/839,234호(도켓(Docket) 번호 20150-2166)에 대한 우선권을 주장한다. 전술한 특허 문헌 각각은 그 전체가 본 명세서에 참조로 포함된다.This application is Peter Joseph Bannan, William A. McGee, Emil Thalps, listed as the inventor, under the name "SYSTEMS AND METHODS FOR CONVERTING A MATRIX INPUT TO A VECTORIZED INPUT FOR A MATRIX PROCESSOR" Claims priority to co-owned U.S. Patent Application No. 15/839,234 (Docket No. 20150-2166) filed on December 12, 2017. Each of the aforementioned patent documents is incorporated herein by reference in its entirety.

본 개시는 고성능 매트릭스 프로세서(high-performance matrix processor)를 위해 MxN 데이터 매트릭스를 벡터화된 입력으로 변환하는 것에 관한 것으로, 보다 구체적으로는, 복잡한 수학적 연산들(operations)이 큰 2 차원 및 3 차원 매트릭스들에서 수행될 수 있도록 매트릭스 프로세서에 대해 정렬되는 1 차원(one-dimensional) 벡터로 MxN 매트릭스를 매핑(mapping)하기 위한 방법 및 시스템에 관한 것이다.The present disclosure relates to transforming an MxN data matrix into a vectorized input for a high-performance matrix processor, and more specifically, two-dimensional and three-dimensional matrices with large complex mathematical operations. It relates to a method and system for mapping an MxN matrix to a one-dimensional vector that is aligned with respect to a matrix processor so that it can be performed in a.

당업자는 시간 민감한(time-sensitive) 복잡한 수학적 연산을 구현하는데 사용되는 일반적인 프로세서 및 시스템에 대한 속도 및 성능의 계속 증가하는 요구를 알고 있을 것이다. 이러한 일반적인 시스템은 대량의 데이터를 처리하고 복잡한 수학적 연산을 수행하기 위해 사용되기 때문에, 계산 자원(computational resources) 및 계산 속도는 이들 계산을 수행하는 기존의 일반적인 하드웨어 설계의 능력으로 제한된다. 예를 들어, 매트릭스 연산들(matrix operations)을 실행하는 범용 컴퓨팅 디바이스들 및 프로세서들은 특정 환경들 하에서 적시에 이러한 동작들을 수행하지 못할 수 있다. 디지털 신호 처리 연산(digital signal processing operations)을 수행하는 종래의 많은 승산기(multipliers)는 일련의 소프트웨어 및 하드웨어 매트릭스 조작 단계에 의존하며, 시간 민감한 시스템 내에서 병목 현상의 원인이 될 수 있다. 종종 이러한 단계들은, 중간 결과를 생성하는 프로세서의 산술 기능(arithmetic functions)을 포함하는데, 연산을 완료하는 다양한 위치(locations)로부터 중간 결과(intermediate results)를 저장하고 페치(fetching)하는 추가된 단계에 기인한 계산 시간의 소모를 비용으로 지불한다. 일 예에서, 입력 매트릭스는 종종 처리 엘리먼트(processing element)로 입력되기 위해 상이한 포맷 또는 아키텍처로 변환될 필요가 있다. 이러한 변환 프로세스는, 입력들이 계산 프로세서 또는 다른 산술 로직(arithmetic logic)으로 후속적으로 처리되는 특정 구조로 포맷될 필요가 있기 때문에 시스템 내에서 상당한 지연을 야기할 수 있다.Those skilled in the art will be aware of the ever-increasing demands of speed and performance for common processors and systems used to implement time-sensitive complex mathematical operations. Because these typical systems are used to process large amounts of data and perform complex mathematical operations, computational resources and computational speed are limited to the ability of conventional general hardware designs to perform these calculations. For example, general purpose computing devices and processors that perform matrix operations may not be able to perform these operations in a timely manner under certain circumstances. Many conventional multipliers that perform digital signal processing operations rely on a series of software and hardware matrix manipulation steps and can be a bottleneck in time sensitive systems. Often these steps involve the arithmetic functions of the processor that produce the intermediate result, with the added step of storing and fetching the intermediate results from the various locations that complete the operation. You pay for the resulting computational time. In one example, the input matrix often needs to be converted to a different format or architecture to be input into a processing element. This conversion process can cause significant delay in the system because the inputs need to be formatted into a particular structure that is subsequently processed by a computational processor or other arithmetic logic.

도 1은 종래의 매트릭스 곱셈 시스템(matrix multiplication system)의 예를 도시한다. 시스템(100)은 계산 유닛(102), 레지스터(104), 캐시(106), 및 메모리(108)를 포함하는 스칼라 머신(scalar machine)이다. 동작시, 계산 유닛(102)은 레지스터(104) 및 캐시(106)를 사용하여 메모리(108)에 저장된 데이터를 처리한다. 처리된 데이터는, 예를 들어, 이미지를 처리하기 위해 콘볼루션 연산에 사용되는 이미지 데이터 및 가중치(weights)이다. 전형적으로, 계산 유닛(102)은, 예를 들어 곱셈을 덧셈으로 변환하고 그 결과를 일부 내부 레지스터로 출력함으로써, 결과적인 매트릭스를 획득하기 위해 입력 매트릭스에 대해 매트릭스 곱셈을 수행할 수 있는 CPU 또는 GPU와 같은 마이크로프로세서이다.1 shows an example of a conventional matrix multiplication system. System 100 is a scalar machine that includes a computing unit 102, a register 104, a cache 106, and a memory 108. In operation, computation unit 102 uses register 104 and cache 106 to process data stored in memory 108. The processed data are, for example, image data and weights used in a convolution operation to process an image. Typically, the calculation unit 102 can perform a matrix multiplication on the input matrix to obtain the resulting matrix, for example by converting the multiplication to addition and outputting the result to some internal register, such as a CPU or GPU. It is the same microprocessor.

예를 들어, 이미지의 출력 픽셀을 나타내는 내적(dot products)은, 전형적으로 부분적인 결과를 획득하기 위해 2 개의 매트릭스로부터 개별 매트릭스 엘리먼트를 도트-곱(dot-multiplying)함으로써 생성되며, 이는 최종 내적(final dot product)을 획득하기 위해 부가된다. 개별적인 매트릭스 엘리먼트들의 곱(multiplication)(예를 들어 스칼라 곱(scalar multiplication))은, 전형적으로 도트 곱(dot multiplication)을 일련의 개별적인 서브-연산들(sub-operations)로 분할(breaking up)함으로써, 개별적인 데이터 엘리먼트에 대해 수행된다. 결과적으로, 부분 곱들(partial products)은 단일 산술 연산을 완료하기 위해 레지스터들(104), 캐시(106), 및 메모리(108) 중 하나 이상으로부터 저장 및 페치되어야 한다.For example, dot products representing the output pixels of an image are typically created by dot-multiplying individual matrix elements from two matrices to obtain partial results, which results in the final dot product ( final dot product). Multiplication of individual matrix elements (e.g., scalar multiplication) is typically by breaking up a dot multiplication into a series of individual sub-operations, It is performed on individual data elements. Consequently, partial products must be stored and fetched from one or more of registers 104, cache 106, and memory 108 to complete a single arithmetic operation.

콘볼루션과 같이 계산적으로 부담이 큰 응용은, 종종, 콘볼루션 연산을 교번 매트릭스-곱셈 연산(alternate matrix-multiply operation)으로 변환하는데 사용되고 계산 유닛(102)에 임베드되는 소프트웨어 기능을 요구한다. 이는 원시 매트릭스-곱셈(raw matrix-multiplied)될 수 있는 2 개의 매트릭스로 이미지 데이터 및 가중치 데이터(weight data)를 재배열(rearranging)하고 재포맷(reformatting)함으로써 달성된다. 그러나, 스칼라 머신(100)에서 데이터를 효율적으로 공유 또는 재사용하기 위한 메커니즘은 존재하지 않으므로, 각각의 스칼라 연산을 실행하는데 필요한 데이터는 매번 레지스터로부터 재페치(re-fetched)되어야 한다. 이러한 연산의 복잡도(complexity)는 콘볼루션 연산의 대상인 이미지 데이터의 양이 증가함에 따라 상당히 더 커진다.Applications that are computationally burdensome, such as convolution, often require software functionality that is used to transform the convolution operation into an alternating matrix-multiply operation and embedded in the computation unit 102. This is achieved by rearranging and reformatting image data and weight data into two matrices that can be raw matrix-multiplied. However, since there is no mechanism for efficiently sharing or reusing data in the scalar machine 100, data required to execute each scalar operation must be re-fetched from the register each time. The complexity of this operation becomes significantly larger as the amount of image data to be subjected to convolution operation increases.

산술 연산을 완료하기 위해 레지스터들(104), 캐시(106), 및 메모리(108)로부터 중간 결과들(intermediate results)을 저장 및 페치하는 추가된 단계들과 연결된 스칼라 머신(100)에서 많은 데이터를 재사용할 수 없다는 것은, 승산기 시스템(multiplier system)(100)과 같은 기존의 시스템들의 단점 중 단지 일부이다.A lot of data from the scalar machine 100 associated with additional steps to store and fetch intermediate results from registers 104, cache 106, and memory 108 to complete the arithmetic operation. The inability to reuse is only some of the shortcomings of existing systems, such as multiplier system 100.

따라서, 매트릭스 곱셈 연산들을 수행할 수 있는 고성능-계산-처리 시스템들 및 방법들에 대해 적절한 입력 구조로 매트릭스를 변환하는 효율적인 변환 방법들 및 디바이스들이 요구된다.Accordingly, what is needed are efficient transform methods and devices that transform a matrix into an input structure suitable for high performance-compute-processing systems and methods capable of performing matrix multiplication operations.

본 발명의 실시예들이 참조되며, 실시예들의 예가 첨부 도면들에 도시될 수 있다. 이러한 도면들은 제한적인 것은 아니며 예시적인 것으로 의도된다. 본 발명이 이러한 실시예들의 맥락에서 일반적으로 설명되었지만, 본 발명의 범위를 이들 특정 실시예들로 제한하도록 의도되지는 않는다는 것을 이해해야 한다.Reference is made to embodiments of the invention, and examples of embodiments may be shown in the accompanying drawings. These drawings are not limiting and are intended to be illustrative. Although the invention has been generally described in the context of these embodiments, it should be understood that it is not intended to limit the scope of the invention to these specific embodiments.

도 1은 종래의 매트릭스 곱셈 시스템의 예를 도시한다.
도 2는 본 개시의 다양한 실시예들에 따른, 매트릭스 곱셈 회로에 콘볼루션들을 매핑하기 위한 예시적인 프로세스의 흐름도이다.
도 3은 본 개시의 다양한 실시예들에 따른, 콘볼루션들을 곱셈 회로에 매핑하기 위한 데이터 포맷터(data formatter)를 이용하는 예시적인 매트릭스 곱셈 시스템을 도시한다.
도 4는 본 개시의 다양한 실시예들에 따른, 콘볼루션들을 매트릭스 곱셈 회로에 매핑하기 위한 프로세스를 예시하는 예시적인 다이어그램이다.1 shows an example of a conventional matrix multiplication system.
2 is a flow diagram of an example process for mapping convolutions to a matrix multiplication circuit, in accordance with various embodiments of the present disclosure.
3 illustrates an exemplary matrix multiplication system using a data formatter to map convolutions to a multiplication circuit, according to various embodiments of the present disclosure.
4 is an example diagram illustrating a process for mapping convolutions to a matrix multiplication circuit, in accordance with various embodiments of the present disclosure.

다음의 설명에서, 설명의 목적으로, 본 발명의 이해를 제공하기 위해 특정 세부사항들이 제시된다. 그러나, 본 발명이 이러한 세부사항 없이도 실시될 수 있다는 것은 당업자에게 명백할 것이다. 또한, 당업자는, 후술되는 본 발명의 실시예들이 프로세스, 장치, 시스템, 디바이스, 또는 유형의 컴퓨터-판독가능 매체(tangible computer- readable medium) 상의 방법과 같이 다양한 방식들로 구현될 수 있다는 것을 알고 있을 것이다.In the following description, for purposes of explanation, certain details are presented to provide an understanding of the invention. However, it will be apparent to those skilled in the art that the present invention may be practiced without these details. In addition, those skilled in the art realize that the embodiments of the present invention described below can be implemented in various ways, such as a method on a process, apparatus, system, device, or tangible computer-readable medium. There will be.

도면에 도시된 컴포넌트(Components) 또는 모듈은 본 발명의 예시적인 실시예를 예시하며, 본 발명을 모호하게 하는 것을 피하기 위한 것이다. 또한, 본 설명 전반에 걸친 컴포넌트들은 서브-유닛들(sub-units)을 포함할 수 있는 개별 기능 유닛들(separate functional units)로 설명될 수 있지만, 당업자는 다양한 컴포넌트들, 또는 그 일부들이 개별 컴포넌트들로 분할될 수 있거나 또는 (단일 시스템 또는 컴포넌트 내에 통합된 것을 포함하여) 함께 통합될 수 있다는 것을 알고 있을 것이다. 본 명세서에서 논의된 기능들(functions) 또는 동작들(operations)은 컴포넌트들로 구현될 수 있다는 것을 유의해야 한다. 컴포넌트들은 소프트웨어, 하드웨어, 또는 이들의 조합으로 구현될 수 있다.Components or modules shown in the drawings illustrate exemplary embodiments of the present invention and are intended to avoid obscuring the present invention. In addition, components throughout this description may be described as separate functional units, which may include sub-units, but those skilled in the art may understand that various components, or portions thereof, are individual components. It will be appreciated that they may be divided into or integrated together (including those integrated within a single system or component). It should be noted that the functions or operations discussed herein can be implemented as components. Components can be implemented in software, hardware, or a combination thereof.

또한, 도면들 내의 컴포넌트들 또는 시스템들 사이의 연결들(connections)은 직접 연결들(direct connections)로 제한되도록 의도되지는 않는다. 오히려, 이들 컴포넌트들 사이에서의 데이터는 중간 컴포넌트에 의해 수정되거나, 재포맷(re-formatted)되거나, 또는 그렇지 않으면 변경될 수 있다. 또한, 추가적이거나 또는 더 적은 연결이 사용될 수 있다. 또한, "연결된(coupled)", "연결된(connected)" 또는 "통신가능하게 연결된(communicatively coupled)"이라는 용어는 직접 연결, 하나 이상의 중간 디바이스를 통한 간접(indirect) 연결, 및 무선 연결을 포함하는 것으로 이해되어야 한다.Also, connections between components or systems in the drawings are not intended to be limited to direct connections. Rather, the data between these components can be modified, re-formatted by an intermediate component, or otherwise changed. Also, additional or fewer connections can be used. Also, the terms "coupled", "connected" or "communicatively coupled" include direct connections, indirect connections through one or more intermediate devices, and wireless connections. It should be understood as.

본 명세서에서 "일 실시예(one embodiment)", "바람직한 실시예(preferred embodiment)", "실시예(embodiment)", 또는 "실시예들(embodiments)"에 대한 참조는, 실시예와 관련하여 설명된 특정 특징(feature), 구조, 특성, 또는 기능(function)이 본 발명의 적어도 하나의 실시예에 포함되고, 하나 이상의 실시예일 수 있다는 것을 의미한다. 또한, 본 명세서의 다양한 곳에서 상술된 문구들(phrases)의 출현들은 반드시 모두 동일한 실시예 또는 실시예들을 지칭하는 것은 아니다.References herein to “one embodiment”, “preferred embodiment”, “embodiment”, or “embodiments” refer to the embodiments It is meant that the specific features, structures, features, or functions described are included in at least one embodiment of the invention, and may be one or more embodiments. In addition, the appearances of the above-mentioned phrases in various places in the specification are not necessarily all referring to the same embodiment or embodiments.

본 명세서의 다양한 곳에서 특정 용어들을 사용하는 것은 예시를 위한 것이며, 제한하는 것으로 해석되지 않아야 한다. 서비스, 기능(function) 또는 자원(resource)은 단일 서비스, 기능 또는 자원으로 제한되지 않으며, 이들 용어의 사용은 분산(distributed) 또는 집합(aggregated)될 수 있는 관련 서비스들, 기능들 또는 자원들의 그룹화(grouping)를 지칭할 수 있다. 또한, 메모리, 데이터베이스, 정보 베이스(information base), 데이터 저장(store), 테이블, 하드웨어 등의 사용은, 정보가 입력되거나 그렇지 않으면 기록될 수 있는 시스템 컴포넌트 또는 컴포넌트들을 지칭하기 위해 본 명세서에서 사용될 수 있다.The use of specific terms in various places in this specification is for illustrative purposes and should not be construed as limiting. A service, function, or resource is not limited to a single service, function, or resource, and the use of these terms is a grouping of related services, functions, or resources that can be distributed or aggregated. (grouping). In addition, the use of memory, databases, information bases, data stores, tables, hardware, etc., can be used herein to refer to system components or components into which information may be entered or otherwise recorded. have.

또한, 다음을 유의해야 한다: (1) 특정 단계들이 선택적으로 수행될 수 있다; (2) 단계들은 본 명세서에 제시된 특정 순서로 제한되지 않을 수 있다; (3) 특정 단계들은 상이한 순서들로 수행될 수 있다; (4) 특정 단계들은 동시에 수행될 수 있다.Also, it should be noted that: (1) Certain steps can be selectively performed; (2) The steps may not be limited to the specific order presented herein; (3) Certain steps can be performed in different orders; (4) Certain steps can be performed simultaneously.

또한, 당업자는 다음을 알 것이다: (1) 특정 단계들이 선택적으로 수행될 수 있다; (2) 단계들은 본 명세서에 제시된 특정 순서로 제한되지 않을 수 있다; (3) 특정 단계들은 상이한 순서들로 수행될 수 있다; (4) 특정 단계들은 동시에 수행될 수 있다.Also, those skilled in the art will know: (1) Certain steps can be selectively performed; (2) The steps may not be limited to the specific order presented herein; (3) Certain steps can be performed in different orders; (4) Certain steps can be performed simultaneously.

본 명세서의 실시예는 주로 콘볼루션의 맥락에서 논의되지만, 당업자는 디콘볼루션(deconvolution)이 또한 매트릭스-매트릭스 타입 곱셈 연산(matrix-matrix type multiply operation)으로 구조화(structured)될 수 있고, 따라서, 본 발명의 원리는 디콘볼루션에 동일하게 적용될 수 있다는 것을 이해할 것이다. 또한, 다른 타입의 수학적 연산들이 본 개시의 다양한 실시예들에 따라 구현될 수 있다.Although the embodiments herein are primarily discussed in the context of convolution, those skilled in the art can also deconvolution also be structured as a matrix-matrix type multiply operation, and thus, It will be understood that the principles of the present invention can be equally applied to deconvolution. Also, other types of mathematical operations can be implemented according to various embodiments of the present disclosure.

본 명세서는 적어도 하나의 입력 매트릭스를 콘볼루션 매트릭스 곱셈 회로에 대한 벡터화된 입력으로 변환하는 것을 설명할 것이지만, 당업자는 설명된 방법들 및 구조들이 광범위한 입력 데이터 타입들을 하나의 포맷으로부터 다른 포맷으로 변환하는데 사용될 수 있다는 것을 알고 있을 것이다. 이러한 상이한 입력 데이터 구조들 모두는 본 명세서의 범위 내에 있다. 도 2는 본 개시의 다양한 실시예들에 따른, 콘볼루션들을 매트릭스 곱셈 회로에 매핑하기 위한 예시적인 프로세스의 흐름도이다. 프로세스(200)는, 콘볼루션 파라미터를 포함하는 콘볼루션 명령들이 예를 들어 데이터 포맷 디바이스(data formatting device)에서 수신되는 경우, 단계(202)에서 시작한다. 콘볼루션 명령들 또는 파라미터들은, 예를 들어, 필터 차원(filter dimension), 스트라이드(stride), 입력의 어드레스(address), 또는 임의의 개수의 출력 채널들(outputs channels)일 수 있다.While the present specification will describe converting at least one input matrix to a vectorized input to a convolution matrix multiplication circuit, those skilled in the art may use the methods and structures described to convert a wide range of input data types from one format to another. You will know that it can be used. All of these different input data structures are within the scope of this specification. 2 is a flow diagram of an example process for mapping convolutions to a matrix multiplication circuit, in accordance with various embodiments of the present disclosure. The process 200 begins at step 202 when convolutional instructions including convolution parameters are received, for example, in a data formatting device. The convolution instructions or parameters can be, for example, a filter dimension, stride, address of input, or any number of outputs channels.

단계(204)에서, 콘볼루션 파라미터들에 기초하여, 콘볼루션 연산의 입력들이 식별된다. 실시예들에서, 입력들은, 실시예로 3 차원 데이터 매트릭스(three-dimensional data matrix)를 나타낼 수 있는 상이한 어드레스로 하나 이상의 저장 위치에 저장된다. 실시예들에서, 매트릭스는 주어진 출력 채널에 대해 병렬로 계산되는 픽셀들의 수(a number of pixels)에 대응하는 1 차원(first dimension), 병렬로 동작하는 출력 채널들의 수(a number of outputs channels)에 대응하는 2 차원(second dimension), 및 각각의 픽셀에 대한 상이한 컬러들(colors)에 대응하는 3 차원(third dimension)을 갖는다.In step 204, based on the convolution parameters, inputs of the convolution operation are identified. In embodiments, the inputs are stored in one or more storage locations at different addresses that may represent a three-dimensional data matrix in an embodiment. In embodiments, the matrix is a first dimension corresponding to a number of pixels calculated in parallel for a given output channel, and a number of outputs channels operating in parallel. It has a two-dimensional (second dimension) corresponding to, and a third dimension (third dimension) corresponding to different colors (colors) for each pixel.

단계(206)에서, 중첩(overlapping)/중복(redundant)되는 입력 데이터는, 예를 들어 상태 머신(state machine)에 의해 식별된다. 실시예에서, 상태 머신은 필터 크기 및 스트라이드와 같은 필터 파라미터를 사용하여 재사용 가능한 연산 대상(operands)을 결정한다.In step 206, overlapping/redundant input data is identified, for example, by a state machine. In an embodiment, the state machine uses filter parameters such as filter size and stride to determine reusable operands.

단계(208)에서, 계산 시간 및 전력 소비(power consumption)를 감소시키기 위해, 중첩되는 입력 데이터의 로컬 복사본들(local copies)이 캐시 또는 버퍼로부터 검색되어, 예를 들어 SRAM으로부터 동일한 데이터를 다시 페치해야 하는 것을 피할 수 있다.In step 208, local copies of overlapping input data are retrieved from the cache or buffer to reduce computation time and power consumption, for example re-fetching the same data from SRAM You can avoid what you have to do.

단계(210)에서, 각각의 콘볼루션 연산에 대해, 시퀀서(sequencer)는, 예를 들어, 매트릭스 프로세서가 고성능 산술 연산(high- performance arithmetic operations)(예를 들어, 큰 매트릭스 곱셈)을 수행할 수 있도록 하는 매트릭스 프로세서의 미리 결정된 입력 포맷에 기초하여, 매트릭스 프로세서에서의 위치에 따라 입력을 배열(arrange)하는데 사용될 수 있다. 실시예에서, 시퀀서는 콘볼루션 연산의 각 사이클마다 입력을 배열한다. 특정 실시예들에서, 시퀀서는 2 차원 매트릭스 사이의 데이터를, 매트릭스 프로세서에 데이터를 입력하는데 사용되는 벡터로 매핑한다. 예를 들어, 시퀀서는 2 차원 매트릭스 내의 그 위치에 기초하여 데이터/연산 대상을 벡터 내의 대응하는 위치로 효과적으로 이동시키는 엘리먼트-바이-엘리먼트(element-by-element)(예를 들어, 비트-바이-비트(bit- by-bit) 또는 바이트-바이-바이트(byte-by-byte)) 매핑을 수행할 수 있다. 일단 입력들이 배열되면, 실시예들에서 정렬기(aligner)는 배열되는 데이터를 정렬하고, 그것을 미리 정의된 시퀀스(sequence)에 따라 매트릭스 프로세서에 피드(feed)할 수 있다.In step 210, for each convolution operation, a sequencer may, for example, allow the matrix processor to perform high-performance arithmetic operations (eg, large matrix multiplication). Based on the predetermined input format of the matrix processor to enable, it can be used to arrange the input according to the position in the matrix processor. In an embodiment, the sequencer arranges inputs for each cycle of the convolution operation. In certain embodiments, the sequencer maps data between two-dimensional matrices to vectors used to input data to the matrix processor. For example, a sequencer is based on its position in a two-dimensional matrix, an element-by-element (e.g., bit-by-) that effectively moves a data/computation object to a corresponding position in a vector. Bit-by-bit or byte-by-byte mapping may be performed. Once the inputs are arranged, the aligner in the embodiments can sort the data being arranged and feed it to the matrix processor according to a predefined sequence.

단계(212)에서, 입력 포맷에 기초하여, 매트릭스 프로세서의 하드웨어 엘리먼트들은 매트릭스 곱셈을 수행하도록 동적으로 구성될 수 있다. 실시예에서, 매트릭스 프로세서 입력은 데이터 입력 매트릭스에 따라 포맷되는 연산 대상을 포함하는 제1 벡터, 및 가중치 입력 매트릭스에 따라 포맷되는 연산 대상을 포함하는 제2 벡터를 수용한다(accommodates).In step 212, based on the input format, the hardware elements of the matrix processor can be dynamically configured to perform matrix multiplication. In an embodiment, the matrix processor input accommodates a first vector comprising a computational object formatted according to a data input matrix, and a second vector comprising a computational object formatted according to a weighted input matrix.

마지막으로, 단계(214)에서, 예를 들어, 콘볼루션 결과를 생성하기 위해 입력 이미지를 필터로 콘볼루션(convolve)하기 위해, 매트릭스 곱셈 결과가 사용될 수 있다. 실시예들에서, 콘볼루션 결과는, 예를 들어, 비선형 함수(non-linear function), 정규화 동작(normalization operation), 및/또는 풀링 연산(pooling operation)을 사용함으로써, 이미지 출력을 향상시키기 위해 더 처리될 수 있다.Finally, in step 214, a matrix multiplication result can be used, for example, to convolve the input image into a filter to produce a convolution result. In embodiments, the convolution result is further enhanced to improve image output, for example, by using a non-linear function, normalization operation, and/or pooling operation. Can be processed.

본 명세서에 개시된 프로세스들 및 시스템들은 콘볼루션 방법들 및 시스템들의 맥락에서 설명되지만, 이는 본 개시의 교시들이 디콘볼루션 방법들 및 시스템들에 동일하게 적용될 수 있기 때문에 본 개시의 범위를 제한하는 것으로 의도되지는 않는다는 것을 유의해야 한다.Although the processes and systems disclosed herein are described in the context of convolutional methods and systems, it is intended to limit the scope of the present disclosure because the teachings of the present disclosure may equally apply to deconvolutional methods and systems. It should be noted that it is not intended.

도 3은 본 개시의 다양한 실시예에 따른, 콘볼루션을 곱셈 회로에 매핑하기 위한 데이터 포맷터를 이용하는 예시적인 매트릭스 곱셈 시스템을 도시한다. 시스템(300)은 메모리 디바이스(302), 데이터 포맷터(306), 제어 로직(308), 캐시 (또는 버퍼)(312), 로직 회로(314), 및 매트릭스 프로세서(316), 및 시퀀서(310)를 포함한다. 도 3의 예를 들어, 데이터 포맷터(306) 및 시퀀서(310)와 같은 하나 이상의 컴포넌트가 각각의 개별 통합된 컴포넌트와 연관된 적어도 일부 기능을 수행하는 통합 컴포넌트(integrated component)로 구현될 수 있다는 것을 이해할 수 있다.3 illustrates an exemplary matrix multiplication system using a data formatter to map convolution to a multiplication circuit, in accordance with various embodiments of the present disclosure. System 300 includes memory device 302, data formatter 306, control logic 308, cache (or buffer) 312, logic circuit 314, and matrix processor 316, and sequencer 310. It includes. It is understood that one or more components, such as, for example, the data formatter 306 and sequencer 310 of FIG. 3 may be implemented as an integrated component that performs at least some function associated with each individual integrated component. Can.

데이터 포맷터(306)는 인-라인 포맷터(in-line formatter)로 구현될 수 있고, 실시예들에서, 매트릭스 프로세서(316)에 의해 효율적으로 처리될 수 있는 벡터 포맷으로 데이터 입력 매트릭스를 변환하는데 사용된다.The data formatter 306 can be implemented in an in-line formatter, and in embodiments, used to convert a data input matrix to a vector format that can be efficiently processed by the matrix processor 316. do.

도 3의 메모리 디바이스(302)는 본 기술 분야에 알려진 임의의 메모리 디바이스이고, 오디오 데이터, 센서 데이터, 및 초음파 데이터와 같은 임의의 타입의 데이터를 포함할 수 있다. 실시예들에서, 메모리(302)는 하나 이상의 위치들 및 상이한 어드레스들에 저장되는 이미지 데이터(304)를 포함한다. 실시예에서, 어드레스는 다차원 데이터 매트릭스(multi-dimensional data matrix)를 나타낼 수 있다.The memory device 302 of FIG. 3 is any memory device known in the art and may include any type of data such as audio data, sensor data, and ultrasound data. In embodiments, memory 302 includes image data 304 stored in one or more locations and different addresses. In an embodiment, the address may represent a multi-dimensional data matrix.

실시예들에서, 제어 로직(308)은 매트릭스 곱셈 시스템(300) 내의 시퀀서(310) 및 다른 컴포넌트를 관리할 수 있다. 제어 로직(308)은 데이터 포맷터(306) 또는 시퀀서(310)로 임베디드(embedded)될 수 있고, 실시예들에서, 데이터 포맷터(306)가 매트릭스 승산기(matrix multiplier)(316) 내의 예상되는 위치들과 매칭하도록 데이터를 포맷할 수 있는, 필터 파라미터들(예를 들어, 스트라이드)에 따라 메모리(302) 위치들로부터 데이터를 선택할 수 있다. 실시예들에서, 캐시/버퍼(312)는 메모리(302)로부터 데이터를 재액세스(re-access) 및 재판독(re-read)해야 하는 것을 피하기 위해 콘볼루션에 의한 재사용을 위한 데이터의 로컬 복사본을 저장하는데 사용될 수 있는 로컬 버퍼이다.In embodiments, control logic 308 may manage sequencer 310 and other components within matrix multiplication system 300. Control logic 308 may be embedded into data formatter 306 or sequencer 310, and in embodiments, data formatter 306 is expected locations within matrix multiplier 316. Data may be selected from memory 302 locations according to filter parameters (eg, stride), which may format the data to match. In embodiments, cache/buffer 312 is a local copy of data for reuse by convolution to avoid having to re-access and re-read data from memory 302. It is a local buffer that can be used to store.

로직 회로(314)는 임의의 수의 입력 연산 대상 및 데이터 레지스터를 나타내는 회로를 포함할 수 있다. 실시예들에서, 로직 회로(314)는 M 개의 가중치 연산 대상(weight operands) 및 N 개의 이미지 데이터 연산 대상을 매트릭스 프로세서(316)로 입력하는 회로를 포함할 수 있다. 가중치 데이터 및 이미지 데이터(304)는 다양한 유형의 메모리(302)(예를 들어, SRAM)에 저장될 수 있다.Logic circuit 314 may include circuitry representing any number of input operation targets and data registers. In embodiments, the logic circuit 314 may include circuitry for inputting M weight operands and N image data calculation targets to the matrix processor 316. Weight data and image data 304 may be stored in various types of memory 302 (eg, SRAM).

매트릭스 프로세서(316)는 동시적인 곱셈, 누산(accumulation), 및 시프트 연산들(shift operations)을 수행할 수 있는 임의의 수의 회로들 및 서브-회로 회로들(sub-circuit circuits)(예를 들어, 산술 로직 유닛들, 레지스터들, 인코더들(encoders) 등)을 포함할 수 있다. 각각의 서브-회로 회로는 산술 연산을 수행할 수 있는 셀(cell)일 수 있다. 실시예들에서, 매트릭스 프로세서(316)는, 예를 들어, 데이터 매트릭스들 및 사전-포맷(pre-formatted)된 가중치 매트릭스들(weight matrices)을 수신 및 곱셈하여 콘볼루션 연산에 대한 곱셈 곱(multiplication product)을 생성하는 곱셈-및-가산(multiply-and-add) 회로를 포함한다.The matrix processor 316 can perform any number of circuits and sub-circuit circuits (for example, simultaneous multiplication, accumulation, and shift operations) , Arithmetic logic units, registers, encoders, etc.). Each sub-circuit circuit may be a cell capable of performing an arithmetic operation. In embodiments, the matrix processor 316 receives and multiplies, for example, data matrices and pre-formatted weight matrices, multiplication for convolution operations. product) to generate multiply-and-add circuits.

콘볼루션 연산은, 상이한 정보의 세트를 포함할 수 있는 복수의 입력 채널들(a number of input channels)에 대하여 개별 필터들의 세트(즉, 사전-포맷되고 메모리(302)에 저장된 가중치들의 세트)를 이미지 데이터에 적용할(apply) 수 있다. 이는 입력 이미지에서 매우 다양한 특징들(features)을 검출할 수 있게 한다.The convolution operation takes a set of individual filters (i.e., a set of weights pre-formatted and stored in memory 302) for a number of input channels that may contain a different set of information. It can be applied to image data. This makes it possible to detect a wide variety of features in the input image.

또한, 상이한 순서로 상이한 특징들의 시퀀스를 분석함으로써, 매크로 특징들(macro features)이 이미지에서 식별될 수 있다. 실시예들에서, 콘볼루션은 출력 이미지에서 출력 픽셀을 나타내는 누산된 내적(dot product)(즉, 정수)을 생성하기 위해 (예를 들어, 가산기에 의해) 합산되는 부분 내적들을 획득하도록 직사각형 가중치 매트릭스(rectangular weight matrix)와 직사각형 입력 매트릭스(rectangular input matrix)의 곱셈을 포함한다.Also, by analyzing a sequence of different features in a different order, macro features can be identified in the image. In embodiments, the convolution is a rectangular weight matrix to obtain the partial dot products that are summed (eg, by an adder) to produce a accumulated dot product (ie, integer) representing the output pixel in the output image. It includes multiplication of (rectangular weight matrix) and rectangular input matrix.

실시예들에서, 산술 연산들은 NxN 타일 출력(tile output)을 생성하기 위해 매트릭스 프로세서(316)의 복수의 행들(rows) 및 열들(columns)을 이용함으로써 병렬화(parallelized)된다. 예를 들어, 96의 주어진 행 크기(given row size) 및 96의 대응하는 열 크기(corresponding column size)는, 9216 픽셀의 출력을 용이하게 한다. 매트릭스 프로세서(316)는 각각의 축 상의 임의의 벡터 길이를 수용할 수 있고, 실시예들에서, 매트릭스 포맷으로 배열된 PxQ 타일들을 포함하는 MxN 매트릭스 프로세서로 치수화(dimensioned)된다.In embodiments, arithmetic operations are parallelized by using a plurality of rows and columns of matrix processor 316 to generate an NxN tile output. For example, a given row size of 96 and a corresponding column size of 96 facilitate output of 9216 pixels. The matrix processor 316 can accommodate any vector length on each axis and, in embodiments, is dimensioned with an MxN matrix processor comprising PxQ tiles arranged in a matrix format.

실시예들에서, 시퀀서(310)는, 예를 들어 제어 로직(308)으로부터, 임의의 수의 콘볼루션 파라미터들을 포함하는 콘볼루션 명령들을 수신한다. 실시예에서, 콘볼루션 명령은 필터 파라미터(예를 들어, 필터 크기(filter size)), 출력 채널의 수(a number of outputs channels) 등을 포함할 수 있다. 실시예에서, 시퀀서(310)는 콘볼루션 연산의 입력의 어드레스 또는 입력을 식별하기 위해 콘볼루션 파라미터를 사용하고, 메모리(302) 내의 대응하는 어드레스 위치로부터 이들 입력을 수신 또는 페치한다.In embodiments, sequencer 310 receives convolution instructions, including any number of convolution parameters, for example from control logic 308. In an embodiment, the convolution command may include filter parameters (eg, filter size), a number of outputs channels, and the like. In an embodiment, sequencer 310 uses convolution parameters to identify the address or input of the input of the convolution operation, and receives or fetches these inputs from corresponding address locations in memory 302.

실시예에서, 데이터 포맷터(306)는 이미지 정보를 포함하는 2 차원 또는 3 차원 데이터를 행 또는 열로 표현될 수 있는 단일 벡터 또는 스트링(string)으로 변환하여, 이미지 데이터를 선형화(linearizing) 또는 벡터화(vectorizing)할 수 있다. 실시예에서, 포맷터(306)는, 콘볼루션 파라미터에 따라, 매트릭스 프로세서(316)의 하드웨어 요건들에 따른 적절한 포맷으로 이미지 데이터(304)를 매핑함으로써 매트릭스 프로세서(316)로 처리하기 위해 이미지 데이터(304)를 준비하여, 매트릭스 프로세서(316)는 예를 들어, 출력 픽셀들을 생성하기 위해 콘볼루션 계산의 일부로 매트릭스 곱셈을 수행할 수 있다.In an embodiment, the data formatter 306 converts two-dimensional or three-dimensional data containing image information into a single vector or string that can be expressed in rows or columns, thereby linearizing or vectorizing the image data ( vectorizing). In an embodiment, the formatter 306 maps image data for processing to the matrix processor 316 by mapping the image data 304 in an appropriate format according to the hardware requirements of the matrix processor 316, according to the convolution parameter. Preparing 304), the matrix processor 316 can perform matrix multiplication as part of the convolution calculation, for example, to generate output pixels.

실시예들에서, 벡터화된 이미지 데이터의 크기는 매트릭스 프로세서(316)로의 입력들의 수와 직접적으로 관련되어, 연산 대상은 매트릭스 프로세서(316)로의 입력들과 정렬된다.In embodiments, the size of the vectorized image data is directly related to the number of inputs to the matrix processor 316, so that the computational object is aligned with the inputs to the matrix processor 316.

실시예들에서, 데이터 포맷터(306)는 예를 들어, 상태 머신을 통해, 중첩된(overlapping)(즉, 동일하거나 중복되는) 입력들을 식별하고, 주어진 콘볼루션 연산에 대해 2 회 이상 액세스되어야 하는 하나 이상의 위치에 존재할 수 있다. 상태 머신은 필터 크기 및 스트라이드와 같은 필터 파라미터를 사용하여 중첩된 데이터를 재사용 가능한(reusable) 데이터로 식별하도록 구성될 수 있으므로, 매트릭스 프로세서(316)는 메모리(302)로부터 데이터를 재액세스 및 전송할 필요 없이 연산 대상을 재사용할 수 있다. 대신에, 실시예에서, 재사용 가능한 데이터는 예를 들어 캐시(312)에 저장된 로컬 복사본으로부터 로드되어, 계산 노력, 시간 및 전력 소비를 감소시킬 수 있다.In embodiments, the data formatter 306 should identify overlapping (ie, identical or overlapping) inputs, eg, via a state machine, and be accessed more than once for a given convolution operation It can exist in more than one location. The state machine can be configured to identify the nested data as reusable data using filter parameters such as filter size and stride, so the matrix processor 316 needs to re-access and transfer data from memory 302 You can reuse the computational object without it. Instead, in embodiments, reusable data may be loaded, for example, from a local copy stored in cache 312, reducing computational effort, time and power consumption.

실시예들에서, 예를 들어 각각의 콘볼루션 연산을 위해, 데이터 시퀀서(310)는, 예를 들어 도트 곱셈을 수행할 때 매트릭스 프로세서(316)의 주어진 입력 포맷을 매칭시키기 위해, 콘볼루션 연산의 각각의 사이클에서 매트릭스 프로세서(316)에 의해 예상되는 위치들(positions)에 따라 검색된 입력들을 배열할 수 있다. 시퀀서(310)는 데이터를 판독하기 위해 어드레스를 생성하고, 결과를 기록하고, 콘볼루션 연산을 수행할 때 시스템(300)의 상태를 계속 추적할 수 있다. 실시예들에서, 시퀀서(310)는, 이러한 정보의 일부 또는 전부를 사용하여, 메모리(302)내의 어느 어드레스들이 데이터를 획득하는지와, 예를 들어 후속 콘볼루션 단계에서 매트릭스 프로세서(316)에 의해 적절하게 사용될 수 있는 방식으로 그것을 어떻게 처리할지를 결정한다. 실시예에서, 시퀀서(310)는, 주어진 입력 포맷에 따라 매트릭스 프로세서(316)와 미리 정의된(predefined) 순서로 검색 및 동기화된 이미지 데이터를 정렬하는데 사용될 수 있는 데이터 포맷터(306)에 연결(coupled)된다.In embodiments, for example, for each convolution operation, the data sequencer 310 may be configured to perform a convolution operation, for example, to match a given input format of the matrix processor 316 when performing dot multiplication. The retrieved inputs can be arranged according to the positions expected by the matrix processor 316 in each cycle. The sequencer 310 can continue to track the state of the system 300 when generating addresses to record data, recording results, and performing convolution operations. In embodiments, sequencer 310 uses some or all of this information to determine which addresses in memory 302 obtain data and, for example, by matrix processor 316 in a subsequent convolution step. Decide how to handle it in a way that can be used properly. In an embodiment, the sequencer 310 is coupled to a data formatter 306 that can be used to sort image data retrieved and synchronized in a predefined order with the matrix processor 316 according to a given input format. )do.

실시예들에서, 다양한 입력 파라미터들 또는 구성들에 의해 결정된 입력 포맷에 기초하여, 매트릭스 프로세서(316)의 하드웨어 엘리먼트들은 임의의 유형의 산술 연산(예를 들어, 곱셈 누산 곱셈 가산(Multiply Accumulate Multiply Add))을 수행하도록 동적으로 구성될 수 있다. 실시예에서, 매트릭스 프로세서(316)로의 입력은 데이터 입력 매트릭스에 따라 포맷된 벡터를 포함한다. 실시예에서, 입력은, 예를 들어 매트릭스 곱셈이 신경 네트워크에 대한 이미지 데이터(304)를 콘볼루션하는데 사용될 수 있도록, 가중치 입력 매트릭스에 따라 포맷된 제2 벡터를 포함한다.In embodiments, based on the input format determined by various input parameters or configurations, the hardware elements of the matrix processor 316 may be any type of arithmetic operation (eg, Multiply Accumulate Multiply Add) )). In an embodiment, the input to the matrix processor 316 includes vectors formatted according to the data input matrix. In an embodiment, the input includes a second vector formatted according to the weighted input matrix, such that matrix multiplication can be used to convolution image data 304 for the neural network.

실시예들에서, 포맷(formatting)은 상이한 입력 크기(input sizes)를 갖는 매트릭스들의 처리를 수용하도록 동적으로 수행된다. 실시예들에서, 입력 채널들을 포함하는 재포맷된 매트릭스들은 캐시/버퍼(312)에 피드(fed)된다. 실시예들에서, 캐시/버퍼(312)는 메모리(302)로부터 데이터를 페치하고, 데이터를 판독하기 위해 메모리(302)를 재액세스할 필요 없이 콘볼루션에 의해 재사용될 수 있는 데이터의 로컬 복사본을 저장한다.In embodiments, formatting is performed dynamically to accommodate the processing of matrices with different input sizes. In embodiments, reformatted matrices containing input channels are fed to the cache/buffer 312. In embodiments, cache/buffer 312 fetches data from memory 302 and creates a local copy of the data that can be reused by convolution without the need to re-access memory 302 to read data. To save.

고속 매트릭스 곱셈에 적합한 대안적인 포맷(alternate format)으로 데이터를 재배열함으로써 콘볼루션 연산을 매트릭스-곱셈 연산(matrix- multiply operation)으로 변환하기 위해 CPU 또는 GPU에 의해 수행되는 포맷 함수들(formatting functions)의 일반적인 소프트웨어 구현들과 달리, 본 개시의 다양한 하드웨어 구현들은 바로(on the fly) 데이터를 재포맷(re-format)하고, (예를 들어, 매 사이클마다 96 개의 데이터(96 pieces of data)로) 실행가능하도록 한다. 사실상 본 개시는, 매트릭스의 상대적으로 많은 수의 엘리먼트가 병렬로 처리되게 하므로, 콘볼루션을 매트릭스 연산에 효율적으로 매핑한다. 실시예들에서, 2N개의 페치된 입력에 대하여, N²개의 계산 데이터(compute data)가 획득될 수 있다.Formatting functions performed by the CPU or GPU to convert the convolution operation into a matrix-multiply operation by rearranging the data in an alternative format suitable for high-speed matrix multiplication. Unlike the general software implementations of, the various hardware implementations of the present disclosure re-format data on the fly (e.g., 96 pieces of data per cycle). ) Make it executable. In fact, the present disclosure effectively maps convolution to matrix operations, as a relatively large number of elements in the matrix are processed in parallel. In embodiments, for 2N fetched inputs, N ² compute data may be obtained.

시스템(300)은 본 개시의 범위 내에 있는 추가적인 컴포넌트들을 포함할 수 있다는 것을 이해할 수 있다. 상기 추가적인 컴포넌트들은 예를 들어, 메모리(302)로부터 데이터를 탐색하고 SRAM 내의 데이터(예를 들어, 가중치들 및 결과들)를 저장하는 DMA의 일부일 수 있는 DRAM, 하드웨어-가속 풀링 유닛(hardware- accelerated pooling unit), 본 기술 분야에 알려진 다른 사후-처리 엘리먼트 등과 같은 사후 처리 엘리먼트(post-processing elements)를 포함할 수 있다.It is understood that system 300 can include additional components that are within the scope of the present disclosure. The additional components may be part of a DMA that retrieves data from memory 302 and stores data (eg, weights and results) in SRAM, for example, a hardware-accelerated pooling unit (hardware-accelerated) pooling unit), other post-processing elements known in the art, and the like.

도 4는 본 개시의 다양한 실시예들에 따른, 콘볼루션들을 매트릭스 곱셈 회로에 매핑하기 위한 프로세스를 예시하는 예시적인 다이어그램이다. 다이어그램(400)에서, 각각의 매트릭스(402A, 402B)는 서브-매트릭스(406, 408)를 포함하는 하이라이트(highlighted)된 섹션으로 도시된다. 예를 들어, 서브-매트릭스(406)는 특정 컬러와 연관된 입력 채널에 대한 것이다. 매트릭스(418)는 가중치 데이터 매트릭스(weight data matrices)와 같은 하나 이상의 입력 매트릭스를 나타낸다. 매트릭스 프로세서(316)는 바람직하게는 도 3을 참조하여 논의된 종류의 프로세서이다.4 is an example diagram illustrating a process for mapping convolutions to a matrix multiplication circuit, in accordance with various embodiments of the present disclosure. In diagram 400, each matrix 402A, 402B is shown as a highlighted section containing sub-matrix 406, 408. For example, sub-matrix 406 is for an input channel associated with a particular color. Matrix 418 represents one or more input matrices, such as weight data matrices. The matrix processor 316 is preferably a processor of the kind discussed with reference to FIG. 3.

실시예들에서, 데이터 매트릭스(402A, 402B)는, 3 개의 상이한 컬러들에 대한 이미지 데이터와 같은 입력 데이터를 유지(hold)할 수 있는 임의의 수의 행들, 열들, 및 채널들을 포함하는 3 차원 데이터 매트릭스의 일부를 형성한다. 유사하게, 매트릭스(418)는 3 개의 상이한 컬러들에 대한 가중치 데이터를 유지하기 위해 임의의 수의 행들 및 열들을 포함할 수 있는 직사각형 입력 가중치 데이터 매트릭스다. 당업자는 입력 매트릭스들의 크기 및 입력 채널들의 수가 상이한 응용들(applications)에 대해 변할 수도 있다는 것을 알고 있을 것이다.In embodiments, the data matrix 402A, 402B is a three-dimensional array containing any number of rows, columns, and channels capable of holding input data, such as image data for three different colors. It forms part of the data matrix. Similarly, matrix 418 is a rectangular input weighting data matrix that can include any number of rows and columns to maintain weighting data for three different colors. Those skilled in the art will appreciate that the size of the input matrices and the number of input channels may vary for different applications.

도 4는, 실시예들에서, 각각의 데이터 서브-매트릭스(406, 408)를 선형화하여 그 선형화된 버전(linearized version)을 생성하는 것에 의해 획득될 수 있는 어레이(410, 412)를 또한 도시한다. 반대로, 가중치 데이터 매트릭스(418)는 어레이(414)를 획득하도록 벡터화될 수 있다. 예로서, 실시예에서 제1 컬러와 연관된 제1 입력 채널에 대한 3x3 서브-매트릭스(406), 제2 컬러와 연관된 제2 입력 채널에 대한 3x3 서브-매트릭스(406), 제3 컬러와 연관된 제3 입력 채널에 대한 제3 3x3 서브-매트릭스(406)는 집합(assembled)되어 어레이(410)를 형성한다. 2의 스트라이드를 가정하면, 매트릭스 프로세서(316)의 모든 입력 열들이 입력 데이터로 채워질 수 있을 때까지, 어레이(412) 등을 형성하기 위해 데이터 매트릭스(402B) 내에서 다른 3x3 서브-매트릭스(408)의 위치가 식별된다. 유사하게, 입력 가중치 데이터는 어레이(414)와 유사한 추가적인 출력 채널을 형성하도록 재포맷(reformatted)될 수 있다.4 also shows an array 410, 412 that can be obtained by linearizing each data sub-matrix 406, 408, in embodiments, to generate its linearized version. . Conversely, the weighted data matrix 418 can be vectorized to obtain an array 414. For example, in an embodiment, a 3x3 sub-matrix 406 for a first input channel associated with a first color, a 3x3 sub-matrix 406 for a second input channel associated with a second color, and a third color associated with a third color The third 3x3 sub-matrix 406 for the three input channels is assembled to form an array 410. Assuming a stride of 2, another 3x3 sub-matrix 408 within the data matrix 402B to form the array 412, etc., until all input columns of the matrix processor 316 can be filled with input data. The location of is identified. Similarly, input weight data can be reformatted to form additional output channels similar to array 414.

실시예들에서, 3개의 개별 입력 채널들을 나타낼 수 있는 3개의 3x3 가중치 데이터 매트릭스들(418)은, 총 27개의 엘리먼트를 포함하는 벡터로 재포맷될 수 있다(예를 들어, 도 3과 관련하여 설명된 포맷터에 의하여). 매트릭스 프로세서(316)에 의하여 수행되는 콘볼루션 연산에서 사용하기 위하여, 상기 27개의 엘리먼트들로부터 27-엘리먼트 어레이(27-element array)(예를 들어, 어레이(414))가 생성될 수 있다. In embodiments, three 3x3 weighted data matrices 418, which may represent three separate input channels, may be reformatted into a vector containing a total of 27 elements (eg, with respect to FIG. 3). By the described formatter). For use in the convolution operation performed by the matrix processor 316, a 27-element array (eg, array 414) may be generated from the 27 elements.

상세하게는, 예를 들어 출력 이미지 내 출력 픽셀을 나타낼 수 있는, 누산된 내적(430)을 생성하도록 매트릭스 프로세서 아키텍처(316)에 의해 누산될 수 있는 부분 내적을 생성하기 위해, 예를 들어 가중치 매트릭스(weight matrix)(418)의 3x3x3 엘리먼트에 대응하는 열(414) 또는 어레이의 각 엘리먼트와, 예를 들어 서브-매트릭스(sub-matrix)(406)의 3x3x3 엘리먼트에 대응하는 열(410) 또는 어레이의 각 엘리먼트를 도트-곱(dot multiplying)함으로써, 데이터 매트릭스(402A, 402B)는 예를 들어 콘볼루션 연산의 일부로 가중치 데이터 매트릭스(418)와 곱해질 수 있다. 다음 내적은 열(412)에 대응하는 벡터와 행(414)을 도트-곱하여 획득될 수 있고, 이렇게 함으로써, 이러한 부분 내적의 값들은 입력 데이터 매트릭스(402A, 402B)에 대한 가중치 매트릭스(418)의 적용에 대응하고, 누산기 출력은 전체 콘볼루션을 나타낸다. 실시예들에서, 매트릭스 프로세서는 96 x 96 크기이며, 이는 9216개의 곱셈 누산 연산들이, 예를 들어 단일 클록 사이클에서 병렬로 수행될 수 있게 한다.Specifically, for example, to generate a partial dot product that can be accumulated by the matrix processor architecture 316 to produce a accumulated dot product 430, which can represent an output pixel in the output image, for example, a weight matrix Columns 414 corresponding to 3x3x3 elements of (weight matrix) 418 or each element of the array, and columns 410 or arrays corresponding to 3x3x3 elements of, for example, sub-matrix 406 By dot multiplying each element of, the data matrices 402A, 402B can be multiplied with the weighted data matrix 418, for example as part of a convolution operation. The next dot product can be obtained by dot-multiplying the row 414 and the vector corresponding to the column 412, whereby the values of these partial dot products are of the weight matrix 418 for the input data matrices 402A, 402B. Corresponding to the application, the accumulator output represents the total convolution. In embodiments, the matrix processor is 96×96 in size, which allows 9216 multiply accumulate operations to be performed in parallel, for example in a single clock cycle.

실시예들에서 출력 채널들의 계산에서, 매트릭스 프로세서는, 매트릭스(418)로부터의 상이한 가중치들의 세트(즉, 필터들)를 제외한 매트릭스(402A, 402B)로부터 동일한 입력 데이터의 세트를 사용하여 동일한 출력 픽셀들을 생성할 수 있으므로, 한번 입력 데이터를 판독함으로써 많은 출력 채널이 한꺼번에 생성될 수 있다. 하나 이상의 입력 채널들(예를 들어, 각각의 컬러(예를 들어, RGB)에 대하여 하나)이 사용될 수 있다. 예를 들어, 각각의 콘볼루션은 각각의 컬러에 대해 각각 하나인 3 개의 상이한 매트릭스들을 나타내는 가중치들(418)을 사용할 수 있다.In the calculation of the output channels in embodiments, the matrix processor uses the same set of input data from the matrix 402A, 402B, except for a different set of weights from the matrix 418 (ie filters), the same output pixel. Since the input data can be read once, many output channels can be created at once by reading the input data. One or more input channels (eg, one for each color (eg, RGB)) may be used. For example, each convolution can use weights 418 representing three different matrices, one for each color.

각각의 출력 채널(436)은, 입력 데이터에서 서로 다른 특징을 나타내는 서로 다른 필터 또는 가중치(418)를 사용하여 생성될 수 있다. 출력 채널들의 수는 특징들의 수에 의존할 수 있다. 실시예에서, 콘볼루션의 수는 출력 채널(436)의 수와 입력 채널의 수의 곱과 같고, 각각의 콘볼루션은 각각의 입력 채널에 대해 N개의 콘볼루션을 가질 수 있다.Each output channel 436 can be created using different filters or weights 418 that represent different characteristics in the input data. The number of output channels may depend on the number of features. In an embodiment, the number of convolutions is equal to the product of the number of output channels 436 and the number of input channels, and each convolution may have N convolutions for each input channel.

도 4에 도시된 바와 같이, 입력 데이터 매트릭스(402A, 402B), 가중치 데이터 매트릭스(418), 및 어레이(410-414)는 서로 다른 수의 행 및 열을 가질 수 있다는 것을 이해할 것이다. 정확하게 표시된 바와 같이, 입력 및 출력 채널의 수는 임의로 선택될 수 있다. 실시예에서, 가중치 데이터 매트릭스(418)가 알려진 경우, 어레이(414)는 포맷터를 사용하지 않고서도 벡터화된 포맷으로 생성 및 저장될 수 있다. 실시예들에서, 도트-곱들은 원샷 매트릭스-매트릭스 곱셈 연산(one-shot matrix-matrix multiply operation)을 생성하기 위해 동시에 수행될 수 있다. It will be appreciated that as shown in FIG. 4, the input data matrix 402A, 402B, weighted data matrix 418, and arrays 410-414 can have different numbers of rows and columns. As accurately indicated, the number of input and output channels can be arbitrarily selected. In embodiments, if the weighted data matrix 418 is known, the array 414 may be created and stored in a vectorized format without using a formatter. In embodiments, dot-products may be performed concurrently to generate a one-shot matrix-matrix multiply operation.

본 발명의 실시예는, 단계들이 수행되게 하는 하나 이상의 프로세서 또는 처리 유닛에 대한 명령(instructions)으로 하나 이상의 비일시적 컴퓨터-판독가능 매체(non- transitory computer-readable media) 상에 인코딩될 수 있다. 하나 이상의 비일시적 컴퓨터-판독가능 매체는 휘발성 및 비휘발성(non-volatile) 메모리를 포함할 것이라는 것을 유의해야 한다. 하드웨어 구현 또는 소프트웨어/하드웨어 구현을 포함하여 대안적인 구현들이 가능하다는 것을 유의해야 한다. 하드웨어-구현 기능은 ASIC(들), 프로그램가능 어레이, 디지털 신호 처리 회로 등을 사용하여 실현될 수 있다. 따라서, 임의의 청구항의 "수단(means)"이라는 용어는 소프트웨어 및 하드웨어 구현 모두를 다루(cover)도록 의도된다. 유사하게, 본 명세서에서 사용되는 용어 "컴퓨터-판독가능 매체 또는 매체(computer-readable medium or media)"는 컴퓨터-판독가능 매체 또는 매체에 구현된 명령의 프로그램을 갖는 하드웨어 및/또는 소프트웨어, 또는 이들의 조합을 포함한다. 이러한 구현 대안들을 염두에 두고, 도면들 및 첨부된 설명은 당업자가 프로그램 코드(즉, 소프트웨어)를 기록하고/기록하거나 요구되는 처리를 수행하기 위해 회로들(즉, 하드웨어)을 제조하도록 요구할 수 있는 기능적 정보를 제공한다는 것을 이해해야 한다.Embodiments of the invention may be encoded on one or more non-transitory computer-readable media with instructions to one or more processors or processing units that cause steps to be performed. It should be noted that one or more non-transitory computer-readable media will include volatile and non-volatile memory. It should be noted that alternative implementations are possible, including hardware implementations or software/hardware implementations. Hardware-implemented functionality may be realized using ASIC(s), programmable arrays, digital signal processing circuitry, and the like. Accordingly, the term “means” in any claim is intended to cover both software and hardware implementations. Similarly, the term “computer-readable medium or media” as used herein refers to hardware and/or software having a program of instructions embodied in a computer-readable medium or media, or these Contains a combination of. With these implementation alternatives in mind, the drawings and accompanying description may require one of ordinary skill in the art to write program code (i.e. software) and/or manufacture circuits (i.e. hardware) to perform the required processing. It should be understood that it provides functional information.

본 발명의 실시예들은 다양한 컴퓨터-구현 동작들을 수행하기 위한 컴퓨터 코드를 갖는 비-일시적 유형의 컴퓨터-판독가능 매체를 갖는 컴퓨터 제품들과 더 관련될 수 있다는 것을 유의해야 한다. 매체 및 컴퓨터 코드는 본 발명의 목적을 위해 특별히 설계되고 구축(constructed)될 수 있거나, 또는 관련 기술분야의 당업자에게 알려져 있거나 이용 가능한 종류의 것일 수 있다. 유형의 컴퓨터-판독가능 매체의 예는, 하드 디스크, 플로피 디스크, 및 자기 테이프와 같은 자기 매체; CD-ROM 및 홀로그래픽 장치와 같은 광학 매체; 자기-광학 매체; 및 주문형 집적 회로(ASIC), 프로그램가능 로직 디바이스(PLD), 플래시 메모리 디바이스, 및 ROM 및 RAM 디바이스와 같이 프로그램 코드를 저장 또는 실행하도록 특별히 구성된 하드웨어 디바이스를 포함하지만, 이에 제한되지는 않는다. 컴퓨터 코드의 예는, 컴파일러에 의해 생성된 것과 같은 머신 코드와, 인터프리터(interpreter)를 사용하여 컴퓨터에 의해 실행되는 상위 레벨 코드(higher level code)를 포함하는 파일을 포함한다. 본 발명의 실시예들은 처리 장치에 의해 실행되는 프로그램 모듈들에 있을 수 있는 머신-실행가능 명령들(machine-executable instructions)로 전체적으로 또는 부분적으로 구현될 수 있다. 프로그램 모듈의 예는, 라이브러리, 프로그램, 루틴, 객체, 컴포넌트, 및 데이터 구조를 포함한다. 분산 컴퓨팅 환경에서, 프로그램 모듈은 로컬, 원격 또는 이들 모두인 설정(settings)에 물리적으로 위치될 수 있다.It should be noted that embodiments of the present invention may further relate to computer products having a non-transitory type of computer-readable medium having computer code for performing various computer-implemented operations. The media and computer code may be specially designed and constructed for the purposes of the present invention, or may be of a kind known or available to those skilled in the art. Examples of tangible computer-readable media include magnetic media such as hard disks, floppy disks, and magnetic tapes; Optical media such as CD-ROMs and holographic devices; Magneto-optical media; And hardware devices specifically configured to store or execute program code, such as application specific integrated circuits (ASICs), programmable logic devices (PLDs), flash memory devices, and ROM and RAM devices. Examples of computer code include machine code such as generated by a compiler, and a file containing higher level code executed by a computer using an interpreter. Embodiments of the invention may be implemented in whole or in part with machine-executable instructions that may be in program modules executed by a processing device. Examples of program modules include libraries, programs, routines, objects, components, and data structures. In a distributed computing environment, program modules may be physically located in settings that are local, remote, or both.

당업자는 컴퓨팅 시스템 또는 프로그래밍 언어가 본 발명의 실시에 중요하지 않다는 것을 인식할 것이다. 당업자는 또한 상술된 복수의 엘리먼트들이 물리적으로 및/또는 기능적으로 서브-모듈들로 분리될 수 있거나 또는 함께 결합될 수 있다는 것을 인식할 것이다.Those skilled in the art will recognize that a computing system or programming language is not critical to the practice of the present invention. Those skilled in the art will also recognize that the plurality of elements described above may be physically and/or functionally separated into sub-modules or combined together.

아래의 청구항의 구성 요소는 다수의 종속성, 구성 및 조합을 갖는 것을 포함하여 상이하게 구성(arranged)될 수 있다. 예를 들어, 실시예들에서, 다양한 청구항들의 주제는 다른 청구항들과 결합될 수 있다.The components of the claims below can be arranged differently, including having multiple dependencies, configurations and combinations. For example, in embodiments, the subject matter of the various claims may be combined with other claims.

전술한 예 및 실시예는 예시적인 것이며 본 발명의 범위를 제한하지 않는다는 것을 당업자는 이해할 것이다. 명세서의 판독 및 도면의 검토를 통해 당업자에게 명백한 모든 치환, 향상, 등가물, 조합 및 개선이 본 발명에 맞는 사상 및 범위 내에 포함되도록 의도된다.It will be understood by those skilled in the art that the foregoing examples and examples are illustrative and do not limit the scope of the invention. It is intended that all substitutions, enhancements, equivalents, combinations and improvements apparent to those skilled in the art through reading the specification and reviewing the drawings are within the spirit and scope of the invention.

Claims

A method for mapping matrix data to an input of a matrix processor, the method comprising:
Receiving first matrix data processed by the matrix processor;
Identifying the length of the input vector associated with the matrix processor;
Mapping the first matrix data to the input vector using a first element-by-element sequence operation;
Identifying at least one element in the first matrix data that overlaps the second matrix data processed by the matrix processor;
Storing the at least one duplicate element in a cache; And
Mapping the second matrix data to the input vector using a second element-by-element sequence operation so that the at least one duplicate element is retrieved from the cache.
How to include.

According to claim 1,
The first matrix data and the second matrix data,
Associated with one or more convolution operations
Way.

According to claim 1,
The first element-by-element sequence operation,
Equivalent to the second element-by-element sequence operation
Way.

According to claim 1,
The cache,
Retrieving the at least one duplicate element from the cache is a local storage location requiring less computation time than retrieving data from SRAM.
Way.

According to claim 1,
Identifying at least one hardware configuration of the matrix processor in relation to identifying the length of the input vector.
How to further include.

The method of claim 5,
The at least one hardware configuration of the matrix processor,
Convolution parameters
How to include.

The method of claim 6,
The convolution parameter,
One or more addresses representing a three-dimensional data matrix
How to include.

The method of claim 7,
The convolution parameter,
At least one of filter size, number of weights, and stride
Way.

According to claim 1,
Mapping the first matrix data to the input vector,
Aligning the first matrix data with respect to the input vector based at least in part on a hardware configuration of the matrix processor.
How to include.

According to claim 1,
The elements of the input vector,
Identified for at least one of each cycle of each position and convolution operation in the matrix processor
Way.

According to claim 1,
The element-by-element sequence operation,
Performed once for each convolution operation
Way.

According to claim 1,
State machine using at least one of stride and filter sizes to determine the overlapping elements
How to include more.

According to claim 1,
Using the above method for mapping to convolve the input image.
How to further include.

According to claim 1,
The first dimension of the matrix processor,
Corresponding to the number of pixels computed in parallel for a given output channel
Way.

According to claim 1,
The second dimension of the matrix processor,
Corresponding to the number of output channels operating in parallel
Way.

A system for mapping convolutional data to a matrix multiplication circuit to improve computational speed,
A memory device for holding image data;
Control logic; And
A data formatter connected to the control logic and the memory device
Including,
The data formatter,
In response to receiving the convolution command, identifying first matrix data and second matrix data associated with the convolution operation;
Identifying the length of the input vector associated with the matrix processor;
Mapping the first matrix data to the input vector using element-by-element sequence operations;
Identifying at least one element in the first matrix data that overlaps with the second matrix data processed by the matrix processor;
Storing the at least one duplicate element in a cache; And
Mapping the second matrix data to the input vector using the element-by-element sequence operation so that the at least one duplicate element is retrieved from the cache.
How it is configured to perform.

The method of claim 16,
Logic circuit for generating the input vector
The system further comprising.

The method of claim 16,
The matrix processor,
In order to generate dot products for outputting convolution results, a plurality of sub-circuits performing dot-multiplications using the first matrix data and the second matrix data circuits)
System comprising a.

The method of claim 18,
The convolution result,
Which is the output matrix corresponding to the application of the filter to the area of the image.
system.

The method of claim 16,
Sequencer to perform the sequence operation
The system further comprising.

The method of claim 16,
The data formatter,
State machine using at least one of stride and filter sizes to determine the overlapping elements
System comprising a.